Transcription
Introduction to Data Mininga.j.m.m. (ton) weijters(slides are partially based on an introduction of GregoryPiatetsky-Shapiro)/faculteit technologie management
Overview Why data mining (data cascade)Application examplesData Mining & Knowledge DiscoveringData Mining versus Process Mining/faculteit technologie management
Why Data Mining Cascade of data– Different growth rates, but about 30% each year is alow growth rate estimation The possibility to use computers to analyze data– 1975 computer for the whole university (main frame)with 1MB working memory, now a PC with 512 MBworking memory/faculteit technologie management
Cascade of data Business and government systems (transactionssystem, ERP systems, Workflow systems, .) Scientific data: astronomy, biology, etc Web, text, and e-commerce (new regularities, aboutdata storage to prevent attempts) Hospitals, internal revenue service ./faculteit technologie management
Examples large data bases AT&T handles billions of calls per day– so much data, it cannot be all stored -- analysis has tobe done “on the fly” Europe's Very Long Baseline Interferometry(VLBI) has 16 telescopes, each of whichproduces 1 Gigabit/second of astronomicaldata over a 25-day observation session Google/faculteit technologie management
First conclusion Very little data will ever be looked at by ahuman Data Mining algorithms and computers areNEEDED to make sense and use of data./faculteit technologie management
Overview Why data mining (data cascade)Application examplesData Mining & Knowledge DiscoveringData Mining versus Process Mining/faculteit technologie management
Application examples I Customer Relationship Management (CRM)– Based on a data base with client information andbehavior try to select other potential consumers of aproduct.– Euro miles. Profiling tax cheaters– Based on the profile of the tax payer and some figuresfrom the tax (electronic) form try to product taxcheating./faculteit technologie management
Application examples II Health care– Given the patient profile and the diagnoses try topredict the number of hospital days. Information isused in planning system. Industry– Job shop planning. Based on already accepted jobs,try to product the delivery time of a new offered job./faculteit technologie management
Type of applications Classification (supervised)– Credit risk: result of data mining are rules that can be usedto classify new clients as: high, normal, low Estimation (supervised)– Credit risk: output is not a classification but a numberbetween -1 and 1 to indicate risk (-1.0 very low, 0.0 normal, 1.0 very high) Clustering (unsupervised) Associations: e.g. Bier & Chips & Peanuts occurfrequently in a shopping list of one person Visualization: to facilitate human discovery/faculteit technologie management
Supervised verses unsupervised Supervised (Credit risk)– Starting point is a historical data base with clientinformation and his/her financial data including credithistory (classification). This data base is used toinduce credit risk rules. Unsupervised (Clustering)– Try to cluster customers into similar groups (howmany groups, in which sense similar)/faculteit technologie management
E-commerce – Case Study A person buys a book (product) at Amazon.com. Task: Recommend other books (products) thisperson is likely to buy Amazon does clustering based on books bought:– customers who bought “Advances in Knowledge Discovery andData Mining”, also bought “Data Mining: Practical MachineLearning Tools and Techniques with Java Implementations” Recommendation program is quite successful/faculteit technologie management
Hands-on-project I Historical consumer data– Age, education, sex, relationship,etc.– Income Model to predict income above50K Use the model to selectconsumers for direct mailing/faculteit technologie management
Problems Suitable for Data-Mining have sub-optimal current methodshave accessible, sufficient, and relevant dataprovides high payoff for the right decisions!(have a changing environment)/faculteit technologie management
Overview Why data mining (data cascade)Application examplesData Mining & Knowledge DiscoveringData Mining versus Process Mining/faculteit technologie management
Knowledge Discovery DefinitionKnowledge Discovery in Data is thenon-trivial process of identifying– valid– novel– potentially useful– and ultimately understandable patterns in data.from Advances in Knowledge Discovery and DataMining, Fayyad, Piatetsky-Shapiro, Smyth, andUthurusamy, (Chapter 1), AAAI/MIT Press 1996/faculteit technologie management
Related FieldsMachineLearningVisualizationData Mining andKnowledge DiscoveryStatistics/faculteit technologie managementDatabases
Statistics, Machine Learning andData Mining Statistics:–– Machine Learning–– more heuristics then theory-basedfocused on improving performance of a learning algorithmsData Mining and Knowledge Discovery–– more theory-basedmore focused on testing hypothesesData Mining one step in the Knowledge Discovery process (applyingthe Machine Learning algorithm)Knowledge Discovery, the whole process including data cleaning,learning, and integration and visualization of resultsDistinctions are fuzzy/faculteit technologie management
Knowledge Discovery Processflow, according to CRISP-DMMonitoring/faculteit technologie managementBusinessUnderstanding DataUnderstanding DataPreparation80% of the timeModeling(applying miningalgorithm) 20%
Phases and parationDetermineBusiness ObjectivesBackgroundBusiness ObjectivesBusiness SuccessCriteriaCollect Initial DataInitial Data CollectionReportDescribe DataData Description ReportSelect DataRationale for Inclusion /ExclusionSituation AssessmentInventory of ResourcesRequirements,Assumptions, andConstraintsRisks and ContingenciesTerminologyCosts and BenefitsExplore DataData Exploration ReportClean DataData Cleaning ReportVerify Data QualityData Quality ReportConstruct DataDerived AttributesGenerated RecordsDetermineData Mining GoalData Mining GoalsData Mining SuccessCriteriaData SetData Set DescriptionIntegrate DataMerged DataFormat DataReformatted DataProduce Project PlanProject PlanInitial Asessment ofTools and Techniques/faculteit technologie managementModelingSelect ModelingTechniqueModeling TechniqueModeling AssumptionsGenerate Test DesignTest DesignBuild ModelParameter SettingsModelsModel DescriptionAssess ModelModel AssessmentRevised ParameterSettingsEvaluationEvaluate ResultsAssessment of DataMining Results w.r.t.Business SuccessCriteriaApproved ModelsReview ProcessReview of ProcessDetermine Next StepsList of Possible ActionsDecisionDeploymentPlan DeploymentDeployment PlanPlan Monitoring andMaintenanceMonitoring andMaintenance PlanProduce Final ReportFinal ReportFinal PresentationReview ProjectExperienceDocumentation
Other related fields Data warehouse– A data warehouse thus not contain simply accumulateddata at a central point, but the data is carefully assembledfrom a variety of information sources around theorganization, cleaned u, quality assured, and then released(published). Business Intelligence (BI)– The use of data in the data ware house to support themanagers with important information/faculteit technologie management
Overview Why data mining (data cascade)Application examplesData Mining & Knowledge DiscoveringData Mining versus Process Mining/faculteit technologie management
Data Mining versus Process Mining Process Mining is data mining but with a strongbusiness process view. Some of the more traditional data miningtechniques can be used in the context ofprocess mining. Some new techniques are developed to performprocess mining (mining of process models)./faculteit technologie management
Why Process Mining Traditional As-Is analysis of business processes stronglybased on the opinion of process expert. The basic ideais to assemble an appropriate team and to organizemodeling sessions in which the knowledge of the teammembers is used to build an adequate As-Is processmodel. The surplus values of process mining in the As-Isanalysis are:– information based on the real performance of the process(objective)– more details/faculteit technologie management
Data Mining versus Process Mining Process Mining is data mining but with a strong business process view. Some of the more traditional data mining techniques can be used in the context of process mining. Some new techniques are developed to perform process mining (mining of process models).