Introduction To Data Mining

Transcription

Introduction to Data Mininga.j.m.m. (ton) weijters(slides are partially based on an introduction of GregoryPiatetsky-Shapiro)/faculteit technologie management

Overview Why data mining (data cascade)Application examplesData Mining & Knowledge DiscoveringData Mining versus Process Mining/faculteit technologie management

Why Data Mining Cascade of data– Different growth rates, but about 30% each year is alow growth rate estimation The possibility to use computers to analyze data– 1975 computer for the whole university (main frame)with 1MB working memory, now a PC with 512 MBworking memory/faculteit technologie management

Cascade of data Business and government systems (transactionssystem, ERP systems, Workflow systems, .) Scientific data: astronomy, biology, etc Web, text, and e-commerce (new regularities, aboutdata storage to prevent attempts) Hospitals, internal revenue service ./faculteit technologie management

Examples large data bases AT&T handles billions of calls per day– so much data, it cannot be all stored -- analysis has tobe done “on the fly” Europe's Very Long Baseline Interferometry(VLBI) has 16 telescopes, each of whichproduces 1 Gigabit/second of astronomicaldata over a 25-day observation session Google/faculteit technologie management

First conclusion Very little data will ever be looked at by ahuman Data Mining algorithms and computers areNEEDED to make sense and use of data./faculteit technologie management

Overview Why data mining (data cascade)Application examplesData Mining & Knowledge DiscoveringData Mining versus Process Mining/faculteit technologie management

Application examples I Customer Relationship Management (CRM)– Based on a data base with client information andbehavior try to select other potential consumers of aproduct.– Euro miles. Profiling tax cheaters– Based on the profile of the tax payer and some figuresfrom the tax (electronic) form try to product taxcheating./faculteit technologie management

Application examples II Health care– Given the patient profile and the diagnoses try topredict the number of hospital days. Information isused in planning system. Industry– Job shop planning. Based on already accepted jobs,try to product the delivery time of a new offered job./faculteit technologie management

Type of applications Classification (supervised)– Credit risk: result of data mining are rules that can be usedto classify new clients as: high, normal, low Estimation (supervised)– Credit risk: output is not a classification but a numberbetween -1 and 1 to indicate risk (-1.0 very low, 0.0 normal, 1.0 very high) Clustering (unsupervised) Associations: e.g. Bier & Chips & Peanuts occurfrequently in a shopping list of one person Visualization: to facilitate human discovery/faculteit technologie management

Supervised verses unsupervised Supervised (Credit risk)– Starting point is a historical data base with clientinformation and his/her financial data including credithistory (classification). This data base is used toinduce credit risk rules. Unsupervised (Clustering)– Try to cluster customers into similar groups (howmany groups, in which sense similar)/faculteit technologie management

E-commerce – Case Study A person buys a book (product) at Amazon.com. Task: Recommend other books (products) thisperson is likely to buy Amazon does clustering based on books bought:– customers who bought “Advances in Knowledge Discovery andData Mining”, also bought “Data Mining: Practical MachineLearning Tools and Techniques with Java Implementations” Recommendation program is quite successful/faculteit technologie management

Hands-on-project I Historical consumer data– Age, education, sex, relationship,etc.– Income Model to predict income above50K Use the model to selectconsumers for direct mailing/faculteit technologie management

Problems Suitable for Data-Mining have sub-optimal current methodshave accessible, sufficient, and relevant dataprovides high payoff for the right decisions!(have a changing environment)/faculteit technologie management

Overview Why data mining (data cascade)Application examplesData Mining & Knowledge DiscoveringData Mining versus Process Mining/faculteit technologie management

Knowledge Discovery DefinitionKnowledge Discovery in Data is thenon-trivial process of identifying– valid– novel– potentially useful– and ultimately understandable patterns in data.from Advances in Knowledge Discovery and DataMining, Fayyad, Piatetsky-Shapiro, Smyth, andUthurusamy, (Chapter 1), AAAI/MIT Press 1996/faculteit technologie management

Related FieldsMachineLearningVisualizationData Mining andKnowledge DiscoveryStatistics/faculteit technologie managementDatabases

Statistics, Machine Learning andData Mining Statistics:–– Machine Learning–– more heuristics then theory-basedfocused on improving performance of a learning algorithmsData Mining and Knowledge Discovery–– more theory-basedmore focused on testing hypothesesData Mining one step in the Knowledge Discovery process (applyingthe Machine Learning algorithm)Knowledge Discovery, the whole process including data cleaning,learning, and integration and visualization of resultsDistinctions are fuzzy/faculteit technologie management

Knowledge Discovery Processflow, according to CRISP-DMMonitoring/faculteit technologie managementBusinessUnderstanding DataUnderstanding DataPreparation80% of the timeModeling(applying miningalgorithm) 20%

Phases and parationDetermineBusiness ObjectivesBackgroundBusiness ObjectivesBusiness SuccessCriteriaCollect Initial DataInitial Data CollectionReportDescribe DataData Description ReportSelect DataRationale for Inclusion /ExclusionSituation AssessmentInventory of ResourcesRequirements,Assumptions, andConstraintsRisks and ContingenciesTerminologyCosts and BenefitsExplore DataData Exploration ReportClean DataData Cleaning ReportVerify Data QualityData Quality ReportConstruct DataDerived AttributesGenerated RecordsDetermineData Mining GoalData Mining GoalsData Mining SuccessCriteriaData SetData Set DescriptionIntegrate DataMerged DataFormat DataReformatted DataProduce Project PlanProject PlanInitial Asessment ofTools and Techniques/faculteit technologie managementModelingSelect ModelingTechniqueModeling TechniqueModeling AssumptionsGenerate Test DesignTest DesignBuild ModelParameter SettingsModelsModel DescriptionAssess ModelModel AssessmentRevised ParameterSettingsEvaluationEvaluate ResultsAssessment of DataMining Results w.r.t.Business SuccessCriteriaApproved ModelsReview ProcessReview of ProcessDetermine Next StepsList of Possible ActionsDecisionDeploymentPlan DeploymentDeployment PlanPlan Monitoring andMaintenanceMonitoring andMaintenance PlanProduce Final ReportFinal ReportFinal PresentationReview ProjectExperienceDocumentation

Other related fields Data warehouse– A data warehouse thus not contain simply accumulateddata at a central point, but the data is carefully assembledfrom a variety of information sources around theorganization, cleaned u, quality assured, and then released(published). Business Intelligence (BI)– The use of data in the data ware house to support themanagers with important information/faculteit technologie management

Overview Why data mining (data cascade)Application examplesData Mining & Knowledge DiscoveringData Mining versus Process Mining/faculteit technologie management

Data Mining versus Process Mining Process Mining is data mining but with a strongbusiness process view. Some of the more traditional data miningtechniques can be used in the context ofprocess mining. Some new techniques are developed to performprocess mining (mining of process models)./faculteit technologie management

Why Process Mining Traditional As-Is analysis of business processes stronglybased on the opinion of process expert. The basic ideais to assemble an appropriate team and to organizemodeling sessions in which the knowledge of the teammembers is used to build an adequate As-Is processmodel. The surplus values of process mining in the As-Isanalysis are:– information based on the real performance of the process(objective)– more details/faculteit technologie management

Data Mining versus Process Mining Process Mining is data mining but with a strong business process view. Some of the more traditional data mining techniques can be used in the context of process mining. Some new techniques are developed to perform process mining (mining of process models).