Introduction To Data Mining Methods And Tools

Transcription

Introduction to Data MiningMethods and Toolsby Michael Hahsler

Agenda What is Data Mining? Data Mining Tasks Relationship to Statistics,Optimization, Machine Learningand AI Tools Data Legal, Privacy and Security Issues

What is Data Mining?One of many definitions:"Data mining is the science of extracting usefulknowledge from huge data repositories."ACM SIGKDD, Data Mining Curriculum: A Proposalhttp://www.kdd.org/curriculum

Why Data Mining?Commercial Viewpoint Businesses collect and warehouse lotsof data.—Purchases at department/grocery stores—Bank/credit card transactions—Web and social media data—Mobile and IOT Computers are cheaper and morepowerful. Competition to provide better services.—Mass customization and recommendationsystems—Targeted advertising—Improved logistics

Why Mine Data?Scientific Viewpoint Data collected and stored atenormous speeds (GB/hour)—remote sensors on a satellite—telescopes scanning the skies—microarrays generating geneexpression data—scientific simulationsgenerating terabytes of data Data mining may help scientists—identify patterns and relationships—to classify and segment data—formulate hypotheses

Knowledge Discovery in Databases (KDD) ProcessData normalizationDecide on task & algorithmNoise/outliersPerformance?Data/dim. reductionMissing dataUnderstand domainFeatures engineeringFeature selectionUsama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. 1996. From data mining to knowledge discovery: an overview.

CRISP-DM Reference Model Cross Industry StandardProcess for Data Mining Open standard process model Industry, tool and applicationneutral Defines tasks and outputs. Now developed by IBM as theAnalytics Solutions UnifiedMethod for DataMining/Predictive Analytics(ASUM-DM). SAS has SEMMA and mostconsulting companies usetheir own similar process.https://en.wikipedia.org/wiki/Cross Industry Standard Process for Data Mining

Tasks in the CRISP-DM Model

Agenda What is Data Mining? Data Mining Tasks Relationship to Statistics,Optimization, Machine Learningand AI Tools Data Legal, Privacy and Security Issues

Data Mining TasksDescriptiveMethodsPredictiveMethodsFind human-interpretablepatterns that describe thedata.Use some features (variables)to predict unknown or futurevalue of other variable.

Data Mining Tasks RegressionClassificationIntroduction to Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Addison Wesley, 2006

Data Mining Tasks RegressionClassification

ClusteringGroup points such that—Data points in one cluster are more similar to one another.—Data points in separate clusters are less similar to one another.Ideal grouping is not known Unsupervised LearningIntracluster distancesare minimizedEuclidean distance based clustering in 3-D space.Intercluster distancesare maximized

Clustering: Market SegmentationGoal: subdivide a market intodistinct subsets of customers.Use a different marketing mixfor each segment.Approach:1. Collect different attributes ofcustomers based on their geographicaland lifestyle related information andobserved buying patterns.2. Find clusters of similar customers.

Clustering DocumentsGoal: Find groups ofdocuments that are similar toeach.Approach: Identifyfrequently occurring terms ineach document. Define asimilarity measure based onterm co-occurrences. Use itto cluster.Gain: Can be used toorganize documents or tocreate recommendations.

Clustering: Data ReductionGoal: Reduce the data size for predictivemodels.Approach: Group data given a subset ofthe available information and then usethe group label instead of the originaldata as input for predictive models.

Data Mining Tasks RegressionClassification

Association Rule Discovery Given is a set of transactions. Each contains anumber of items. Produce dependency rules of the formLHS RHS which indicate that if the set of items in the LHSare in a transaction, then the transaction likely willalso contain the RHS item.TIDItems1Bread, Coke, Milk2Beer, Bread3Beer, Coke, Diaper, Milk4Beer, Bread, Diaper, Milk5Coke, Diaper, MilkTransaction data{Milk} {Coke}{Diaper, Milk} {Beer}Discovered Rules

Association RuleDiscoveryMarketing andSales Promotion Let the rule discovered be{Potato Chips, } {Soft drink} Soft drink as RHS: What should be doneto boost sales? Discount Potato Chips? Potato Chips in LHS: Shows whichproducts would be affected if the storediscontinues selling Potato Chips. Potato Chips in LHS and Soft drink inRHS: What products should be sold withPotato Chips to promote sales of Softdrinks!

AssociationRule DiscoverySupermarketshelfmanagement Goal: To identify itemsthat are bought together bysufficiently many customers. Approach:—Process the point-of-sale datato find dependencies among items.—Place dependent items close to each other (convenience). far from each other to expose the customer to the maximumnumber of products in the store.

AssociationRuleDiscoveryInventoryManagement Goal: Anticipate the nature of repairs to keep the service vehiclesequipped with right parts to speed up repair time. Approach: Process the data on tools and parts required inprevious repairs at different consumer locations and discover cooccurrence patterns.

Data Mining Tasks RegressionClassification

Regression Predict a value of a givencontinuous valued variablebased on the values of othervariables, assuming a linear ornonlinear model ofdependency. Studied in statistics andeconometrics.Applications: Predicting sales amounts of new product based on advertisingexpenditure. Predicting wind velocities as a function of temperature, humidity, airpressure, etc. Time series prediction of stock market indices (autoregressive models).

Data Mining Tasks RegressionClassification

ClassificationFind a model for the class attribute as a function of the values of otherattributes/features.Class information is available Supervised LearningTid Refund rried10 nClassifierModel

ClassificationFind a model for the class attribute as a function of the values of otherattributes/features.Goal: assign new records to a class as accurately as possible.Tid Refund rried10 5K90KRefund bleIncome75K50K150K90K40K80KCheat?Test SetTrainingSetLearnClassifierModel

Classification:DirectMarketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a newproduct. Approach:— Use the data for a similar product introduced before or from a focus group.We have customer information (e.g., demographics, lifestyle, previouspurchases) and know which customers decided to buy and which decidedotherwise. This buy/don’t buy decision forms the class attribute.— Use this information as input attributes to learn a classifier model.— Apply the model to new customers to predict if they will buy the product.

Goal: To predict whether a customer is likely to be lost to acompetitor.Classification:CustomerAttrition/Churn Approach:—Use detailed record of transactions with each of the past andpresent customers, to find attributes (frequency, recency,complaints, demographics, etc.).—Label the customers as loyal or disloyal.—Find a model for disloyalty.—Rank each customer on a loyal/disloyal scale (e.g., churnprobability).

Classification: SkySurvey Cataloging Goal: To predict class (star orgalaxy) of sky objects, especiallyvisually faint ones, based on thetelescopic survey images (fromPalomar Observatory). Approach:—Segment the image to identifyobjects.—Derive features per object (40).—Use known objects to modelthe class based on thesefeatures. Result: Found 16 new highred-shift quasars.From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Data Mining Tasks RegressionClassification

Deviation/Anomaly Detection Detect significant deviations from normal behavior. Applications:—Credit Card Fraud Detection—Network IntrusionDetectionTypical network traffic at Universitylevel may reach over 100 millionconnections per day

Other Data Mining TasksText mining –documentclustering, topicmodelsMiningspatiotemporaldata (e.g., movingobjects)Graph mining –social networksData streammining/real timedata miningVisual data miningDistributed datamining

Challenges of Data MiningScalabilityDataownership andprivacyData a

Agenda What is Data Mining? Data Mining Tasks Relationship to Statistics,Optimization, Machine Learningand AI Tools Data Legal, Privacy and Security Issues

Origins ofData Mining Draws ideas from AI,machine learning, patternrecognition, statistics, anddatabase systems. There are differences interms ofAIMachine Learning (1959-)—used data and—the goals.Chief Data Scientist,White ining/

Relationship to other FieldsLearning entLearningOnlineLearning Application AreasMath

Relationship to other FieldsLearning entLearningOnlineLearningArtificial Intelligence: Create an autonomous agent that perceives its environmentand takes actions that maximize its chance of reaching some goal.Areas: reasoning, knowledge representation, planning, learning, natural languageprocessing, and vision.

Relationship to other FieldsLearning entLearningOnlineLearningOptimization: Selection of a best alternative from some set of available alternativeswith regard to some criterion.Techniques: Linear programming, integer programming, nonlinear programming,stochastic and robust optimization, heuristics, etc.

Relationship to other FieldsLearning entLearningOnlineLearningStatistics: Study of the collection, analysis, interpretation, presentation, andorganization of data.Techniques: Descriptive statistics, statistical inference (estimation, testing), design ofexperiments.

Relationship to other FieldsLearning entLearningOnlineLearningLearning Strategy: From what data do we learn? Is a training set with correct answers available? Long-term structure of rewards? No answer and no reward structure? Do we have to update the model regularly? Supervised learningReinforcement learningUnsupervised learningOnline learning

Relationship to other FieldsLearning entLearningOnlineLearningStatistical learning: deals with the problem of finding a predictive function basedon data.Tools: (Linear) classifiers, regression and regularization.

Relationship to other FieldsLearning entLearningOnlineLearningMachine Learning involves the study of algorithms that can extract informationautomatically, i.e., without on-line human guidance.Techniques: Focus on supervised learning.

Relationship to other FieldsLearning entLearningOnlineLearningData Mining: Manually analyze a given dataset to gain insights and predict potentialoutcomes.Techniques: Any applicable technique from databases, statistics, machine/statisticallearning. New methods were developed by the Data Mining community.

Data Mining & Analytics12OR108Data Mining / Stats6Column 1Column 2Column 3StatisticsOR4Machine Learning20Row 1Row 2DB / CSRow 3Row 4

Prescriptive AnalyticsWhat decisions should we make nowto achieve the best future outcome?DataPredict whatPredict whatwill happenwill happenin the futurePredictiveModelDecisionPredict whatwill changeEvaluatepredictedoutcomesOptimizeIssues:- What are the decision variables? Causality?- Relationship can be non-linear. Convex?- Uncertainty about quality and reliability of the predictive model.

DataScienceSource: T. Stadelmann, et al., Applied Data Science in EuropeGood luck finding this person!Probably a team effort!

Agenda What is Data Mining? Data Mining Tasks Relationship to Statistics,Optimization, Machine Learningand AI Tools Data Legal, Privacy and Security Issues

Tools: Commercial PlayersGartner MQ for DataScience and MachineLearning Platforms,2020 vs 2019changes.

Tools: Popularityhttps://www.kdnuggets.com/polls/

Tools: TypesSimplegraphical userinterfaceProcessorientedProgrammingoriented

Tools: Simple GUI Weka: WaikatoEnvironment forKnowledge Analysis (JavaAPI) Rattle: GUI for DataMining using R

Tools: Process oriented SAS EnterpriseMiner IBM SPSSModeler RapidMiner Knime Orange

Tools: Programming oriented R—Rattle for beginners—RStudio IDE, markdown, shiny—Microsoft Open R Python—Numpy, scikit-learn, pandas—Jupyter notebook Both have similar capabilities. Slightly different focus:—R: statistical computing and visualization—Python: Scripting, big data—Interoperability via rpy2 and rediculate

https://www.dataquest.io/blog/python-vs-r/

Agenda What is Data Mining? Data Mining Tasks Relationship to Statistics,Optimization, Machine Learningand AI Tools Data Legal, Privacy and Security Issues

Data

Data Warehousehttp://www.fulcrumlogic.com/data warehousing.shtml

Data Warehouse Subject Oriented: Data warehouses are designed to help you analyzedata (e.g., sales data is organized by product and customer). Integrated: Integrates data from disparate sources into a consistentformat. Nonvolatile: Data in the data warehouse are never overwritten ordeleted. Time Variant: maintains both historical and (nearly) current data.

ETL: Extract, Transform and Load Extracting data from outsidesources Transforming data to fitanalytical needs. E.g.,Source: SAS, ETL: What it is and why it matters—Clean missing data, wrong data,etc.—Normalize and translate(e.g., 1 "female")—Join from several sources—Calculate and aggregate data Loading data into the datawarehouse

OnLine Analytical Processing (OLAP)Operations: Smart phones Product SliceDiceDrill-downRoll-upPivotTXRegionStore data in "data cubes" for fast OLAP operations.Requires a special database structure (Snow-flake scheme).

Big Data "Big data is a term for data setsthat are so large or complex thattraditional data processing applications are inadequate to deal withthem."Wikipedia 3 V's: Volume, velocity, variety, (veracity) dMapReduce

Agenda What is Data Mining? Data Mining Tasks Relationship to Statistics,Optimization, Machine Learningand AI Tools Data Legal, Privacy and Security Issues

Legal, Privacy and Security Issues?

Legal, Privacy and Security IssuesAre weallowed tocollect thedata?Are weallowed touse thedata?Is privacypreservedin theprocess?Is it ethicalto use andact on thedata?Problem: Internet is global, but legislation is local!

Legal, Privacy and Security IssuesData-Gathering via AppsPresents a Gray Legal AreaBy KEVIN J. O’BRIENPublished: October 28, 2012BERLIN — Angry Birds, the top-selling paid mobile app for the iPhonein the United States and Europe, has been downloaded more than abillion times by devoted game players around the world, who oftenspend hours slinging squawking fowl at groups of egg-stealing pigs.When Jason Hong, an associate professor at the Human-ComputerInteraction Institute at Carnegie Mellon University, surveyed 40 users,all but two were unaware that the game was storing their locations sothat they could later be the targets of ads.

Here is what the small print says.USA Today Network Josh Hafner, 2:38 p.m. EDT July 13, 2016Pokémon Go’s constant location tracking and camera access requiredfor gameplay, paired with its skyrocketing popularity, could providedata like no app before it.“Their privacy policy is vague,” Hong said. “I’d say deliberately vague,because of the lack of clarity on the business model.”.The agreement says Pokémon Go collects data about its users as a“business asset.” This includes data used to personally identify playerssuch as email addresses and other information pulled from Google andFacebook accounts players use to sign up for the game.If Niantic is ever sold, the agreement states, all that data can go toanother company.

ConclusionData Mining isinterdisciplinary andoverlaps significantly withmany fields Statistics CS (machine learning, AI,data bases) OptimizationData Mining requires a teameffort with members whohave expertise in severalareas Data management Statistics Programming Communication Application domain

Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Addison Wesley, 2006. Data Mining Tasks Classification Regression. Clustering Group points such that —Data