Cse643, Cse590 Data Mining - Stony Brook University

Transcription

cse643, cse590Data MiningProfessor Anita WasilewskaComputer Science DepartmentStony Brook University

Course Textbook Course Textbook:Jianwei Han, Micheline KamberDATA MININGConcepts and TechniquesMorgan Kaufmann, 2006Second Edition

Course structure Part One: Chapter 1: Intuitive Introduction and DMOverview of DM techniques Part Two: Chapter 2 (preprocessing), Chapter 6 (classification) Chapter 5 (association) Chapter 7 (clustering) Part Three: students presentations 1 book chapter 3 (data warehouse) and some parts ofchapters 2, 5, 6, 7, 8, 10, 11 and Student’s Choices Part Four : students presentations 2 Research papers or applications of students’ choiceconnected with their presentation 1

Chapter 1: IntroductionData Mining Main Objectives Indentification of data as asource of useful information Use of discovered informationfor competitive advantageswhen working in businessenviroment

Data – Information - Knowledge Data – as in databases Information or knowledge is ameta information ABOUT thepatterns hidden in the data§ The patterns must be discoveredautomatically

Why Data Mining? Data explosion problemAutomated data collection tools andmature database technology lead totremendous amounts of data stored indatabases, data warehouses and otherdata repositories

Why Data Mining? Data explosion problem We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining Data Mining:Extraction of interesting knowledge (rules,regularities, patterns, constraints) from data in largedatabases

What is Data Mining? There are many activities with the samename: CONFUSSION DM: Huge volumes of data DM: Potential hidden knowledge DM: Process of discovery of hiddenpatterns in data

DM: Intuitive DefinitionDM is a process to extractpreviously unknown knowledgefrom large volumes of dataRequires both new technologiesand new methods

Data Mining DM creates models (algorithms):classification (chapter 6)association (chapter 5)prediction (chapter 6)clustering (chapter 7)DM often presents the knowledge as a setof rules of the formIF. THEN.In this case it is called a Descriptive DMFinds other relationships in dataDetects deviations

DM: Some Applications Market analysis and management target marketing, customer relationmanagement, market basket analysis,cross selling, market segmentation Risk analysis and management Forecasting, customer retention, improvedunderwriting, quality control, competitiveanalysis

DM: Some Applications More Applications Text mining News groups, emails, documents Web analysis Intelligent query answering Scientific Applications

DM: Business Advantages Data Mining uses gathered data toPredict tendencies and wavesClassifiy new dataFind previously unknown patterns for theuse for business advantages Discover unknown relationships

DM: Technologies Many commercially avaible tools Many methods (models, algorithms)for the same task TOOLS ALONE ARE NOT THESOLUTION The user must often be able tointerpret the results

DM: Technologies One of the requirements of DM is:“the results must be easily comprehensibleto the user” Strenght of Descriptive Methods Most often, especially when dealing withstatistical methods separate analysts are needed to interpretthe results (knowledge) Weakness of Statistical Methods

Data Mining vs Statistics Some statistical methods are consideredas a part of Data Mining i.e. they are used as Data Mining algorithms,or as a part of Data Mining algorithms Some, like statistical prediction methods,different types of regression, andclustering methods are now considered asan integral part of Data Mining researchand applications

Bussiness Applications Buying patternsFraud detectionCustomer CampaingsDecision supportMedical aplicationsMarketingand more

Fraud Detection and Management (B1) Applications– widely used in health care, retail, creditcard services, telecommunications(phone card fraud), etc. Approach– use historical data to build models offraudulent behavior and use data miningto help identify similar instances

Fraud Detection and Management (B2) Examples– auto insurance: detect characteristics ofgroup of people who stage accidents to collecton insurance– money laundering: detect characteristics ofsuspicious money transactions (US Treasury'sFinancial Crimes Enforcement Network)– medical insurance: detect characteristics offraudulent patients and doctors

Fraud Detection and Management (B3) Detecting inappropriate medical treatment– Australian Health Insurance Commission detectedthat in many cases blanket screening tests wererequested (save Australian 1m/yr) Detecting telephone fraud– DM builds telephone call model: destination of thecall, duration, time of day or week.– Detects patterns that deviate from an expected norm.– British Telecom identified discrete groups of callerswith frequent intra-group calls, especially mobilephones, and broke a multimillion dollar fraud

Fraud Detection and Management (B4) Retail– Analysts used Data Mining techniquesto estimate that 38% of retail shrink isdue to dishonest employees– and more .

Data Mining vs Data Marketing Data Mining methods apply to manydomains Data Marketing: Applications of Data Mining methods inwhich the goal is to find buying patternsin Transactional Data Bases APPRIORI Algorithm

Market Analysis and Management (MA1) Data sources for analysis– Credit card transactions, loyalty cards,discount coupons, customer complaint calls,plus (public) lifestyle studies Target marketing– DM finds clusters of “model” customers whoshare the same characteristics: interest,income level, spending habits, etc.

Market Analysis and Management (MA2) Determine customer purchasing patternsover time– Conversion of single to a joint bank account: whenmarriage occurs, etc. Cross-market analysis– Associations/co-relations between product sales– Prediction based on the association information

Market Analysis and Management (MA3) Customer profiling– data mining can tell you what types ofcustomers buy what products (clustering orclassification) Identifying customer requirements identifying the best products for differentcustomers

Corporate Analysis and Risk Management(CA1) Finance planning and asset evaluation– cash flow analysis and prediction– contingent claim analysis to evaluate assets– cross-sectional and time series analysis(financial-ratio, trend analysis, etc.) Resource planning:– summarize and compare the resources andspending

Corporate Analysis and Risk Management(CA2) Competition:– monitor competitors and marketdirections– group customers into classes and aclass-based pricing procedure– set pricing strategy in a highlycompetitive market

Business Summary Data Mining helps to improve competitiveadvantage of organizations in dynamicallychanging environment; it improves clients retention andconversion Different Data Mining methods arerequired for different kind of data anddifferent kinds of goals

Scientific Applications Networks failure detectionControllersGeographic Information SystemsGenome- BioinformaticsIntelligent robotsIntelligent roomsetc etc .

What is NOT Data Mining Once patterns are found Data Miningprocess is finished Use of the patterns is not Data Mining Monitoring is not Data Mining Querries to the database are not DM

Evolution of Database Technology 1960s:– Data collection, database creation, IMS andnetwork DBMS 1970s:– Relational data model, relational DBMSimplementation

Evolution of Database Technology 1980s:– RDBMS, advanced data models (extendedrelational, OO, deductive, etc.) andapplication-oriented DBMS (spatial, scientific,engineering, etc.) 1990s—2000s:– Data mining and data warehousing,multimedia databases, and Web database 2000 --- Big Data

Short History of Data Mining 1989 - KDD term: Knowledge Discovery inDatabases appears in (IJCAI Workshop) 1991 - a collection of research papers editedby Piatetsky-Shapiro and Frawley 1993 – Association Rule Mining AlgorithmAPRIORI proposed by Agraval, Imielinskiand Swami 1996 – present: KDD evolves as aconjuction of different knowledge areas: data bases, machine learning, statistics,artificial intelligence and the term Data Mining becomes popular

Data Mining: Confluence ofMultiple ationScienceStatisticsData MiningVisualizationOtherDisciplines

KDD process Definition[Piatetsky-Shapiro 97] KDD is a non trivial process foridentification of :– Valid– New– Potentially useful– Understable– patterns in data

KDD ProcessINTERPRETATION AND EVALUATIONKnowledgeDATA MINING (proper)Rules, PatternsmodelsData PreparationTransformed dataCLEANINGProcessed DataSELECTIONTarget dataData

DM: Data Mining DM is a step in the KDD process in which algorithms are applied to look forpatterns in data We use term DATA MINING PROPER for DMstep in KDD Process We usually use term DM process term for KDDprocess

DM: Data Mining Process Rememeber It is necessary to apply first the preprocessing operations to clean andpreprocess the data in order to obtain significantpatterns DM Process can be re-iterated- and usually is

Data Mining ProcessINTERPRETATION AND EVALUATIONKnowledgeDATA MINING (proper)Rules, PatternsmodelsData PreparationTransformed dataCLEANINGProcessed DataSELECTIONTarget dataData

KDD vs DM KDD is a term used by academia DM is a often a commercial term DM term is also being used in academia,as it has become a “brand name” for bothKDD process and its DM sub-process The important point is to see Data Mining as a process with DataMining Proper as part of it

Steps of the KDD or DM process Preprocessing: includes all theoperations that have to be performed beforea data mining algorithm is applied() Data Mining (proper): knowledgediscovery algorithms are applied in order toobtain the patterns(8 ) Interpretation: discovered patterns arepresented in a proper format and the userdecides if it is neccesary to re-iterate thealgorthms

Architecture of a Typical Data MiningSystemGraphical user interfacePattern evaluationData mining engineDatabase or datawarehouse serverData cleaning & data ehouse

What Kind of Data? Relational DatabasesData warehousesTransactional databasesAdvanced DB and information repositories––––––Object-oriented and object-relational databasesSpatial databasesTime-series data and temporal dataText databases and multimedia databasesHeterogeneous and legacy databasesWWW

Descriptive Data Mining:concept description Concept – is defined semantically as anysubset of records We often define the concept by distinguishingin our database an attribute c and its value v In this case the concept description is writtensyntactically as : c v We define a concept with the description c v asCONCEPT {records: c v}

Descriptive Data Mining:concept description Let C be a concept with the description c v, i.e.C {records: c v} We call such attribute c a CLASS attribute, or adecision attribute, or a classification attribute The description c v is called a class, ordecision description

Descriptive Data Mining:concept, class description For example: climate wet is a description of the concept of WETCLIMATE and WET CLIMAT {records: climate wet} We use words: decision attribute, classattribute, concept attribute We talk about concept or class description REMEMBER: all definitions are relative tothe database we deal with.

Desctiptive DMConcept, Class Characteristics Let C be a class (concept) with a description c v, i.e.C {records: c v} The class C characteristics is a set of attributes a1, a2, ak, and their respective values v1, v2, . vk that arecharacteristic for a given concept C, i.e. such that {records: a1 v1 & a2 v2& .ak vk} /\ C non empty set Characteristics description of C is then written asa1 v1 & a2 v2& .ak vk

Characterization Describes the process which aim is to findrules that describe characteristic propertiesof a concept. They take the formIf concept then characteristicsIf c v then a2 v2& .ak vkC 1 à A 1 & B 325% (support: there are 25% of the records for whichthe rule is true) C 1 à A 1 & B 417% C 1 à A 0 & B 216%

Discrimination It is the process which aim is to find rulesthat allow us to discriminate the objects(records) belonging to a given concept(one class ) from the rest of recordsIf characteristics then conceptIf a2 v2& .ak vk then c vA 0 & B 1 à C 133% 83% (support, confidence: the conditionalprobability of the concept given the characteristics)A 2 & B 0 à C 127% 80%A 1 & B 1 à C 112% 76%

Classification - Supervised Learning– Classification– Finding models (rules) that describe(characterize) or/ and distinguish(discriminate) classes or concepts for futureprediction– Example: classify countries based on climate(characteristics)– classify cars based on gas mileage and use itto predict classification of a new car

Classification AlgorithmsModels, Basic Classifiers Decision Trees (ID3, C4.5) –descriptiveNeural Networks- statisticalBayesian Networks - statisticalRough Sets - descriptiveGenetic Algorithms – descriptive or statisticalbut mainly an optimization method Classification by Association – descriptive

Classification AlgorithmsModels, Basic Classifiers– Presentation of results:– characteristic and /or discriminant rules– In case of descriptive DM– converged network (Neural, Bayes) in case ofstatistical DM

Statistical DM, Clustering Statistical Prediction - predict some unknownor missing numerical values Cluster analysis (statistical)– Class label is unknown– Goal: group data to form new classes– It is called unsupervised learning– For example: cluster houses to find distributionpatterns– Clustering is based on the principle:– maximizing the intra-class similarity and minimizingthe interclass similarity

Statistical DM Outlier analysis– Outlier: a data object that does not complywith the general behavior of the data– It can be considered as noise or exceptionbut is quite useful in fraud detection, rareevents analysis

Statistical DM Trend and evolution analysis (statistical)– Trend and deviation: regression analysis– Sequential pattern mining, periodicity analysis– Similarity-based analysis Other pattern-directed or statistical analyses

ClassificationSupervised Learning Given a set of objects (concept, class) describedby a class attribute, a classification algorithmsbuilds a set of discriminant and /orcharacterization rules (or other descriptions incase of Statistical methods) in order to be able, as the next step, to classify unknown sets ofobjects This is called a supervised learning

Classification: Chapter 6 Decision Trees (ID3, C4.5) – descriptiveNeural Networks -statisticalRough Sets – descriptiveBayesian Networks- statisticalGenetic Algorithms- can be both, but ismainly an optimization method

Association: Chapter 5Problem Statement I {i1, i2, ., in} a set of itemsTransaction T: set of items, T is subset of IData Base: set of transactionsAn association rule is an implication of theform : X- Y, where X, Y are disjointsubsets of T Problem: Find association rules that havesupport and confidence greater taht userspecified minimum support and minimunconfidence

Association Rules Confidence: a rule X- Y holds in thedatabase D with a confidence c if the c% of transactions in D that contain Xalso contain Y Support: a rule X- Y has a support s inD if s% of transactions contain XUY

Association Rules Association rules presentation (predicate presentation)Multi-dimensional:age(X, “20.29”) income(X, “20.29K”) à buys(X,“PC”) [support 2%, confidence 60%]Single-dimensional:– buys(x, “computer”) à buys(x, “software”) [1%,75%]Association rules presentation (non-predicatepresentation)Age 20.29 /\ income 20.29K à buys PC (2%, 60%)Buys computer à buys software (1%,75%)

Clustering- Chapter 7 Database segmentation Given a set of objects (records) thealgorithm obtains a division of the objectsinto clusters in which the distance ofobjects inside a claster is minimal and thedistance among objects of diferent clustersis maximal Unsupervised learning

Other Statistical Methodschapter 6 Regression Temporal Series Lazy learners Support Vector Machines.

Major Issues in Data Mining (1)Book Slide Mining methodology and user interaction– Mining different kinds of knowledge indatabases– Interactive mining of knowledge at multiplelevels of abstraction– Incorporation of background knowledge– Data mining query languages and ad-hoc datamining– Expression and visualization of data miningresults

Major Issues in Data Mining (2)Book Slide– Handling noise and incomplete data– Pattern evaluation: the interestingnessproblem– Performance and scalabilityEfficiency and scalability of data miningalgorithmsParallel, distributed and incrementalmining methods

Major Issues in Data Mining (3)Book Slide Issues relating to the diversity of data types– Handling relational and complex types of data– Mining information from heterogeneous databases andglobal information systems (WWW) Issues related to applications and social impacts– Application of discovered knowledge Domain-specific data mining tools Intelligent query answering Process control and decision making– Integration of the discovered knowledge with existingknowledge: A knowledge fusion problem– Protection of data security, integrity, and privacy

Summary Data mining: discovering comprehensible, interestingpatterns from large amounts of data A natural evolution of database technology, in greatdemand, with wide applications A KDD process, or DM process includes data cleaning,data integration, data selection, transformation, datamining proper, pattern evaluation, and knowledgepresentation Mining can be performed in a variety of informationrepositories

Summary Data mining functionalities:characterization, discrimination,association, classification, clustering,outlier and trend analysis Classification of data mining systems Major issues in data mining

PreprocesingIntroduction to chapter 2

Preprocesing Select, integrate, and clean the data Decide which kind of patterns are needed Decide which algorithm is the best for yourgoal It depends on many factors Prepare data for algorithms Different algorithms accept differnt dataformat

Implementaion Preparation Identify the problem to be solved.Study problem it in detailExplore the solution spaceFind one acceptable solution (feasible toimplement) Specify the solution Prepare the data

Preparation Remember GIGO! (garbage in gabageout) Add some data, if necessary Structure the data in a proper form Be careful with incomplete and noisy data

Some implementation preparationrules to follow Select the problemSpecify the problemStudy the dataThe problem must guide the search for toolsand technologies Search for the simpliest model (algorithm,method) Define for each data the solution is valid, whereit is not valid at all and where it is valid withsome constraints

Studying the data The surrounding world consists of objects ,(data)and the DM problem is to find the relationshipsamong objects The objects are characterized by properties: attributes, values of attributes that have to be analized The results (rules, descriptions) are valid(true) under certain circumstances (data) and incertain moments (avaible data at the moment)

Measures Type of data decides a way in which dataare analized and preprocessed– Names (attributes)– Categories, classes, class attributes– Ordered values of attributes– Intervals of values of attributes– Types of values of attributes

Types of data Generally we distinguish:– Quantitative Data– Qualitative Data Bivaluated: often very usefulNull Values are not applicableMissing data usually not acceptableNNetworks, and Bayes accepts somemissing data.

What to take into account Eliminate redundant recordsEliminate out of range values of attributesDecide a generalization levelConsistency

Other preprocessing tasks Generalization vs specificationDiscretizationSamplingReducing number of attributes at thepreprocessing stage

Summary The preprocessing is required and is anessential part of the DM process If preprocessing is not performed patternsobtained could be of no use Preprocessing is a tedious task that couldeven take more time that DM proper

APPROACHES TO DATA MINING

Aproaches Mathematics: Consist in the creation ofmathematical models, algorithms, methods, toextract rules, regularities and patterns Rough Sets is the most precise model Statistics: They are focused in the creation ofstatistical models to analise data Regression, Bayesian networks, NN, Clustering

Statistical methods Numerical data are needed Statistical methods are also often used inpreprocessing steps to study the sample Hypothesis validation and regressionanalisys are used in data mining steps ofthe process

Decision trees Discovering discriminant rules Descriptive Data Mining Method: succesive division of the set ofdata This is a classification algorithm Works better when attributeshave a small set of values

Apriori Algorithm Agrawal, Imielinski (IBM S. José. California) It is an intuitive and efficient algorithm to extractassociations from transactions Also used as classification algorithm classification by association Method: Iterates until the associations obtained don’thave the required support

Rough SetsDescriptive Classification Approximation space A (U,IND(B)): Lower Approximation X B {o U /[o] X } Upper Approximation X B {o U /[o] X } Boundary RegionBnd(X)B X B X B Positive Region: POSB(D) { X : X IND ( D )}

Rough SetsBoundaryRegionLowerConceptXBoundary Lower Upper

Variable Precision Rough SetModelConcept XLowerNew objects add to the lower 0c ( X ,Y ) if 1 card ( X Y ) / card ( X )card ( X ) 0card ( X ) 0

Rough Sets in SQLBegin RE clases CLASES FORSELECT C1,.,CN, D, COUNT (*) AS cntFROM RGROUP BY C1,.,CN, DORDER BY C1,.,CN, D, CNT desc”);while not end records() doequ class exec(“FETCH 1 IN cursor”);first decision value get value(equ class(“D”));insert(equ class,upper[first decision value]);while (equ class exec(“FETCH 1 IN cursor”) dodecision value get value(equ class(“D”));insert(equ class,upper[first decision value]);end whileend whileEnd UPPER

Statistical Methods Neural Network: statisticalCLASSIFICATION algorithm the network is trained to obtainclassification patterns Clustering: form groups of objects withoutany previous hypothesis

Genetic Algorithms Optimization method They should be used when the goal is to find anoptimal solution in solution space They often are used together with neuralnetwoks, or other methods to produce moreunderstable (optimal) outputs They also are used to find the optimal set ofdiscriminant and/or characteristic rules for agiven database and a given class

Classification: requirements Decision attribute; called also classattribute, concept attribute Condition attributes: rest of the attributesor its subset Some require numerical data but there arealgorithms to deal with any kind of data

Asociation: requirements Transactional data There is not needed to specify right andleft side of the rules There are algorithms to tackle any kindof data Minimum support Maximum number of rules to beobtained

Clustering: requirements Set of attributesMaximum number of clustersNumber of iterationsMimimun number of elements in anycluster

Data Mining vs Statistics Some statistical methods are considered as a part of Data Mining i.e. they are used as Data Mining algorithms, or as a part of Data Mining algorithms Some, like statistical prediction methods, different types of regression, and clustering methods are now considered as