Data Mining - Uni-due.de

Transcription

Data MiningPractical Machine Learning Tools and TechniquesSlides for Chapter 1 of Data Mining by I. H. Witten and E. Frank

What’s it all about? Data vs informationData mining and machine learningStructural descriptions Datasets Weather, contact lens, CPU performance, labor negotiationdata, soybean classificationFielded applications Rules: classification and associationDecision treesLoan applications, screening images, load forecasting,machine fault diagnosis, market basket analysisGeneralization as searchData mining and ethicsData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)2

Data vs. information Society produces huge amounts of data Sources: business, science, medicine, economics,geography, environment, sports, Potentially valuable resourceRaw data is useless: need techniques toautomatically extract information from it Data: recorded factsInformation: patterns underlying the dataData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)3

Information is crucial Example 1: in vitro fertilization Given: embryos described by 60 featuresProblem: selection of embryos that will surviveData: historical records of embryos and outcomeExample 2: cow culling Given: cows described by 700 featuresProblem: selection of cows that should be culledData: historical records and farmers’ decisionsData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)4

Data mining Extracting implicit,previously unknown,potentially usefulinformation from dataNeeded: programs that detect patterns andregularities in the dataStrong patterns good predictions Problem 1: most patterns are not interestingProblem 2: patterns may be inexact (or spurious)Problem 3: data may be garbled or missingData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)5

Machine learning techniques Algorithms for acquiring structural descriptionsfrom examplesStructural descriptions represent patternsexplicitly Can be used to predict outcome in new situationCan be used to understand and explain howprediction is derived(may be even more important)Methods originate from artificial intelligence,statistics, and research on databasesData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)6

Structural descriptions Example: if then rulesIf tear production rate reducedthen recommendation noneOtherwise, if age young and astigmatic nothen recommendation softAgeSpectacleprescriptionAstigmatismTear ermetropeNoReducedNonePresbyopicMyopeYesNormalHard Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1)7

Can machines really learn? Definitions of “learning” from dictionary:Difficult to measureTo get knowledge of by study,experience, or being taughtTo become aware by information orfrom observationTo commit to memoryTo be informed of, ascertain; to receiveinstruction Operational definition:Things learn when they change theirbehavior in a way that makes themperform better in the future. Trivial for computersDoes a slipper learn?Does learning imply intention?Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1)8

The weather problem Conditions for playing a certain ainyMildNormalFalseYes IfIfIfIfIfoutlook sunny and humidity high then play nooutlook rainy and windy true then play nooutlook overcast then play yeshumidity normal then play yesnone of the above then play yesData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)9

Ross QuinlanMachine learning researcher from 1970’sUniversity of Sydney, Australia1986 “Induction of decision trees” ML Journal1993 C4.5: Programs for machine learning.Morgan Kaufmann199? Started Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1)10

Classification vs. association rules Classification rule:predicts value of a given attribute (the classification of an example)If outlook sunny and humidity highthen play no Association rule:predicts value of arbitrary attribute (or combination)If temperature cool then humidity normalIf humidity normal and windy falsethen play yesIf outlook sunny and play nothen humidity highIf windy false and play nothen outlook sunny and humidity highData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)11

Weather data with mixed attributes Some attributes have numeric 0FalseYes IfIfIfIfIfoutlook sunny and humidity 83 then play nooutlook rainy and windy true then play nooutlook overcast then play yeshumidity 85 then play yesnone of the above then play yesData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)12

The contact lenses dataAgeSpectacle prescriptionAstigmatismTear production ardNoneSoftNoneNoneData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)13

A complete and correct rule setIf tear production rate reduced then recommendation noneIf age young and astigmatic noand tear production rate normal then recommendation softIf age pre-presbyopic and astigmatic noand tear production rate normal then recommendation softIf age presbyopic and spectacle prescription myopeand astigmatic no then recommendation noneIf spectacle prescription hypermetrope and astigmatic noand tear production rate normal then recommendation softIf spectacle prescription myope and astigmatic yesand tear production rate normal then recommendation hardIf age young and astigmatic yesand tear production rate normal then recommendation hardIf age pre-presbyopicand spectacle prescription hypermetropeand astigmatic yes then recommendation noneIf age presbyopic and spectacle prescription hypermetropeand astigmatic yes then recommendation noneData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)14

A decision tree for this problemData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)15

Classifying iris flowersSepal lengthSepal widthPetal lengthPetal widthType15.13.51.40.2Iris setosa24.93.01.40.2Iris setosa517.03.24.71.4Iris versicolor526.43.24.51.5Iris versicolor1016.33.36.02.5Iris virginica1025.82.75.11.9Iris virginica If petal length 2.45 then Iris setosaIf sepal width 2.10 then Iris versicolor.Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1)16

Predicting CPU performance Example: 209 different computer configurationsCycle time(ns)Main 03283226920848051280003200672094801000400000045 Linear regression functionPRP -55.9 0.0489 MYCT 0.0153 MMIN 0.0056 MMAX 0.6410 CACH - 0.2700 CHMIN 1.480 CHMAXData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)17

Data from labor negotiationsAttributeDurationWage increase first yearWage increase second yearWage increase third yearCost of living adjustmentWorking hours per weekPensionStandby payShift-work supplementEducation allowanceStatutory holidaysVacationLong-term disability assistanceDental plan contributionBereavement assistanceHealth plan contributionAcceptability of contractType(Number of Number of hours){none,ret-allw, empl-cntr}PercentagePercentage{yes,no}(Number of 334.3%4.4%?38?4%?12gen?full?fullgood Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1)4024.54.0?none40?4?12avgyesfullyeshalfgood18

Decision trees for the labor dataData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)19

Soybean classificationAttributeEnvironment Time of occurrencePrecipitation Seed ConditionMold growth Fruit Condition of fruitpodsFruit spotsLeaf ConditionLeaf spot size Stem ConditionStem lodging Root ConditionDiagnosisNumberof values73Sample valueJulyAbove Yes319NormalDiaporthe stem cankerData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)20

The role of domain knowledgeIf leaf condition is normaland stem condition is abnormaland stem cankers is below soil lineand canker lesion color is brownthendiagnosis is rhizoctonia root rotIf leaf malformation is absentand stem condition is abnormaland stem cankers is below soil lineand canker lesion color is brownthendiagnosis is rhizoctonia root rotBut in this domain, “leaf condition is normal” implies“leaf malformation is absent”!Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1)21

Fielded applications The result of learning—or the learning methoditself—is deployed in practical applications Processing loan applicationsScreening images for oil slicksElectricity supply forecastingDiagnosis of machine faultsMarketing and salesSeparating crude oil and natural gasReducing banding in rotogravure printingFinding appropriate technicians for telephone faultsScientific applications: biology, astronomy, chemistryAutomatic selection of TV programsMonitoring intensive care patientsData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)22

Processing loan applications(American Express) Given: questionnaire withfinancial and personal informationQuestion: should money be lent?Simple statistical method covers 90% of casesBorderline cases referred to loan officersBut: 50% of accepted borderline cases defaulted!Solution: reject all borderline cases? No! Borderline cases are most active customersData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)23

Enter machine learning 1000 training examples of borderline cases20 attributes: Learned rules: correct on 70% of cases ageyears with current employeryears at current addressyears with the bankother credit cards possessed, human experts only 50%Rules could be used to explain decisions tocustomersData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)24

Screening images Given: radar satellite images of coastal watersProblem: detect oil slicks in those imagesOil slicks appear as dark regions with changingsize and shapeNot easy: lookalike dark regions can be causedby weather conditions (e.g. high wind)Expensive process requiring highly trainedpersonnelData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)25

Enter machine learning Extract dark regions from normalized imageAttributes: size of regionshape, areaintensitysharpness and jaggedness of boundariesproximity of other regionsinfo about backgroundConstraints: Few training examples—oil slicks are rare!Unbalanced data: most dark regions aren’t slicksRegions from same image form a batchRequirement: adjustable false alarm rateData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)26

Load forecasting Electricity supply companiesneed forecast of future demandfor powerForecasts of min/max load for each hour significant savingsGiven: manually constructed load model thatassumes “normal” climatic conditionsProblem: adjust for weather conditionsStatic model consist of: base load for the yearload periodicity over the yeareffect of holidaysData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)27

Enter machine learning Prediction corrected using “most similar” daysAttributes: temperaturehumiditywind speedcloud cover readingsplus difference between actual load and predicted loadAverage difference among three “most similar” daysadded to static modelLinear regression coefficients form attribute weightsin similarity functionData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)28

Diagnosis of machine faults Diagnosis: classical domainof expert systemsGiven: Fourier analysis of vibrations measuredat various points of a device’s mountingQuestion: which fault is present?Preventative maintenance of electromechanicalmotors and generatorsInformation very noisySo far: diagnosis by expert/hand crafted rulesData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)29

Enter machine learning Available: 600 faults with expert’s diagnosis 300 unsatisfactory, rest used for trainingAttributes augmented by intermediate conceptsthat embodied causal domain knowledgeExpert not satisfied with initial rules because theydid not relate to his domain knowledgeFurther background knowledge resulted in morecomplex rules that were satisfactoryLearned rules outperformed hand crafted onesData Mining: Practical Machine Learning Tools and Techniques (Chapter 1)30

Marketing and sales I Companies precisely record massive amounts ofmarketing and sales dataApplications: Customer loyalty:identifying customers that are

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1) 2 What’s it all about? Data vs information Data mining and machine learning Structural descriptions Rules: classification and association Decision trees Datasets Weather, contact lens, CPU performance, labor negotiation data, soybean classification