Transcription
Statistical learningAn introductionEmmeke Aarts
DateTopicPresented by8 novemberStatistical learning: an introductionDr. Emmeke Aarts29 november Regression from the data scienceDr. Dave HessenperspectiveEmmeke Aarts6 decemberClassificationDr. Gerko Vink10 januariResampling MethodsDr. Gerko Vink7 februariRegularizationDr. Maarten Cruijf7 maartMoving beyond linearityDr. Maarten Cruijf4 aprilTree-Based modelsDr. Emmeke Aarts9 meiSupport vector MachinesDr. Daniel Oberski6 juniUnsupervised learningProf. Dr. Peter van der HeijdenADS lunch lecture - Machine learning, an introduction2
Outline What is statistical learning Accuracy versus interpretability Supervised versus unsupervised learning Regression versus classification Model accuracy & bias-variance trade off Potential benefits for social scientist SoftwareEmmeke AartsADS lunch lecture - Machine learning, an introduction3
What is statistical learning – Big data ‘[B]ig data’ [ ] refers to large,diverse,complex,longitudinal,and/or distributed data setsgenerated from instruments,sensors,Internet transactions,email,video,click streams,and/or all other digital sources available[.](NSF, NIH 2012)Emmeke AartsADS lunch lecture - Machine learning, an introduction4
What is statistical learning – Big dataSource: http://informationcatalyst.comEmmeke AartsADS lunch lecture - Machine learning, an introduction5
What is statistical learning – Big dataSo, how different from e.g., the massive data sets arising in physics?1. ‘Big data’ [is] the amassing of huge amounts of statistical information on socialand economic trends and human behavior. (M. Chen)data on people2. Granularity: documents of social phenomena at the granularityof individual people and their activities. (M.I. Jordan)Issues regarding ethics, privacy, bias, fairness, and inclusion.For a nice overview on this, see Hanna Wallach on Medium: Big data, machine learning and the social sciences: Fairness, accountability, and TransparancyEmmeke AartsADS lunch lecture - Machine learning, an introduction6
Why should social scientist bother Science: “minimal evidence of emerging computational social scienceengaged in quantitative modeling of these new kinds of digitaltraces.” (Lazer, Science) Industry & government: computational social science is occurring ona large scale, in places like Google Yahoo the National Security AgencySee e.g.: D. Lazer et el. (2009). Life in the network: the coming age of computational social science. ScienceEmmeke AartsADS lunch lecture - Machine learning, an introduction7
What is statistical learningMachine learning (ML): Allowing computers tolearn for themselves without explicitly beingprogrammed Google: AlphaGo, computer that defeated worldchampion Go player Apple & android: Siri voice assistantTrain a system by showing examples of inputoutput behavior, instead ofprogramming it manually by anticipating thedesired response for all possible inputsEmmeke AartsADS lunch lecture - Machine learning, an introduction8
What is statistical learning Artificial intelligence (AI): Constructing machines(robots, computers) to think and act like humanbeings ML is a subset of AI Statistical learning (SL): a set of approaches forestimating f; a function that represents our datathat can be used for e.g., prediction and/orinference SL is a subset of MLEmmeke AartsADS lunch lecture - Machine learning, an introduction9
Supervised versus unsupervised learningSupervised learningInputStatistical modelOutputBuilding a statistical model for predicting / estimating an output based on one or more inputsEmmeke AartsADS lunch lecture - Machine learning, an introductionGraph from: James, G., Witten, D.,Hastie, T., & Tibshirani, R. (2013). Anintroduction to statistical learning10
Supervised versus unsupervised learningSupervised learning most widely used machine-learning methods are supervised spam classifiers of e-mail face recognizers over images medical diagnosis systems for patients Methods include decision trees(logistic) regressionsupport vector machinesneural networksBayesian classifiersEmmeke AartsADS lunch lecture - Machine learning, an introduction11
Supervised versus unsupervised learningUnsupervised learningInputStatistical modelInputs, but no outputs. Try to learn structure and relationships from these dataEmmeke AartsADS lunch lecture - Machine learning, an introduction12
Supervised versus unsupervised learningUnsupervised learning assumptions about structural properties of the data Dimension reduction methods principal components analysis factor analysis random projections Clustering K-means clusteringEmmeke AartsADS lunch lecture - Machine learning, an introduction13
How do we learnParametric modelsNon-parametric modelsGraphs from: James, G., Witten, D.,Hastie, T., & Tibshirani, R. (2013). Anintroduction to statistical learningLinear modelSmooth thin-plate spline modelRestrictiveFlexibleInference: interpretableNot so interpretableEmmeke AartsADS lunch lecture - Machine learning, an introduction14
Accuracy versus interpretabilityImage from: -black-box/Emmeke AartsADS lunch lecture - Machine learning, an introduction15
Black box example from neuroscienceDeep imagereconstruction fromhuman brain activityG. Shen*, T. Horikawa*, K.Majima*, and Y. Kamitani2017Emmeke AartsADS lunch lecture - Machine learning, an introduction16
Black box example from neuroscienceEmmeke AartsADS lunch lecture - Machine learning, an introduction17
Black box example from neuroscienceEmmeke AartsADS lunch lecture - Machine learning, an introduction18
Accuracy versus interpretabilityIllustration from: James, G., Witten, D.,Hastie, T., & Tibshirani, R. (2013). Anintroduction to statistical learningEmmeke AartsADS lunch lecture - Machine learning, an introduction19
Regression versus classificationQuantitative outcomesQualitative outcomesPredict an quantitative outcome - regressionPredict to which category anobservation belongs - classificationEmmeke AartsADS lunch lecture - Machine learning, an introduction20
Model accuracyModel accuracy: Mean Squared ErrorTo obtain the MSE, we use training set and a test setEmmeke AartsADS lunch lecture - Machine learning, an introduction21
Model accuracyGraphs from: James, G., Witten, D.,Hastie, T., & Tibshirani, R. (2013). Anintroduction to statistical learningEmmeke AartsADS lunch lecture - Machine learning, an introduction22
Bias-variance trade offGraphs from: James, G., Witten, D.,Hastie, T., & Tibshirani, R. (2013). Anintroduction to statistical learningEmmeke AartsADS lunch lecture - Machine learning, an introduction23
Potential benefits for social scientist Solutions to overfitting (what we call the replication bias)1. uncover patterns and structure embedded in data2. test and improve model specification and predictions3. perform data reductionEmmeke AartsADS lunch lecture - Machine learning, an introduction24
Software R and Python: core of machine learning development Matlab has a ML toolbox, but lacks customizability Some techniques available in SPSS / Stata Specific programs for specific techniques, e.g., TensorflowEmmeke AartsADS lunch lecture - Machine learning, an introduction25
References James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction tostatistical learning. New York: springer Lazer, D., Pentland, A. S., Adamic, L., Aral, S., Barabasi, A. L., Brewer, D., . &Jebara, T. (2009). Life in the network: the coming age of computational socialscience. Science, 323(5915), 721. Shen, G., Horikawa, T., Majima, K., & Kamitani, Y. (2017). Deep imagereconstruction from human brain activity. bioRxiv, 240317. learning-and-the-socialsciences-927a8e20460dEmmeke AartsADS lunch lecture - Machine learning, an introduction26
What is statistical learning Artificial intelligence (AI): Constructing machines (robots, computers) to think and act like human beings Emmeke Aarts ADS lunch lecture - Machine learning, an introduction 9 ML is a subset of AI Statistical learning (SL): a set of approaches for estimating f; a function that represents our data