Statistical Learning - Universiteit Utrecht

Transcription

Statistical learningAn introductionEmmeke Aarts

DateTopicPresented by8 novemberStatistical learning: an introductionDr. Emmeke Aarts29 november Regression from the data scienceDr. Dave HessenperspectiveEmmeke Aarts6 decemberClassificationDr. Gerko Vink10 januariResampling MethodsDr. Gerko Vink7 februariRegularizationDr. Maarten Cruijf7 maartMoving beyond linearityDr. Maarten Cruijf4 aprilTree-Based modelsDr. Emmeke Aarts9 meiSupport vector MachinesDr. Daniel Oberski6 juniUnsupervised learningProf. Dr. Peter van der HeijdenADS lunch lecture - Machine learning, an introduction2

Outline What is statistical learning Accuracy versus interpretability Supervised versus unsupervised learning Regression versus classification Model accuracy & bias-variance trade off Potential benefits for social scientist SoftwareEmmeke AartsADS lunch lecture - Machine learning, an introduction3

What is statistical learning – Big data ‘[B]ig data’ [ ] refers to large,diverse,complex,longitudinal,and/or distributed data setsgenerated from instruments,sensors,Internet transactions,email,video,click streams,and/or all other digital sources available[.](NSF, NIH 2012)Emmeke AartsADS lunch lecture - Machine learning, an introduction4

What is statistical learning – Big dataSource: http://informationcatalyst.comEmmeke AartsADS lunch lecture - Machine learning, an introduction5

What is statistical learning – Big dataSo, how different from e.g., the massive data sets arising in physics?1. ‘Big data’ [is] the amassing of huge amounts of statistical information on socialand economic trends and human behavior. (M. Chen)data on people2. Granularity: documents of social phenomena at the granularityof individual people and their activities. (M.I. Jordan)Issues regarding ethics, privacy, bias, fairness, and inclusion.For a nice overview on this, see Hanna Wallach on Medium: Big data, machine learning and the social sciences: Fairness, accountability, and TransparancyEmmeke AartsADS lunch lecture - Machine learning, an introduction6

Why should social scientist bother Science: “minimal evidence of emerging computational social scienceengaged in quantitative modeling of these new kinds of digitaltraces.” (Lazer, Science) Industry & government: computational social science is occurring ona large scale, in places like Google Yahoo the National Security AgencySee e.g.: D. Lazer et el. (2009). Life in the network: the coming age of computational social science. ScienceEmmeke AartsADS lunch lecture - Machine learning, an introduction7

What is statistical learningMachine learning (ML): Allowing computers tolearn for themselves without explicitly beingprogrammed Google: AlphaGo, computer that defeated worldchampion Go player Apple & android: Siri voice assistantTrain a system by showing examples of inputoutput behavior, instead ofprogramming it manually by anticipating thedesired response for all possible inputsEmmeke AartsADS lunch lecture - Machine learning, an introduction8

What is statistical learning Artificial intelligence (AI): Constructing machines(robots, computers) to think and act like humanbeings ML is a subset of AI Statistical learning (SL): a set of approaches forestimating f; a function that represents our datathat can be used for e.g., prediction and/orinference SL is a subset of MLEmmeke AartsADS lunch lecture - Machine learning, an introduction9

Supervised versus unsupervised learningSupervised learningInputStatistical modelOutputBuilding a statistical model for predicting / estimating an output based on one or more inputsEmmeke AartsADS lunch lecture - Machine learning, an introductionGraph from: James, G., Witten, D.,Hastie, T., & Tibshirani, R. (2013). Anintroduction to statistical learning10

Supervised versus unsupervised learningSupervised learning most widely used machine-learning methods are supervised spam classifiers of e-mail face recognizers over images medical diagnosis systems for patients Methods include decision trees(logistic) regressionsupport vector machinesneural networksBayesian classifiersEmmeke AartsADS lunch lecture - Machine learning, an introduction11

Supervised versus unsupervised learningUnsupervised learningInputStatistical modelInputs, but no outputs. Try to learn structure and relationships from these dataEmmeke AartsADS lunch lecture - Machine learning, an introduction12

Supervised versus unsupervised learningUnsupervised learning assumptions about structural properties of the data Dimension reduction methods principal components analysis factor analysis random projections Clustering K-means clusteringEmmeke AartsADS lunch lecture - Machine learning, an introduction13

How do we learnParametric modelsNon-parametric modelsGraphs from: James, G., Witten, D.,Hastie, T., & Tibshirani, R. (2013). Anintroduction to statistical learningLinear modelSmooth thin-plate spline modelRestrictiveFlexibleInference: interpretableNot so interpretableEmmeke AartsADS lunch lecture - Machine learning, an introduction14

Accuracy versus interpretabilityImage from: -black-box/Emmeke AartsADS lunch lecture - Machine learning, an introduction15

Black box example from neuroscienceDeep imagereconstruction fromhuman brain activityG. Shen*, T. Horikawa*, K.Majima*, and Y. Kamitani2017Emmeke AartsADS lunch lecture - Machine learning, an introduction16

Black box example from neuroscienceEmmeke AartsADS lunch lecture - Machine learning, an introduction17

Black box example from neuroscienceEmmeke AartsADS lunch lecture - Machine learning, an introduction18

Accuracy versus interpretabilityIllustration from: James, G., Witten, D.,Hastie, T., & Tibshirani, R. (2013). Anintroduction to statistical learningEmmeke AartsADS lunch lecture - Machine learning, an introduction19

Regression versus classificationQuantitative outcomesQualitative outcomesPredict an quantitative outcome - regressionPredict to which category anobservation belongs - classificationEmmeke AartsADS lunch lecture - Machine learning, an introduction20

Model accuracyModel accuracy: Mean Squared ErrorTo obtain the MSE, we use training set and a test setEmmeke AartsADS lunch lecture - Machine learning, an introduction21

Model accuracyGraphs from: James, G., Witten, D.,Hastie, T., & Tibshirani, R. (2013). Anintroduction to statistical learningEmmeke AartsADS lunch lecture - Machine learning, an introduction22

Bias-variance trade offGraphs from: James, G., Witten, D.,Hastie, T., & Tibshirani, R. (2013). Anintroduction to statistical learningEmmeke AartsADS lunch lecture - Machine learning, an introduction23

Potential benefits for social scientist Solutions to overfitting (what we call the replication bias)1. uncover patterns and structure embedded in data2. test and improve model specification and predictions3. perform data reductionEmmeke AartsADS lunch lecture - Machine learning, an introduction24

Software R and Python: core of machine learning development Matlab has a ML toolbox, but lacks customizability Some techniques available in SPSS / Stata Specific programs for specific techniques, e.g., TensorflowEmmeke AartsADS lunch lecture - Machine learning, an introduction25

References James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction tostatistical learning. New York: springer Lazer, D., Pentland, A. S., Adamic, L., Aral, S., Barabasi, A. L., Brewer, D., . &Jebara, T. (2009). Life in the network: the coming age of computational socialscience. Science, 323(5915), 721. Shen, G., Horikawa, T., Majima, K., & Kamitani, Y. (2017). Deep imagereconstruction from human brain activity. bioRxiv, 240317. learning-and-the-socialsciences-927a8e20460dEmmeke AartsADS lunch lecture - Machine learning, an introduction26

What is statistical learning Artificial intelligence (AI): Constructing machines (robots, computers) to think and act like human beings Emmeke Aarts ADS lunch lecture - Machine learning, an introduction 9 ML is a subset of AI Statistical learning (SL): a set of approaches for estimating f; a function that represents our data