Human And Machine Learning - Carnegie Mellon University

Transcription

Human and Machine LearningTom MitchellMachine Learning DepartmentCarnegie Mellon UniversityNovember 20, 20061

How can studies ofmachine (human) learning informstudies ofhuman (machine) learning?2

Learning improving performanceat some taskthrough experience3

Outline1. Machine Learning and Human Learning2. Aligning specific results from ML and HL Learning to predict and achieve rewards TD learning Dopamine system in the brainValue of redundancy in data inputs Cotraining Intersensory redundancy hypothesis3. Core questions and conjectures4

Machine Learning - PracticeSpeech RecognitionObject recognitionMining Databases Reinforcement learning Supervised learningText analysisControl learning Bayesian networks Hidden Markov models Unsupervised clustering Explanation-based learning .5

Machine Learning - TheorySimilar theories forPAC Learning Theory(for supervised concept learning) Reinforcement skill learning# examples (m) Active student queryingerror rate (ε)representationalcomplexity (H)failureprobability (δ) Unsupervised learning also relating: # of mistakes during learning learner’s query strategy convergence rate asymptotic performance 6

What We Know About ML Excellent algorithms for pure induction– SVM’s, decision trees, graphical models, neural nets, . Algorithms for dimensionality reduction– PCA, ICA, compression algorithms, . Fundamental information theoretic bounds relate data andbiases to probability of successful learning– PAC learning theory, statistical estimation, grammar induction, . Active learning by querying teacher is much more dataefficient than random observation Algorithms to learn from delayed feedback (reinforcement)– Temporal difference learning, Q learning, policy iteration, . .7

ML Has Little to Say About Learning cumulatively over time Learning from instruction (lectures, discussion) Role of motivation, forgetting, curiosity, fear,boredom, . Implicit (unconscious) versus explicit (deliberate)learning .8

What We Know About HL*Neural level: Hebbian learning: connection between the pre-synaptic andpost-synaptic neuron increases if pre-synaptic neuron isrepeatedly involved in activating post-synaptic– Biochemistry: NMDA channels, Ca2 , AMPA receptors, . Timing matters: strongest effect if pre-synaptic actionpotential occurs within 0 - 50msec before postsynapticfiring. Time constants for synaptic changes are a few minutes.– Can be disrupted by protein inhibitors injected after the trainingexperience* I’m not an expert9

What We Know About HL*System level: In addition to single synapse changes, memory formation involveslonger term ‘consolidation’ involving multiple parts of the brain Time constant for consolidation is hours or days: memory of newexperiences can be disrupted by events occurring after the experience(e.g., drug interventions, trauma).– E.g., injections in amygdala 24 hours after training can impact recallexperience, with no impact on recall within a few hours Consolidation thought to involve regions such as amygdala,hippocampus, frontal cortex. Hippocampus might orchestrateconsolidation without itself being home of memories Dopamine seems to play a role in reward-based learning (andaddictions)* I’m not an expert10

What We Know About HL*Behavioral level: Power law of practice: competence vs. training on log-log plot is astraight line, across many skill types Role of reasoning and knowledge compilation in learning– chunking, ACT-R, Soar Timing: Expanded spacing of stimuli aids memory, . Theories about role of sleep in learning/consolidation Implicit and explicit learning. (unaware vs. aware). Developmental psychology: knows much about sequence of acquiredexpertise during childhood– Intersensory redundancy hypothesis* I’m not an expert11

Models of Learning ProcessesMachine Learning:Human Learning: # of examplesError rateReinforcement learningExplanations # of examplesError rateReinforcement learningExplanations Learning from examplesComplexity of learner’srepresentationProbability of successExploitation / explorationPrior probabilitiesLoss functions Human supervision– Lectures– Question answeringAttention, motivationSkills vs. PrinciplesImplicit vs. Explicit learningMemory, retention, forgetting 12

1. Learning to predict and achieve rewardsTD learning Dopamine in the brain13

Reinforcement Learning[Sutton and Barto 1981; Samuel 1957]V* (s) E[rt γ rt 1 γ 2 rt 2 .]14

Reinforcement Learning in MLγ .9r 100S0S1S2S3V 72V 81V 90V 1000V(s t ) E[rt γ rt 1 γ 2 rt 2 .]V(s t ) E[rt ] γ V(s t 1 )To learn V, use each transition to generate a training signal:15

Reinforcement Learning in MLtraining error rt γ V(s t 1 ) V(s t ) Variants of RL have been used for a variety ofpractical control learning problems– Temporal Difference learning– Q learning– Learning MDPs, POMDPs Theoretical results too– Assured convergence to optimal V(s) under certainconditions– Assured convergence for Q(s,a) under certain conditions16

Dopamine As Reward Signalt[Schultz et al.,Science, 1997]17

Dopamine As Reward Signalt[Schultz et al.,Science, 1997]18

Dopamine As Reward Signalt[Schultz et al.,Science, 1997]error rt γ V(s t 1 ) V(s t )19

RL Models for Human Learning[Seymore et al., Nature 2004]20

[Seymore et al., Nature 2004]21

Human EEG responses to Pos/Neg Rewardfrom [Nieuwenhuis et al.]Response due tofeedback on timing task(press button exactly 1sec after sound).Neural source appearsto be in anteriorcingulate cortex (ACC)Response is abnormalin some subjects withOCD22

One Theory of RL in the Brainfrom [Nieuwenhuis et al.] Basal ganglia monitor events, predict future rewards When prediction revised upward (downward), causesincrease (decrease) in activity of midbraindopaminergic neurons, influencing ACC This dopamine-basedactivation somehow resultsin revising the rewardprediction function.Possibly through directinfluence on Basal ganglia,and via prefrontal cortex23

Summary: Temporal Difference ML ModelPredicts Dopaminergic Neuron Acitivity during Learning Evidence now of neural reward signals from– Direct neural recordings in monkeys– fMRI in humans (1 mm spatial resolution)– EEG in humans (1-10 msec temporal resolution) Dopaminergic responses track temporal difference error inRL Some differences, and efforts to refine HL model– Better information processing model– Better localization to different brain regions– Study timing (e.g., basal ganglia learns faster than PFC ?)24

2.The value of unlabeled multi-sensory datafor learning classifiersCotraining Intersensory redundancyhypothesis25

Redundantly Sufficient FeaturesProfessor Faloutsosmy advisor26

Co-TrainingIdea: Train Classifier1 and Classifier2 to:1. Correctly classify labeled examples2. Agree on classification of unlabeledAnswer1Answer2Classifier1Classifier227

Where else might this work?Co-TrainingAnswer1- learning lexicons and namedentity recognizers for people,places, dates, books, (eg.,Riloff&Jones; Collins et al.)Answer2Classifier1Classifier2New YorkI flew to todayI flew to New York today.28

CoEM applied to Named Entity Recognition[Rosie Jones, 2005], [Ghani & Nigam, 2000]NPcontextUpdaterules:29

CoEM applied to Named Entity Recognition[Rosie Jones, 2005], [Ghani & Nigam, 2000]Updaterules:30

CoEM applied to Named Entity Recognition[Rosie Jones, 2005], [Ghani & Nigam, 2000]Updaterules:31

32

Co-Training Theory[Blum&Mitchell 98; Dasgupta 04, .]CoTraining setting :learn f : X Ywhere X X 1 X 2# labeled examples# unlabeled examplesFinalAccuracywhere x drawnfrom unknown distributionand g1 , g 2 ( x) g1 ( x1 ) g 2 ( x2 ) f ( x)Number ofredundantinputsConditionaldependenceamong inputsÆ want inputs less dependent,increased number of redundantinputs, 33

Theoretical Predictions of CoTraining Possible to learn from unlabeled examples Value of unlabeled data depends on– How (conditionally) independent are X1 and X2 The more the better– How many redundant sensory inputs Xi there are Expected error decreases exponentially with this number Disagreement on unlabeled data predicts true errorDo these predictions hold for human learners?34

Co-Training[joint work with Liu, Perfetti, Zi]Can it work for humanslearning chinese as asecond language?Answer: nailClassifier1Answer: nailClassifier235

Examples Training fonts andspeakers for “nail” Testing fonts and speakersfor “nail”FamiliarUnfamiliar36

Experiment: Cotraining in Human Learning[with Liu, Perfetti, Zi 2006] 44 human subjects learning Chinese as second lanuageTarget function to be learned:– chinese word (spoken / written) Æ english word– 16 distinct words, 6 speakers, 6 writers 16x6x6 stimulus pairs Training conditions:48 labeled pairs1. Labeled pairs: 2. Labeled pairs plusunlabeled singles:32 labeled pairs 192 unlabeled singles 16 labeled pairs3. Labeled pairs plusunlabeled, conditionallyindep. pairs:32 labeled pairs192 unlabeled pairs16 labeled pairsTest: 16 test words (single chinese stimulus), require english label37

ResultsDoes it matter whether X1, X2are conditionally independent?10.9Accuracy0.80.7LabeledLab unl singlesLab unl liarspeakerUnfamiliarspeakerTesting task38

Impact of Conditional Independence inunlabeled pairs1Labeled only0.9Lab unlab singlesAccuracy0.8Lab cond dep lab pairs0.7Lab cond indep lab pairs0.60.50.40.30.2Familiar Unfamiliar Familiar Unfamiliarfontfontspeaker speakerTesting task39

40

Infant Learning and Intersensory Redundancy Infants– 3 month olds attend to amodal properties (tempo of hammer) whengiven multisensory inputs, but not when given single modality input[Bahrick et al., 2002] Animals– Quail embryos learned an individual maternal call 4x faster when givenmultisensory data (synchronizing light with rate and rhythm of thesound) [Lickliter et al., 2002]41

Intersensory Redundancy and Infant Development[Bahrick & Lickliter, Dev. Psy, 2000] Intersensory redundancy: “spatiallycoordinated and temporallysynchronous presentation of the sameinformation across two or moresenses” Sight & sound of ball bouncingAmodal property: tempoIntersensory Redundancy Hypothesis [Bahrick & Lickliter]:1. Learning of amodal properties is facilitated by multimodalstimulation2. Learning of modality-specific properties facilitated by unimodalstimulation3. These effects are most pronounced in early development42

Co-TrainingWhere else might this work?- learning to recognizephonemes/vowels[de Sa, 1994; Coen 2006]Answer1 /va/ .4, /ba/ .6Classifier1AudioAnswer2 /va/ .9, /ba/ 0.1Classifier2Video43

[Michael Coen, 2006]had (æ)head (ε)hud (ʌ)hawed (ɔ)heard (ɜ)hood (ʊ)F2hid (ɪ)who’d (u)hid (ɪ)head (ε)LipDatahod (α)F1heed (i)had (æ)hud (ʌ)hood (ʊ)heard (ɜ)hawed (ɔ)who’d (u)Major ax isFormantDataheed (i)hod (α)Minor axisMutual clustering44

CoTraining Summary Unlabeled data improves supervised learning whenexample features are redundantly sufficient and onlyweakly (conditionally) correlated Theoretical results– If X1,X2 conditionally independent given Y PAC learnable from weak initial classifier plus unlabeled data disagreement between g1(x1) and g2(x2) bounds final classifiererror– Disagreement between classifiers over unlabeled examplespredicts true classification error Aligns with developmental psychology claims aboutimportance of multi-sensory input Unlabeled conditionally independent pairs improvesecond language learning in humans– But dependent pairs are also helpful !45

Human and Machine LearningAdditional overlaps: Learning representations for perception– Dimensionality reduction methods, low level percepts– Lewicky et al.: optimal sparse codes of natural scenes yieldgabor filters found in primate visual cortex Learning using prior knowledge– Explanation-based learning, graphical models, teachingconcepts & skills, chunking– VanLehn et al: explanation-based learning accounts forsome human learning behaviors Learning multiple related outputs– MultiTask learning, teach multiple operations on the sameinput– Caruana: patient mortality predictions improve if samepredictor must also learn to predict ICU status, WBC, etc.46

Some questions and conjectures47

One learning mechanism or many? Humans:– Implicit and explicit learning (unaware/aware)– Radically different time constants in synaptic changes (minutes)versus long term memory consolidation (days) Machines:– Inductive, data-intensive algorithms– Analytical compilation, knowledge dataConjecture:In humans two very different learning processes.Implicit largely inductive, Explicit involves self-explanationPredicts: if an implicit learning task can be made explicit, itwill be learnable from less data48

Can Hebbian Learning Explain it All? Humans:– It is the only synapse-level learning mechanism currently known– It is also known that new neurons grow, travel, and dieConjecture:Yes, much of human learning will be explainable by Hebbian learning,just as much of computer operation can be explained by modelingtransistors. Even two different learning mechanisms.But much will need to be understood at an architectural level. E.g.,what architectures could implement goal supervised learning in terms ofHebbian mechanisms?49

What is Learned, What Must be Innate?We don’t know. However, we do know: Low level perceptual features can emerge fromunsupervised exposure to perceptual stimuli [e.g., M. Lewicky].– Natural visual scenes Æ Gabor filters similar to those in visual cortex– Natural sounds Æ basis functions similar to those in auditory cortex Semantic object hierarchies can emerge from observedground-level facts– Neural network model [McClelland et al] ML models can help determine what representations canemerge from raw data.50

Machine Learning - Practice Object recognition Mining Databases Speech Recognition Control learning Reinforcement learning Supervised learning Bayesian networks Hidden Markov models Unsupervised clustering Explanation-based learning . Role of reasoning and knowledge compilation in learning - chunking, ACT-R .