Machine Learning For Hackers And Why It Matters For Free Software - Duboue

Transcription

Lightning TalkPractical Intro to MLCase StudiesWrapping UpMachine learning for hackersand why it matters for Free SoftwarePablo Ariel DuboueLes Laboratoires FoulabMontreal, QuebecObserve, Hack, Make 2013DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpOutline1Lightning Talk2Practical Intro to MLMachine LearningConcepts3Case StudiesNaive BayesLogistic RegressionMaximum EntropyNeural Networks4Wrapping UpFree SoftwareConclusionsDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpLightning TalkThree Potential AttendeesActivists (lighting talk)Hackers (makers, developers, tinkerers)Machine learning practitionersDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpLightning TalkDream OutcomesCreation of a SourceForge-like site but for FLOSS dataSee http://commoncrawl.org (100Tb of Web data!)Creation of a cross over of TikiWiki with Wikipedia but fortrained programs A hacker approaches ML as a practitioner rather than fromtheoretical perspectiveAn activist helps create a GPL/CC crossover license thatprotects community driven data e ortsDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpLightning TalkML & FLOSSIssuesAre ML models (discussed next) the preferred form formodi cation?Practical aspects are several order of magnitude morecomplicatedThreats:Obsolescence (source code is less valuable)Yet-another-GPL-circumvention trickOpportunitiesContributing code is di cult, contributing and curating data iseasierTurning users into contributorsDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpLightning TalkWhat is Machine LearningStatistical modelling with focus on predictive applications.Common case ( supervised learning ):Training/estimation/ compilation? input: vectors of features, including target feature (data)output: trainedmodelExecution/prediction/ interpretation? input: vector of features (w/o target feature) plus trainedmodeloutput: predicted target featureDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpLightning TalkExample.Stanford Syntatic Parser.Java, GPL licensedMature code, surprisingly well-writtenProbabilistic Context Free Grammar (2Mb trained model)Source: Penn Treebank 640Mb (compressed)(S (NP (DT An) (VBG operating) (NN system))(VP (VBZ is)(NP(NP (DT the) (NN set))(PP (IN of)(NP (JJ basic) (NNS programs)(CC and)(NNS utilities)))(SBAR (WHNP (WDT that)) (S (VP (VBP make) (NP (PRP your) (NNcomputer) (NN run))))))) DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpLightning TalkWhat is a Model?Depends on the machine learning methodology employed.Some modelsareeasy to understand and modify by hand.Example from a biology text classi cation system (to bediscussed later)if the word DATA is NOT presentbeforethe termand the word DEVELOPMENT is presentafterthe term class gene [91.7%]before the termbefore the termpresent after the term classif the word FRAGMENT is NOT presentand the word ALLELE is presentand the word THAT is NOTgene [93.9%]before the termafter the term classif the word ENCODES is NOT presentand the word ENCODES is presentgene [96.5%]DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpLightning TalkIncomprehensible ModelsMost models being used nowadays are not intendend to beunderstood as such nor modi ed by handNeural networksSupport Vector MachinesMarkov ModelsConditional Random FieldsDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpLightning TalkThreats to FreedomThe main threat isobsolescenceWhat are we going to do if type of applications users grow toexpect and enjoy in privative platforms rely on large train sets?Not unlike the threat posed by cloud services being addressedby the FreedomBox foundation.Applications such asOCR (book scanning)Speech Recognition (dictation)Computer Vision (automatically tag your friends on photos)Question Answering (Siri / Watson)DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpLightning TalkDiminishing Value Behind Source CodeValue on the dataFacebookLinkedInGoogle FlickrData vendorshttp://www.infochimps.com/marketplace (general data,including Twitter data)http://www.ldc.upenn.edu (linguistic data)DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpLightning TalkYet-another-clever-GPL-circumvention trick?Vendor releases the source code but keeps the data behind thetrained model closed.Not unlike rmware binary blobs?To me, the rmware binary blobs are a much better analogy tomachine learning models than video game assets.DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpLightning TalkThreats to PracticalityTraining machine learning models takes a whole di erent typeof build-machine64Gb of RAM for 3 days, sure!Why? Oh my, why?Distributing training data involves order of magnitude morespace and bandwidthComparable to wikimedia mirroring (or more)DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpLightning TalkOpportunities for Speci c ProjectsMain challenge for Free Software IMO is to change users intocontributorsContributors volunteering new training data can follow thesuccess case of TranslatorsData contributors canAnnotate more data to x a bug ("data patches")Curate existing data (think Wikipedia)DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpLightning TalkOpportunities for Multi-project CollaborationInter-project collaboration opportunities.Sharing data is easier than sharing code as its format seldomchanges.Think object-orientation.All syntactic parsers in the last 15 years of work in the eldhave used the same Penn Treebank data set.Sharing annotation work is easier than sharing data patches.Think work on i10nDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpLightning TalkSky is the limitMachine learning can enable the creation of computerprograms without programming Teaching the computer what to doWhile we are still far from that, current f (x1 , . . . , xn )of functions are possibleDrDubML4FLOSS@OHM ytype

Lightning TalkPractical Intro to MLCase StudiesWrapping UpLightning TalkAbout the SpeakerI am passionate about improving society throughlanguage technology and split my time between teaching,doing research and contributing to free software projectsColumbia UniversityNatural Language GenerationThesis: Indirect Supervised Learning of Strategic GenerationLogic , defended Jan. 2005.IBM Research WatsonQuestion AnsweringDeep QA - Watson system (Jeopardy)Independent Researcher, living in MontrealCollaboration with Universite de MontrealFree Software projects and consulting for startupsDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpMachine LearningConceptsOutline1Lightning Talk2Practical Intro to MLMachine LearningConcepts3Case StudiesNaive BayesLogistic RegressionMaximum EntropyNeural Networks4Wrapping UpFree SoftwareConclusionsDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpMachine LearningConceptsWhat is Machine Learning?A new way of programmingMagic!Leaving part of the behavior of your program to be speci edby calculating unknown numbers from "data"Two phases of execution: training and application DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpMachine LearningConceptsThe ultimate TDDIf you're using a library, you almost do no coding, just test!But every time you test, your data becomes more and moreobsoleteNo peeking!Have met people who didn't have any testsThey considered bugs in the code same are the same as modelissuesMy experience has been quite the opposite, the code you writeimplementing machine learning algorithms has to be doubleand triple checkedDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpMachine LearningConceptsTaxonomy of Machine Learning ApproachesSupervised learningMonkey see, monkey doClassi cationUnsupervised learningDo I look fat?ClusteringOthersReinforcement learning: learning from past successes andmistakes (good for game AIs and politicians)Active learning: asking what you don't know (needs less data)Semi-supervised: annotated raw dataDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpMachine LearningConceptsMajor LibrariesScikit-learn (Python)R packages (R)Weka (Java)Mallet (CRF, Java)OpenNLP MaxEnt (Java)Apache Mahout (Java).DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpMachine LearningConceptsOutline1Lightning Talk2Practical Intro to MLMachine LearningConcepts3Case StudiesNaive BayesLogistic RegressionMaximum EntropyNeural Networks4Wrapping UpFree SoftwareConclusionsDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpMachine LearningConceptsConceptsTrying to learn a function f (x1 , . . . , xn ) yinput features.target class.xi are they is theThe key here is extrapolation, that is, we want our learnedfunction togeneralize to unseen inputs.Linear interpolation is on itself a type of supervised learning.DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpMachine LearningConceptsDataCollecting the dataData collection hooksAnnotating dataAnnotation guidelinesCross and self agreementRepresenting the data (asfeatures, more on this later)Understanding how well the system operates over the dataTesting onunseendataA DB is a rather poor ML algorithmMake sure your system is not just memorizing the data Freedom of the modelDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpMachine LearningConceptsEvaluatingHeld out dataMake sure the held out is representative of the problem and theoverall population of instances you want to apply the classi erRepeated experimentsEvery time you run something on eval data, it changes you!Cross-validationTraining and testing on the same data but not quitedata {A,B,C}train in A,B, test in Ctrain in A,C, test in Btrain in B,C, test in ADrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpMachine LearningConceptsMetricsMeasuring how many times a classi er outputs the rightanswer ( accuracy ) is not enoughMany interesting problems are very biased towards abackground classIf 95% of the time something doesn't happen, saying it'll neverhappen (not a very useful classi er!) will make you only 5%wrongprecisionrecall correctly tagged tagged correctly tagged should be tagged F 2·DrDubtptp fptptp·RP RPML4FLOSS@OHM fn

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpOutline1Lightning Talk2Practical Intro to MLMachine LearningConcepts3Case StudiesNaive BayesLogistic RegressionMaximum EntropyNeural Networks4Wrapping UpFree SoftwareConclusionsDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpNaive BayesCount and multiplyHow spam lters workVery easy to implementWorks relatively well but it can seldom solve the problemcompletelyIf you add the target class as a feature, it will still has a higherror rateIt never trusts anything too muchDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpWhy Naive?Bayes Rule( F1 , . . . , Fn ) p Cposterior ( ) p (F1 , . . . , Fn C )p (F1 , . . . , Fn )p Cprior likelihoodevidenceNaive PartIndependence assumption of the Fx , that is(p Fi C , F ) p (F C )ji( , . . . , Fn ) p (C ) p (F1 C ) . . . p (Fn C )p C F1DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpDecision TreesFind the partition of the data with higher information gainValue of a piece of gossipIG(splitting S at A into T ) H (S ) tEasy to understandBoth algorithm and trained modelsCan over t badlyUnderperformingComing back withrandom forestsDrDubML4FLOSS@OHM T( ) H (t )p t

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpBiology: Problem Disambiguating proteins, genes, and RNA in text: a machinelearning approach, Hatzivassiloglou, Duboue, Rzhetsky (2001)The same term refers to genes, proteins and mRNA: By UV cross-linking and immunoprecipitation, we show thatSBP2speci cally binds selenoprotein mRNAs both in vitroand in vivo. TheSBP2clone used in this study generates a 3173 nttranscript (2541 nt of coding sequence plus a 632 nt 3' UTRtruncated at the polyadenylation site). This ambiguity is so pervasive that in many cases the authorof the text inserts the word gene , protein or mRNA todisambiguate it itselfThat happens in only 2.65% of the cases thoughDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpBiology: FeaturesTake a context around the term, use the occurrence of wordsbefore or after the term as features.Keep a tally of the number of times each word has appear withwhich target ML4FLOSS@OHM0.08

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpBiology: MethodsInstead of multiplying, operate on logsfloat [ ]//if. . .predictforeach (float [ ] )wordinpriors . clone ( ) ;context. . .( w o r d f r e q s . c o n t a i n s K e y ( word ) ){float [ ] l o g f r e q s w o r d f r e q s . g e t ( w o r d ) ;for ( int i 0 ; i p r e d i c t . l e n g t h ; i )p r e d i c t [ i ] l o g f r e q s [ i ] ;}DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpBiology: ResultsUsed a number of variations on the featuresRemoved capitalization, stemming, ltered part-of-speech,added positional informationChanged the problem from three-way to two-way classi cationResults of Tree-learning and Naive Bayes were comparable(76% two-way and 67% three-way).Distilled some interesting rules from the decision trees:after ENCODES is presentbefore ENCODES is NOT present classgene [96.5%]DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpOutline1Lightning Talk2Practical Intro to MLMachine LearningConcepts3Case StudiesNaive BayesLogistic RegressionMaximum EntropyNeural Networks4Wrapping UpFree SoftwareConclusionsDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpLogistic RegressionWon't explain in detailIt is similar to linear regression but in log space(Wikipedia)Can take lots of features and lots of dataHigh performanceOutput is a goodness of tDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpWekaARFF formatText le, with two sections@relation training name@attribute attribute name numericx number of features@data7.0,1.1,. .x number of training instancesTraining classi ersjava -jar weka.jar weka.classifiers.functions.LogisticRegression -ttrain.arffOr programmatically:Create anInstances class with certainInstance to add to itattributes and createobjects of typeCreate an empty classi er and train it on the InstancesUsing the trained classi ersclassifyInstance(Instance) or distributionForInstance(Instance)DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpDrDubML4FLOSS@OHMWhy Weka?

Lightning TalkPractical Intro to MLCase StudiesWrapping UpJeopardy! : ProblemNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpLearning to rankRather than predicting a class, choose the best one amongmany instances case, the instances were candidate answersIn the Jeopardy!Features related to each particular answer candidate evidence As logistic regression produces a goodness of t, it can beused for rankingOther classi ers might just give you 0 or 1 independent ofrelative goodnessDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpJeopardy! : DeploymentNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpDeepQA Architecture, from Ferrucci (2012)DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpJeopardy! : Feature EngineeringFirst four phases of merging and ranking, from Gondek, Lally, Kalyanpur,Murdock, Duboue, Zhang, Pan, Qiu, Welty (2012)DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpOutline1Lightning Talk2Practical Intro to MLMachine LearningConcepts3Case StudiesNaive BayesLogistic RegressionMaximum EntropyNeural Networks4Wrapping UpFree SoftwareConclusionsDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpMaximum EntropyTons and tons of (binary) featuresVery popular at beginning of 2000'sCRF has taken some of its glamourMature codeOpenNLP MaxEnt uses strings to represent its input dataprevious succeeds current Terrence next D.currentWordIsCapitalizedTraining with trainModel(dataIndexer, iterations) and using itwith double[] eval(String[] context)DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpKeaText: French POS TaggerAn existing part-of-speech tagger for the French language wasa mixture of Python and PerlInstead of re-engineering it, ran it on a many French docsTrained a new MaxEnt model on itTook less than 2 days of work and produced a Java POStagger at about 5% the same performance as the originalMore complicated than WSD, as it involves more classes andevery word has to be taggedwith MaxEnt everything thinkable can be used as a featureFeatures includeThe word itself, previous words, with their identi ed tagsSu xes and pre xes, up to the 1st & last 4 chars of the wordWhether the word has special characters or if it is all numbersor uppercaseDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpOutline1Lightning Talk2Practical Intro to MLMachine LearningConcepts3Case StudiesNaive BayesLogistic RegressionMaximum EntropyNeural Networks4Wrapping UpFree SoftwareConclusionsDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpNeural NetworksThe original MLSecond to best algorithmSlowMost people are familiar with itAI winterMaking a come back with Deep LearningDrDubML4FLOSS@OHM

Naive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpLightning TalkPractical Intro to MLCase StudiesWrapping UpHow to Train ANNsHiddenInputOutput(Wikipedia)Execution: Feed-forward!yq K xi wiqiTraining: Backpropagation of errorsProblem: over tting, use a separate set as the terminationcriteriaDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpK4B: ProblemGiven the bytecodes of a java method, come up with someterms to describe itUse all the Java code in the Debian archive as training dataPairs bytecodes / javadocApplications in Reverse EngineeringJava malwareMore information:Training data: http://keywords4bytecodes.orgSource code: 4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpK4B: DataFinal corpus:1M methods35M words24M JVM instructionsExample training instance:Class: net.sf.antcontrib.property.VariableMethod: public void execute() throws org.apache.tools.ant.BuildExceptionJavaDoc: Execute this task.Bytecodes: (126 in total)0 aload 01 getfield net.sf.antcontrib.property.Variable.remove4 ifeq 457 aload 08 getfield net.sf.antcontrib.property.Variable.name11 ifnull 2614 aload 015 getfield net.sf.antcontrib.property.Variable.name18 ldc ""20 invokevirtual java.lang.String.equals(java.lang.Object)23 ifeq 3626 new org.apache.tools.ant.BuildException29 dupDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpTheanoMy current e orts on K4B are centered around Theano, aDeep Learning Python library written at Universite de ep Learning focuses onusing multi-layer neural networks with many layers (deepnetworks)some layers are trained from the inputs directlythey synthesize complex features without access to the outputclassesKey for hackers, Theano creates de neural networks assymbolic structures than then can compile to be run on GPUsDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpHow to Come Up with Features1Throw everything (and the kitchen sink) at it2Stop and think1What information would2Look for published workyouuse to solve that problem?Papers: http://aclweb.org/anthology-new/Blog postingsOpen source projects3Add computable featuresLearning to sum takes an incredible amount of training!DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpImproving a Classi erMore dataBetter featuresSolve a di erent problemShop around for a di erent classi er / parametrizationProcedural over ttingAdd unlabelled dataDrop ML and program it by handDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpNaive BayesLogistic RegressionMaximum EntropyNeural NetworksWrap-UpThe Bad NewsDi cult to maintainLink between data and trained model is easy to get lostYou'll be dealing with errors (defects) and very few ways tosolve themAdding more data, if it helps, will produce lots of regressions(asymptotic behavior)Not all errors are the same, but they look like that in thereported metricsYour compile time has began to be measured in hours (or days)Time to upgrade. your cluster.Be prepared to stare into the void every time you are askedabout odd system behaviorDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpFree SoftwareConclusionsOutline1Lightning Talk2Practical Intro to MLMachine LearningConcepts3Case StudiesNaive BayesLogistic RegressionMaximum EntropyNeural Networks4Wrapping UpFree SoftwareConclusionsDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpFree SoftwareConclusionsdebian-legal circa 2009.Original 05/msg00028.htmlMathieu Blondel asked two questions:Can Debian ship models in main without distributing theoriginal data?Yes, because the model is considered the preferred form formodi cation.The reasoning followed a pre-existing decision from 2Drendered images for games (rendered from an underlining 3Dmodel).Can violations of data licensing be detected?Arti cially introduced errors for ngerprintingDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpFree SoftwareConclusionsSome Quotes."Free data is important for the very same reason that freeprograms are!"Mark Weyer (Wed, 27 May 2009 11:36:55 0200) de "[then do not ship] pictures that are initially photographs of anobject (the preferred form of modi cation is the originalobject; if you want to see it at another angle, you need to takeanother photograph)"Josselin Mouette (Wed, 27 May 2009 10:33:52 0200) 1243413232.14420.49.camel@shizuru DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpFree SoftwareConclusionsTraining Data vs. FeaturesFeature vectors are not unlike generated YACC (or Bison) Cles.ExamplesSpeechTraining data: transcribed speechFeature data: wave segments with associated transcriptionSpelling correctionTraining data: Wikipedia historyFeature data: edits that modify a word with less than 3characters total editSyntactic ParsingTraining data: newspaper articles bracketed and annotatedwith syntactic categoriesFeature data: trees of height one, with the most importantword of it ( lexical head )DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpFree SoftwareConclusionsOutline1Lightning Talk2Practical Intro to MLMachine LearningConcepts3Case StudiesNaive BayesLogistic RegressionMaximum EntropyNeural Networks4Wrapping UpFree SoftwareConclusionsDrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpFree SoftwareConclusionsSome Crazy IdeasEver heard of Tiki Wiki? http://tiki.orgA project run like a wikiEverybody is granted commit access and the code is veryaccessible (PHP)Imagine a ML equivalentUsers can edit the data and download newly trained modelsright awayThe models get combined into an end-to-end software systemthat solves a given issueOtherwise acquiring data for general useMaybe build a Free Software-volunteer driven MechanicalTurk-like tool?DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpFree SoftwareConclusionsThoughtlandMy current project, 100% Free SoftwareVisualizing n-dimensional error surfacesInput: training data machine learning algorithmOutput: a paragraph of text describing how the error surface looks like in n-dimensionsMachine Learning with Weka (cross-validated error cloud)Clustering with Apache Mahout (using model based clustering)Text Generation (using OpenSchema and SimpleNLG)http://thoughtland.duboue.netScalaOpen source: OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpFree SoftwareConclusionsSummaryDon't be afraid of getting your hands dirtyTry to incorporate some trained models into your existing/newprojectsBut don't forget about testingAnd keeping track of the input dataAnd don't train at the users' computerPick a library, any library, and give it a try with existing datasets:UCI Machine Learning Repository:http://www.ics.uci.edu/ mlearn/TunedIT: http://tunedit.org/DrDubML4FLOSS@OHM

Lightning TalkPractical Intro to MLCase StudiesWrapping UpFree SoftwareConclusionsContacting PabloVisit the Foulab Village uno cial Canadian Consulate!Email: pablo.duboue@gmail.comWebsite: http://duboue.netTwitter: @pabloduboueIRC: DrDub (FreeNode, ##foulab)GitHub: https://github.com/DrDubLinkedIn: M

Machine learning for hackers and why it matters for Free Software Pablo Ariel Duboue Les Laboratoires Foulab Montreal, Quebec Observe, Hack, Make 2013 DrDub ML4FLOSS@OHM. . Depends on the machine learning methodology employed. Some models are easy to understand and modify by hand. Example from a biology text classi cation system (to be