Building Machine Learning Systems With Python

Transcription

Building Machine LearningSystems with PythonMaster the art of machine learning with Python andbuild effective machine learning systems with thisintensive hands-on guideWilli RichertLuis Pedro CoelhoBIRMINGHAM - MUMBAI

Building Machine Learning Systems with PythonCopyright 2013 Packt PublishingAll rights reserved. No part of this book may be reproduced, stored in a retrievalsystem, or transmitted in any form or by any means, without the prior writtenpermission of the publisher, except in the case of brief quotations embedded incritical articles or reviews.Every effort has been made in the preparation of this book to ensure the accuracyof the information presented. However, the information contained in this book issold without warranty, either express or implied. Neither the authors, nor PacktPublishing, and its dealers and distributors will be held liable for any damagescaused or alleged to be caused directly or indirectly by this book.Packt Publishing has endeavored to provide trademark information about all of thecompanies and products mentioned in this book by the appropriate use of capitals.However, Packt Publishing cannot guarantee the accuracy of this information.First published: July 2013Production Reference: 1200713Published by Packt Publishing Ltd.Livery Place35 Livery StreetBirmingham B3 2PB, UK.ISBN 978-1-78216-140-0www.packtpub.comCover Image by Asher Wishkerman (a.wishkerman@mpic.de)

CreditsAuthorsWilli RichertProject CoordinatorAnurag BanerjeeLuis Pedro CoelhoProofreaderReviewersPaul HindleMatthieu BrucherMike DriscollMaurice HT LingAcquisition EditorKartikey PandeyLead Technical EditorMayur HuleTechnical EditorsSharvari H. BaetRuchita BhansaliAthira LajiZafeer RaisCopy EditorsInsiya MorbiwalaAditya NairAlfida PaivaLaxmi SubramanianIndexerTejal R. SoniGraphicsAbhinash SahuProduction CoordinatorAditi GajjarCover WorkAditi Gajjar

About the AuthorsWilli Richert has a PhD in Machine Learning and Robotics, and he currently worksfor Microsoft in the Core Relevance Team of Bing, where he is involved in a varietyof machine learning areas such as active learning and statistical machine translation.This book would not have been possible without the support of mywife Natalie and my sons Linus and Moritz. I am also especiallygrateful for the many fruitful discussions with my current andprevious managers, Andreas Bode, Clemens Marschner, HongyanZhou, and Eric Crestan, as well as my colleagues and friends,Tomasz Marciniak, Cristian Eigel, Oliver Niehoerster, and PhilippAdelt. The interesting ideas are most likely from them; the bugsbelong to me.

Luis Pedro Coelho is a Computational Biologist: someone who uses computersas a tool to understand biological systems. Within this large field, Luis works inBioimage Informatics, which is the application of machine learning techniques tothe analysis of images of biological specimens. His main focus is on the processingof large scale image data. With robotic microscopes, it is possible to acquirehundreds of thousands of images in a day, and visual inspection of all theimages becomes impossible.Luis has a PhD from Carnegie Mellon University, which is one of the leadinguniversities in the world in the area of machine learning. He is also the author ofseveral scientific publications.Luis started developing open source software in 1998 as a way to apply to real codewhat he was learning in his computer science courses at the Technical University ofLisbon. In 2004, he started developing in Python and has contributed to several opensource libraries in this language. He is the lead developer on mahotas, the popularcomputer vision package for Python, and is the contributor of several machinelearning codes.I thank my wife Rita for all her love and support, and I thank mydaughter Anna for being the best thing ever.

About the ReviewersMatthieu Brucher holds an Engineering degree from the Ecole Superieured'Electricite (Information, Signals, Measures), France, and has a PhD inUnsupervised Manifold Learning from the Universite de Strasbourg, France. Hecurrently holds an HPC Software Developer position in an oil company and workson next generation reservoir simulation.Mike Driscoll has been programming in Python since Spring 2006. He enjoyswriting about Python on his blog at http://www.blog.pythonlibrary.org/. Mikealso occasionally writes for the Python Software Foundation, i-Programmer, andDeveloper Zone. He enjoys photography and reading a good book. Mike has alsobeen a technical reviewer for the following Packt Publishing books: Python 3 ObjectOriented Programming, Python 2.6 Graphics Cookbook, and Python Web DevelopmentBeginner's Guide.I would like to thank my wife, Evangeline, for always supportingme. I would also like to thank my friends and family for all that theydo to help me. And I would like to thank Jesus Christ for saving me.

Maurice HT Ling completed his PhD. in Bioinformatics and BSc (Hons) inMolecular and Cell Biology at the University of Melbourne. He is currently aresearch fellow at Nanyang Technological University, Singapore, and an honoraryfellow at the University of Melbourne, Australia. He co-edits the Python papersand has co-founded the Python User Group (Singapore), where he has served asvice president since 2010. His research interests lie in life—biological life, artificiallife, and artificial intelligence—using computer science and statistics as tools tounderstand life and its numerous aspects. You can find his website at:http://maurice.vodien.com

www.PacktPub.comSupport files, eBooks, discount offers and moreYou might want to visit www.PacktPub.com for support files and downloads relatedto your book.Did you know that Packt offers eBook versions of every book published, with PDFand ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy.Get in touch with us at service@packtpub.com for more details.At www.PacktPub.com, you can also read a collection of free technical articles, signup for a range of free newsletters and receive exclusive discounts and offers on Packtbooks and eBooks.TMhttp://PacktLib.PacktPub.comDo you need instant solutions to your IT questions? PacktLib is Packt's onlinedigital book library. Here, you can access, read and search across Packt's entirelibrary of books.Why Subscribe? Fully searchable across every book published by Packt Copy and paste, print and bookmark content On demand and accessible via web browserFree Access for Packt account holdersIf you have an account with Packt at www.PacktPub.com, you can use this to accessPacktLib today and view nine entirely free books. Simply use your login credentialsfor immediate access.

Table of ContentsPrefaceChapter 1: Getting Started with Python Machine Learning17Machine learning and Python – the dream teamWhat the book will teach you (and what it will not)What to do when you are stuckGetting startedIntroduction to NumPy, SciPy, and MatplotlibInstalling PythonChewing data efficiently with NumPy and intelligently with SciPyLearning NumPy89101112121213Learning SciPyOur first (tiny) machine learning applicationReading in the dataPreprocessing and cleaning the dataChoosing the right model and learning algorithm1719192022Summary31IndexingHandling non-existing valuesComparing runtime behaviorsBefore building our first modelStarting with a simple straight lineTowards some advanced stuffStepping back to go forward – another look at our dataTraining and testingAnswering our initial questionChapter 2: Learning How to Classify with Real-world ExamplesThe Iris datasetThe first step is visualizationBuilding our first classification modelEvaluation – holding out data and cross-validation1515162222242628303333343538

Table of ContentsBuilding more complex classifiersA more complex dataset and a more complex classifierLearning about the Seeds datasetFeatures and feature engineeringNearest neighbor classificationBinary and multiclass classificationSummaryChapter 3: Clustering – Finding Related PostsMeasuring the relatedness of postsHow not to do itHow to do itPreprocessing – similarity measured as similar numberof common wordsConverting raw text into a bag-of-wordsCounting wordsNormalizing the word count vectorsRemoving less important wordsStemmingInstalling and using NLTKExtending the vectorizer with NLTK's stemmerStop words on steroidsOur achievements and goalsClusteringKMeansGetting test data to evaluate our ideas onClustering postsSolving our initial challengeAnother look at noiseTweaking the 7585960616263656768717273Chapter 4: Topic Modeling75Chapter 5: Classification – Detecting Poor Answers89Latent Dirichlet allocation (LDA)Building a topic modelComparing similarity in topic spaceModeling the whole of WikipediaChoosing the number of topicsSummarySketching our roadmapLearning to classify classy answers[ ii ]7576808386879090

Table of ContentsTuning the instanceTuning the classifierFetching the dataSlimming the data down to chewable chunksPreselection and processing of attributesDefining what is a good answerCreating our first classifierStarting with the k-nearest neighbor (kNN) algorithmEngineering the featuresTraining the classifierMeasuring the classifier's performanceDesigning more featuresDeciding how to improveBias-variance and its trade-offFixing high biasFixing high varianceHigh bias or low biasUsing logistic regressionA bit of math with a small exampleApplying logistic regression to our postclassification problemLooking behind accuracy – precision and recallSlimming the classifierShip it!SummaryChapter 6: Classification II – Sentiment AnalysisSketching our roadmapFetching the Twitter dataIntroducing the Naive Bayes classifierGetting to know the Bayes theoremBeing naiveUsing Naive Bayes to classifyAccounting for unseen words and other odditiesAccounting for arithmetic underflowsCreating our first classifier and tuning itSolving an easy problem firstUsing all the classesTuning the classifier's parametersCleaning tweetsTaking the word types into accountDetermining the word types[ iii 136138139

Table of ContentsSuccessfully cheating using SentiWordNetOur first estimatorPutting everything togetherSummary141143145146Chapter 7: Regression – Recommendations147Chapter 8: Regression – Recommendations Improved165Chapter 9: Classification III – Music Genre Classification181Predicting house prices with regressionMultidimensional regressionCross-validation for regressionPenalized regressionL1 and L2 penaltiesUsing Lasso or Elastic nets in scikit-learnP greater than N scenariosAn example based on textSetting hyperparameters in a smart wayRating prediction and recommendationsSummaryImproved recommendationsUsing the binary matrix of recommendationsLooking at the movie neighborsCombining multiple methodsBasket analysisObtaining useful predictionsAnalyzing supermarket shopping basketsAssociation rule miningMore advanced basket analysisSummarySketching our roadmapFetching the music dataConverting into a wave formatLooking at musicDecomposing music into sine wave componentsUsing FFT to build our first classifierIncreasing experimentation agilityTraining the classifierUsing the confusion matrix to measure accuracy inmulticlass problemsAn alternate way to measure classifier performance usingreceiver operator characteristic (ROC)[ iv 73173176178179181182182182184186186187188190

Table of ContentsImproving classification performance with Mel Frequency CepstralCoefficientsSummaryChapter 10: Computer Vision – Pattern RecognitionIntroducing image processingLoading and displaying imagesBasic image processingThresholdingGaussian blurringFiltering for different effects193197199199200201202205207Adding salt and pepper noise207Pattern recognitionComputing features from imagesWriting your own featuresClassifying a harder datasetLocal feature representationsSummary210211212215216219Putting the center in focus208Chapter 11: Dimensionality ReductionSketching our roadmapSelecting featuresDetecting redundant features using filtersCorrelationMutual information221222222223223225Asking the model about the features using wrappersOther feature selection methodsFeature extractionAbout principal component analysis (PCA)230232233233Limitations of PCA and how LDA can helpMultidimensional scaling (MDS)Summary236237240Sketching PCAApplying PCAChapter 12: Big(ger) DataLearning about big dataUsing jug to break up your pipeline into tasksAbout tasksReusing partial resultsLooking under the hoodUsing jug for data analysis[v]234234241241242242245246246

Table of ContentsUsing Amazon Web Services (AWS)Creating your first machines248250Automating the generation of clusters with starclusterSummary255259Installing Python packages on Amazon LinuxRunning jug on our cloud machine253254Appendix: Where to Learn More about Machine Learning261Index265Online coursesBooksQ&A sitesBlogsData sourcesGetting competitiveWhat was left outSummary[ vi ]261261262262263263264264

PrefaceYou could argue that it is a fortunate coincidence that you are holding this book inyour hands (or your e-book reader). After all, there are millions of books printedevery year, which are read by millions of readers; and then there is this book read byyou. You could also argue that a couple of machine learning algorithms played theirrole in leading you to this book (or this book to you). And we, the authors, are happythat you want to understand more about the how and why.Most of this book will cover the how. How should the data be processed so thatmachine learning algorithms can make the most out of it? How should you choosethe right algorithm for a problem at hand?Occasionally, we will also cover the why. Why is it important to measure correctly?Why does one algorithm outperform another one in a given scenario?We know that there is much more to learn to be an expert in the field. After all, we onlycovered some of the "hows" and just a tiny fraction of the "whys". But at the end, wehope that this mixture will help you to get up and running as quickly as possible.What this book coversChapter 1, Getting Started with Python Machine Learning, introduces the basic ideaof machine learning with a very simple example. Despite its simplicity, it willchallenge us with the risk of overfitting.Chapter 2, Learning How to Classify with Real-world Examples, explains the use ofreal data to learn about classification, whereby we train a computer to be able todistinguish between different classes of flowers.Chapter 3, Clustering – Finding Related Posts, explains how powerful thebag-of-words approach is when we apply it to finding similar posts withoutreally understanding them.

PrefaceChapter 4, Topic Modeling, takes us beyond assigning each post to a single clusterand shows us how assigning them to several topics as real text can deal withmultiple topics.Chapter 5, Classification – Detecting Poor Answers, explains how to use logisticregression to find whether a user's answer to a question is good or bad. Behindthe scenes, we will learn how to use the bias-variance trade-off to debug machinelearning models.Chapter 6, Classification II – Sentiment Analysis, introduces how Naive Bayesworks, and how to use it to classify tweets in order to see whether they arepositive or negative.Chapter 7, Regression – Recommendations, discusses a classical topic in handlingdata, but it is still relevant today. We will use it to build recommendationsystems, a system that can take user input about the likes and dislikes torecommend new products.Chapter 8, Regression – Recommendations Improved, improves our recommendationsby using multiple methods at once. We will also see how to build recommendationsjust from shopping data without the need of rating data (which users do notalways provide).Chapter 9, Classification III – Music Genre Classification, illustrates how if someone hasscrambled our huge music collection, then our only hope to create an order is to leta machine learner classify our songs. It will turn out that it is sometimes better totrust someone else's expertise than creating features ourselves.Chapter 10, Computer Vision – Pattern Recognition, explains how to apply classificationsin the specific context of handling images, a field known as pattern recognition.Chapter 11, Dimensionality Reduction, teaches us what other methods existthat can help us in downsizing data so that it is chewable by our machinelearning algorithms.Chapter 12, Big(ger) Data, explains how data sizes keep getting bigger, and howthis often becomes a problem for the analysis. In this chapter, we explore someapproaches to deal with larger data by taking advantage of multiple core orcomputing clusters. We also have an introduction to using cloud computing(using Amazon's Web Services as our cloud provider).Appendix, Where to Learn More about Machine Learning, covers a list of wonderfulresources available for machine learning.[2]

PrefaceWhat you need for this bookThis book assumes you know Python and how to install a library usingeasy install or pip. We do not rely on any advanced mathematics suchas calculus or matrix algebra.To summarize it, we are using the following versions throughout this book, butyou should be fine with any more recent one: Python: 2.7NumPy: 1.6.2SciPy: 0.11Scikit-learn: 0.13Who this book is forThis book is for Python programmers who want to learn how to perform machinelearning using open source libraries. We will walk through the basic modes ofmachine learning based on realistic examples.This book is also for machine learners who want to start using Python to build theirsystems. Python is a flexible language for rapid prototyping, while the underlyingalgorithms are all written in optimized C or C . Therefore, the resulting code isfast and robust enough to be usable in production as well.ConventionsIn this book, you will find a number of styles of text that distinguish betweendifferent kinds of information. Here are some examples of these styles, and anexplanation of their meaning.Code words in text are shown as follows: "We can include other contexts throughthe use of the include directive".A block of code is set as follows:def nn movie(movie likeness, reviews, uid, mid):likes movie likeness[mid].argsort()# reverse the sorting so that most alike are in# beginninglikes likes[::-1]# returns the rating for the most similar movie availablefor ell in likes:if reviews[u,ell] 0:return reviews[u,ell][3]

PrefaceWhen we wish to draw your attention to a particular part of a code block, therelevant lines or items are set in bold:def nn movie(movie likeness, reviews, uid, mid):likes movie likeness[mid].argsort()# reverse the sorting so that most alike are in# beginninglikes likes[::-1]# returns the rating for the most similar movie availablefor ell in likes:if reviews[u,ell] 0:return reviews[u,ell]New terms and important words are shown in bold. Words that you see on thescreen, in menus or dialog boxes for example, appear in the text like this: "clickingon the Next button moves you to the next screen".Warnings or important notes appear in a box like this.Tips and tricks appear like this.Reader feedbackFeedback from our readers is always welcome. Let us know what you think aboutthis book—what you liked or may have disliked. Reader feedback is important forus to develop titles that you really get the most out of.To send us general feedback, simply send an e-mail to feedback@packtpub.com,and mention the book title via the subject of your message. If there is a topic thatyou have expertise in and you are interested in either writing or contributing toa book, see our author guide on www.packtpub.com/authors.Customer supportNow that you are the proud ow

About the Authors Willi Richert has a PhD in Machine Learning and Robotics, and he currently works for Microsoft in the Core Relevance Team of Bing, where he is involved in a variety of machine learning areas suc