Machine Learning With R - Kpfu.ru

Transcription

Machine Learning with RLearn how to use R to apply powerful machine learningmethods and gain an insight into real-world applicationsBrett LantzBIRMINGHAM - MUMBAI

Machine Learning with RCopyright 2013 Packt PublishingAll rights reserved. No part of this book may be reproduced, stored in a retrievalsystem, or transmitted in any form or by any means, without the prior writtenpermission of the publisher, except in the case of brief quotations embedded incritical articles or reviews.Every effort has been made in the preparation of this book to ensure the accuracyof the information presented. However, the information contained in this book issold without warranty, either express or implied. Neither the author, nor PacktPublishing, and its dealers and distributors will be held liable for any damagescaused or alleged to be caused directly or indirectly by this book.Packt Publishing has endeavored to provide trademark information about all of thecompanies and products mentioned in this book by the appropriate use of capitals.However, Packt Publishing cannot guarantee the accuracy of this information.First published: October 2013Production Reference: 1211013Published by Packt Publishing Ltd.Livery Place35 Livery StreetBirmingham B3 2PB, UK.ISBN 978-1-78216-214-8www.packtpub.comCover Image by Abhishek Pandey (abhishek.pandey1210@gmail.com)

CreditsAuthorBrett LantzReviewersProject CoordinatorAnugya KhuranaProofreadersJia LiuSimran BhogalMzabalazo Z. NgwenyaAmeesha GreenAbhinav UpadhyayPaul HindleAcquisition EditorJames JonesLead Technical EditorAzharuddin SheikhTechnical EditorsPooja ArondekarIndexerTejal SoniGraphicsRonak DhruvProduction CoordinatorNilesh R. MohitePratik MoreAnusri RamchandranHarshad VairatCover WorkNilesh R. Mohite

About the AuthorBrett Lantz has spent the past 10 years using innovative data methods tounderstand human behavior. A sociologist by training, he was first enchanted bymachine learning while studying a large database of teenagers' social networkingwebsite profiles. Since then, he has worked on interdisciplinary studies of cellulartelephone calls, medical billing data, and philanthropic activity, among others. Whenhe's not spending time with family, following college sports, or being entertained byhis dachshunds, he maintains dataspelunking.com, a website dedicated to sharingknowledge about the search for insight in data.This book could not have been written without the support of myfamily and friends. In particular, my wife Jessica deserves manythanks for her patience and encouragement throughout the pastyear. My son Will (who was born while Chapter 10 was underway),also deserves special mention for his role in the writing process;without his gracious ability to sleep through the night, I couldnot have strung together a coherent sentence the next morning. Idedicate this book to him in the hope that one day he is inspired tofollow his curiosity wherever it may lead.I am also indebted to many others who supported this bookindirectly. My interactions with educators, peers, and collaboratorsat the University of Michigan, the University of Notre Dame, and theUniversity of Central Florida seeded many of the ideas I attemptedto express in the text. Additionally, without the work of researcherswho shared their expertise in publications, lectures, and source code,this book might not exist at all. Finally, I appreciate the efforts of theR team and all those who have contributed to R packages, whosework ultimately brought machine learning to the masses.

About the ReviewersJia Liu holds a Master's degree in Statistics from the University of Maryland,Baltimore County, and is presently a PhD candidate in statistics from Iowa StateUniversity. Her research interests include mixed-effects model, Bayesian method,Boostrap method, reliability, design of experiments, machine learning and datamining. She has two year's experience as a student consultant in statistics and twoyear's internship experience in agriculture and pharmaceutical industry.Mzabalazo Z. Ngwenya has worked extensively in the field of statisticalconsulting and currently works as a biometrician. He holds an MSc in MathematicalStatistics from the University of Cape Town and is at present studying for a PhD(at the School of Information Technology, University of Pretoria), in the field ofComputational Intelligence. His research interests include statistical computing,machine learning, and spatial statistics. Previously, he was involved in reviewingLearning RStudio for R Statistical Computing (Van de Loo and de Jong, 2012), andR Statistical Application Development by Example beginner's guide (PrabhanjanNarayanachar Tattar , 2013).

Abhinav Upadhyay finished his Bachelor's degree in 2011 with a major inInformation Technology. His main areas of interest include machine learning andinformation retrieval.In 2011, he worked for the NetBSD Foundation as part of the Google Summer ofCode program. During that period, he wrote a search engine for Unix manual pages.This project resulted in a new implementation of the apropos utility for NetBSD.Currently, he is working as a Development Engineer for SocialTwist. His day-to-daywork involves writing system level tools and frameworks to manage the productinfrastructure.He is also an open source enthusiast and quite active in the community. In his freetime, he maintains and contributes to several open source projects.

www.PacktPub.comSupport files, eBooks, discount offersand moreYou might want to visit www.PacktPub.com for support files and downloads related toyour book.Did you know that Packt offers eBook versions of every book published, with PDF andePub files available? You can upgrade to the eBook version at www.PacktPub.com andas a print book customer, you are entitled to a discount on the eBook copy. Get in touchwith us at service@packtpub.com for more details.At www.PacktPub.com, you can also read a collection of free technical articles, sign upfor a range of free newsletters and receive exclusive discounts and offers on Packt booksand eBooks.TMhttp://PacktLib.PacktPub.comDo you need instant solutions to your IT questions? PacktLib is Packt's online digitalbook library. Here, you can access, read and search across Packt's entire library of books.Why Subscribe? Fully searchable across every book published by PacktCopy and paste, print and bookmark contentOn demand and accessible via web browserFree Access for Packt account holdersIf you have an account with Packt at www.PacktPub.com, you can use this to accessPacktLib today and view nine entirely free books. Simply use your login credentials forimmediate access.

Table of ContentsPrefaceChapter 1: Introducing Machine Learning15The origins of machine learningUses and abuses of machine learningEthical considerationsHow do machines learn?Abstraction and knowledge representationGeneralizationAssessing the success of learningSteps to apply machine learning to your dataChoosing a machine learning algorithmThinking about the input dataThinking about types of machine learning algorithmsMatching your data to an appropriate algorithmUsing R for machine learningInstalling and loading R ng an R packageInstalling a package using the point-and-click interfaceLoading an R packageChapter 2: Managing and Understanding DataR data structuresVectorsFactorsListsData framesMatrixes and arrays24252729303031323537

Table of ContentsManaging data with RSaving and loading R data structuresImporting and saving data from CSV filesImporting data from SQL databasesExploring and understanding dataExploring the structure of dataExploring numeric variablesMeasuring the central tendency – mean and medianMeasuring spread – quartiles and the five-number summaryVisualizing numeric variables – boxplotsVisualizing numeric variables – histogramsUnderstanding numeric data – uniform and normal distributionsMeasuring spread – variance and standard deviation39394041424344454749515354Exploring categorical variables56Exploring relationships between variables58Measuring the central tendency – the modeVisualizing relationships – scatterplotsExamining relationships – two-way cross-tabulations575961Summary63Understanding classification using nearest neighborsThe kNN algorithm6667Why is the kNN algorithm lazy?Diagnosing breast cancer with the kNN algorithmStep 1 – collecting dataStep 2 – exploring and preparing the data74757677Chapter 3: Lazy Learning – Classification Using Nearest Neighbors 65Calculating distanceChoosing an appropriate kPreparing data for use with kNNTransformation – normalizing numeric dataData preparation – creating training and test datasetsStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceTransformation – z-score standardizationTesting alternative values of kSummaryChapter 4: Probabilistic Learning – Classification UsingNaive BayesUnderstanding naive BayesBasic concepts of Bayesian methodsProbabilityJoint probability[ ii ]70717279808183848486878990919192

Table of ContentsConditional probability with Bayes' theoremThe naive Bayes algorithm9395The naive Bayes classificationThe Laplace estimatorUsing numeric features with naive Bayes9698100Data preparation – creating training and test datasetsVisualizing text data – word cloudsData preparation – creating indicator features for frequent words108108112Example – filtering mobile phone spam with the naive Bayes algorithm 101Step 1 – collecting data102Step 2 – exploring and preparing the data103Data preparation – processing text data for analysis104Step 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceSummaryChapter 5: Divide and Conquer – Classification UsingDecision Trees and RulesUnderstanding decision treesDivide and conquerThe C5.0 decision tree algorithmChoosing the best splitPruning the decision treeExample – identifying risky bank loans using C5.0 decision treesStep 1 – collecting dataStep 2 – exploring and preparing the dataData preparation – creating random training and test datasetsStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceBoosting the accuracy of decision treesMaking some mistakes more costly than othersUnderstanding classification rulesSeparate and conquerThe One Rule algorithmThe RIPPER algorithmRules from decision treesExample – identifying poisonous mushrooms with rule learnersStep 1 – collecting dataStep 2 – exploring and preparing the dataStep 3 – training a model on the dataStep 4 – evaluating model performance[ iii 38138140142142145147149150150151152154

Table of ContentsStep 5 – improving model performanceSummaryChapter 6: Forecasting Numeric Data – Regression MethodsUnderstanding regressionSimple linear regressionOrdinary least squares estimationCorrelationsMultiple linear regressionExample – predicting medical expenses using linear regressionStep 1 – collecting dataStep 2 – exploring and preparing the dataExploring relationships among features – the correlation matrixVisualizing relationships among features – the scatterplot matrixStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceModel specification – adding non-linear relationshipsTransformation – converting a numeric variable to a binary indicatorModel specification – adding interaction effectsPutting it all together – an improved regression modelUnderstanding regression trees and model treesAdding regression to treesExample – estimating the quality of wines with regression treesand model treesStep 1 – collecting dataStep 2 – exploring and preparing the dataStep 3 – training a model on the dataVisualizing decision 183184184185186187188190191192194196Step 4 – evaluating model performance197Step 5 – improving model performanceSummary199203Measuring performance with mean absolute errorChapter 7: Black Box Methods – Neural Networks andSupport Vector MachinesUnderstanding neural networksFrom biological to artificial neuronsActivation functionsNetwork topologyThe number of layersThe direction of information travelThe number of nodes in each layerTraining neural networks with backpropagation[ iv ]198205206207209211212213214215

Table of ContentsModeling the strength of concrete with ANNsStep 1 – collecting dataStep 2 – exploring and preparing the dataStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceUnderstanding Support Vector MachinesClassification with hyperplanesFinding the maximum margin217217218220222224225226227Using kernels for non-linear spacesPerforming OCR with SVMsStep 1 – collecting dataStep 2 – exploring and preparing the dataStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceSummary231233234235237239241242The case of linearly separable dataThe case of non-linearly separable dataChapter 8: Finding Patterns – Market Basket Analysis UsingAssociation RulesUnderstanding association rulesThe Apriori algorithm for association rule learningMeasuring rule interest – support and confidenceBuilding a set of rules with the Apriori principleExample – identifying frequently purchased groceries withassociation rulesStep 1 – collecting dataStep 2 – exploring and preparing the dataData preparation – creating a sparse matrix for transaction dataVisualizing item support – item frequency plotsVisualizing transaction data – plotting the sparse matrixStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceSorting the set of association rulesTaking subsets of association rulesSaving association rules to a file or data frameSummaryChapter 9: Finding Groups of Data – Clustering with k-meansUnderstanding clusteringClustering as a machine learning 60263263264265266267268269

Table of ContentsThe k-means algorithm for clustering271Finding teen market segments using k-means clusteringStep 1 – collecting dataStep 2 – exploring and preparing the data278279279Using distance to assign and update clustersChoosing the appropriate number of clustersData preparation – dummy coding missing valuesData preparation – imputing missing valuesStep 3 – training a model on the dataStep 4 – evaluating model performanceStep 5 – improving model performanceSummaryChapter 10: Evaluating Model PerformanceMeasuring performance for classificationWorking with classification prediction data in RA closer look at confusion matricesUsing confusion matrices to measure performanceBeyond accuracy – other measures of performanceThe kappa statisticSensitivity and specificityPrecision and recallThe F-measureVisualizing performance 2303307309310311ROC curves312Estimating future performanceThe holdout methodCross-validationBootstrap samplingSummaryChapter 11: Improving Model PerformanceTuning stock models for better performanceUsing caret for automated parameter tuningCreating a simple tuned modelCustomizing the tuning process315316319322324325326327330333Improving model performance with meta-learningUnderstanding ensemblesBaggingBoostingRandom forests337337339343344Summary350Training random forestsEvaluating random forest performance[ vi ]346348

Table of ContentsChapter 12: Specialized Machine Learning TopicsWorking with specialized dataGetting data from the Web with the RCurl packageReading and writing XML with the XML packageReading and writing JSON with the rjson packageReading and writing Microsoft Excel spreadsheets using xlsxWorking with bioinformatics dataWorking with social network data and graph dataImproving the performance of RManaging very large datasetsMaking data frames faster with data.tableCreating disk-based data frames with ffUsing massive matrices with arning faster with parallel computing358GPU computingDeploying optimized learning algorithms362363Measuring execution timeWorking in parallel with foreachUsing a multitasking operating system with multicoreNetworking multiple workstations with snow and snowfallParallel cloud computing with MapReduce and HadoopBuilding bigger regression models with biglmGrowing bigger and faster random forests with bigrfTraining and evaluating models in parallel with caretSummaryIndex359359360360361363363364364365[ vii ]

PrefaceMachine learning, at its core, is concerned with algorithms that transforminformation into actionable intelligence. This fact makes machine learningwell-suited to the present day era of Big Data. Without machine learning, itwould be nearly impossible to keep up with the massive stream of information.Given the growing prominence of R—a cross-platform, zero-cost statisticalprogramming environment—there has never been a better time to start usingmachine learning. R offers a powerful but easy-to-learn set of tools that can assistyou with finding data insights.By combining hands-on case studies with the essential theory that you need tounderstand how things work under the hood, this book provides all the knowledgethat you will need to start applying machine learning to your own projects.What this book coversChapter 1, Introducing Machine Learning, presents the terminology and concepts thatdefine and distinguish machine learners, as well as a method for matching a learningtask with the appropriate algorithm.Chapter 2, Managing and Understanding Data, provides an opportunity to get yourhands dirty working with data in R. Essential data structures and procedures usedfor loading, exploring, and understanding data are discussed.Chapter 3, Lazy Learning – Classification Using Nearest Neighbors, teaches you howto understand and apply a simple yet powerful learning algorithm to your firstmachine learning task: identifying malignant samples of cancer.Chapter 4, Probabilistic Learning – Classification Using Naive Bayes, reveals the essentialconcepts of probability that are used in cutting-edge spam filtering systems. You'lllearn the basics of text mining in the process of building your own spam filter.

PrefaceChapter 5, Divide and Conquer – Classification Using Decision Trees and Rules, exploresa couple of learning algorithms whose predictions are not only accurate but easilyexplained. We'll apply these methods to tasks where transparency is important.Chapter 6, Forecasting Numeric Data – Regression Methods, introduces machine learningalgorithms used for making numeric predictions. As these techniques are heavilyembedded in the field of statistics, you will also learn the essential metrics needed tomake sense of numeric relationships.Chapter 7, Black Box Methods – Neural Networks and Support Vector Machines, coverstwo extremely complex yet powerful machine learning algorithms. Though themathematics may appear intimidating, we will work through examples that illustratetheir inner workings in simple terms.Chapter 8, Finding Patterns – Market Basket Analysis Using Association Rules, exposesthe algorithm for the recommendation systems used at many retailers. If you've everwondered how retailers seem to know your purchasing habits better than you knowthem yourself, this chapter will reveal their secrets.Chapter 9, Finding Groups of Data – Clustering with k-means, is devoted to a procedurethat locates clusters of related items. We'll utilize this algorithm to identify segmentsof profiles within a web-based community.Chapter 10, Evaluating Model Performance, provides information on measuring thesuccess of a machine learning project, and obtaining a reliable estimate of thelearner's performance on future data.Chapter 11, Improving Model Performance, reveals the methods employed by theteams found at the top of machine learning competition leader boards. If you havea competitive streak, or simply want to get the most out of your data, you'll need toadd these techniques to your repertoire.Chapter 12, Specialized Machine Learning Topics, explores the frontiers of machinelearning. From working with Big Data to making R work faster, the topics coveredwill help you push the boundaries of what is possible with R.What you need for this bookThe examples in this book were written for and tested with R Version 2.15.3 on bothMicrosoft Windows and Mac OS X, though they are likely to work with any recentversion of R.[2]

PrefaceWho this book is forThis book is intended for anybody hoping to use data for action. Perhaps you alreadyknow a bit about machine learning, but have never used R; or perhaps you know alittle R but are new to machine learning. In any case, this book will get you up andrunning quickly. It would be helpful to have a bit of familiarity with basic math andprogramming concepts, but no prior experience is required. You need only curiosity.ConventionsIn this book, you will find a number of styles of text that distinguish betweendifferent kinds of information. Here are some examples of these styles, and anexplanation of their meaning.Code words in text, database table names, folder names, filenames, file extensions,pathnames, dummy URLs, user input, and Twitter handles are shown as follows:"To fit a linear regression model to data with R, the lm() function can be used."Any command-line input or output is written as follows: pairs.panels(insurance[c("age", "bmi", "children", "charges")])New terms and important words are shown in bold. Words that you see on thescreen, in menus or dialog boxes for example, appear in the text like this: "Instead,ham messages use words such as can, sorry, need, and time."Warnings or important notes appear in a box like this.Tips and tricks appear like this.Reader feedbackFeedback from our readers is always welcome. Let us know what you think aboutthis book—what you liked or may have disliked. Reader feedback is important for usto develop titles that you really get the most out of.To send us general feedback, simply send an e-mail to feedback@packtpub.com,and mention the book title via the subject of your message.If there is a topic that you have expertise in and you are interested in either writingor contributing to a book, see our author guide on www.packtpub.com/authors.[3]

PrefaceCustomer supportNow that you are the proud owner of a Packt book, we have a number of things tohelp you to get the most from your purchase.Downloading the example codeYou can download the example code files for all Packt books you have purchasedfrom your account at http://www.packtpub.com. If you purchased this bookelsewhere, you can visit http://www.packtpub.com/support and register to havethe files e-mailed directly to you.ErrataAlthough we have taken every care to ensure the accuracy of our content, mistakes dohappen. If you find a mistake in one of our books—maybe a mistake in the text or thecode—we would be grateful if you would report this to us. By doing so, you can saveother readers from frustration and help us improve subsequent versions of this book.If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link,and entering the details of your errata. Once your errata are verified, your submissionwill be accepted and the errata will be uploaded on our website, or added to any listof existing errata, under the Errata section of that title. Any existing errata can beviewed by selecting your title from http://www.packtpub.com/support.PiracyPiracy of copyright material on the Internet is an ongoing problem across all media.At Packt, we take the protection of our copyright and licenses very seriously. If youcome across any illegal copies of our works, in any form, on the Internet, pleaseprovide us with the location address or website name immediately so that we canpursue a remedy.Please contact us at copyright@packtpub.com with a link to the suspectedpirated material.We appreciate your help in protecting our authors, and our ability to bring youvaluable content.QuestionsYou can contact us at questions@packtpub.com if you are having a problem withany aspect of the book, and we will do our best to address it.[4]

Introducing Machine LearningIf science fiction stories are to be believed, teaching machines to learn will inevitablylead to apocalyptic wars between machines and their makers. In the early stages,computers are taught to play simple games of tic-tac-toe and chess. Later, machinesare given control of traffic lights and communications, followed by military dronesand missiles. The machines' evolution takes an ominous turn once the computersbecome sentient and learn how to teach themselves. Having no more need for humanprogrammers, humankind is then "deleted."Thankfully, at the time of this writing, machines still require user input.Your impressions of machine learning may be very heavily influenced by these typesof mass media depictions of artificial intelligence. And even though there may be ahint of truth to such tales; in reality, machine learning is focused on more practicalapplications. The task of teaching a computer to learn is tied more closely to aspecific problem that would be a computer that can play games, ponder philosophy,or answer trivial questions. Machine learning is more like training an employee thanraising a child.Putting these stereotypes aside, by the end of this chapter, you will have gained afar more nuanced understanding of machine learning. You will be introduced tothe fundamental concepts that define and differentiate the most commonly usedmachine learning approaches.You will learn: The origins and practical applications of machine learning How knowledge is defined and represented by computers The basic concepts that differentiate machine learning approaches

Introducing Machine LearningIn a single sentence, you could say that machine learning provides a set of tools thatuse computers to transform data into actionable knowledge. To learn more abouthow the process works, read on.The origins of machine learningSince birth, we are inundated with data. Our body's sensors—the eyes, ears, nose,tongue, and nerves—are continually assailed with raw data that our brain translatesinto sights, sounds, smells, tastes, and textures. Using language, we are able to sharethese experiences with others.The earliest databases recorded information from the observable environment.Astronomers recorded patterns of planets and stars; biologists noted results fromexperiments crossbreeding plants and animals; and cities recorded tax payments,disease outbreaks, and populations. Each of these required a human being to firstobserve and second, record the observation. Today, such observations are increasinglyautomated and recorded systematically in ever-growing computerized databases.The invention of electronic sensors has additionally contributed to an increase in therichness of recorded data. Specialized sensors see, hear, smell, or taste. These sensorsprocess the data far differently than a human being would, and in many ways, thisis a benefit. Without the need for translation into human language, the raw sensorydata remains objective.It is important to note that although a sensor does not havea subjective component to its observations, it does notnecessarily report truth (if such a concept can be defined).A camera taking photographs in black and white mightprovide a far different depiction of its environment thanone shooting pictures in color. Similarly, a microscopeprovides a far different depiction of reality than a telescope.Between databases and sensors, many aspects of our lives are recorded.Governments, businesses, and individuals are recording and reporting all mannersof information from the monumental to the mundane. Weather sensors recordtemperature and pressure data, surveillance cameras watch sidewalks and subwaytunnels, and all manner of electronic behaviors are monitored: transactions,communications, friendships, and many others.[6]

Chapter 1This deluge of data has led some to state that we have entered an era of Big Data, butthis may be a bit of a misnomer. Human beings have always been surrounded bydata. What makes the current era unique is that we have easy data. Larger and moreinteresting data sets are increasingly accessible through the tips of our fingers, onlya web search away. We now live in a period with vast quantities of data that can bedirectly processed by machines. Much of this information has the potential to informdecision making, if only there was a systematic way of making sense from it all.The field of study interested in the development of computer algorithms fortransforming data into intelligent action is known as machine learning. This fieldoriginated in an environment where the available data, statistical methods, andcomputing power rapidly and simultaneously evolved. Growth in data necessitatedadditional computing power, which in turn spurred the development of statisticalmethods for analyzing large datasets. This created a cycle of advancement allowingeven larger and more interesting data to be collected.A closely related sibling of machine learning, data mining, is concerned withthe generation of novel insight from large databases (not to be confused with thepejorative term "data mining," describing the practice of cherry-picking data tosupport a theory). Although there is some disagreement over how widely the twofields overlap, a potential point of distinction is that machine learning tends to befocused on performing a known task, whereas data mining is about the search forhidden nuggets of information. For instance, you might use machine learning toteach a robot to drive a car, whereas you would utilize data mining to learn whattype o

Using R for machine learning 23 Installing and loading R packages 24 Installing an R package 24 Installing a package using the point-and-click interface 25 Loading an R package 27 Summary 27 Chapter 2: Managing and Understanding Data 29 R data structures 30 Vectors 30