Data Mining With Python (Working Draft)

Transcription

Data Mining with Python (Working draft)Finn Årup NielsenNovember 29, 2017

ContentsContentsiList of FiguresviiList of Tablesix1 Introduction1.1 Other introductions to Python? . . . . . . . . . . .1.2 Why Python for data mining? . . . . . . . . . . . .1.3 Why not Python for data mining? . . . . . . . . .1.4 Components of the Python language and software1.5 Developing and running Python . . . . . . . . . . .1.5.1 Python, pypy, IPython . . . . . . . . . . . .1.5.2 Jupyter Notebook . . . . . . . . . . . . . .1.5.3 Python 2 vs. Python 3 . . . . . . . . . . . .1.5.4 Editing . . . . . . . . . . . . . . . . . . . .1.5.5 Python in the cloud . . . . . . . . . . . . .1.5.6 Running Python in the browser . . . . . . .1112355667772 Python2.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . .2.2 Datatypes . . . . . . . . . . . . . . . . . . . . . . . .2.2.1 Booleans (bool) . . . . . . . . . . . . . . . .2.2.2 Numbers (int, float, complex and Decimal)2.2.3 Strings (str) . . . . . . . . . . . . . . . . . .2.2.4 Dictionaries (dict) . . . . . . . . . . . . . . .2.2.5 Dates and times . . . . . . . . . . . . . . . .2.2.6 Enumeration . . . . . . . . . . . . . . . . . .2.2.7 Other containers classes . . . . . . . . . . . .2.3 Functions and arguments . . . . . . . . . . . . . . .2.3.1 Anonymous functions with lambdas . . . . .2.3.2 Optional function arguments . . . . . . . . .2.4 Object-oriented programming . . . . . . . . . . . . .2.4.1 Objects as functions . . . . . . . . . . . . . .2.5 Modules and import . . . . . . . . . . . . . . . . . .2.5.1 Submodules . . . . . . . . . . . . . . . . . . .2.5.2 Globbing import . . . . . . . . . . . . . . . .2.5.3 Coping with Python 2/3 incompatibility . . .2.6 Persistency . . . . . . . . . . . . . . . . . . . . . . .2.6.1 Pickle and JSON . . . . . . . . . . . . . . . .2.6.2 SQL . . . . . . . . . . . . . . . . . . . . . . .9999101111121313141414151717181919202021i

.2121222223232425252728292929293031313 Python for data mining3.1 Numpy . . . . . . . . . . . . . . . . . . . . . . . . .3.2 Plotting . . . . . . . . . . . . . . . . . . . . . . . .3.2.1 3D plotting . . . . . . . . . . . . . . . . . .3.2.2 Real-time plotting . . . . . . . . . . . . . .3.2.3 Plotting for the Web . . . . . . . . . . . . .3.3 Pandas . . . . . . . . . . . . . . . . . . . . . . . . .3.3.1 Pandas data types . . . . . . . . . . . . . .3.3.2 Pandas indexing . . . . . . . . . . . . . . .3.3.3 Pandas joining, merging and concatenations3.3.4 Simple statistics . . . . . . . . . . . . . . .3.4 SciPy . . . . . . . . . . . . . . . . . . . . . . . . .3.4.1 scipy.linalg . . . . . . . . . . . . . . . .3.4.2 Fourier transform with scipy.fftpack . .3.5 Statsmodels . . . . . . . . . . . . . . . . . . . . . .3.6 Sympy . . . . . . . . . . . . . . . . . . . . . . . . .3.7 Machine learning . . . . . . . . . . . . . . . . . . .3.7.1 Scikit-learn . . . . . . . . . . . . . . . . . .3.8 Text mining . . . . . . . . . . . . . . . . . . . . . .3.8.1 Regular expressions . . . . . . . . . . . . .3.8.2 Extracting from webpages . . . . . . . . . .3.8.3 NLTK . . . . . . . . . . . . . . . . . . . . .3.8.4 Tokenization and part-of-speech tagging . .3.8.5 Language detection . . . . . . . . . . . . . .3.8.6 Sentiment analysis . . . . . . . . . . . . . .3.9 Network mining . . . . . . . . . . . . . . . . . . . .3.10 Miscellaneous issues . . . . . . . . . . . . . . . . .3.10.1 Lazy computation . . . . . . . . . . . . . .3.11 Testing data mining code . . . . . . . . . . . . . 4555656572.72.82.92.102.112.122.132.6.3 NoSQL . . . . . . . . . . . . . . . . . . .Documentation . . . . . . . . . . . . . . . . . . .Testing . . . . . . . . . . . . . . . . . . . . . . .2.8.1 Testing for type . . . . . . . . . . . . . . .2.8.2 Zero-one-some testing . . . . . . . . . . .2.8.3 Test layout and test discovery . . . . . . .2.8.4 Test coverage . . . . . . . . . . . . . . . .2.8.5 Testing in different environments . . . . .Profiling . . . . . . . . . . . . . . . . . . . . . . .Coding style . . . . . . . . . . . . . . . . . . . . .2.10.1 Where is private and public? . . . . . .Command-line interface scripting . . . . . . . . .2.11.1 Distinguishing between module and script2.11.2 Argument parsing . . . . . . . . . . . . .2.11.3 Exit status . . . . . . . . . . . . . . . . .Debugging . . . . . . . . . . . . . . . . . . . . . .2.12.1 Logging . . . . . . . . . . . . . . . . . . .Advices . . . . . . . . . . . . . . . . . . . . . . .4 Case: Pure Python matrix library594.1 Code listing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59ii

5 Case: Pima data set5.1 Problem description and objectives5.2 Descriptive statistics and plotting .5.3 Statistical tests . . . . . . . . . . .5.4 Predicting diabetes type . . . . . .6 Case: Data mining a database6.1 Problem description and objectives . . . . . . . . .6.2 Reading the data . . . . . . . . . . . . . . . . . . .6.3 Graphical overview on the connections between the6.4 Statistics on the number of tracks sold . . . . . . .6565666769. . . . . . .tables. . . .7171717274.7 Case: Twitter information diffusion757.1 Problem description and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757.2 Building a news classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758 Case: Big data778.1 Problem description and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778.2 Stream processing of JSON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778.2.1 Stream processing of JSON Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78Bibliography81Index85iii

iv

PrefacePython has grown to become one of the central languages in data mining offering both a general programminglanguage and libraries specifically targeted numerical computations.This book is continuously being written and grew out of course given at the Technical University ofDenmark.v

vi

List of Figures1.1The Python hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.1Overview of methods and attributes in the common Python 2 built-in data types plotted as aformal concept analysis lattice graph. Only a small subset of methods and attributes is shown. 163.13.2Sklearn classes derivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Comorbidity for ICD-10 disease code (appendicitis). . . . . . . . . . . . . . . . . . . . . . . .49555.1Seaborn correlation plot on the Pima data set . . . . . . . . . . . . . . . . . . . . . . . . . . .686.1Database tables graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73vii4

viii

List of Tables2.12.22.3Basic built-in and Numpy and Pandas datatypes . . . . . . . . . . . . . . . . . . . . . . . . .Class methods and attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Testing concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1015223.13.23.33.43.53.63.7Function for generation of Numpy data structures.Some of the subpackages of SciPy. . . . . . . . . .Python machine learning packages . . . . . . . . .Scikit-learn methods . . . . . . . . . . . . . . . . .sklearn classifiers . . . . . . . . . . . . . . . . . . .Metacharacters and character classes . . . . . . . .NLT submodules. . . . . . . . . . . . . . . . . . . .334448484950535.1Variables in the Pima data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .65ix.

x

Chapter 1Introduction1.1Other introductions to Python?Although we cover a bit of introductory Python programming in chapter 2 you should not regard this book asa Python introduction: Several free introductory ressources exist. First and foremost the official Python Tutorial at http://docs.python.org/tutorial/. Beginning programmers with no or little programming experiencemay want to look into the book Think Python available from http://www.greenteapress.com/thinkpython/or as a book [1], while more experienced programmers can start with Dive Into Python available fromhttp://www.diveintopython.net/.1 Kevin Sheppard’s presently 381-page Introduction to Python for Econometrics, Statistics and Data Analysis covers both Python basics and Python-based data analysis with Numpy,SciPy, Matplotlib and Pandas, — and it is not just relevant for econometrics [2]. Developers already wellversed in standard Python development but lacking experience with Python for data mining can begin withchapter 3. Readers in need of an introduction to machine learning may take a look in Marsland’s Machinelearning: An algorithmic perspective [3], that uses Python for its examples.1.2Why Python for data mining?Researchers have noted a number of reasons for using Python in the data science area (data mining, scientificcomputing) [4, 5, 6]:1. Programmers regard Python as a clear and simple language with a high readability. Even nonprogrammers may not find it too difficult. The simplicity exists both in the language itself as well asin the encouragement to write clear and simple code prevalent among Python programmers. See thisin contrast to, e.g., Perl where short form variable names allow you to write condensed code but alsorequires you to remember nonintuitive variable names. A Python program may also be 2–5 shorterthan corresponding programs written in Java, C or C [7, 8].2. Platform-independent. Python will run on the three main desktop computing platforms Mac, Linuxand Windows, as well as on a number of other platforms.3. Interactive program. With Python you get an interactive prompt with REPL (read-eval-print loop)like in Matlab and R. The prompt facilitates exploratory programming convenient for many datamining tasks, while you still can develop complete programs in an edit-run-debug cycle. The Pythonderivatives IPython and Jupyter Notebook are particularly suited for interactive programming.4. General purpose language. Python is a general purpose language that can be used to a wide varietyof tasks beyond data mining, e.g., user applications, system administration, gaming, web developmentpsychological experiment presentations and recording. This is in contrast to Matlab and R.1 Forfurther free website for learning Python see sources.html.1

Too see how well Python with its modern data mining packages compares with R take a look at Carl J.V.’s blog posts on Will it Python? 2 and his GitHub repository where he reproduces R code in Pythonbased on R data analyses from the book Machine Learning for Hackers.5. Python with its BSD license fall in the group of free and open source software. Although somelarge Python development environments may have associated license cost for commercial use, thebasic Python development environment may be setup and run with no licensing cost. Indeed in somesystems, e.g., many Linux distributions, basic Python comes readily installed. The Python PackageIndex provides a large set of packages that are also free software.6. Large community. Python has a large community and has become more popular. Several indicatorstestify to this. Popularity of Language Index (PYPL) bases its programming language ranking onGoogle search volume provided by Google Trends and puts Python in the third position after Java andPHP. According to PYPL the popularity of Python has grown since 2004. TIOBE constructs anotherindicator putting Python in rank 6th. This indicator is “based on the number of skilled engineers worldwide, courses and third party vendors”.3 Also Python is among the

1.2 Why Python for data mining? Researchers have noted a number of reasons for using Python in the data science area (data mining, scienti c computing)