CRASH COURSE PYTHON - 3142.nl

Transcription

CRASH COURSE PYTHON‹nr.›Het begint met een idee

This talk Not a programming course For data analysts, who want to learn Python For optimizers, who are fed up with Matlab2Vrije Universiteit Amsterdam

Python Scripting languageexpensive computations typically in compiled modules such as matrix multiplication, optimization, classification Faster Python code: Numba’s @jit construct (or Cython) Support for functions and OOP (classes, abstract classes,polymorphism, inheritance; but no encapsulation) Direct competitors: R, Julia, Matlab3Vrije Universiteit Amsterdam

Zen of PythonBeautiful is better than ugly.Explicit is better than implicit.Simple is better than complex.Complex is better than complicated.Flat is better than nested.Sparse is better than dense.Readability counts.Special cases aren't special enough to break the rules.Although practicality beats purity.Errors should never pass silently.Unless explicitly silenced.4Vrije Universiteit Amsterdam

Python 2 or 3? 51994: Python 12000: Python 2 (backward compatible)2008: Python 3Most pronounced difference: Python 2: print “hello world!” Python 3: print(“hello world!”) Strength of Python: broad availability of modules Many modules have been updated for Python 3 Some people still use Python 2Vrije Universiteit Amsterdam

Installing Python Windows users: use winPython Has MKL for fast linear algebra, and many preinstalled modulesPortable, so extract & goShips with the Spyder editor for coding and debuggingand a compiler for new moduleswinPython 3.4 is currently recommended (3.5 does not (yet) ship witha compiler)Mac users OS X ships with Python 2.7 (and depends on it, do not “update” to 3)Python 3 can be installed alongsideLinux Ubuntu ships with both Python 2.7 and Python 3.4 Commands: python & python36Vrije Universiteit Amsterdam

Installing modules Mac/Linux/POSIX-compatible systems: run pip from theterminale.g.: pip install cylp WinPython: run “WinPython Command Prompt.exe”and use pip For dependencies that require a shell script (“./configure”):add the folder “winPython/share/mingwpy/bin” to the pathinstall msys from mingw.orgstart msys (C:\MinGW\msys\1.0\msys.bat)Configure&compile the dependency7Vrije Universiteit Amsterdam

Running Python The editor probably has a hotkey (F5 in Spyder) Shell command: “python filename.py” Alternative: “python” (runs commands as they areentered)8Vrije Universiteit Amsterdam

Crash courseData typeInitialize emptyInitialize with dataListx []x [1,2,5]Tuple-x (1,2,5)Setx set()x {1,2,5}Dictx {}x {"one": 1, "two": 2, "five": 5}Stringx ""x "hello world"9Vrije Universiteit Amsterdam

Precision Integers have infinite precisionFloats have finite precision use decimal/float/mpmath modules for arbitrary precision 683720566806937610Vrije Universiteit Amsterdam

Creating a listCodeOutputx [0,1,2,3,4,5,6,7,8,9,10]print(x)x range(11)print(x)print(list(x))[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]range(0, 11)[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] range(n) is “list-like” internally it is an object that can be converted to a list range(int(1e10)) requires a few bytes instead of 74.5 GB11Vrije Universiteit Amsterdam

LoopsCodeOutputfor i in [1,2,3]:print(i)12345while i 5:i 1print(i) No curly braces or “end for”Structure is derived from level of indentationOne statement per lineNo semicolons required12Vrije Universiteit Amsterdam

FunctionsCodeOutputdef fun(name, greeting 'Hi', me 'evil caterpillar'):print(greeting ' ' name ', this is ' me)return 0Hi group, this is Pythonfun('group', me 'Python') All arguments are named: fun(name ‘group’) Naming useful for optional arguments Return is optional13Vrije Universiteit Amsterdam

FunctionsCodeOutputdef trick me(a,b,c):a.append('o')b.append('o')c 1['m', 'n', 'o', 'o'] ['m', 'n', 'o', 'o'] 1x ['m','n']y xz 1trick me(x,y,z)print(x,y,z) Behavior depends on whether type is mutable 14variables are pointersmemory gets overwritten for mutable types only String, int, double, tuple are immutableList, set, dict are mutable “y list(x)” creates a shallow copy(y copy.deepcopy(x) when x contains mutabledata)Vrije UniversiteitAmsterdam

List comprehensionsCreating a list with squares: 0, 1, 4, , 100Naive codeIdiomatic Pythonx []for i in range(11):x.append(i*i)x [i*i for i in range(11)]Creating a list of even numbers 6, 8, 10, 12, 14Naive codeIdiomatic Pythonx []for i in range(6,15):if i % 2 0:x.append(i)x [i for i in range(6,15) if i%2 0]# orx [i for i in range(6,15,2)]15Vrije Universiteit Amsterdam

One-liner example Find the last ten digits of the series:11 22 33 . 10001000 (projecteuler.net) print(str(sum([k**k for k in range(1,1001)]))[-10:])9110846700 [k**k for k in range(1,1001)] creates the termssum(.) takes the sumstr(.) converts the argument to a string[-10:] takes a substring16Vrije Universiteit Amsterdam

Modules Matlab replacements scipy (free, linear algebra)matplotlib (free, graphing)Optimization cylp (free, linear and mixed integer optimization)pyipopt (free, convex optimization)gurobi / cplex (academic license)Data mining pandas (free, importing and slicing data)scikit-learn (free, machine learning)xgboost (free, gradient boosting)takes less than 20 lines to create a cross-validated ensemble ofclassifiers17Vrije Universiteit Amsterdam

RecapExample: function, for-loop, range, commentdef take sum(S):sum 0for i in S:sum ireturn sumprint(take sum(range(7)))# outputs 21Example: named argumentsdef fun(name, greeting 'Hi', me 'evil caterpillar'):print(greeting ' ' name ', this is ' me)return 018fun('group', me 'Python')Vrije Universiteit Amsterdam

Data mining Reading data with pandasVisualization with matplotlibMachine learning with scikit-learn19Vrije Universiteit Amsterdam

Reading data Pandas offers read csv, read excel, read sql, read json,read html, read sas, etc read * returns pandas data structure: DataFrame Having data in DataFrame is useful filtering, combining, grouping, sorting to csv, to excel, etc (for, e.g., converting csv to json)20Vrije Universiteit Amsterdam

Example: reading csv fileCSV fileid,feat 1,feat 2,feat 3,feat 4,feat ,0,0,1,6,05,0,0,0,0,0,1Codeimport pandasfilename 'train.csv'X pandas.read csv(filename, sep ",")y X.targetX.drop(['target', 'id'], axis 1, inplace True)21Vrije Universiteit Amsterdam

Filtering dataCSV fileid,feat 1,feat 2,feat 3,feat 4,feat ,0,0,1,6,05,0,0,0,0,0,1Codefilename 'train.csv'data pandas.read csv(filename)print(data[0:2])output:12id23feat 100feat 200feat 300feat 400feat 500target1122Vrije Universiteit Amsterdam

Filtering dataCSV fileid,feat 1,feat 2,feat 3,feat 4,feat ,0,0,1,6,05,0,0,0,0,0,1Codefilename 'train.csv'data pandas.read csv(filename)print(data[data.feat 1 1])output:03id14feat 111feat 200feat 300feat 401feat 506target1123Vrije Universiteit Amsterdam

VisualizationCodedata[data.feat 2 5].feat 2.plot(kind 'hist')# since the data takes few distinct values:data[data.feat 2 5].feat 2.value counts().sort index().plot(kind 'bar')24Vrije Universiteit Amsterdam

GroupingCodeimport numpy as nppandas.set option('display.precision',2)for feat 2 value,group in data.groupby('feat 2'):# group is the DataFrame data[feat 2 feat 2 value]data.groupby('feat 2').aggregate(pandas.Series.nunique)# other aggregation functions: np.min, np.max, np.sum, np.stdfeat 2012345 25idfeat 1feat 3feat 4feat 048363927271315107743999756Vrije Universiteit Amsterdam

Example: time seriesCodeimport pandasimport numpy as npts pandas.Series(np.random.randn(1000), \index pandas.date range('1/1/2000', periods 1000))ts ts.cumsum()ts.plot()print(ts.mean())# output: 28.64280223089867826Vrije Universiteit Amsterdam

Example: large data setSuppose csv file is 100 GB and has thousands of columnsSubset of three columns is manageableCodeinfile 'train.csv'outfile ‘output.xlsx’df pandas.DataFrame()# chunksize is the number of rows to read per iterationfor data in pandas.read csv(infile, chunksize 100):data data[['feat 1', 'feat 2', 'target']]df pandas.concat([df,data])writer pandas.ExcelWriter(outfile)df.to excel(writer, 'Sheet1')writer.save()27Vrije Universiteit Amsterdam

Logistic regressionCodefrom sklearn import cross validation,linear modelfrom sklearn.metrics import log lossfilename 'train.csv'X pandas.read csv(filename, sep ",")y X.targetX.drop(['target', 'id'], axis 1, inplace True)y[y 1] 0y[y 1] 1X,X test,y,y test cross validation.train test split(X, y, test size 0.5)clf linear model.LogisticRegression()clf.fit(X,y)prediction clf.predict proba(X test)print(log loss(y test,prediction))# output: 0.00159227347414; log loss is in in [0, 34.5]#28 0 for “perfect fit”, 0.7 for “constant p 0.5”, 34.5 for “all wrong”Vrije Universiteit Amsterdam

Vrije Universiteit Amsterdam Windows users: use winPython Has MKL for fast linear algebra, and many preinstalled modules Portable, so extract & go Ships with the Spyder editor for coding and debugging and a compiler for new modules winPython 3.4 is currently recommended (3.5 does not (yet) ship with a compiler) Mac users OS X ships with Python 2.7 (and depends .