Python Programming Pandas PDF Free Download

2y ago

47 Views

1 Downloads

200.73 KB

34 Pages

Report/dmca

Download PDF

Transcription

Python programming — PandasFinn Årup NielsenDTU ComputeTechnical University of DenmarkOctober 5, 2013

PandasOverviewPandas?Reading dataSummary statisticsIndexingMerging, joiningGroup-by and cross-tabulationStatistical modelingFinn Årup Nielsen1October 5, 2013

PandasPandas?“Python Data Analysis Library”Young library for data analysisDeveloped from http://pandas.pydata.org/Main author Wes McKinney has written a 2012 book (McKinney, 2012).Finn Årup Nielsen2October 5, 2013

PandasWhy Pandas?A better Numpy: keep track of variable names, better indexing, easierlinear modeling.A better R: Access to more general programming language.Why not pandas?R: Still primary language for statisticians, means most avanced tools arethere.NaN/NA (Not a number/Not available)Support to third-party algorithms compared to Numpy? Numexpr? (NumExpr in 0.11)Finn Årup Nielsen3October 5, 2013

PandasGet some data from RGet a standard dataset, Pima, from R: R library(MASS) write.csv(Pima.te, "pima.csv")pima.csv now contains comma-separated .587,51,"Yes"Finn Årup Nielsen4October 5, 2013

PandasRead data with PandasBack in Python: import pandas as pd pima pd.read csv("pima.csv")“pima” is now what Pandas call a DataFrame object. This object keepstrack of both data (numerical as well as text), and column and rowheaders.Lets use the first columns and the index column: import pandas as pd pima pd.read csv("pima.csv", index col 0)Finn Årup Nielsen5October 5, 2013

PandasSummary statistics pima.describe()Unnamed: 0npregcount 332.000000 0Finn Årup 000023.00000027.00000037.00000081.0000006October 5, 2013

Pandas. . . Summary statisticsOther summary statistics (McKinney, 2012, around page 101):pima.count() Count the number of rowspima.mean(), pima.median(), pima.quantile()pima.std(), pima.var()pima.min(), pima.max()Operation across columns instead, e.g., with the mean method:pima.mean(axis 1)Finn Årup Nielsen7October 5, 2013

PandasIndexing the rowsFor example, you can see the first two rows or the three last rows: pima[0:2]npreg glu bp skinbmiped age type16 148 7235 33.6 0.62750 Yes2185 6629 26.6 0.35131No pima[-3:]npreg glu bp skinbmiped age type33010 101 7648 32.9 0.17163No3315 121 7223 26.2 0.24530No332193 7031 30.4 0.31523NoNotice that this is not an ordinary numerical matrix: We also got text (inthe “type” column) within the “matrix”!Finn Årup Nielsen8October 5, 2013

PandasIndexing the columnsSee a specific column, here ’bmi’ (body-mass index): pima["bmi"]133.6226.6328.1431.0[here I cut out several lines]33032.933126.233230.4Name: bmi, Length: 332The returned type is another of Pandas Series object, — another of thefundamental objects in the library: type(pima["bmi"]) class ’pandas.core.series.Series’ Finn Årup Nielsen9October 5, 2013

PandasConditional indexingGet the fat people (those with BMI above 30): pima.shape(332, 9) pima[pima["bmi"] 30].shape(210, 9)See histogram (with from pylab import *): pima["bmi"].hist() show()Or kernel density estimation plot (McKinney, 2012, p 239) pima["bmi"].plot(kind "kde") show()Finn Årup Nielsen10October 5, 2013

PandasPlotsHistogram and kernel density estimate (KDE) of the “bmi” variable (bodymass index) of the Pima data set.Finn Årup Nielsen11October 5, 2013

PandasRow and column conditional indexingExample by David Marx in R:ABCDE -runif(10)runif(10)runif(10)runif(10)runif(10)df - data.frame(A,B,C,D,E)sliced df - df[ , df[1,] .5 ]That is, select the columns in a dataframe where the values of the firstrow is below 0.5. Here with a 10-by-5 dataset with uniformly-distributedrandom numbers and columns indexed by letters.Finn Årup Nielsen12October 5, 2013

Pandas. . . Row and column conditional indexingEquivalent in Pythonimport pandas as pdfrom pylab import *df pd.DataFrame(rand(10,5), columns ["A", "B", "C", "D", "E"])df.ix[:, df.ix[0, :] 0.5]These variations do not workdf[:, df[0] 0.5]df[:, df[:1] 0.5]df.ix[:, df[:1] 0.5]Finn Årup Nielsen13October 5, 2013

PandasConstructing a DataFrameConstructing a DataFrame from a dictionary where the keys become thecolumn names import pandas as pd import string spam corpus map(string.split, [ "buy viagra", "buy antibody" ]) unique words set([ word for doc in spam corpus for word in doc ]) word counts [ (word, map(lambda doc: doc.count(word), spam corpus))for word in unique words ] spam bag of words pd.DataFrame(dict(word counts)) print(spam bag of words)antibody buy viagra00111110Finn Årup Nielsen14October 5, 2013

PandasConcatenationAnother corpus and then concatenation with the previous dataset other corpus map(string.split, [ "buy time", "hello" ]) unique words set([ word for doc in other corpus for word in doc ]) word counts [ (word, map(lambda doc: doc.count(word), other corpus))for word in unique words ] other bag of words pd.DataFrame(dict(word counts)) print(other bag of words)buy hello time01011010 pd.concat([spam bag of words, other bag of words], ignore index True)antibody buy hello time viagra001NaNNaN1111NaNNaN02NaN101NaN3NaN010NaNFinn Årup Nielsen15October 5, 2013

PandasFilling in missing data(McKinney, 2012, page 145 ) pd.concat([spam bag of words, other bag of words], ignore index True)antibody buy hello time viagra001NaNNaN1111NaNNaN02NaN101NaN3NaN010NaN pd.concat([spam bag of words, other bag of words], ignore index True).fillna(0)antibody buy hello time viagra001001111000201010300100Finn Årup Nielsen16October 5, 2013

PandasCombining datasetsSee tml for otherPandas operations:concatjoinmergecombine firstFinn Årup Nielsen17October 5, 2013

PandasJoin exampleTwo data sets with partially overlapping rows (as not all students answer each questionnaire) where the columns should be concatenated (i.e.,scores for individual questionnaires)import pandas as pdxl pd.ExcelFile("E13 1 Resultater-2013-10-02.xlsx")df1 xl.parse("Resultater", index col [0, 1, 2], header 3)df1.columns map(lambda colname: unicode(colname) " 1", df1.columns)xl pd.ExcelFile("E13 2 Resultater-2013-10-02.xlsx")df2 xl.parse("Resultater", index col [0, 1, 2], header 3)df2.columns map(lambda colname: unicode(colname) " 2", df2.columns)df pd.DataFrame().join([df1, df2], how "outer")df[["Score 1", "Score 2"]].corr()Finn Årup Nielsen18# Score correlationOctober 5, 2013

PandasProcessing after join df.ix[:5,["Score 1", "Score 2"]]Bruger(faan)s06.s07.s07.s07.FornavnFinn Årup.EfternavnNielsenScore 1Score 709000.7418000.569666(edited)Note that the second user (“s06.”) did not solve the second assignment.The joining operation by default adds a NaN to the missing element, —indicating a missing value (not available, NA).Finn Årup Nielsen19October 5, 2013

PandasThe GroupbyGroupby method (McKinney, 2012, chapter 9): splits the dataset basedon a key, e.g., a DataFrame column name.Think of SQL’s GROUP BY.Example with Pima Indian data set splitting on the ’type’ column (elements are “yes” and “no”) and taking the mean in each of the twogroups: 5 108.188341 70.130045Yes4.614679 141.908257 6.5128440.4645650.65896329.21524735.614679The returned object from groupby is a DataFrameGroupBy object whilethe mean method on that object/class returns a DataFrame objectFinn Årup Nielsen20October 5, 2013

Pandas. . . The GroupbyMore elaborate with two aggregating methods: grouped by type pima.groupby("type") grouped by type.agg([np.mean, np.std])npregglumeanstdmeanstdtypeNo2.932735 2.781852 108.188341 22.645932Yes4.614679 3.901349 141.908257 1Finn Årup Nielsen21October 5, 2013

Pandas. . . The GroupbyWithout groupby checking mean (32.889908) and std (9.065951) for’skin’ ’Yes’: np.mean(pima[pima["type"] "Yes"]["skin"])32.889908256880737# Correct np.std(pima[pima["type"] "Yes"]["skin"])9.0242684519300891# ? import scipy.stats scipy.stats.nanstd(pima[pima["type"] "Yes"]["skin"])9.065951207005341# Ok np.std(pima[pima["type"] "Yes"]["skin"], ddof 1)9.065951207005341# Degrees of freedom!Numpy’s std is the biased estimate while Pandas std is the unbiasedestimate.Finn Årup Nielsen22October 5, 2013

PandasCross-tabulationFor categorical variables select two columns and generate a matrix withcounts for occurences (McKinney, 2012, p. 277) pd.crosstab(pima.type, pima.npreg)npreg 012345678typeNo34 56 38 23 19 13 1495Yes15 15 11 1567486948101112131517451511210101Remember: pima[1:4]npreg glu218531894378Finn Årup 1670.24823age type31No21No26 YesOctober 5, 2013

PandasCross-tabulation plot# Wrong orderingpd.crosstab(pima.type, pima.npreg).plot(kind "bar")Finn Årup Nielsen24October 5, 2013

PandasCross-tabulation plot# Transposepd.crosstab(pima.type, pima.npreg).T.plot(kind "bar")Finn Årup Nielsen25October 5, 2013

PandasCross-tabulation plot# Or better:pd.crosstab(pima.npreg, pima.type).plot(kind "bar")Finn Årup Nielsen26October 5, 2013

PandasOther Pandas capabilitiesHierarchical indexing (McKinney, 2012, page 147 )Missing data support (McKinney, 2012, page 142 )Pivoting (McKinney, 2012, chapter 9)Time series (McKinney, 2012, chapter 10)Finn Årup Nielsen27October 5, 2013

PandasStatistical modeling with statsmodelsExample with Longley dataset.Ordinary least squares fitting a dependent variable “TOTEMP” (TotalEmployment) from 6 independent variables:import statsmodels.api as sm# For ’load pandas’ you need a recent statsmodelsdata sm.datasets.longley.load pandas()# Endogeneous (response/dependent) & exogeneous variables (design matrix)y, x data.endog, data.exogresult sm.OLS(y, x).fit() # OLS: ordinary least squaresresult.summary()# Print summaryFinn Årup Nielsen28October 5, 2013

PandasOLS Regression Results Dep. Variable:TOTEMPR-squared:0.988Model:OLSAdj. R-squared:0.982Method:Least SquaresF-statistic:161.9Date:Mon, 17 Jun 2013Prob -117.56No. Observations:16AIC:247.1Df Residuals:10BIC:251.8Df Model:5 coefstd errtP t [95.0% Conf. 70.0219.00387.832 6Jarque-Bera nd. No.4.56e 05 Finn Årup Nielsen29October 5, 2013

PandasStatsmodels 0.5“Minimal example” from statsmodels documentation:import numpy as npimport pandas as pdimport statsmodels.formula.api as smfurl /HistData/Guerry.csv"dat pd.read csv(url)results smf.ols("Lottery Literacy np.log(Pop1831)", data dat).fit()results.summary()Note: 1) Loading of data with URL, 2) import statsmodels.formula.api(possible in statsmodels 0.5), 3) R-like specification of linear modelformula (from patsy).Finn Årup Nielsen30October 5, 2013

PandasMore informationhttp://pandas.pydata.org/The canonical book “Python for data analysis” (McKinney, 2012).Will it Python?: Porting R projects to Python, exemplified though scriptsfrom Machine Learning for Hackers (MLFH) by Drew Conway and JohnMiles White.Finn Årup Nielsen31October 5, 2013

PandasSummaryPandas helps you represent your data (both numerical and categorical)and helps you keep track of what they refer to (by column and row name).Pandas makes indexing easy.Pandas has some basic statistics and plotting facilities.Pandas may work more or less seamlessly with standard statistical models(e.g., general linear model with OLS-estimation)Watch out: Pandas is still below version 1 numbering!Standard packaging not up to date: Newest version of Pandas is 0.11.0,while, e.g., Ubuntu LTS 12.04 is 0.7.0: sudo pip install --upgrade pandasLatest pip-version of statsmodels is 0.4.3, development version is 0.5 withstatsmodels.formula.api that yields more R-like linear modeling.Finn Årup Nielsen32October 5, 2013

ReferencesReferencesMcKinney, W. (2012).ISBN 9781449319793.Finn Årup NielsenPython for Data Analysis.33O’Reilly, Sebastopol, California, first edition.October 5, 2013

Back in Python: import pandas as pd pima pd.read_csv("pima.csv") \pima" is now what Pandas call a DataFrame object. This object keeps track of both data (numerical as well as text), and column and ro