Python For R Users - Robert Richardson

Transcription

Python for R UsersByChandan RoutrayAs a part of internship atwww.decisionstats.com

Basic CommandsFunctionsRPythonDownloading and installing a packageinstall.packages('name')pip install nameLoad a packagelibrary('name')import name as other nameChecking working directorygetwd()import osos.getcwd()Setting working directorysetwd()os.chdir()List files in a directorydir()os.listdir()List all objectsls()globals()Remove an objectrm('name')del('object')Dec 2014Copyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.i

Data Frame CreationRPython(Using pandas package*)Creating a data frame “df” ofdimension 6x4 (6 rows and 4columns) containing randomnumbersA matrix(runif(24,0,1),nrow 6,ncol 4)df data.frame(A)Here, runif function generates 24 randomnumbers between 0 to 1 matrix function creates a matrix fromthose random numbers, nrow and ncolsets the numbers of rows and columnsto the matrix data.frame converts the matrix to dataframeimport numpy as npimport pandas as pdA np.random.randn(6,4)df pd.DataFrame(A)Here, np.random.randn generates amatrix of 6 rows and 4 columns;this function is a part of numpy**library pd.DataFrame converts the matrixin to a data frame*To install Pandas library visit: http://pandas.pydata.org/; To import Pandas library type: import pandas as pd;**To import Numpy library type: import numpy as np;Dec 2014Copyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.1

Data Frame CreationRDec 2014Copyrigt www.decisionstats.comPythonLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.2

Data Frame: Inspecting and Viewing DataRPython(Using pandas package*)Getting the names of rows andcolumns of data frame “df”Seeing the top and bottom “x”rows of the data frame “df”rownames(df)df.indexreturns the name of the rowsreturns the name of the rowscolnames(df)df.columnsreturns the name of the columnsreturns the name of the columnshead(df,x)df.head(x)returns top x rows of data framereturns top x rows of data frametail(df,x)df.tail(x)returns bottom x rows of data framereturns bottom x rows of data frameGetting dimension of data frame“df”dim(df)df.shapereturns in this format : rows, columnsreturns in this format : (rows,columns)Length of data frame “df”length(df)len(df)returns no. of columns in data framesreturns no. of columns in data framesDec 2014Copyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.3

Data Frame: Inspecting and Viewing DataRDec 2014PythonCopyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.4

Data Frame: Inspecting and Viewing DataRPython(Using pandas package*)Getting quick summary(likemean, std. deviation etc. ) ofdata in the data frame “df”summary(df)df.describe()returns mean, median , maximum,minimum, first quarter and third quarterreturns count, mean, standarddeviation, maximum, minimum, 25%,50% and 75%Setting row names and columnsnames of the data frame “df”rownames(df) c(“A”, ”B”, “C”, ”D”,“E”, ”F”)df.index [“A”, ”B”, “C”, ”D”,“E”, ”F”]set the row names to A, B, C, D and Eset the row names to A, B, C, D andEcolnames c(“P”, ”Q”, “R”, ”S”)set the column names to P, Q, R and Sdf.columns [“P”, ”Q”, “R”, ”S”]set the column names to P, Q, R andSDec 2014Copyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.5

Data Frame: Inspecting and Viewing DataPythonRDec 2014Copyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.6

Data Frame: Sorting DataRPython(Using pandas package*)Sorting the data in the dataframe “df” by column name “P”Dec 2014Copyrigt www.decisionstats.comdf[order(df P),]df.sort(['P'])Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.7

Data Frame: Sorting DataRDec 2014PythonCopyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.8

Data Frame: Data SelectionRPython(Using pandas package*)Slicing the rows of a data framefrom row no. “x” to row no.“y”(including row x and y)df[x:y,]df[x 1:y]Python starts counting from 0Slicing the columns name “x”,”Y” myvars c(“X”,”Y”)newdata df[myvars]etc. of a data frame “df”df.loc[:,[‘X’,’Y’]]Selecting the the data from rowno. “x” to “y” and column no. “a”to “b”df[x:y,a:b]df.iloc[x 1:y,a 1,b]Selecting the element at row no.“x” and column no. “y”df[x,y]df.iat[x 1,y 1]Dec 2014Copyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.9

Data Frame: Data SelectionRDec 2014PythonCopyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.10

Data Frame: Data SelectionRPython(Using pandas package*)Using a single column’s valuesto select data, column name “A”RDec 2014subset(df,A 0)df[df.A 0]It will select the all the rows in which thecorresponding value in column A of thatrow is greater than 0It will do the same as the R functionPythonCopyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.11

Mathematical FunctionsFunctionsRPython(import math and numpy library)Dec 2014Sumsum(x)math.fsum(x)Square Rootsqrt(x)math.sqrt(x)Standard dian(x)Copyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.12

Mathematical FunctionsRDec 2014PythonCopyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.13

Data ManipulationFunctionsRPython(import math and numpy library)Convert character variable to numeric variableas.numeric(x)For a single value: int(x), long(x), float(x)For list, vectors etc.: map(int,x), map(float,x)Convert factor/numeric variable to charactervariablepaste(x)For a single value: str(x)For list, vectors etc.: map(str,x)Check missing value in an objectis.na(x)math.isnan(x)Delete missing value from an objectna.omit(list)cleanedList [x for x in list if str(x) ! 'nan']Calculate the number of characters in charactervaluenchar(x)len(x)Dec 2014Copyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.14

Date & Time ManipulationFunctionsRPython(import lubridate library)(import datetime library)Getting time and date at an instantSys.time()datetime.datetime.now()Parsing date and time in format:YYYY MM DD HH:MM:SSd Sys.time()d format ymd hms(d)d datetime.datetime.now()format “%Y %b %d %H:%M:%S”d format d.strftime(format)Dec 2014Copyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.15

Data VisualizationFunctionsRPython(import matplotlib library**)Scatter Plot variable1 vs Var)plt.show()Histogram for Varhist(Var)plt.hist(Var)plt.show()Pie Chart for Varpie(Var)from pylab import *pie(Var)show()Boxplot for Var** To import matplotlib library type: import matplotlib.pyplot as pltDec 2014Copyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.16

Data Visualization: Scatter PlotPythonRDec 2014Copyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.17

Data Visualization: Box PlotPythonRDec 2014Copyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.18

Data Visualization: HistogramRDec 2014PythonCopyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.19

Data Visualization: Line PlotPythonRDec 2014Copyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.20

Data Visualization: BubbleRDec 2014PythonCopyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.22

Data Visualization: BarRDec 2014PythonCopyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.21

Data Visualization: Pie ChartRDec 2014PythonCopyrigt www.decisionstats.comLicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.23

Thank YouFor feedback contactDecisionStats.com

Coming up Data Mining in Python and R ( see draft slidesafterwards)

Machine Learning: SVM on Iris DatasetR(Using svm* function)library(e1071)data(iris)trainset iris[1:149,]testset iris[150,]svm.model svm(Species ., data trainset, cost 100, gamma 1, type 'C classification')svm.pred predict(svm.model,testset[ 5])svm.predOutput: VirginicaPython(Using sklearn** library)#Loading Libraryfrom sklearn import svm#Importing Datasetfrom sklearn import datasets#Calling SVMclf svm.SVC()#Loading the packageiris datasets.load iris()#Constructing training dataX, y iris.data[: 1], iris.target[: 1]#Fitting SVMclf.fit(X, y)#Testing the model on test dataprint clf.predict(iris.data[ 1])Output: 2, corresponds to Virginica*To know more about svm function in R visit: http://cran.r-project.org/web/packages/e1071/** To install sklearn library visit : http://scikit-learn.org/, To know more about sklearn svm visit: learn.svm.SVC.html

Linear Regression: Iris DatasetR(Using lm* function)data(iris)total size dim(iris)[1]num target c(rep(0,total size))for (i in 1:length(num target)){if(iris Species[i] 'setosa'){num target[i] 0}else if(iris Species[i] 'versicolor'){num target[i] 1}else{num target[i] 2}}iris Species num targettrain set iris[1:149,]test set iris[150,]fit lm(Species 0 Sepal.Length Sepal.Width Petal.Length Petal.Width , data train set)coefficients(fit)predict.lm(fit,test set)Output: 1.64Python(Using sklearn** library)from sklearn import linear modelfrom sklearn import datasetsiris datasets.load iris()regr linear model.LinearRegression()X, y iris.data[: 1], iris.target[: 1]regr.fit(X, y)print(regr.coef )print regr.predict(iris.data[ 1])Output: 1.65*To know more about lm function in R visit: s/html/lm.html** ** To know more about sklearn linear regression visit : learn.linear model.LinearRegression.html

Random forest: Iris DatasetR(Using randomForest* package)Python(Using sklearn** library)from sklearn import ensemblefrom sklearn import datasetsclf ensemble.RandomForestClassifier(n estimators 100,max depth 10)for (i in 1:length(num target)){if(iris Species[i] 'setosa'){num target[i] 0} iris datasets.load iris()X, y iris.data[: 1], iris.target[: 1]else if(iris Species[i] 'versicolor'){num target[i] 1}clf.fit(X, y)else{num target[i] 2}}print clf.predict(iris.data[ 1])library(randomForest)data(iris)total size dim(iris)[1]num target c(rep(0,total size))iris Species num targettrain set iris[1:149,]test set iris[150,]iris.rf randomForest(Species .,data train set,ntree 100,importance TRUE,proximity TRUE)print(iris.rf)predict(iris.rf, test set[ 5], predict.all TRUE)Output: 1.845Output: 2*To know more about randomForest package in R visit: t/** To know more about sklearn random forest visit : learn.ensemble.RandomForestClassifier.html

Decision Tree: Iris DatasetR(Using rpart* package)library(rpart)data(iris)sub c(1:149)fit rpart(Species ., data iris,subset sub)fitpredict(fit, iris[ sub,], type "class")Output: VirginicaPython(Using sklearn** library)from sklearn.datasets import load irisfrom sklearn.tree importDecisionTreeClassifierclf DecisionTreeClassifier(random state 0)iris datasets.load iris()X, y iris.data[: 1], iris.target[: 1]clf.fit(X, y)print clf.predict(iris.data[ 1])Output: 2, corresponds to virginica*To know more about rpart package in R visit: http://cran.r-project.org/web/packages/rpart/** To know more about sklearn desicion tree visit : learn.tree.DecisionTreeClassifier.html

Gaussian Naive Bayes: Iris DatasetR(Using e1071* package)library(e1071)data(iris)Python(Using sklearn** library)from sklearn.datasets import load irisfrom sklearn.naive bayes import GaussianNBtrainset iris[1:149,]testset iris[150,]classifier naiveBayes(trainset[,1:4],trainset[,5])clf GaussianNB()iris datasets.load iris()X, y iris.data[: 1], iris.target[: 1]clf.fit(X, y)print clf.predict(iris.data[ 1])predict(classifier, testset[, 5])Output: VirginicaOutput: 2, corresponds to virginica*To know more about e1071 package in R visit: http://cran.r-project.org/web/packages/e1071/** To know more about sklearn Naive Bayes visit : learn.naive bayes.GaussianNB.html

K Nearest Neighbours: Iris DatasetR(Using kknn* package)library(kknn)data(iris)trainset iris[1:149,]testset iris[150,]iris.kknn kknn(Species .,trainset,testset, distance 1,kernel "triangular")summary(iris.kknn)fit fitted(iris.kknn)fitOutput: VirginicaPython(Using sklearn** library)from sklearn.datasets import load irisfrom sklearn.neighbors importKNeighborsClassifierknn KNeighborsClassifier()iris datasets.load iris()X, y iris.data[: 1], iris.target[: 1]knn.fit(X,y)print knn.predict(iris.data[ 1])Output: 2, corresponds to virginica*To know more about kknn package in R visit:** To know more about sklearn k nearest neighbours visit : learn.neighbors.NearestNeighbors.html

Thank YouFor feedback please let us know atohri2007@gmail.com

Data Frame: Inspecting and Viewing Data R Python (Using pandas package*) Getting the names of rows and columns of data frame “df” rownames(df) returns the name of the rows colnames(df) returns the name of the columns df.index returns the name of the rows df.columns retu