Python For Data Analysis - IDEAL

Transcription

PANDASPython for Data AnalysisMoshiul ArefinFebruary 8, 2014EE 380L Data Mining, University of Texas at Austin

pandas - Outline lityData LoadingPlottingWhat else can pandas doQuestion

pandas - Overview Python Data Analysis Library, similar to: RMATLAB SASCombined with the IPython toolkitBuilt on top of NumPy, SciPy, to some extent matplotlibPanel Data SystemOpen source, BSD-licensedKey Components SeriesDataFrame

pandas - Purpose Ideal tool for data scientistsMunging dataCleaning dataAnalyzing dataModeling dataOrganizing the results of the analysis into a formsuitable for plotting or tabular display

pandas - Terminology IPython is a command shell for interactive computing inmultiple programming languages, especially focused onthe Python programming language, that offersenhanced introspection, rich media, additional shellsyntax, tab completion, and rich history. NumPy is the fundamental package for scientificcomputing with Python.

pandas - Terminology SciPy (pronounced “Sigh Pie”) is a Python-basedecosystem of open-source software for mathematics,science, and engineering. Matplotlib is a python 2D plotting library which producespublication quality figures in a variety of hardcopy formatsand interactive environments across platforms. Data Munging or Data Wrangling means taking datathat's stored in one format and changing it into anotherformat.

pandas - Terminology Cython programming language is a superset of Pythonwith a foreign function interface for invoking C/C routines and the ability to declare the static type ofsubroutine parameters and results, local variables, andclass attributes.

pandas - Data Structures: Series One-dimensional arraylike object containingdata and labels (orindex) Lots of ways to build aSeries

Series - Working with the index A series index can bespecified Single values can beselected by index Multiple values can beselected with multipleindexes

Series - Working with the index Think of a Series as afixed-length, order dict However, unlike dict,index items don't have tobe unique

Series - Operations Filtering NumPy-type operationson data

Series - Incomplete data pandas can accomodateincomplete data

Series - Automatic alignment Unlike in NumPyndarray, data isautomatically aligned

Data Structures: DataFrame Spreadsheet-like data structure containing an ordercollection of columns Has both a row and column index Consider as dict of Series (with shared index)

DataFrameCreation with dict of equal-length lists

DataFrameCreation with dict of dicts

DataFrame Columns can beretrieved as Series dict notationattribute notation Rows can retrieved byposition or by name(using ix attribute)

DataFrame New Columns can beadded (by computatoinor direct assignment)

DataFrame - Reindexing Creation of new objectwith the data conformedto a new index

FunctionalitySummarizing and Descriptive Statistics

FunctionalityBoolean indexing

Data Loading pandas supports several ways to handle data loading Text file data read csvread table Structured data (JSON, XML, HTML) works well with existing libraries Excel (depends upon xlrd and openpyxl packages) Database pandas.io.sql module (read frame)

Plotting

Plotting

Plotting

What else? Data Aggregation GroupByPivot Tables Time Series Periods/FrequenciesOperations with Time Series with Different FrequenciesDownsampling/UpsamplingPlotting with TimeSeries (auto-adjust scale) Advanced Analysis Decile and Quartile AnalysisSignal Frontier AnalysisFuture Contract RollingRolling Correlation and Linear Regression

Questions?

pandas - Bibliography Python Data Analysis Library & pandas: Python DataAnalysis Library. http://pandas.pydata.org/ pandas - Python Data Analysis. 984889 Getting started with pandas. ed-with-pandas IPython. http://ipython.org/ http://en.wikipedia.org/wiki/IPython

pandas - Bibliography NumPy. http://www.numpy.org/ SciPy. http://scipy.org/ Matplotlib. http://matplotlib.org/ Data Munging or Data Wrangling. /Data wrangling

pandas - Bibliography Cython. ipedia.

Matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Data Munging or Data Wrangling means taking data that's