Getting Started With Analysis In Python: NumPy, Pandas And .

Transcription

Getting Started withAnalysis in Python:NumPy, Pandas and PlottingBioinformatics and Research Computing (BaRC)http://barc.wi.mit.edu/hot topics/

Python Packages Efficient and reusable– Avoid re-writing code– More flexibility Use the “import” command to use a packageimport numpy as np Packages covered in this workshop:– NumPy– Pandas– Graphical: matplotlib, plotly and seaborn2

Harris, C.K., et al. Array Programing with NumPy Nature (2020)3

NumPy Numerical Python Efficient multidimensional array processingand operations– Linear algebra (matrix operations)– Mathematical functions An array is a type of data structure Array (objects) must be of the same type import numpy as np np.array([1,2,3,4],float)4

(NumPy) Array ConceptsHarris, C.K., et al. Array Programing with NumPy Nature (2020)5

(NumPy) Array Concepts Index: refers to individual elements, orsubarrays, that allows users to interact witharrays– slices Shape: number of elements along each axis,which determines the dimensions Vectorization: array programming, operationson the entire array than individual elementsHarris, C.K., et al. Array Programing with NumPy Nature (2020)6

NumPy: SlicingMcKinney, W., Python for Data Analysis, 2nd Ed. (2017)7

Pandas Efficient for processing tabular, or panel, data Built on top of NumPy Data structures: Series and DataFrame (DF)– Series: one-dimensional , same data type– DataFrame: two-dimensional, columns of different data types– index can be integer (0,1, ) or non-integer ('GeneA','GeneB', 1GeneB0.44GeneC5.21GeneDGeneEaxis 10.1158 0.0210411.0316.75axis 00.1602 0.06433 0.046740.050450 0.029458

What can you do with aPandas DataFrame? Filter– Select rows/columns Sort Numerical or Mathematical operations (e.g.mean) Group by column(s) Many e/9

DataFrame Slicing:Selecting DataEnsembl IDGeneGTEX1117FGTEX111CUENSG00000223972 DDX11L10.1082ENSG00000227232 WASH7P21.4ENSG00000243485 MIR1302-11ENSG00000237613 FAM138A loc by row or column namese.g. "Gene", "GTEX-117F"GTEX111FC0.1158 0.0210411.0316.750.16020.06433 0.046740.050450 0.02945ENSG00000268020 OR4G4P000ENSG00000186092 OR4F5000 iloc by integer location,i.e. column or row numbere.g. 1,2,310

Data Formatting/Organizing By default, Pandas,and otherpackages, expectyour dataformatted suchthat each columnrepresents avariable, and eachrow to representan observationhttps://pandas.pydata.org/Pandas Cheat Sheet.pdf11

Data Format ExampleGeneAdipose Adipose 0FAM138AHeart0.1158 0.05103 0.03214 0.0483311.62Heart0.1449.95310.350 1411.6200.048339.9530.090180.14410.350.14412

Pandas - groupby Split, Apply and Combine13

Plotting Matplotlib Seaborn Plotly14

Numerical Python Efficient multidimensional array processing and operations –Linear algebra (matrix operations) –Mathematical functions An array is a type of data structure Array (objects) must be of the sametype i