Programming Principles In Python (CSCI 503)

Transcription

Programming Principles in Python (CSCI 503)DataDr. David KoopD. Koop, CSCI 503/490, Fall 2021

CPU-Bound vs. I/O-Bound[J. Anderson]D. Koop, CSCI 503/490, Fall 20212

Threading Threading address the I/O waits byletting separate pieces of a programrun at the same time Threads run in the same process Threads share the same memory(and global variables) Operating system schedules threads;it can manage when each threadruns, e.g. round-robin scheduling When blocking for I/O, other threadscan run[J. Anderson]D. Koop, CSCI 503/490, Fall 20213

Python Threading Speed If I/O bound, threads work great because time spent waiting can now beused by other threads Threads do not run simultaneously in standard Python, i.e. they cannot takeadvantage of multiple cores Use threads when code is I/O bound, otherwise no real speed-up plus someoverhead for using threadsD. Koop, CSCI 503/490, Fall 20214

Python and the GIL Solution for reference counting (used for garbage collection) Could add locking to every value/data structure, but with multiple lockscomes possible deadlock Python instead has a Global Interpreter Lock (GIL) that must be acquired toexecute any Python code This effectively makes Python single-threaded (faster execution) Python requires threads to give up GIL after certain amount of time Python 3 improved allocation of GIL to threads by not allowing a single CPUbound thread to hog itD. Koop, CSCI 503/490, Fall 20215

Multiprocessing Multiple processes do not need to share the same memory, interact less Python makes the difference between processes and threads minimal inmost cases Big win: can take advantage of multiple cores!D. Koop, CSCI 503/490, Fall 20216

Multiprocessing using concurrent.futures import concurrent.futuresimport multiprocessing as mpimport timedef dummy(num):time.sleep(5)return num ** 2with concurrent.futures.ProcessPoolExecutor(max workers 5,mp context mp.get context('fork')) as executor:results executor.map(dummy, range(10)) mp.get context('fork') changes from 'spawn' used by default inMacOS, works in notebookD. Koop, CSCI 503/490, Fall 20217

asyncio Single event loop that controls when each task is run Tasks can be ready or waiting Tasks are not interrupted like they are with threading- Task controls when control goes back to the main event loop- Either waiting or complete Event loop keeps track of whether tasks are ready or waiting- Re-checks to see if new tasks are now ready- Picks the task that has been waiting the longest async and await keywords Requires support from libraries (e.g. aiohttp)[J. Anderson]D. Koop, CSCI 503/490, Fall 20218

When to use threading, asyncio, or multiprocessing? If your code has a lot of I/O or Network usage:- If there is library support, use asyncio- Otherwise, multithreading is your best bet (lower overhead) If you have a GUI- Multithreading so your UI thread doesn't get locked up If your code is CPU bound:- You should use multiprocessing (if your machine has multiple cores)[J. Anderson]D. Koop, CSCI 503/490, Fall 20219

Concurrency ComparisonNumber ofProcessorsConcurrency TypeSwitching DecisionPre-emptivemultitasking(threading)The operating system decides whento switch tasks external to Python.1Cooperativemultitasking(asyncio)The tasks decide when to give upcontrol.1MultiprocessingThe processes all run at the same(multiprocessing) time on different processors.Many[J. Anderson]D. Koop, CSCI 503/490, Fall 202110

pandas Contains high-level data structures and manipulation tools designed to makedata analysis fast and easy in Python Built on top of NumPy Built with the following requirements:- Data structures with labeled axes (aligning data)- Support time series data- Do arithmetic operations that include metadata (labels)- Handle missing data- Add merge and relational operationsD. Koop, CSCI 503/490, Fall 202111

Pandas Code Conventions Universal:- import pandas as pd Also used:- from pandas import Series, DataFrameD. Koop, CSCI 503/490, Fall 202112

Series A one-dimensional array (with a type) with an index Index defaults to numbers but can also be text (like a dictionary) Allows easier reference to speci c items obj pd.Series([7,14,-2,1]) Basically two arrays: obj.values and obj.index Can specify the index explicitly and use strings obj2 pd.Series([4, 7, -5, 3],index ['d', 'b', 'a', 'c']) Kind of like xed-length, ordered dictionary can create from a dictionary obj3 pd.Series({'Ohio': 35000, 'Texas': 71000,'Oregon': 16000, 'Utah': 5000})13fifiD. Koop, CSCI 503/490, Fall 2021

Texasdtype: boolSeriesFalseTexasdtype: boolSeries also has these as instance methods:TrueOregonTexasdtype: boolFalseFalseI discuss working with missing data in more dIn [27]: eatureformanyapplications Indexing: s[1]CaliforniaTrueindexed data in arithmetic otnull(s) Can check forOregonFalseIn [28]: obj3In [29]: obj4TexasFalseOut[28]:Out[29]: Both index andvaluescanhaveanassociatedname:dtype: boolOhio35000CaliforniaNaN 'population'; s.index.name Oregon'state'- s.nameI discuss16000Ohio35000working with missing data in more detail later in this chapter.Texas71000Oregon16000 Addition elinkUtah5000Texascritical Series feature for many applications is that it automatically aligns differently- 71000dtype: int64dtype: float64indexeddata in arithmetic Arithmeticoperationsalign: operations:In [28]: dtype: int64In [30]: obj3 obj4Out[30]:D. Koop, CSCI 503/490, Fall 2021In [29]: 71000dtype: float64In [30]: obj3 142000UtahNaN110 dtype:Chapterfloat645: Getting Started with pandas[W. McKinney, Python for Data Analysis]Data alignment features are addressed as a se14

Data Frame A dictionary of Series (labels for each series)A spreadsheet with row keys (the index) and column headersHas an index shared with each seriesAllows easy reference to any cell df DataFrame({'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada'],'year': [2000, 2001, 2002, 2001],'pop': [1.5, 1.7, 3.6, 2.4]}) Index is automatically assigned just as with a series but can be passed in aswell via index kwarg Can reassign column names by passing columns kwargD. Koop, CSCI 503/490, Fall 202115

DataFrame Constructor InputsTable 5-1. Possible data inputs to DataFrame constructorTypeNotes2D ndarrayA matrix of data, passing optional row and column labelsdict of arrays, lists, or tuplesEach sequence becomes a column in the DataFrame. All sequences must be the same length.NumPy structured/record arrayTreated as the “dict of arrays” casedict of SeriesEach value becomes a column. Indexes from each Series are unioned together to form theresult’s row index if no explicit index is passed.dict of dictsEach inner dict becomes a column. Keys are unioned to form the row index as in the “dict ofSeries” case.list of dicts or SeriesEach item becomes a row in the DataFrame. Union of dict keys or Series indexes become theDataFrame’s column labelsList of lists or tuplesTreated as the “2D ndarray” caseAnother DataFrameThe DataFrame’s indexes are used unless different ones are passedNumPy MaskedArrayLike the “2D ndarray” case except masked values become NA/missing in the DataFrame resultIndex ObjectsD. Koop, CSCI 503/490, Fall 2021[W. McKinney, Python for Data Analysis]16

DataFrame Access and Manipulation df.values 2D NumPy array Accessing a column:- df[" column "]- df. column - Both return Series- Dot syntax only works when the column is a valid identi er Assigning to a column:- df[" column "] scalar # all cells set to same value- df[" column "] array # values set in order- df[" column "] series # values set according to match# between df and series indexesD. Koop, CSCI 503/490, Fall 2021fi17

Data FrameD. Koop, CSCI 503/490, Fall 202118

Data FrameColumn NamesD. Koop, CSCI 503/490, Fall 202118

Data FrameColumn NamesIndexD. Koop, CSCI 503/490, Fall 202118

Data FrameColumn NamesIndexColumn: df['Island']D. Koop, CSCI 503/490, Fall 202118

Data FrameColumn NamesRow: df.loc[2]IndexColumn: df['Island']D. Koop, CSCI 503/490, Fall 202118

Data FrameColumn NamesRow: df.loc[2]IndexCell: df.loc[341,'Species']Column: df['Island']D. Koop, CSCI 503/490, Fall 202118

Data FrameColumn NamesRow: df.loc[2]Missing DataIndexCell: df.loc[341,'Species']Column: df['Island']D. Koop, CSCI 503/490, Fall 202118

DataFrame Index Similar to index for Series Immutable Can be shared with multiple structures (DataFrames or Series) in operator works with: 'Ohio' in df.index Can choose new index column(s) with set index() reindex creates a new object with the data conformed to new index- obj2 obj.reindex(['a', 'b', 'c', 'd', 'e'])- can ll in missing values in different waysfiD. Koop, CSCI 503/490, Fall 202119

Reading & Writing Data in PandasFormatTypetextData DescriptionReaderWriterCSVread csvto csvtextFixed-Width Text Fileread fwftextJSONread jsonto jsontextHTMLread htmlto htmltextLocal clipboardread clipboardto clipboardMS Excelread excelto excelbinaryOpenDocumentread excelbinaryHDF5 Formatread hdfto hdfbinaryFeather Formatread featherto featherbinaryParquet Formatread parquetto parquetbinaryORC Formatread orcbinaryMsgpackread msgpackto msgpackbinaryStataread statato statabinarySASread sasbinarySPSSread spssbinaryPython Pickle Formatread pickleto pickleSQLSQLread sqlto sqlSQLGoogle BigQueryread gbqto ser guide/io.html]D. Koop, CSCI 503/490, Fall 202120

read csv Convenient method to read csv les Lots of different options to help get data into the desired format Basic: df pd.read csv(fname) Parameters:- path: where to read the data from- sep (or delimiter): the delimiter (',', ' ', '\t', '\s ')- header: if None, no header- index col: which column to use as the row index- names: list of header names (e.g. if the le has no header)- skiprows: number of list of lines to skipD. Koop, CSCI 503/490, Fall 2021fifi21

Writing CSV data with pandas Basic: df.to csv( fname ) Change delimiter with sep kwarg:- df.to csv('example.dsv', sep ' ') Change missing value representation- df.to csv('example.dsv', na rep 'NULL') Don't write row or column labels:- df.to csv('example.csv', index False, header False) Series may also be written to csvD. Koop, CSCI 503/490, Fall 202122

Documentation pandas documentation is pretty good Lots of recipes on stackover ow for particular data manipulations/queriesD. Koop, CSCI 503/490, Fall 2021fl23

Food Inspections ExampleD. Koop, CSCI 503/490, Fall 202124

Programming Principles in Python (CSCI