Textual Analysis & Introduction To Python

Transcription

Textual Analysis & Introductionto PythonFeb 18 2016CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences1

Tuesday’s Class Wrap-up Tuesday’s Class– Using Matrix Multiplication (MMULT)– What if we use counting? So much difficult to build 2-D different and sum mapCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences2

Today’s Class Intro to text analysis problems Intro to PythonCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences3

Text Analysis and PythonWe’re starting a new unit in our course!CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences4

Textual AnalysisDefineProblemFind DataWrite a set ofinstructionsComputerSolutionCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences5

Textual AnalysisDefineProblemFind DataWrite a set GCTGTGCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences6

Textual AnalysisDefineProblemBuild a Concordance ofa text Locations of words Frequency of wordsFind DataWrite a set GCTGTGCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences7

ConcordancesAlphabetical index of all words in a textWordPage ate9 CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences8

Concordances Before computers, was a huge pain. What texts might have had nce (publishing)CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences9

Concordances Before computers, was a huge pain. What texts might have had concordances?– The Bible– The Quran– The Vedas– e (publishing)CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences10

Concordances Before computers, was a huge pain. What texts might have had concordances?– The Bible– The Quran– The Vedas– ShakespeareNot a “New” Problem:First Bible Concordancecompleted in 1230http://en.wikipedia.org/wiki/Concordance (publishing)CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences11

Concordances How long would the King James Bible take us?– 783,137 wordshttp://agards-bible-timeline.com/q10 bible-facts.htmlCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences12

Concordances How long would the King James Bible take us?– 783,137 words800,000 * (3 min. to look up word and put page #) 2,400,000 minutes 40,000 hours 1,667 days 4.5 yearshttp://agards-bible-timeline.com/q10 bible-facts.htmlCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences13

Concordances How long would the King James Bible take us?– 783,137 words800,000 * (3 min. to look up word and put page #) 2,400,000 minutes 40,000 hours 1,667 days 4.5 yearsTakes 70 hours to read the King James Bible aloudhttp://agards-bible-timeline.com/q10 bible-facts.htmlCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences14

Strong’s Concordance Concordance of the King James Bible Published in 1890 by James StrongWikipediaCSCI 0931 - Intro. to Comp. forhttp://www.christianbook.com/reader/?item no 563788the Humanities and Social Sciences15

From Concordance to Word FrequencySuppose our text has 1000 words total.WordPageNumbers# 0 CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences16

Google Ngrams Google (verb) “Google n-grams” ngram: a set of n words– “hello” is a 1-gram– “hello there” is a 2-gram Click on “Google Ngram viewer” for moreinformation Question: what is the data source here?CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences17

Textual AnalysisDefineProblemBuild a Concordance ofa text Locations of words Frequency of words Word frequencies across timeFind DataWrite a set GCTGTGCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences18

The Wizard of OZ About 40 Books, written by 7 different authors#1#14#15#33#16 Lyman Frank BaumRuth Plumly Thompsonhttp://www.ssc.wisc.edu/ zzeng/soc357/OZ.pdfCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences19

The Wizard of OZ About 40 Books, written by 7 different authors#1#14#15 Lyman Frank Baum(1856-1919)#33#16Ruth Plumly ThompsonPublished in1921http://www.ssc.wisc.edu/ zzeng/soc357/OZ.pdfCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences20

The Wizard of OZ About 40 Books, written by 7 different authors#1#14#15 Lyman Frank Baum(1856-1919)#33#16Ruth Plumly ThompsonPublished in1921http://www.ssc.wisc.edu/ zzeng/soc357/OZ.pdfCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences21

The Federalist Papers 85 articles written in 1787 topromote the ratification of the USConstitution In 1944, Douglass Adair guessedauthorship––––Alexander Hamilton (51)James Madison (26)John Jay (5)3 were a collaboration Confirmed in 1964 by a computeranalysisWikipediahttp://pages.cs.wisc.edu/ gfung/federalist.pdfCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences22

Textual AnalysisDefineProblemBuild a Concordance ofa text Locations of words Frequency of words Word frequencies across time Determine authorshipFind DataWrite a set ofinstructionsComputerSolutionCSCI 0931 - Intro. to Comp. for the Humanities and Social GATCAGCTACGATCGATCTACGATCGTAGCTGTG23

Textual AnalysisDefineProblemBuild a Concordance ofa text Locations of words Frequency of words Word frequencies across time Determine authorship Count labels to determineliberal media biasFind DataWrite a set ofinstructionsComputerSolutionCSCI 0931 - Intro. to Comp. for the Humanities and Social GATCAGCTACGATCGATCTACGATCGTAGCTGTG24

How are we going to analyze texts?Excelfirehow.comNumerical DataCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences25

How are we going to analyze texts?ExcelTextual Datafirehow.comNumerical DataCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences26

How are we going to analyze texts?Textual DataMakita Cordless Chain Saw, 270CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences27

How are we going to analyze texts?Python: A Programming LanguageFree!Textual Data9poundhammer.blogspot.comCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences28

Textual AnalysisDefineProblemBuild a Concordance ofa text Locations of words Frequency of words Word frequencies across time Determine authorship Count labels to determineliberal media biasFind DataWrite a set ofinstructionsPythonSolutionCSCI 0931 - Intro. to Comp. for the Humanities and Social GATCAGCTACGATCGATCTACGATCGTAGCTGTG29

“Python”CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences30

“Python” A language for giving the computerinstructions. It has syntax and semantics.CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences31

“Python” A language for giving the computerinstructions. It has syntax and semantics. Might say “write a Python program”, meaning“write instructions in the Python language”CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences32

“Python” A language for giving the computerinstructions. It has syntax and semantics. Might say “write a Python program”, meaning“write instructions in the Python language” There is an interpreter (e.g., IDLE) that takesPython instructions and executes them withthe CPU, etc.CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences33

Install Let’s install Python 3.5.x www.python.org/downloads/CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences34

Install – Mac OS X On Mac OS X double click the pkg file youdownloaded. Follow the instructions byagreeing and click next.CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences35

Install - WindowsCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences36

Let’s open IDLE On Mac OS X opena terminal windowType: idle3 On WindowsStart - All Programs - Python - IDLECSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences37

Introduction to Python Expressions are inputs that Python evaluates– Expressions return an output– Like using a calculatorType the expressions belowafter ‘ ’ and hit EnterCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences1. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) Lists38

Introduction to Python Expressions are inputs that Python evaluates– Expressions return an output– Like using a calculatorType the expressions belowafter ‘ ’ and hit Enter 6 2 8 2.04 24-24*24/2CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences1. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) Lists39

Introduction to Python Assignments do not have an output, they arestored in memory.1. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) ListsCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences40

Introduction to Python Assignments do not have an output, they arestored in memory.– We’ve done this kind of thing with spreadsheetsWe have assignedthe number 1 tocell A1.CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences1. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) Lists41

Introduction to Python Assignments do not have an output, they arestored in memory.– We’ve done this kind of thing with spreadsheetsWe have assignedthe number 1 tocell A1.Let’s rename cellA1 to x.CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences1. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) Lists42

Introduction to Python Assignments do not have an output, they arestored in memory.– We’ve done this kind of thing in Spreadsheets x 1Let’s rename cellA1 to x.CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences1. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) Lists43

Introduction to Python Assignments do not have an output, they arestored in memory.– We’ve done this kind of thing in Spreadsheets x 1variable valueMemoryVariable NameValuex1CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences1. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) Lists44

Introduction to Python Assignments do not have an output, they arestored in memory.– We’ve done this kind of thing in Spreadsheets x 1– We can now use x in expressions! x 12 (x 2)*39MemoryVariable NameValuex1CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences1. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) Lists45

Introduction to Python You can name your variablesanything numberOfEggs 100 myNumber 12345 noninteger 4.75 Well, almost anything– No spaces, operators, punctuation,number in the first position Variables usually start with alowercase letter and, if useful,describe something about thevalue.CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences1. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) Lists46

Choices Why are those the rules for names?Someone thought about it and made a choiceUsually based on years of experienceMany choices seem crazy.– Until one day you see they’re obviously correctCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences47

Introduction to Python Try this: 3/21. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) ListsCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences48

Introduction to Python Try this: 3/2 There are two types of numbers in Python.The type()function is useful. type(3) class 'int' type(3/2) class 'float' CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences1. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) Lists49

Introduction to Python Try this: 3/2 There are two types of numbers in Python.The type()function is useful. type(3) class 'int' type(3/2) class 'float' Floats are numbers that 3.0/2.0 display with decimal1.5points.CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences1. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) Lists50

Introduction to Python Try this: 3/2 There are two types of numbers in Python.The type()function is useful. type(3) class 'int' type(1.5) class 'float' Floats aredecimals. 3.0/2.01.5General Rule: Expressionsfor a particular type willoutput that same type!Except for the divisionoperator (/)CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences1. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) Lists51

Introduction to Python Strings are sequences of characters, surroundedby single quotes. 'hi''hi' myString 'hi there' myString'hi there' The operator concatenatesCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences1. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) Lists52

Introduction to Python Strings are sequences of characters, surroundedby single quotes. 'hi''hi' myString 'hi there' myString'hi there' The operator concatenatesGeneral Rule: Expressionsfor a particular type willoutput that same type!CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences1. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) Lists53

Introduction to Python Strings are sequences of characters, surroundedby single quotes. 'hi''hi' myString 'hi there' myString'hi there' The operator concatenates 'hi 'hiendString ' class!'myString endStringthere class!'newString myString endStringnewStringthere class!'CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences1. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) Lists54

Introduction to Python Lists are an ordered collection of items [5,10,15][5, 10, 15] myList [5,10,15] myList[5, 10, 15] stringList ['hi','there','class'] stringList['hi', 'there', 'class']CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences1. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) Lists55

Introduction to Python Lists are an orderedcollection of items [5,10,15][5, 10, 15] myList [5,10,15] myList[5, 10, 15] stringList ['hi','there','class'] stringList['hi', 'there', 'class'] Individual items are elements The operator concatenates myList stringList[5, 10, 15, 'hi', 'there', 'class']CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences1. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) Lists56

Introduction to Python To get an element from a list, use theexpression myList[i] where i is theindex. Often spoken: “myList sub i” List indices start at 0!1. Expressions myList[0]5 myList[1]10 myList[2]15 What does myList[1] 4CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciencesdo?2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) Lists57

Introduction to Python To get a range of elements from a list, use theexpression myList[i:j] where i is the startindex (inclusive) and j is the end index(exclusive). myList[5, 4, 15] myList[0:2][5, 4] myList[1:3][4, 15] newList [2,5,29,1,9,59,3] newList[2, 5, 29, 1, 9, 59, 3] newList[2:6][29, 1, 9, 59]CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences1. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) Lists58

Introduction to Python Indexing and ranges also work on Strings. 'hi 'h' 'e' 'r' myString[0:6]the'CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences1. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) Lists59

Introduction to Python Remember what assignments doMemoryVariable .75myString'hi there'endString' s']newList[2,5,29,1,9,59,3]CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences1. Expressions2. Assignmentsa) Variables3. Typesa) Integersb) Floatsc) Stringsd) Lists60

Class ReviewPython So Far (to be updated/refined!)1. Expressions Evaluate input and returns some output (calculator)2. Variable Assignments: variable expression Store the value of the expression in the variableinstead of outputting the value. There is always an equals sign in an assignment Variables can be named many things List assignments: listvar [ index ] expression 3. TypesExpressions for a particular Integers vs. Floats (Decimals)type will output that same Strings in single quotestype! Floats have a higher Lists are sets of other typespriorityWecanforindexinto andStrings& ListsCSCI 0931 - Intro.to Comp.the HumanitiesSocial Sciences61

A brief review of things you didn’tknow you’d learned In a spreadsheet, there are many types ofdata Numbers (start with /- or a digit) Strings (nondigit-start, or start with ‘ ) Formulas (start with ) Ranges (B2, B2:B4, B2:D5) Errors (#N/A) BlanksCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences62

What shows up in a cell If a formula evaluates to a number or string, thatnumber or string If it evaluates to a range, the value in the first cell ofthat range .sometimes– If you write A1:A6, you get A1– If you write OFFSET(A1:A6, 0, 0), Gsheets fills in adjacentcells; excel just fills in one cell If evaluation leads to an error, then #N/A Mostly, we never notice any of this In Python, the rules have greater consistency, andbecause results aren’t instantly visible, knowing therules matters moreCSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences63

Python _ CSCI 0931 - Intro. to Comp. for the Humanities and Social Sciences 33 A language for giving the computer instructions. It has syntax and semantics. Might say write a Python program, meaning write instructions in the Python language There is an interpreter (e.g., IDLE) that takes Py