Retrieving Time From Scanned Books - UMass Amherst

Transcription

Retrieving Time from Scanned BooksApril 1, 2015John Foley & James AllanCenter for Intelligent Information RetrievalSchool of Computer ScienceUniversity of Massachusetts Amherst{jfoley, allan}@cs.umass.edu@johnfoleyiv1

Books? Books are an important challenge in IR ECIR 2015 Keynote, Marti Hearst Books are important primary sources Education Historical Research Relatively unused2

Books? Why not KBs? Knowledge bases are massively skewed toward thepresent day.3

Scanned Books?Over six million digitized books publicly available Languages Cultures Periods of time Subjects4

Retrieving Time?5

Retrieving Time?6

Retrieving Time?There’s probably good informationhere, but standard search does notpresent it to us.7

Retrieving Time?8

Retrieving Time?Wait, 1879, 1880? British Inventors?Easy to pick on snippet/answer generation,and the website has good information.9

Retrieving Time? .it’s hard.1803?10

Invention of the Light Bulb 1803 - Arc lamp demonstrated 1835 - First constant electric light presented 1850 - First Fluorescent demonstration “Lightning in a tube” 1878 - Actual page claims Einstein starts here 1879 - Search engine’s claim 1880 - Einstein’s Patent 1908 - Manufacturer’s adopt Einstein’s screw design 1962 - First LED is invented 2008 - LED light bulbs are on the market .11

Retrieving Times from Scanned BooksWhen?12

Retrieving Times from Scanned Booksinvention of the light bulbWhen?13

Retrieving Times from Scanned Booksinvention of the light bulb1.2.3.4.5.6.7.8.9.10.When?1803 - Arc lamp demonstrated1835 - First constant electric light presented1850 - Lightning in a Tube1878 - Einstein’s Patent1879 - Einstein’s Patent1880 - Einstein’s Patent1908 - Manufacturer’s adopt Einstein’s screw design1962 - First LED is invented2008 - LED light bulbs are on the market2015 - Graphene Light Bulb (BBC article)14

Possible Uses: As Query Facetsinvention of the light 15

How do we intrinsically evaluate such a system?16

Retrieving Times from Scanned Booksinvention of the light 1880190819622008201517

Retrieving Times from Scanned Booksinvention of the light 908196220082015QueryWhen?100 0 1803 1100 0 1835 1100 0 1878 2100 0 1880 418

Retrieving Times from Scanned Booksinvention of the light 908196220082015QueryWhen?“Documents”19

Retrieving Times from Scanned Booksinvention of the light 908196220082015QueryWhen?“Documents”“qrel” file100 0 1803 1100 0 1835 1100 0 1878 2100 0 1880 420

Wouldn’t it be great if there were somehuman-curated list of events in history?http://www.toptensocialmedia.com21

22

http://en.wikipedia.org/wiki/187923

http://en.wikipedia.org/wiki/187924

Wikipedia Year Bullets as QueriesRelevant year or source page: 1879Original, HTML Bullet point: December 31 – Thomas Edison demonstrates incandescent lighting to thepublic for the first time in Menlo Park, New Jersey.Strip dates: (so we don’t accidentally cheat!)thomas edison demonstrates incandescent lighting to the public for the firsttime in menlo park new jerseyLemmatize/Stem/Stop/Otherwise Preprocess:thomas edison demonstrate incandescent light public first time menlo park newjersey25

Wikipedia Year Bullets as QueriesRelevant year or source page: 1879Original, HTML Bullet point: December 31 – Thomas Edison demonstrates incandescent lighting to thepublic for the first time in Menlo Park, New Jersey.Strip dates: (so we don’t accidentally cheat!)thomas edison demonstrates incandescent lighting to the public for the firsttime in menlo park new jerseyLemmatize/Stem/Stop/Otherwise Preprocess:thomas edison demonstrate incandescent light public first time menlo park newjerseythomas edison demonstrate incandescent lighting publicWhen?26

Wikipedia Year Bullets as Queriesthomas edison demonstrate incandescent lighting publicWhen?Relevant document: 187927

Wiki-Year-Facts http://ciir.cs.umass.edu/downloads/ http://cs.umass.edu/ jfoley/datasets.html Facts covering from 500 BC to June 2013 40,000 english facts Simple JSON format28

Generating Ambiguous Queries Facts are nice, but aren’t ambiguous, like our exampleinvention of the lightbulbWhen?Relevant document: 1879? 1880? 1803? 2012?29

Generating Ambiguous Queries Generate new queries from shared pairs of entities. (1221) The Maya of the Yucatán revolt agains therulers of Chichen-Itza (1528) The Maya peoples drive SpanishConquistadores out of the Yucatán (1848) The Independent Republic of Yucatán joinsMexico in exchange for help in suppressing a revoltby the indigenous Maya population.maya yucatánWhen?30

Intermezzo31

Intermezzo We’ve motivated the problem:invention of the light bulbWhen?32

Intermezzo We’ve motivated the problem:invention of the light bulbWhen? We’ve created an evaluation and some queries33

Intermezzo We’ve motivated the problem:invention of the light bulbWhen? We’ve created an evaluation and some queries Now we can try to solve the problem Extract or identify temporal information Evaluate retrieval models for these events34

How can we use books to predict 0190819622008201535

How can we use books to predict times?Extract TemporalInformationRank 190819622008201536

Time Taggers Pros Good, open-source implementations e.g. HeidelTime, StanfordNLP Cons Tuned/Trained for news articles e.g. Relative times are based on document datesThe Writings of Abraham Lincoln: 1905 OCR37

OCR can be difficult20 The Writings of ADDRESS AT GETTYSBURGNOVEMBER IQ, 1863. Four score and seven years agoour fathers brought forth on this continent, a new nation,conceived in Liberty, and dedicated to the proposition thatall men are created equal 38

OCR can be difficult“fT O me,“ wrote George Thompson to Mr. Garrison, onhear - Nov. 23, J ing of Lincohi’s election, “it seems thatthe triumph l8 . j I f just achieved has placed the cause ina new, a critical, and a trying position j demanding if it bepossible additional vigi lance, inflexible steadfastness tofundamental moral principles, and unrelaxed energy in theemployment of anti-slavery means. You have now to grapplewith the new doctrine of Eepublican conservatism, and willbe called to [ocric]39

High-Precision Sentences In 1908 Thomas Edison proposed a method of castingtwo - and three - story houses in one operation . We add for reference description of the Phonographtaken from The Americana '' : Phonograph , aninstrument invented in 1877 , by Thomas Edison ofMenlo Park , N. J. , by means of which articulate soundscan be regis - tered permanently , and afterwardreproduced from such mechanical register .40

Exploring the Solution SpaceWhat we have:Time-labeled sentences on certain pages of certainbooks.What we want:A method to rank years as a result of a query over thisdata.1.2.3.4.5.1803187918802008201541

Existing Approaches that perform poorly (A) Let’s use publication date of a neighbor document: Works really well on news in related work Really poorly for books: 0.03 MRR The Writings of Abraham Lincoln: 1905 (B) Okay, what if we use the most-commonly mentioneddate or time? Works okay. (Results in paper). We tried pages and full books.42

Modeling EventsWhat we have:Time-labeled sentences on certain pages of certainbooks.What we want:A method to rank years as a result of a query over thisdata.1.2.3.4.5.1803187918802008201543

Modeling EventsWhat we have:Time-labeled sentences on certain pages of certainbooks.Books,Whatwe want:Pages,SentencesA method to rank years as a result of a query over thisdata.1.2.3.4.5.1803187918802008201544

Modeling EventsWhat we have:Time-labeled sentences on certain pages of certainbooks.Books,Whatwe want:Pages,SentencesA method to rank years as a result of a query over thisdata.Event Model1.2.3.4.5.1803187918802008201545

Modeling EventsWhat we have:Time-labeled sentences on certain pages of certainbooks.Books,Whatwe want:Pages,SentencesA method to rank years as a result of a query over thisdata.Event ModelPredicted Times1.2.3.4.5.1803187918802008201546

Books, Sentences and Years18791880188147

Proposed Models Sentence-Event Model Every Sentence is an event. Year-Event Model Every sentence mentioning the same year is an event. Book-Year-Event Model Every sentence mentioning the same year in the same book is an event.1879187948

Evaluation: Dataset 926 years to explore: .19251000 50,228 books from the INEX book track 16 million pages, 5.2 billion terms Dropped special XML from that track 14 million time expressions 10.4 million with high precision, and 9.5 million inour target years49

Evaluation: Settings Two types of queries: 15,739 wiki-facts, 3,235 ambiguous versions Split ⅔ train/validate, ⅓ test Five models Three we just presented Four loosely based on related work Joint Book-Year is novel Long documents can help1879 Invariant of complexity of retrieval framework Results hold on QL vs. SDM50

Results on Best 3 87951

Conclusions We automatically derive events from a collection ofhistorical books using standard NLP tools We use this data to predict the year(s) associated with aquery And we evaluate a number of retrieval models onthis different domain and task We create and release an interesting set ofautomatically derived judgments from Wikipedia52

Thank you for listeningRetrieving Time from Scanned BooksJohn Foley & James Allan{jfoley, b.com/jjfiv/ecir2015timebooks53

9. 2008 - LED light bulbs are on the market 10. 2015 - Graphene Light Bulb (BBC article) Possible Uses: As Query Facets 15 1803 1835 1850 1878 1879 1880 1908 1962 2008 . Thomas Edison demonstrates incandescent lighting to the public for the first time in Menlo Park, New Jersey. Strip dates: (so we don't accidentally cheat!) .