Jure Leskovecand AnandRajaraman - Stanford University

Transcription

CS345a: Data MiningjJure Leskovec and Anand RajaramanStanford University

Friday 5:30 at Gates B12 5:30‐7:30pmYou will learn and get hands on experience on: Login to Amazon EC2 and request a cluster Run Hadoop MapReduce jobs UseU AsterA t nClusterCl t softwareft Amazon have us 12k of computing timeEach students has about 200 worth ofcomputing time1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining2

Ideally teams of 2 students (1 (3) is also ok)Project: Discovers interesting relationships within asignificant amount of data Have some original idea that extends/builds onwhat we learned in class Extend/Improve/SpeedExtend/Improve/Speed‐upup some existing algorithm Define a new problem and solve it1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining3

Answer the following questions: What is the problem you are solving? WhatWh t ddatat willill you use ((wherehwillill you gett it)? How will you do it? WhichWhi h algorithms/techniquesl ith /t h iyou planl tot use? Be as specific as you can! Who will you evaluate,evaluate measure success? What do you expect to submit at the end of thequarter?1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining4

Due on midnight Feb 1 2010 Email the PDF to cs345a‐win0910‐staff@lists.stanford.edu TAs will assign group numbers Name your file: group# proposal.pdf group# proposal pdf1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining5

WikipediaIM buddy graphYahoo Altavista web graphStanford WebBaseTwitter DataBlogs and news dataNetflixRestaurant reviewsYahoo Music Ratings1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining6

Complete edit history of Wikipedia untilJanuary 2008 For every single edit the completesnapshot of the article is saved Each page has a talk page: 1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining7

Talk page: Editors discuss things like:1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining8

Every registereduse has a page:1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining9

Every useruser’ss page has a talk page: Users discuss things:1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining10

1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining11

title Anarchism /title id 12 /id revision id 18201 /id timestamp 2002-02-25T15:00:22Z /timestamp contributor ip Conversion script /ip /contributor minor / comment Automated conversion /comment text xml:space "preserve" ''Anarchism'' istheory that advocates the abolition of allgovernment. /text /revision revision / id 19746 /id timestamp 2002-02-25T15:43:11Z /timestamp contributor ip 140.232.153.45 /ip /contributor comment * /comment text xml:space xml:space "preserve" ''Anarchism'‘preserve Anarchismistheory that advocates the abolition of all.1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Miningthe politicalforms ofthe politicalforms of government.12

Complete edit and talk history of Wikipedia: How do articles evolve? Use string edit distance like approach to measure differencesbetween versions of the article Model the evolution of the content Which users make what types of edits? Big vs. small changes, reorganization? Suggest to a which user should edit the page? HowH dod users talklk andd thenh editdi same pages? Do users first talk and then edit? Is it the other wayy around? Suggest users which pages to edit1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining13

Altavista web graph from 2002: 1/21/2010Nodes are webpagesDi t d edgesDirecteddare hhyperlinksli k1.4 billion public webpagesSSeverall billionbilli edgesdFor each node we also know the page URLJure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining14

SPAM: Use the web‐graph structureto more efficiently extractspam webpages Link farms Spider traps Personalized and topictopic‐sensitive PageRank1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining15

Website structure identification: From the webgraph extract “websites” What are common navigational structures ofwebsites? Cluster website graphs Identify common subgraphs and patterns What are roles pages/links play in the graph: ContentC t t pages Navigational pagesp g Index pages Build a summary/map of the website1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining16

A collection of focused snapshots of the WebData starts in 2004 and continues till today General crawls start from 1000 seed webpages Crawl up to 150150,000000 pager per site Specialized crawls: Universities US Government Hurricane Katrina (2005) – daily crawls Monthly newspaper crawls1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining17

Smaller than Altavista but youalso have the page content Can do topic analysis Topic sensitive PageRank Study the evolution of websites andwebpages1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining18

50 million tweets per month startingJune 2009 (6 months) Format: T2009-06-07 02:07:42Uhttp://twitter.com/redsoxtweetsW#redsox Extra Bases: Sox win, 8-1: The Rangersspoiled Jon Lester's perfecto and his shutout.http://tinyurl.com/pyhgwy T iimportantTwot t thithings: URLs Hash‐tags1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining19

Trending topics: raising,raising falling Inferring links of the who‐follows‐whom network What is the lifecycles of URLs and hash‐tags? Finding early/influential users? Clustering tweets by topic or category Sentiment analysis – are peoplepositive/negativeiti /ti aboutb t somethingthi ((a product?)d t?)1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining20

More than 1 million newsmedia and blogarticles per day since August 2008 Extract phrases (quotes) and links http://memetracker.org Format: -09-01 00:00:13dangerously unprepared to be presidenteven more dangerously unpreparedunderstands the challenges that we faceworked and succeededstill to this day refuses to acknowledge that the surge hassucceededhttp://www.cnn.com1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining21

Find all variants (mutations) of the samephrase – cluster phrases based on editdistance and time: lipstick on a pigyou can put lipstick on a pigyou can put lipstick on a pig but itit'ss still a pigi think they put some lipstick on a pig but it's still a pigputting lipstick on a pigTemporal variations of the phrase volume1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining22

py of a pPredict the ppopularityphrase over time How does information mutate/change over time? Which media sites are the most influential? Build apredictive model of site influence Which nodes are early mentioners, late comers,summarizers?i? Sentiment analysis – are people positive/negative aboutsomething (news, a product) Create a model of political bias (liberal vs. conservative) What is genuine newsnews, what are genuine phrases and what isspam?1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining23

We also have the Wikipedai webserver logs,logsi.e., page visit statistics How does Wiki page visit statistics correlatewith external events,events natural disasters? Use Twitter or MemeTracker data to detect those Compare occurrence of phrases and visits toWikipedia pages1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining24

A large IM buddy graph from March 2005230 million nodes7 340 million undirected edges7,340 Limitations: Only have the buddy graph with random node ids No communication or edge strength1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining25

Find communities,communities clusters in such a big graph Count frequentqsubgraphsg p Design algorithms to characterize thestructure off the network as a whole1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining26

Movie ratings: Netflix prize dataset: http://www.netflixprize.com/ Yahoo Music ratings: Yahoo Music user ratingsg of songsg with artist,,album and genre information 717 million ratings 136,000 songs 1.8 users R tRestaurantt reviewsi1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining27

Collaborative filtering: Predict what ratings will user give to particularsongs/movies,g/, i.e.,, which sons will he/she/like? Supplement the data with additional datasources: Movies ‐‐ IMDB Playlists from the web Lyric (text of the song) Include taste, temporal component,diversity into the model1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining28

Stanford Search QueriesNew York Times articles since 1987 Article are manually annotated by subjectcategories and keywords Entity or relation extraction Extract keywords, predict article category Don’t’ ffeell llimiteddbby thesehYou can collect the dataset yourselfAnd define the project/question yourself1/21/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining29

Project: Discovers interestinginteresting relationshipsrelationships withinwithin aa . Name your file: group# proposal pdf group# _proposal.pdf 1/21/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 5 . Data Mining 29. Title: Microsoft PowerPoint - 06-projects.pptx Author: jure Created Date: 1/21/2010 2:37:04 PM .