Large Graph Mining: Power Tools And A Practitioner's Guide - CMU

Transcription

CMU SCSLarge Graph Mining:Power Tools and a Practitioner’s guideTask 5: Graphs over time & tensorsFaloutsos, Miller, TsourakakisCMUKDD '09Faloutsos, Miller, TsourakakisP5-1

CMU SCSOutline Introduction – MotivationTask 1: Node importanceTask 2: Community detectionTask 3: RecommendationsTask 4: Connection sub-graphsTask 5: Mining graphs over timeTask 6: Virus/influence propagationTask 7: Spectral graph theoryTask 8: Tera/peta graph mining: hadoopObservations – patterns of real graphsConclusionsKDD '09Faloutsos, Miller, TsourakakisP5-2

CMU SCSThanks to Tamara Kolda (Sandia)for the foils on tensordefinitions, and on TOPHITSKDD '09Faloutsos, Miller, TsourakakisP5-3

CMU SCSDetailed outline Motivation Definitions: PARAFAC and Tucker Case study: web miningKDD '09Faloutsos, Miller, TsourakakisP5-4

CMU SCSExamples of Matrices:Authors and termsdataJohnPeterMary .Nick .KDD '09.135.miningclassif.114226.Faloutsos, Miller, Tsourakakistree.55 .7 .P5-5

CMU SCSMotivation: Why tensors? Q: what is a tensor?KDD '09Faloutsos, Miller, TsourakakisP5-6

CMU SCSMotivation: Why tensors? A: N-D generalization of matrix:KDD’09JohnPeterMaryNick.KDD '09data135.miningclassif.114226.Faloutsos, Miller, Tsourakakistree.55 .7 .P5-7

CMU SCSMotivation: Why tensors? A: N-D generalization of DD '09data135.miningclassif.114226.Faloutsos, Miller, Tsourakakistree.55 .7 .P5-8

CMU SCSTensors are useful for 3 or moremodesTerminology: ‘mode’ (or ‘aspect’):Mode#3data135Mode#2KDD '09.miningclassif.114226.tree.Mode ( aspect) #1Faloutsos, Miller, Tsourakakis.55 .7 .P5-9

CMU SCSNotice 3rd mode does not need to be time we can have more than 3 modesDest. port125.80135IP sourceKDD '09.114.226.IP destinationFaloutsos, Miller, Tsourakakis55 .7 .P5-10

CMU SCSNotice 3rd mode does not need to be time we can have more than 3 modes– Eg, fFMRI: x,y,z, time, person-id, task-idFrom DENLAB, Temple U.(Prof. V. Megalooikonomou Faloutsos, Miller, TsourakakisP5-11KDD '09

CMU SCSMotivating Applications Why tensors are useful?– web mining (TOPHITS)–––––KDD '09environmental sensorsIntrusion detection (src, dst, time, dest-port)Social networks (src, dst, time, type-of-contact)face recognitionetc Faloutsos, Miller, TsourakakisP5-12

CMU SCSDetailed outline Motivation Definitions: PARAFAC and Tucker Case study: web miningKDD '09Faloutsos, Miller, TsourakakisP5-13

CMU SCSTensor basics Multi-mode extensions of SVD – recallthat:KDD '09Faloutsos, Miller, TsourakakisP5-14

CMU SCSReminder: SVDnnmA mΣVTU– Best rank-k approximation in L2KDD '09Faloutsos, Miller, TsourakakisP5-15

CMU SCSReminder: SVDnmAσ1u1 v1 σ2u2 v2 – Best rank-k approximation in L2KDD '09Faloutsos, Miller, TsourakakisP5-16

CMU SCSIxJxKIxRCKxRGoal: extension to 3 modesJxRB¼AKDD '09 RxRxRFaloutsos, Miller, TsourakakisP5-17

CMU SCSMain points: 2 major types of tensor decompositions:PARAFAC and Tucker both can be solved with alternating leastsquares’’ (ALS)KDD '09Faloutsos, Miller, TsourakakisP5-18

CMU SCSSpecially Structured Tensors Tucker Tensor Kruskal TensorOurNotationTIxJxKIxRJxSV UKDD '09IxJxK RxSxTFaloutsos, Miller, nJxR VvRR x R x RuRP5-19

CMU SCSIxJxKIxRCKxTTucker Decomposition - intuitionJxSB¼ARxSxT author x keyword x conference A: author x author-group B: keyword x keyword-group C: conf. x conf-group KDDG:howgroupsrelatetoeachother'09Faloutsos, Miller, TsourakakisP5-20

CMU SCSIntuition behind core tensor 2-d case: co-clustering [Dhillon et al. Information-Theoretic Coclustering, KDD’03]KDD '09Faloutsos, Miller, TsourakakisP5-21

CMU SCSnk .5 .50m 0 0 000.5.500l .03 .03 l k .2 .2 0000.5.5KDD '09 .05 .05 .05 0 0 0 .05 .05 .05 0 0 0 m 0 0 0 .05 .05 .05 0 0 0 .05 .05 .05 .04 .04 0 .04 .04 .04 .04 .04 .04 0 .04 .04 n[.36 .36 .28 000000] .28 .36 .36eg, terms x documents .054 .054 00 .036 .036Faloutsos, Miller, 42.042.028.02800.054.054.036.036 00.054.054.036.036P5-22

CMU SCSmed. doc .05 .05 .05 0 0 0 .05 .05 .05 0 0 0 .05 .05 00 00 00 .05 05 .05 .05 .04 .04 0 .04 .04 .04 .04 .04 .04 0 .04 .04 term group xdoc. group .5 0 0 .03 .03 .5 0 0 .2 .2 0 .5 0 00 .05 .05 0 0 .5 term xterm-groupKDD '09cs doc[.36 .36 .28 000000] .28 .36 .36doc xdoc groupmed. termscs termscommon terms .054 .054 00 .036 .036Faloutsos, Miller, 42.042.028.02800.054.054.036.036 00.054.054.036.036P5-23

CMU SCSTensor tools - summary Two main tools– PARAFAC– Tucker Both find row-, column-, tube-groups– but in PARAFAC the three groups are identical ( To solve: Alternating Least Squares )KDD '09Faloutsos, Miller, TsourakakisP5-24

CMU SCSDetailed outline Motivation Definitions: PARAFAC and Tucker Case study: web miningKDD '09Faloutsos, Miller, TsourakakisP5-25

CMU SCSWeb graph mining How to order the importance of web pages?– Kleinberg’s algorithm HITS– PageRank– Tensor extension on HITS (TOPHITS)KDD '09Faloutsos, Miller, TsourakakisP5-26

CMU SCSKleinberg’s Hubs and Authorities(the HITS method)Sparse adjacency matrix and its SVD:authority scoresfor 1st topicauthority scoresfor 2nd topicfromtohub scoresfor 1st topicKDD '09Kleinberg, JACM, 1999Faloutsos, Miller, Tsourakakishub scoresfor 2nd topicP5-27

CMU SCSHITS Authorities on Sample Data.97.24.08.05.02.01.011st Principal Factorwww.ibm.comwww.alphaworks.ibm.com2nd Principal Factorwww-128.ibm.comWe started our crawl from.99 .mcs.anl.gov/neos,.11 www2.lehigh.edu3rd Principal Factorwww.research.ibm.comand crawled 4700 pages,.06 www.lehighalumni.comwww.redbooks.ibm.com.75 java.sun.comresulting in 560.06 www.lehighsports.comnews.com.com.38 www.sun.comcross-linked hosts.02 www.bethlehem-pa.gov.36 developers.sun.com 4th Principal Factor.02 www.adobe.com.24 see.sun.com.60 www.pueblo.gsa.gov.02 lewisweb.cc.lehigh.edu.16 www.samag.com.45 www.whitehouse.gov.02 www.leo.lehigh.edu.13 docs.sun.com.35 www.irs.gov.02 www.distance.lehigh.edu.12 blogs.sun.com .31 travel.state.gov 6th Principal Factor.02 fp1.cc.lehigh.edu.08 sunsolve.sun.com.22 www.gsa.gov.97 mathpost.asu.edu.08 www.sun-catalogue.com.20 www.ssa.gov.18 math.la.asu.edu.08 news.com.com .16 www.census.gov.17 www.asu.eduauthority scoresauthority scoresfor 2nd topicstfor 1 topicfromtohub scoresfor 1 topicKDD '09st.04 www.act.org.14 www.govbenefits.gov.03 www.eas.asu.edu.13 www.kids.gov.02 archives.math.utk.edu.13 www.usdoj.gov.02 www.geom.uiuc.edu.02 www.fulton.asu.edu.02 www.amstat.org.02 www.maa.orghub scores Faloutsos, Miller, Tsourakakisfor 2nd topicP5-28

CMU SCSThree-Dimensional View of the WebObserve that thistensor is very sparse!KDD '09Faloutsos, Miller, TsourakakisKolda, Bader, Kenny, ICDM05P5-29

CMU SCSThree-Dimensional View of the WebObserve that thistensor is very sparse!KDD '09Faloutsos, Miller, TsourakakisKolda, Bader, Kenny, ICDM05P5-30

CMU SCSThree-Dimensional View of the WebObserve that thistensor is very sparse!KDD '09Faloutsos, Miller, TsourakakisKolda, Bader, Kenny, ICDM05P5-31

CMU SCSTopical HITS (TOPHITS)Main Idea: Extend the idea behind the HITS model to incorporateterm (i.e., topical) information.term scoresfor 2nd topictermterm scoresfor 1st topicfromtoauthority scoresfor 1st topichub scoresfor 1st topicKDD '09Faloutsos, Miller, Tsourakakisauthority scoresfor 2nd topichub scoresfor 2nd topicP5-32

CMU SCSTopical HITS (TOPHITS)Main Idea: Extend the idea behind the HITS model to incorporateterm (i.e., topical) information.term scoresfor 2nd topictermterm scoresfor 1st topicfromtoauthority scoresfor 1st topichub scoresfor 1st topicKDD '09Faloutsos, Miller, Tsourakakisauthority scoresfor 2nd topichub scoresfor 2nd topicP5-33

CMU SCSTOPHITS Terms & Authoritieson Sample Data.23.18.17.16.16.15.15.14.12.121st Principal FactorJAVA.86 java.sun.comSUN.38 developers.sun.com2nd Principal FactorPLATFORM.16 docs.sun.comTOPHITS uses 3D analysis to find.20 NO-READABLE-TEXT .99 www.lehigh.eduSOLARIS.14 see.sun.comthe dominant groupings of web.16 FACULTY.06 3rdwww2.lehigh.eduPrincipal FactorDEVELOPER.14 www.sun.com.16 SEARCH.03 www.lehighalumni.compages and terms.15 NO-READABLE-TEXTEDITION.09 www.samag.com .97 www.ibm.com.16 NEWS .15 IBMDOWNLOAD.07 developer.sun.com .18 www.alphaworks.ibm.com.16 LIBRARIESPrincipal Factor.12 SERVICESwww-128.ibm.comINFO.06 sunsolve.sun.com .07 4th.16 COMPUTING.26 INFORMATION.87 www.pueblo.gsa.gov.12 WEBSPHERE.05 www.developer.ibm.comSOFTWARE.05 access1.sun.com.12 LEHIGH.12 WEB .24 FEDERAL .02 www.redbooks.ibm.com.24 www.irs.govNO-READABLE-TEXT.05 iforce.sun.com.23 CITIZEN.23 6thwww.whitehouse.gov.11 DEVELOPERWORKS.01 www.research.ibm.comPrincipal Factorwk # unique links using term k.11 LINUX .22 OTHER.26 PRESIDENT.19 travel.state.gov.87 www.whitehouse.gov.19 CENTER.18 www.gsa.gov.11 RESOURCES.25 NO-READABLE-TEXT.18 www.irs.gov.19 LANGUAGES.09 www.consumer.gov.11 TECHNOLOGIES.25 BUSH.16 12thtravel.state.govPrincipal Factor.15 U.S.09 www.kids.gov.10 DOWNLOADS.25 WELCOME.10www.gsa.gov.75 OPTIMIZATION.35 www.palisade.com.15 PUBLICATIONS.07 www.ssa.gov.17 WHITE.58 SOFTWARE.08 www.ssa.gov.35 www.solver.com.14 CONSUMER.05www.forms.gov.16 U.S.05 www.govbenefits.govPrincipal Factor.08 DECISION.33 13thplato.la.asu.edu.13 FREE.04 www.govbenefits.gov.15 HOUSE.07 NEOS .46 ADOBE.04 www.census.gov.99 www.adobe.com.29 www.mat.univie.ac.at.13 BUDGET.04 www.usdoj.gov.06 TREE .45 READER.28 www.ilog.com.13 PRESIDENTS.04 www.kids.gov16th Principal Factor.05 GUIDE .45 ACROBAT.26 www.dashoptimization.com.11 OFFICE.02 www.forms.gov.50 WEATHER.30 FREE.05 SEARCH.26 www.grabitech.com.81 www.weather.gov.30 NO-READABLE-TEXT.24 OFFICE.05 ENGINE.25 www-fp.mcs.anl.gov.41 www.spc.noaa.gov.29 HERE .23 CENTER.30 lwf.ncdc.noaa.gov.05 CONTROL.22 www.spyderopts.com19thPrincipal 29COPY.05ILOG.17www.mosek.comterm scoresterm scoresfor 1 topic.22 TAX.73 www.irs.govfor 2 topic.05 DOWNLOAD.17 ORGANIZATION.14 www.nhc.noaa.gov.43 travel.state.gov.15 NWS .17 TAXES.09 www.prh.noaa.gov.15 CHILD.22 www.ssa.govto.15 SEVERE.07 .gov.15 FIRE.06 www.nohrsc.nws.govauthority scoresauthority scores.14BENEFITS.06www.usdoj.gov.15 POLICY.06 www.srh.noaa.govfor 2 topicfor 1 topic.14STATE.03www.census.gov.14 CLIMATEhub scores.14INCOME.03www.usmint.govfor 2 topichub scoresKDD '09Faloutsos, Miller, Tsourakakis.13 SERVICE.02 www.nws.noaa.govfor 1 topic.13 REVENUE.02 www.gsa.gov.12 CREDIT.01 www.annualcreditreport.comTensor PARAFACstfromtermndndstndstP5-34

CMU SCSConclusions Real data are often in high dimensions withmultiple aspects (modes) Tensors provide elegant theory andalgorithms– PARAFAC and Tucker: discover groupsKDD '09Faloutsos, Miller, TsourakakisP5-35

CMU SCSReferences T. G. Kolda, B. W. Bader and J. P. Kenny.Higher-Order Web Link Analysis UsingMultilinear Algebra. In: ICDM 2005, Pages 242249, November 2005. Jimeng Sun, Spiros Papadimitriou, Philip Yu.Window-based Tensor Analysis on Highdimensional and Multi-aspect Streams, Proc. ofthe Int. Conf. on Data Mining (ICDM), HongKong, China, Dec 2006KDD '09Faloutsos, Miller, TsourakakisP5-36

CMU SCSResources See tutorial on tensors, KDD’07 (w/ TamaraKolda and Jimeng Sun):www.cs.cmu.edu/ christos/TALKS/KDD-07-tutorialKDD '09Faloutsos, Miller, TsourakakisP5-37

CMU SCSTensor tools - resources Toolbox: from Tamara Kolda:csmr.ca.sandia.gov/ tgkolda/TensorToolbox T. G. Kolda and B. W. Bader. Tensor Decompositions andApplications. SIAM Review, Volume 51, Number 3, September 2009csmr.ca.sandia.gov/ tgkolda/pubs/bibtgkfiles/TensorReview-preprint.pdf T. KoldaTensorDecompositionfor Multi-AspectKDDICDE’09'09 and J. Sun: ScalableFaloutsos,Miller, Data Mining (ICDM 2008)

6th Principal Factor.75 OPTIMIZATION .35 www.palisade.com.58 SOFTWARE .35 www.solver.com.08 DECISION .33 plato.la.asu.edu.07 NEOS .29 www.mat.univie.ac.at.06 TREE .28 www.ilog.com.05 GUIDE .26 www.dashoptimization.com authority scores for 1 st topic hub scores for 1 st topic hub scores for 2 nd topic authority scores for 2 nd topic from to