Truth Discovery For Passive And Active Crowdsourcing

Transcription

Truth Discovery for Passive and Active CrowdsourcingJing Gao1, Qi Li1, and Wei Fan21SUNYBuffalo; 2Baidu Research Big Data Lab1

Overview1234567 Introduction of Crowdsourcing Crowdsourced Data Aggregation for Reliable Information Discovery Passive Crowdsourcing Scenarios Active Crowdsourcing Scenarios Applications and Open Question Resources References2

Overview1234567 Introduction of Crowdsourcing Crowdsourced Data Aggregation for Reliable Information Discovery Passive Crowdsourcing Scenarios Active Crowdsourcing Scenarios Applications and Open Question Resources References3

Motivation Huge amounts of datacontributed by users (usergenerated content, userbehavioral data, sensory data, ) Crowdsourced data containsvaluable information andknowledge Inevitable error, noise andconflicts in the data Objective: obtain reliableinformation fromcrowdsourced data4

Passive Crowdsourcing“My girlfriend always gets a baddry skin, rash on her upper arm,cheeks, and shoulders when sheis on [Depo]. . . . ”“I have had no side effectsfrom [Depo] (except . ), butotherwise no rashes ”DEPO USER1 Bad dryskinDEPO USER1 RashDEPO USER2 Norashes DEPO Rash“Made it through some prettybad traffic! ( John F. KennedyInternational Airport (JFK) inNew York, NY)”“Good news.no traffic onGeorge Washington bridgeapproach from Jersey”JFK airportTrafficJFK airportTrafficBadGood JFKBad Traffic5

Passive Crowdsourcing Description Users/Data sources are sharing information on theirown. Goal To extract and integrate relevant informationregarding a specific task6

objectIntegrateextractUser 1User 2User 3User 4User 57

Active CrowdsourcingAre the two images of the same person?Definitely SameMaybe SameNot Sure Annotation ResultsDefinitely SameMaybe SameNot SureMaybe DifferentFinal Answer:SameDefinitely Different8

Active Crowdsourcing Description Users/Data sources generate information based onrequests. Goal To actively design and collect data for a specifictask. And then integrate the information.9

TaskqueryUser 1User 2User 3User 4User 510

TaskIntegrationUser 1User 2User 3User 4User 511

Overview1234567 Introduction of Crowdsourcing Crowdsourced Data Aggregation for Reliable Information Discovery Passive Crowdsourcing Scenarios Active Crowdsourcing Scenarios Applications and Open Question Resources References12

A Straightforward Fusion Solution Voting/Averaging Take the value that is claimed by majority of the sources Or compute the mean of all the claims Limitation Ignore source reliability Source reliability Is crucial for finding the true fact but unknown13

Truth Discovery & Crowdsourced DataAggregation Problem Input: Multiple conflicting information about thesame set of objects provided by various informationsources Goal: Discover trustworthy information (i.e., thetruths) from conflicting data on the same object14

Truth Discovery & Crowdsourced DataAggregation Principle Infer both truth and source reliability fromthe data A source is reliable if it provides many pieces of trueinformation A piece of information is likely to be true if it isprovided by many reliable sources15

Truth Discovery & Crowdsourced DataAggregation A common goal to improve the quality of the aggregation/fusion results Via a common method To aggregate by estimating source reliabilities Similar principles Data from reliable sources are more likely to be accurate A source is reliable if it provides accurate information Mutual challenge Prior knowledge and labels are rarely available16

Data Collection and GenerationTruth discoveryCrowdsourced dataaggregation We can’t controlgeneration step. We only collect. We can controldata generation toa certain degree What to ask How to ask How many labelsper question17

Data Format of ClaimsTruth discovery Data is collectedfrom open domain. Can’t define dataspace type of data range of dataCrowdsourced dataaggregation Data generation iscontrolled For easier validationof answers,requesters usuallychoose Multi-choice question Scoring in a range18

Model Categories Statistical model (STA) Generative model (GM) Optimization model (OPT)19

Statistical Model (STA) General goal: To find the (conditional) probability of a claim being true Source reliability: Probability(ies) of a source/worker making a true claim20

STA - TruthFinderDifferent websites often provide conflicting informationon a subject, e.g., Authors of “Rapid Contextual Design”Online StoreAuthorsPowell’s booksHoltzblatt, KarenBarnes & NobleKaren Holtzblatt, Jessamyn Wendell, Shelley WoodA1 BooksKaren Holtzblatt, Jessamyn Burns Wendell, Shelley WoodCornwall booksHoltzblatt-Karen, Wendell-Jessamyn Burns, WoodMellon’s booksWendell, JessamynLakeside booksWENDELL, JESSAMYNHOLTZBLATT, KARENWOOD, SHELLEYBlackwell onlineWendell, Jessamyn, Holtzblatt, Karen, Wood, Shelley[Yin et al., TKDE’08]21

STA - TruthFinder Each object has a set of conflictive facts E.g., different author lists for a bookAnd each web site provides some facts How to find the true fact for each object? Web sitesFactsw1f1w2w3w4f2Objectso1f3f4o2f522

STA - TruthFinder1. There is usually only one true fact for a property ofan object2. This true fact appears to be the same or similar ondifferent web sites E.g., “Jennifer Widom” vs. “J. Widom”3. The false facts on different web sites are less likelyto be the same or similar False facts are often introduced by random factors4. A web site that provides mostly true facts formany objects will likely provide true facts forother objects23

STA - TruthFinder Confidence of facts Trustworthiness of web sites A fact has high confidence if it is provided by (many)trustworthy web sites A web site is trustworthy if it provides many facts with highconfidence Iterative steps Initially, each web site is equally trustworthy Based on the four heuristics, infer fact confidence from website trustworthiness, and then backwards Repeat until achieving stable state24

STA - TruthFinderWeb sitesFactsw1f1w2f2w3f3w4Objectso1o2f42525

STA - TruthFinderWeb sitesFactsw1f1w2f2w3f3w4Objectso1o2f42626

STA - TruthFinderWeb sitesFactsw1f1w2f2w3f3w4Objectso1o2f42727

STA - TruthFinderWeb sitesFactsw1f1w2f2w3f3w4Objectso1o2f42828

STA - TruthFinder The trustworthiness of a web site 𝒘: 𝒕(𝒘) Average confidence of facts it provides t w f F w s f F w Sum of fact confidenceSet of facts provided by wt(w1)w1 The confidence of a fact 𝒇: 𝒔(𝒇) One minus the probability that all web sitesproviding f are wrongt(w2)Probability that w is wrongw2s f 1 1 t w w W fs(f1)f1 Set of websites providing f29

STA - TruthFinderType of errorCorrectVoting71Miss author(s)Incomplete names121825461130223105052Wrong first/middlenamesHas redundant namesAdd incorrect namesNo informationTruthFinder Barnes&Noble8564 Viewing an author list as a fact3030

Generative Model (GM)Source reliabilityTruth31

Generative Model (GM) One of the most popular models GTM [Zhao&Han, QDB’12] LTM [Zhao et al., VLDB’12] MSS [Qi et al., WWW’13] LCA [Pasternack&Roth, WWW’13] TEM [Zhi et al., KDD’15] DS [Dawid&Skene, 1979] GLAD [Whitehill et al., NIPS’09] 32

GM - Maximum Likelihood EstimationMultiple choice questionswith fixed answer spaceFor each worker, the reliabilityis a confusion matrix.Worker’s answerCorrect answerABCDABCD𝑘𝜋𝑗𝑙 : the probability that worker 𝑘 answers 𝑙 when 𝑗 is thecorrect answer.𝑝𝑗 : the probability that a randomly chosen question hascorrect answer 𝑗.[Dawid&Skene, 1979]33

GM - Maximum Likelihood ���𝑜𝑑𝑖 𝑞 𝑖𝑠 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝜋𝑞𝑙𝑙 ��𝑖 𝑞 𝑖𝑠 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 �𝑜𝑑𝑖 𝐾1 𝑗 𝑞𝐽𝑘𝑝𝑗𝑗 1𝑙 1𝜋𝑗𝑙𝑘𝑙 134

GM - Maximum Likelihood ���𝑑 𝐾𝑘𝑝𝑗𝑖𝑗 11 𝑗𝑖 𝑞𝑖𝐽𝜋𝑗𝑙𝑘𝑙 1 This is the likelihood if the correct answers (i.e., 𝑞𝑖 ’s)are known. What if we don’t know the correct answers?𝑘 Unknown parameters are 𝑝𝑗 , 𝑞, 𝜋𝑗𝑙EM algorithm35

GM - Extension and Theoretical Analysis Extensions Naïve Bayesian [Snow et al., EMNLP’08] Finding a good initial point [Zhang et al., NIPS’14] Adding instances’ feature vectors [Raykar et al., 2010][Lakkaraju et al. 2015] Using prior over worker confusion matrices [Raykar etal., 2010][Liu et al., NIPS’12] [Lakkaraju et al. SDM’15] Clustering workers/instances [Lakkaraju et al. SDM’15] Theoretical analysis Error bound [Li et al., 2013] [Zhang et al., NIPS’14]36

GM - GLAD ModelEach imagebelongs to one oftwo possiblecategories ofinterest, i.e.,binary labeling.Known variables:observed labels.[Whitehill et al., NIPS’09]37

GM - GLAD ModelObserved labelTrue label𝑝 𝐿𝑖𝑗 𝑍𝑗 𝛼𝑖 , 𝛽𝑗 Worker’s accuracy.Always correct 𝛼𝑖 Always wrong 𝛼𝑖 Log odds for theobtained labelsbeing correct11 𝑒 𝛼𝑖 𝛽𝑗Difficulty of image.Very ambiguous 1/𝛽𝑗 Very easy 1/𝛽𝑗 038

GM - Latent Truth Model (LTM) Multiple facts can be true for each entity (object) One book may have 2 authors A source can make multiple claims per entity, wheremore than one of them can be true A source may claim a book w. 3 authors Sources and objects are independent respectively Assume book websites and books are independent The majority of data coming from many sources arenot erroneous Trust the majority of the claims[Zhao et al., VLDB’12]39

GM - Latent Truth Model rue1Brett’s �s BooksFalse2Ecampus.comFalse3Brett’s BooksTrue TruthTrueTrueTrueRIDEntity (book)Attribute (Author)1Data Mining: Concepts and TechniquesJiawei Han2Data Mining: Concepts and TechniquesMicheline Kamber3Introduction to AlgorithmsThomas H. Cormen40

GM - Latent Truth Model (LTM)False positive ratesensitivityTruth of Facts41

GM - Latent Truth Model (LTM) For each source 𝑘 Generate false positive rate (with strong regularization, believingmost sources have low FPR): 𝜙𝑘0 𝐵𝑒𝑡𝑎 𝛼0,1 , 𝛼0,0 Generate its sensitivity (1-FNR) with uniform prior, indicating lowFNR is more likely: 𝜙𝑘1 𝐵𝑒𝑡𝑎 𝛼1,1 , 𝛼1,0 For each fact 𝑓 Generate its prior truth prob, uniform prior: 𝜃𝑓 𝐵𝑒𝑡𝑎 𝛽1 , 𝛽0 Generate its truth label: 𝑡𝑓 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 𝜃𝑓 For each claim 𝑐 of fact 𝑓, generate observation of 𝑐. If 𝑓 is false, use false positive rate of source:𝑜𝑐 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 𝜙𝑠0𝑐 If 𝑓 is true, use sensitivity of source: 𝑜𝑐 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 𝜙𝑠1𝑐42

GM - Latent Truth Model (LTM)Results on book ng1.0000.8630.0000.8800.9274343

Optimization Model (OPT) General modelarg min𝑔(𝑤𝑠 , 𝑣𝑜 )𝑤𝑠 , 𝑣𝑜 𝑜 𝑂 𝑠 𝑆𝑠. 𝑡. 𝛿1 𝑤𝑠 1, 𝛿2𝑣𝑜 1 What does the model mean? The optimal solution can minimize the objective function Joint estimate true claims 𝑣𝑜 and source reliability 𝑤𝑠 undersome constraints 𝛿1 , 𝛿2 , . . Objective function 𝑔 , can be distance, entropy, etc.44

Optimization Model (OPT) General modelarg min𝑔(𝑤𝑠 , 𝑣𝑜 )𝑤𝑠 , 𝑣𝑜 𝑜 𝑂 𝑠 𝑆𝑠. 𝑡. 𝛿1 𝑤𝑠 1, 𝛿2𝑣𝑜 1 How to solve the problem? Convert the primal problem to its (Lagrangian) dual form Block coordinate descent to update parameters If each sub-problem is convex and smooth, thenconvergence is guaranteed45

OPT - CRH Framework𝐾min𝒳 ,𝒲𝑓(𝒳 , 𝒲) 𝑁𝑀( ) (𝑘)𝑑𝑚 𝑣𝑖𝑚 , 𝑣𝑖𝑚𝑤𝑘𝑘 1𝑖 1 𝑚 1s. t. 𝛿 𝒲 1,𝒲 0.Basic idea Truths should be close to the observations from reliablesourcesMinimize the overall weighted distance to the truths inwhich reliable sources have high weights[Li et al., SIGMOD’14]46

OPT - CRH Framework Loss function 𝑑𝑚 : loss on the data type of the m-th property Output a high score when the observation deviates from thetruth Output a low score when the observation is close to thetruth Constraint function The objective function may go to without constraints Regularize the weight distribution47

OPT - CRH Framework Run the following until convergence Truth computation Minimize the weighted distance between the truth and thesources’ observations𝐾 𝑘𝑣𝑖𝑚 arg min𝑣𝑤𝑘 𝑑𝑚 𝑣, 𝑣𝑖𝑚𝑘 1 Source reliability estimation Assign a weight to each source based on the differencebetween the truths and the observations made by the source𝒲 arg min 𝑓(𝒳𝒲 , 𝒲)48

OPT - Minimax Entropy Workers: 𝑖 1, 2, , 𝑚 Items: 𝑗 1, 2, , 𝑛 Categories: 𝑘 1, 2, , 𝑐Input: response tensor 𝑍𝑚 𝑛 𝑐 𝑧𝑖𝑗𝑘 1, if worker 𝑖 labels item 𝑗 as category 𝑘 𝑧𝑖𝑗𝑘 0, if worker 𝑖 labels item 𝑗 as others (not 𝑘) 𝑧𝑖𝑗𝑘 unknown , if worker 𝑖 does not label item 𝑗Goal: Estimate the ground truth 𝑦𝑗𝑙[Zhou et al., NIPS’12]49

OPT - Minimax Entropyworker 1worker 2 worker 𝑚item 1𝑧11𝑧21 𝑧𝑚1item 2𝑧12𝑧22 𝑧12 item 𝑛𝑧1𝑛𝑧2𝑛𝑧𝑚𝑛50

OPT - Minimax Entropyworker 1worker 2 worker 𝑚item 1𝜋11𝜋21 𝜋𝑚1item 2𝜋12𝜋22 𝜋12 item 𝑛𝜋1𝑛𝜋2𝑛𝜋𝑚𝑛𝜋𝑖𝑗 is a vector that presents the underlinedistribution of the observation.i.e., 𝑧𝑖𝑗 is drawn from 𝜋𝑖𝑗 .51

OPT - Minimax Entropyworker 1worker 2 worker 𝑚item 1𝜋11𝜋21 𝜋𝑚1item 2𝜋12𝜋22 𝜋12 item 𝑛𝜋1𝑛𝜋2𝑛𝜋𝑚𝑛Column constraint: the number of votes perclass per item 𝑖 𝑧𝑖𝑗𝑘 should match 𝑖 𝜋𝑖𝑗𝑘52

OPT - Minimax Entropyworker 1worker 2 worker 𝑚item 1𝜋11𝜋21 𝜋𝑚1item 2𝜋12𝜋22 𝜋12 item 𝑛𝜋1𝑛𝜋2𝑛𝜋𝑚𝑛Row constraint : the empirical confusion matrixper worker 𝑗 𝑦𝑗𝑙 𝑧𝑖𝑗𝑘 should match 𝑗 𝑦𝑗𝑙 𝜋𝑖𝑗𝑘53

OPT - Minimax Entropy If we know the true label 𝑦𝑗𝑙 Maximum entropy of 𝜋𝑖𝑗𝑘 under constraints54

OPT - Minimax Entropy To estimate the true label 𝑦𝑗𝑙 Minimizing the maximum entropy of 𝜋𝑖𝑗𝑘55

OPT - Minimax Entropy To estimate the true label 𝑦𝑗𝑙 Minimizing the maximum entropy of 𝜋𝑖𝑗𝑘Minimize entropyis equivalent tominimizing the KL divergence56

Overview1234567 Introduction of Crowdsourcing Crowdsourced Data Aggregation for Reliable Information Discovery Passive Crowdsourcing Scenarios Active Crowdsourcing Scenarios Applications and Open Questions Resources References57

Aggregation of Passively Crowdsourced Data More challengesSource correlationsSpatial-temporal dataWeather er expertisePhysicsMusic58

Source Correlations Many truth discovery methods consider independentsources Sources provide information independently Source correlation can be hard to model However, this assumption may be violated in real life Copy relationships between sources Sources can copy information from one or more othersources General correlations of sources Sources may provide data from complementary domains(negative correlation) Sources may apply common rules in extraction (positivecorrelation)59

Source Dependency Known relationships Apollo-Social [Wang et al., IPSN’14] For a claim, a source may copy from a related source with a certainprobability Used MLE to estimate a claim being correct Unknown relationships Accu-Copy [Dong et al., VLDB’09a] [Dong et al., VLDB’09b] MSS [Qi et al., WWW’13] Modeled as a PGM Related sources are grouped together and assigned with a groupweight60

Copy Relationships between Sources High-level intuitions for copying detection Common error implies copying relation e.g., many same errors in 𝑠1 𝑠2 imply source 1 and 2 arerelated Source reliability inconsistency implies copy direction e.g., 𝑠1 𝑠2 and 𝑠1 𝑠2 has similar accuracy, but 𝑠1 𝑠2 and𝑠2 𝑠1 has different accuracy, so source 2 may be a copier.Objects covered Common Objects coveredby source 1 butobjects by source 2 butnot by source 2not by source 1𝑠1 𝑠2𝑠1 𝑠2𝑠2 𝑠1[Dong et al., VLDB’09a] [Dong et al., VLDB’09b] [Pochampally et al., SIGMOD’14]61

Copy Relationships between Sources Incorporate copying detection in truth discoveryStep tectionStep 3Step 162

Spatial-Temporal Data Challenges of dynamic data Efficiency Correlation among entities Data smoothness63

Real Time Truth DiscoverytConflictingt 1Truths are evolving64

Real Time Truth Discovery - DynaTD Challenges of dynamic data Efficiency: When data comes sequentially, the iterativeprocedure is time costly Temporal relations exist among entities Source reliability changes: Observed source reliabilityfluctuates around a certain value.[Li et al., KDD’15]65

Real Time Truth Discovery - DynaTD Loss function (similar to [Liet al., SIGMOD’14])𝑠𝑇𝐿𝑇 𝑇𝑙𝑡 𝑡 1 Solution𝜃𝑡 1𝑐𝑡𝑆𝑠𝑣𝑜,𝑡𝑤𝑠𝑠 1𝑜 1 2 𝑣𝑜,𝑡𝑆𝑐𝑡𝑠 log 𝑤𝑠 𝑠 1 Equivalence between the optimization problem and themaximization of error likelihood Derive the incremental truth discovery algorithm which candynamically update source weights and compute truthsupon the arrival of new data66

Real Time Truth Discovery - DynaTDSource reliability evolves over timeUpdate source reliability based on continuouslyarriving data:𝑠𝑠𝑝 𝑤𝑠 𝑒1:𝑇 𝑝 𝑒𝑇𝑠 𝑤𝑠 𝑝(𝑤𝑠 𝑒1:𝑇 1)67

Correlation Among Entities Example Temporal correlation Spatial correlation Etc.Traffic ConditionWeather ConditionGas Price68

Mobile SensingHuman Sensor69

PM2.5 value?e170

250198e127571

Correlation Among Entitiese1eIt is observed only by one sensor!Insufficient information for estimation.ee52e79ee6e43e10e872

Correlation Among Entitiese1eCorrelation information can helpimprove the estimation accuracy!ee52e79ee46e8e3e1073

Correlated Entities – TD-corr Input: Observations for Nentities by K sensors(𝑘)𝑥𝑖 Correlation informationamong entitiese1 Output:e9( ) Truth of each entity 𝑥𝑖 Reliability of eachsensor 𝑤𝑘e5e2e7e6e4e3e8e10[Meng et al., SenSys’15]74

Correlated Entities – TD-corrVariable 1: Sensor WeightReliability degree of the informationprovided by the sensor𝑁min 𝑓 𝑋𝑋 ,𝑊 𝐾,𝑊 𝑤𝑘𝑖 1𝐾𝑠. 𝑡.Variable 2: TruthTrue value of an entity( )𝑥𝑖 (𝑘) 2𝑥𝑖𝑆(𝑖, 𝑖 ′ ) 𝛼𝑖 ′ 𝑁(𝑖)𝑘 1( )𝑥𝑖 ′ ( ) 2𝑥𝑖exp 𝑤𝑘 1𝑘 1Constraint FunctionSimilarity FunctionSimilarity between correlated entities(e.g., Gaussian Kernel)Regularizationwith correlation information75

Correlated Entities – TD-corr𝑁min 𝑓 𝑋𝑋 ,𝑊 𝐾,𝑊 𝑤𝑘𝑖 1( )𝑥𝑖 (𝑘) 2𝑥𝑖( )𝑥𝑖 ′𝑆(𝑖, 𝑖 ′ ) 𝛼𝑖 ′ 𝑁(𝑖)𝑘 1 ( ) 2𝑥𝑖Partition entities into disjoint independent sets{𝑰𝟏 , 𝑰𝟐 , , 𝑰𝑱 }(there are no correlations within the same set)𝐾min 𝑓 𝑋𝑋 ,𝑊 ,𝑊 𝑤𝑘𝐼𝑗 𝐼 𝑖 𝐼𝑗𝑘 1( )𝑥𝑖 (𝑘) 2𝑥𝑖′ 𝛼𝑆(𝑖, 𝑖 )𝑖 ′ 𝑁(𝑖)( )𝑥𝑖 ′ ( ) 2𝑥𝑖76

Experiments on Air Quality Sensing System Air Quality Sensing System Monitor particulate matter with diameter less than 2.5 micron(PM2.5) 14 participants equipped with mini-AQM Ground truth is collected with Thermo Conduct PM2.5 sensing in 4 areas in Tsinghua University77

Correlated Entities – TD-corrThe proposed method performs betterespecially when the coverage rates of sensors are low78

Long-tail Phenomenon Challenge when most sources make a few claims Sources weights are usually estimated as proportional to theaccuracy of the sources If long-tail phenomenon occurs, most source weights are notproperly estimated. Challenge when most entities get a few claims If an entity get very few claims, the estimation of the truthmay not be accurate Confidence-aware approaches considers the confidence interval of the estimation79

Long-tail Phenomenon on Sources Side CATD Assume that sources are independent and error madeby source 𝑠: 𝜖𝑠 𝑁 0, 𝜎𝑠2 𝜖𝑎𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑒 𝑠 𝑆 𝑤𝑠 𝜖𝑠𝑠 𝑆 𝑤𝑠 𝑁 0,2 2𝑠 𝑆 𝑤𝑠 𝜎𝑠2𝑠 𝑆 𝑤𝑠Without loss of generality, we constrain Optimization𝑠 𝑆 𝑤𝑠 1[Li et al., VLDB’15]80

Long-tail Phenomenon on Sources Side CATDSample variance is not accurate with small number ofsamples.Find a range of values that can act as good estimates.Calculate confidence interval based on𝑁𝑠 𝜎𝑠2𝜎𝑠2 𝜒 2 𝑁𝑠81

Long-tail Phenomenon on Sources Side CATD Consider the possibly worst scenario of 𝜎𝑠2 Use the upper bound of the 95% confidenceinterval of 𝜎𝑠22 0𝑠𝑥𝑛 𝑥𝑛𝑛 𝑁𝑠𝑢𝑠2 𝜒 20.05, 𝑁𝑠82

Long-tail Phenomenon on Sources Side CATD Closed-form solution:1𝑤𝑠 2 𝑢𝑠𝜒 20.05, 𝑁𝑠𝑛 𝑁𝑠𝑥𝑛𝑠 0𝑥𝑛283

Long-tail Phenomenon on Sources Side CATDExample on calculating confidence interval84

Long-tail Phenomenon on Sources Side CATDExample on calculating source weight85

Long-tail Phenomenon on Sources Side CATDGame datasetHigher levelindicatesharderquestionsQuestionlevelError rate ofMajorityVotingError rate 30430.130490.37370.1414100.52270.204586

Long-tail Phenomenon on Claim Side ETCIBoot Provide estimation of confidence intervals (i.e., CI) foreach entity’s truth ons𝑿 𝑥1 , 𝑥2 , , 𝑥𝑛𝑿1𝜽 𝑿1𝑿2𝜽 𝑿2 𝑿B𝜽 𝑿B[Xiao et al., KDD’16]87

Long-tail Phenomenon on Claim Side ETCIBoot Derive confidence intervals from bootstrap samplesConfidence Intervals obtained onindoor floorplan dataset88

Fine-Grained Truth Discovery - FaitCrowd To learn fine-grained (topical-level) user expertise and thetruths from conflicting crowd-contributed answers. Topic is learned from question&answer textsPoliticsPhysicsMusic[Ma et al., KDD’15]89

Fine-Grained Truth Discovery - FaitCrowd Input Question Set User Set Answer Set Question 2deq521efq612df2Topic Output Questions’ Topic Topical-Level Users’Expertise 3q4q5q6Truth121212Questionq1q2q3q4q5q6Ground Truth12121902

Fine-Grained Truth Discovery - FaitCrowd OverviewModeling Content yqm wqmModeling AnswersuzqMq 'aqu ' InputOutputNqbqQ 2'eK qtqK U 2HyperparameterIntermediateVariable Jointly modeling question content and users’ answers by introducinglatent topics. Modeling question content can help estimate reasonable userreliability, and in turn, modeling answers leads to the discovery ofmeaningful topics. Learning topics, topic-level user expertise and truths simultaneously.91

Fine-Grained Truth Discovery - FaitCrowd Answer Generation The correctness of a user’s answermay be affected by the question’stopic, user’s expertise on the topicand the question’s bias. Draw user’s expertise yqm wqmuzqMq 'aqu ' NqbqQ 2'eK qtqK U 292

Fine-Grained Truth Discovery - FaitCrowd Answer Generation The correctness of a user’s answermay be affected by the question’stopic, user’s expertise on the topicand the question’s bias. Draw user’s expertise Draw the truth yqm wqmuzqMq 'aqu ' NqbqQ 2'eK qtqK U 293

Fine-Grained Truth Discovery - FaitCrowd Answer Generation The correctness of a user’s answermay be affected by the question’stopic, user’s expertise on the topicand the question’s bias. Draw user’s expertise Draw the truth Draw the bias yqm wqmuzqMq 'aqu ' NqbqQ 2'eK qtqK U 294

Fine-Grained Truth Discovery - FaitCrowd Answer Generation The correctness of a user’s answermay be affected by the question’stopic, user’s expertise on the topicand the question’s bias. Draw user’s expertise Draw the truth Draw the bias yqm wqmuzqMq 'aqu ' NqbqQ 2'eK qtqK U 2 Draw a user’s answer95

Fine-Grained Truth Discovery - FaitCrowdGame 0.37370.14140.1010100.52270.20450.113696

Overview1234567 Introduction of Crowdsourcing Crowdsourced Data Aggregation for Reliable Information Discovery Passive Crowdsourcing Scenarios Active Crowdsourcing Scenarios Applications and Open Questions Resources References97

Active Crowdsourcing requesterworker98

Active Crowdsourcing Scenarios ChallengesBudget AllocationTask 1Task 2 Task 3Privacy ProtectionIncentive Mechanism99

Budget Allocation Since active crowdsourcing costs money, we needto use the budget wisely. Budget allocation Which instance should we query for labels? Which worker should we choose for a certain task? Goal To maximize utility (eg. overall accuracy)100

Maximize Accuracy – Opt-KG Need to estimate the labeling ambiguity for eachinstance on the fly Intuition: avoid spending much budget on fairly easy instances avoid spending much budget on few highly ambiguousinstances Ideally put those few highly ambiguous instances aside to savebudget estimate the reliability of each worker on the fly allocate as many labeling tasks to reliable workers aspossible[Chen et al., ICML’13]101

Problem Settings 𝑁 independent binary instances True label 𝑍𝑖 1, 1 Instance difficulty: 𝜃𝑖 𝑃 𝑍𝑖 1 relative frequency of 1 appears when the number ofworkers approaches infinity 𝑃 𝑍𝑖 1 0.5 means the instance is hard Workers are noiseless (for basic model) 𝑃 𝑦𝑖𝑗 1 𝜃𝑖 , where 𝑦𝑖𝑗 is worker 𝑗’s label forinstance 𝑖 Labels for instance 𝑖 are i.i.d. from Bernoulli(𝜃𝑖 )103

Bayesian setting 𝜃𝑖 is drawn from a known Beta prior distributionBeta(𝑎𝑖0 , 𝑏𝑖0 ) It means we have 𝑎𝑖0 positive and 𝑏𝑖0 negativepseudo-labels for the i-th instance at the initialstage Posterior:𝑡 1 Beta 𝑎𝑖𝑡 1,𝑏 𝑖𝑡𝑡Beta(𝑎𝑖𝑡𝑡 1, 𝑏𝑖𝑡𝑡 ), if 𝑦𝑖𝑡 1Beta(𝑎𝑖𝑡𝑡 , 𝑏𝑖𝑡𝑡 1), if 𝑦𝑖𝑡 1104

Maximize Accuracy – Opt-KG Formally, maximizes the expected accuracy taken over thesample paths 𝑖0 , 𝑦𝑖0 , , 𝑖 𝑇 1 , 𝑦𝑖𝑇 1 generated by a policy 𝜋 Stage-wise Rewards: Get label 1: 𝑅𝑖 1 ℎ(𝐼(𝑎, 𝑏))𝑡 𝑎, 𝑏 ℎ 𝐼 𝑎 1, 𝑏 Get label 1: 𝑅𝑖 1 ℎ(𝐼(𝑎, 𝑏))𝑡 𝑎, 𝑏 ℎ 𝐼 𝑎, 𝑏 1 Where ℎ 𝑥 max(𝑥, 1 𝑥),𝐼 𝑎, 𝑏 is the cdf of Beta 𝑎, 𝑏 at 𝑥 0.5 Greedy strategy 1𝑅 𝑆 𝑡 , 𝑖𝑡 max(𝑅𝑖 1𝑡 , 𝑅𝑖 𝑡 )105

Maximize Accuracy – Opt-KGQ1Q2Q3106

Maximize Accuracy – Opt-KGQ1Accuracy81%Q2Accuracy69%Q3Accuracy50%107

Maximize Accuracy – racy50%Reward108

Maximize Accuracy – wardUnselectedQ3Accuracy50%RewardSelected109

Challenges Under a Tight BudgetQuantity and Quality Trade-offQ1Q2Q3orQ1Q2Q3Different Requirements of QualityI want my resultsare not randomlyguessed.I will approve a result ifmore than 75% of theworkers agree on thatlabel.[Li et al., WSDM’16]110

Maximize Quantity – Requallo Inputs Requester's requirement The budget T: the maximum amount of labels can be afforded Goal Label as many instances as possible whichachieve the requirement under the budget111

Examples of Requirement Minimum ratio Approve the result on an instance if 𝑎𝑖 : 𝑏𝑖 𝑐 or 𝑏𝑖 : 𝑎𝑖 𝑐 Equivalent to set a threshold on entropy Hypothesis test Fisher exact test to test if the labels are randomly guessed Calculate the p-value, and approve the result if𝑝 value α112

Completeness Ratio between the observed total vote counts and theminimum count of labels it needs to achieve therequirement. Denoted as:𝑎𝑖 𝑏𝑖𝑟 𝑎𝑖 , 𝑏𝑖 𝑍𝑖Observed totalvote countsMinimum countto achieve therequirement113

Completeness Ratio between the observed total vote counts and theminimum count of labels it needs to achieve therequirement. Example: 𝑎𝑖 3, 𝑏𝑖 1, requirement is the minimum ratio of 43 14 If 𝑍𝑖 1, completeness If 𝑍𝑖 1,4 13 1completeness 3 125 415114

Expected Completeness𝑉𝑖 𝑎𝑖 , 𝑏𝑖Completenessthat the true𝑎𝑖 𝑏given𝑖 𝑃 𝑍𝑖 1 𝑎𝑖 , 𝑏𝑖 )label is 1𝑟 𝑏𝑖𝑎𝑖 𝑏𝑖Completeness 𝑃 𝑍𝑖 1 𝑎𝑖 , 𝑏𝑖 )𝑟 𝑎𝑖 given that the truelabel is 1where𝑟 𝑏𝑖 𝑟 𝑎𝑖 , 𝑏𝑖 𝑍𝑖 1),𝑟 𝑎𝑖 𝑟 𝑎𝑖 , 𝑏𝑖 𝑍𝑖 1115

Maximize Quantity – Requallo The goal is to label instances as many as possible thatachieve the requirement of quality. Stage-wise reward𝑡𝑡𝑡𝑡𝑅𝑖 1𝑡 𝑉𝑖 𝑡 𝑎𝑖 𝑡 1, 𝑏𝑖 𝑡 𝑉𝑖 𝑡 𝑎𝑖 𝑡 , 𝑏𝑖 𝑡𝑡𝑡𝑡𝑡𝑅𝑖 1𝑡 𝑉𝑖 𝑡 𝑎𝑖 𝑡 , 𝑏𝑖 𝑡 1 𝑉𝑖 𝑡 𝑎𝑖 𝑡 , 𝑏𝑖 𝑡 Greedy strategy 1𝑅 𝑆 𝑡 , 𝑖𝑡 max(𝑅𝑖 1𝑡 , 𝑅𝑖 𝑡 )116

Maximize Quantity – RequalloRequirement: Minimum Ratio of 3Q1Q2Q3117

Maximize Quantity – RequalloRequirement: Minimum Ratio of 50%118

Maximize Quantity – RequalloRequirement: Minimum Ratio of teness50%Reward119

Maximize Quantity – RequalloRequirement: Minimum Ratio of Q3Completeness50%RewardUnselected120

Crowdsourcing for Machine Learning Crowdsourced labels for machine learning Labeling by machine can save more money Pros: labeling is cheap Cons: workers are noisy Solution: reduce noise in annotations121

Crowdsourcing for Active Learning Active learning Motivation: budget Goal: query as few instances as possible to train a goodclassifier Which to query? The most “informative” instances Uncertainty, density, influence, Active learning with crowdsourced labels Workers are weak oracles Instances can be queried multiple times Which to query? How to query?122

Active Learning with Crowdsourced Labels Strategy 1: Query the “best” worker [Yan et al., ICML’11] Strategy 2: Repeat labeling [Mozafari et al., VLDB’14] Once an instance is queried, query multiple workers Strategy 3: Joint design Jointly consider model uncertainty and label uncertainty Model uncertainty label uncertainty[Sheng et al., KDD’08] Model uncertainty label uncertainty[Zhao et al., PASSAT’11]123

Incentive Mechanism Goal: Design payment mechanisms to incentivize workers A win-win strategy For requesters Get the optimal utility (eg. quality, profit, etc) for theirexpense For workers Get maximal payment if they follow the rules124

Incentive Mechanism – Double or Nothing Incentive compatibility To encourage the worker to skip the questions aboutwhich she is unsure Reason: for the questions that a worker is not sure of,her answers could be very unreliable No-free-lunch If all the questions attempted by the worker are answeredincorrectly, then the payment must be zero[Shah&Zhou, NIPS’15]125

Incentive Mechanism – Double or Nothing Input Confidence thresholdWrong𝑇CorrectSkipped Budget 𝜇 Evaluations 𝑥1 , , 𝑥𝐺 1, 0, 1 𝐺 of the worker’sanswers to the 𝐺 gold standard questions Let 𝐶 𝐺𝑖 1 𝟏𝑥𝑖 1 and 𝑊 Number of correct answers Payment:𝐺𝑖 1 𝟏𝑥𝑖 1Number of

Multiple facts can be true for each entity (object) One book may have 2 authors A source can make multiple claims per entity, where more than one of them can be true A source may claim a book w. 3 authors Sources and objects are independent respectively Assume