Tweet URL Analysis 2018/05/01 Virginia Tech, Blacksburg VA .

Transcription

Tweet URL AnalysisGuoxin Sun, Kehan Lyu, Liyan LiCS 4624 Multimedia, Hypertext, and Information AccessDr. Edward FoxVirginia Tech, Blacksburg VA 240612018/05/01

Overview RecapIssuesResultsFuture planAcknowledgementReferences

Analyze the characteristics of URLs embedded in tweets.RecapFigure 1: Architecture of the URL Analysis System [1]

Issues1.Bad separator for long URL fileshttp://www.theictm.org/big-diabetes. paugustine ekb3vfg74m thei.2.Halt caused by using articleDateExtractor library

URL Characteristic AnalysisPercentage of the URL(s) with Keyword per yearPercentage of Tweets with URL(s) per yearTweets with Different Number of URL(s)Percentage of Unique URL(s) in Tweet CollectionsPercentage of Unique URL(s) with different status codePercentage of successful retrieved URL(s) per yearTime interval between Tweet Post Date and Webpage DateTime interval between Tweet Post Date and Wayback Machine Archive DateTop 10 Domains in Tweets/RetweetsTop 10 URLs in Tweets/Wayback Machine

Percentage of Tweets with URL(s) per yearStatistics:50% of Tweets have URLs onaveragePeople are more interested inembedding URLs in Tweetsfrom 2013 2015The Interest faded away from2015 2017

Tweets with Different Number of URL(s)Statistics:90% of Tweets have 1URL10% of Tweets have 2URLsLess than 1% of Tweetshave 3 or more URLs

Percentage of Unique URL(s) with different status codeStatistics:55% 70% of URLs havestatus code 2xx25% 42% of URLs havestatus code 4xxAround 1% of URLs haveother status codes

Percentage of successful retrieved URL(s) per yearStatistics:URLs in earlier Tweetshave higher chance to bearchived by WaybackMachine

Time interval between Tweet Post Date and Webpage DateStatistics:Most of Tweet posted onthe same day of Webpageposted.

Time interval between Tweet Post Date and Wayback MachineArchive DateStatistics:Most of archived URLswere archived within 5days of Tweets post date

Future Plan Finalizing the report Analyzing more collections

Future Plan - Possible Improvement Utilizing idle machines

AcknowledgementLiuqing LiGraduate Research Assistant in DLRL (Digital Library Research Laboratory)Ph.D. candidate, Department of Computer Science, Virginia TechThanks go to NSF for support by grant IIS-1619028.

References1.2.3.4.Liuqing Li and Edward A. Fox. 2018. A Study of Historical Short URLs in Event Collectionsof Tweets. Web Archiving and Digital Libraries (WADL 2018), a workshop held inconjunction with ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ayback/tree/master/wayback-cdx-server, accessdate: 10 April 2018https://archive.org/help/wayback api.php, access date: 10 April 2018http://urlex.org, access date: 10 April 2018

Thank you!Questions?

Tweet URL Analysis Guoxin Sun, Kehan Lyu, Liyan Li CS 4624 Multimedia, Hypertext, and Information