Large Scale Data Analytics Of User Behavior For Improving .

Transcription

Large Scale Data Analytics of User Behaviorfor Improving Content DeliveryAthula BalachandranCMU-CS-14-142December 2014School of Computer ScienceComputer Science DepartmentCarnegie Mellon UniversityPittsburgh, PA 15213Thesis Committee:Srinivasan Seshan, Co-ChairVyas Sekar, Co-ChairHui ZhangPeter SteenkisteAditya Akella, University of Wisconsin-MadisonSubmitted in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy.Copyright c 2014 Athula BalachandranThis research was sponsored by the National Science Foundation under grant numbers CNS-0721857, CNS-0905277,and CNS-1040801; and the U.S. Army Research Office under grant number W911NF0910273.The views and conclusions contained in this document are those of the author and should not be interpreted asrepresenting the official policies, either expressed or implied, of any sponsoring institution, the U.S. government orany other entity.

Keywords: data analytics, machine learning, user behavior, user experience, content delivery, peer-to-peer, video streaming, web browsing

To Achan and Amma

iv

AbstractThe Internet is fast becoming the de facto content delivery network of the world,supplanting TV and physical media as the primary method of distributing larger filesto ever-increasing numbers of users at the fastest possible speeds. Recent trendshave, however, posed challenges to various players in the Internet content deliveryecosystem. These trends include exponentially increasing traffic volume, increasinguser expectation for quality of content delivery, and the ubiquity and rise of mobiletraffic.For example, exponentially increasing traffic—primarily caused by the popularity of Internet video—is stressing the existing Content Delivery Network (CDN) infrastructures. Similarly, content providers want to improve user experience to matchthe increasing user expectation in order to retain users and sustain their advertisementbased and subscription-based revenue models. Finally, although mobile traffic is increasing, cellular networks are not as well designed as their wireline counterparts,causing poorer quality of experience for mobile users. These challenges are facedby content providers, CDNs and network operators everywhere and they seek to design and manage their networks better to improve content delivery and provide betterquality of experience.This thesis identifies a new opportunity to tackle these challenges with the helpof big data analytics. We show that large-scale analytics on user behavior data canbe used to inform the design of different aspects of the content delivery systems.Specifically, we show that insights from large-scale analytics can lead to better resource provisioning to augment the existing CDN infrastructure and tackle increasing traffic. Further, we build predictive models using machine learning techniquesto understand users’ expectations for quality. These models can be used to improveusers’ quality of experience. Similarly, we show that even mobile network operators who do not have access to client-side or server-side logs on user access patternscan use large-scale data analytics techniques to extract user behavior from networktraces and build machine learning models that help configure the network better forimproved content delivery.

vi

AcknowledgmentsThis dissertaton marks the end of a very long chapter in my life, and serves as thestepping stone to another exciting and maybe a tad intimidating chapter. I’ve beenin “school” of some form for as long as I can remember, going on 24 years now,and this disseration is my cue to say goodbye to school and to step into the “realworld”. While working on this research, I have learned so much from my mentors,colleagues, and friends, and I feel strangely confident that this journey has preparedme for any challenge I may face in work or life.I’m lucky to have such a stellar group of researchers on my thesis committee.My advisor, Srini Seshan, was the best advisor I could have hoped for to guide methrough this PhD. As a green first-year, Srini showed me the ropes on the hot topicsin networking, and gave me the freedom to work on what I found interesting. Srinisees the field from a vantage point that few others do, and he was able to teach mewhich trees yield fruits and which ones do not. His positive attitude gave me theconfidence to keep going after paper rejections.Vyas Sekar has been involved in my research since the day I arrived at CMU—first as a mentor, then a collaborator, and now as my co-advisor. Vyas was alwaysavailable to help with writing or ideas. With his razor-sharp intuition, Vyas wouldhave a solution to any problem I faced, and he’d be ready to jump on Skype orHangouts to help me day or night. I also have Vyas to thank for helping me improvemy shoddy writing over the years.I’m fortunate that I’ve been able to pick Aditya Akella’s brain during our severalcalls; his suggestions largely paved the way for our IMC 2013 paper. I’m also immensely grateful to Hui Zhang; he trusted a novice PhD student with production dataat Conviva, which was the keystone of the IMC and SIGCOMM papers, and indeedthe turning point for my PhD. Peter Steenkiste kept tabs on my research, encouragedme to present my work during Tuesday seminars, and more lately, helped greatlyimprove this dissertation with his detailed feedback.Outside of my thesis committee, I’m fortunate to have met and collaborated withseveral luminaries in the field. In my first summer at Intel Research, Nina Taft,Kevin Fall, and Gianluca Iannacone took me under their wing. Ion Stoica gaveguidance for the QoE work during my visit to Conviva. I am also indebted to folks atConviva especially Dilip, Xi, Dima and Jibin for answering several questions aboutthe Conviva data and cluster. Jeff Pang, my mentor during my internship at AT&TLabs, guided me through the mobile QoE work that became our Mobicom paper. Iam also indebted to my other co-authors: Ashok Anand, Emir Halevopic, ShobhaVenkataraman, He Yan and Vaneet Agarwal.Gloomy winters and research troubles were more than neutralized by an amazingset of friends within and outside the department. My officemates Soonho, Dongsu,George, and Carol helped take my mind off work through numerous fun conversations. I’ll also fondly remember chats with other GHC 7th floor denizens: Anvesh,Ankit, Mehdi, Sarah, Dana, Yair, Gabriel, Joao, David, Matt, Junchen, and Richard.I am also indebted to Deborah Cavlovich, Angela Miller and the rest of the excellent

staff and faculty at CMU for their support and help over the years. My housematesand neighbors made my life at home fun. Sunayana, Bhavana, Meghana, Anjali,Kriti and Ruta have been great company and have let me partake in their delicioushomecooked food more times than I can remember. My friends from undergradat CMU—Leela, Vivek, and Srivatsan—filled me up on insti news and memories.Though separated by longer distances, my undergrad buddies Nitya, Pranava andDeepa helped me let off steam over phone, Skype, or occasional visits.Through my entire academic journey, my parents have encouraged me even whenI did not believe in myself; this work is as much their effort as it is mine. Mygrandmother and uncle have been pillars of support from when I was young. Myrelatives and cousins in the US were my home away from home over the past fiveyears. Over the past year, I’m also fortunate to have had the encouragement of myparents-in-law. I am extremely blessed to have an amazing companion. Having donea PhD in networking himself, Anirudh was very understanding of the ways of a PhDstudent. He was the source of my sanity, a sounding board for my random ideas andan excellent proof-reader for my drafts and slides.viii

Contents123Introduction1.1 Background and Scope . . . . . . .1.1.1 Content Delivery Ecosystem1.1.2 Big Data Analytics . . . . .1.1.3 Thesis Scope . . . . . . . .1.2 Thesis Statement and Approach . .1.3 Thesis Contributions . . . . . . . .1.4 Dissertation Outline . . . . . . . . .Large-Scale Data Analytics for CDN Resource Management2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . .2.2 Analyzing Telco-CDN federation . . . . . . . . . . . . . .2.2.1 User Access Patterns . . . . . . . . . . . . . . . .2.2.2 System Model . . . . . . . . . . . . . . . . . . .2.2.3 Global provisioning problem . . . . . . . . . . . .2.2.4 Evaluation . . . . . . . . . . . . . . . . . . . . .2.2.5 Main observations . . . . . . . . . . . . . . . . .2.3 Analyzing hybrid P2P-CDN . . . . . . . . . . . . . . . .2.3.1 User Access Patterns . . . . . . . . . . . . . . . .2.3.2 Revisiting P2P-CDN benefits . . . . . . . . . . .2.3.3 Main observations . . . . . . . . . . . . . . . . .2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . .2.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . .Developing a Predictive Model for Internet Video Quality-of-Experience3.1 Motivation and Challenges . . . . . . . . . . . . . . . . . . . . . . . .3.1.1 Problem scope . . . . . . . . . . . . . . . . . . . . . . . . . .3.1.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.1.3 Challenges in developing video QoE . . . . . . . . . . . . . . .3.2 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.2.1 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.2.2 Machine learning building blocks . . . . . . . . . . . . . . . .3.2.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.3 Identifying Confounding Factors . . . . . . . . . . . . . . . . . . . . 444546474949

3.43.53.63.73.8453.3.1 Approach . . . . . . . . . . . . .3.3.2 Analysis results . . . . . . . . . .3.3.3 Summary of main observations .Addressing confounding factors . . . . .3.4.1 Candidate approaches . . . . . .3.4.2 Results . . . . . . . . . . . . . .3.4.3 Proposed predictive model . . . .Implications for system design . . . . . .3.5.1 Overview of a video control plane3.5.2 Quality model . . . . . . . . . .3.5.3 Strategies . . . . . . . . . . . . .3.5.4 Evaluation . . . . . . . . . . . .Discussion . . . . . . . . . . . . . . . . .Related Work . . . . . . . . . . . . . . .Chapter Summary . . . . . . . . . . . . .495054555557575858596060616263Predictive Analytics for Extracting and Monitoring Web Performance over CellularNetworks4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.1.1 Cellular Network Architecture . . . . . . . . . . . . . . . . . . . . . . .4.1.2 Data Collection Apparatus . . . . . . . . . . . . . . . . . . . . . . . . .4.1.3 Applications of Web QoE Model . . . . . . . . . . . . . . . . . . . . . .4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.3 Extracting User Experience Metrics . . . . . . . . . . . . . . . . . . . . . . . .4.3.1 Detecting Clicks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.3.2 Measuring User Experience . . . . . . . . . . . . . . . . . . . . . . . .4.4 Analyzing Network Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.4.1 How network factors impact web QoE . . . . . . . . . . . . . . . . . . .4.4.2 Analysis on Other Websites . . . . . . . . . . . . . . . . . . . . . . . .4.4.3 Comparison with Other Mobile Applications . . . . . . . . . . . . . . .4.4.4 Dependencies and Other Factors . . . . . . . . . . . . . . . . . . . . . .4.5 Modeling Web QoE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.5.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.5.2 Insights and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .65676768686970707475788083838484878889Conclusions and Future Work5.1 Summary of Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5.1.1 CDN Resource Management for Handling Increasing Traffic . . . . .5.1.2 Predictive Model for Improving Video Quality of Experience . . . . .5.1.3 Predictive Analytics for Extracting and Monitoring Cellular Web QoE5.1.4 Summary of Thesis Contributions . . . . . . . . . . . . . . . . . . .5.2 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91919192929394x.

5.3Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5.3.1 Improved techniques to re-learn and refresh models . . . . . .5.3.2 Fine-grained video quality metrics using intra-session analysis5.3.3 Web QoE model for Cellular Network Operations . . . . . . .5.3.4 Predictive Analytics for Other Aspects of Content Delivery . .Bibliography.969696979799xi

xii

List of Figures1.11.22.1Overview of the Internet content delivery ecosystem . . . . . . . . . . . . . . . .Flow of information during content delivery from CDNs to ISPs to Users. Welook at how we can use large-scale data analytics to help improve content delivery at each point in the flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . .462.192.20The result shows the CDF of the correlation coefficient between the #views andthe population of the regio

Large Scale Data Analytics of User Behavior for Improving Content Delivery Athula Balachandran CMU-CS-14-142 December 2014 School of Computer Science Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee: Srinivasan Seshan, Co-Chair Vyas Sekar, Co-Chair Hui Zhang Peter Steenkiste