Marko Grobelnik Marko.grobelnik@ijs.si Jozef Stefan . - PlanetData

Transcription

Marko Grobelnikmarko.grobelnik@ijs.siJozef Stefan InstituteLjubljana, SloveniaStavanger, May 8th 2012

Introduction What is Big data? Why Big-Data? When Big-Data is really a problem? TechniquesToolsApplicationsLiterature

‘Big-data’ is similar to ‘Small-data’, butbigger but having data bigger consequentlyrequires different approaches: techniques, tools & architectures to solve: New problems and old problems in a better way.

From “Understanding Big Data” by IBM

Big-Data

Key enablers for the growth of “Big Data” are: Increase of storage capacities Increase of processing power Availability of data

NoSQL MapReduce Storage Servers Processing DatabasesMongoDB, CouchDB, Cassandra, Redis, BigTable,Hbase, Hypertable, Voldemort, Riak, ZooKeeper Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine,S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie,Greenplum S3, Hadoop Distributed File System EC2, Google App Engine, Elastic, Beanstalk, Heroku R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene,ElasticSearch, Datameer, BigSheets, Tinkerpop

when the operations on data are complex: e.g. simple counting is not a complex problem Modeling and reasoning with data of different kindscan get extremely complex Good news about big-data: Often, because of vast amount of data, modelingtechniques can get simpler (e.g. smart counting canreplace complex model based analytics) as long as we deal with the scale

Research areas (suchas IR, KDD, ML, NLP,SemWeb, ) are subcubes within the datacubeUsageQualityContextDynamicityScalability

Good recommendationscan make a bigdifference when keepinga user on a web site the key is how richcontext model a system isusing to select informationfor a user Bad recommendations 1%users, good ones 5% erated in 20ms

DomainSub-domainPage URLURL sub-directories Page Meta TagsPage TitlePage ContentNamed Entities Has QueryReferrer Query GeoIP CountryGeoIP StateGeoIP City Referring DomainReferring URLOutgoing URL Absolute DateDay of the WeekDay periodHour of the dayUser AgentZip CodeStateIncomeAgeGenderCountryJob TitleJob Industry

Trend Detection SystemLog Files( 100Mpage clicksper day)Streamof clicksUserprofilesStream ofprofilesTrends andupdated segmentsNYTarticlesSegmentKeywordsStockMarketStock Market, mortgage, banking,investors, Wall Street, turmoil, NewYork Stock ExchangeHealthdiabetes, heart disease, disease, heart,illnessGreenEnergyHybrid cars, energy, power, model,carbonated, fuel, bulbs,Hybrid carsHybrid cars, vehicles, model, engines,dieselTraveltravel, wine, opening, tickets, hotel,sites, cars, search, restaurant SalesSegments Campaignto sellsegmentsAdvertisers

50Gb of uncompressed log files10Gb of compressed log files0.5Gb of processed log files50-100M clicks4-6M unique users7000 unique pages with more then 100 hitsIndex size 2GbPre-processing & indexing time 10min on workstation (4 cores & 32Gb) 1hour on EC2 (2 cores & 16Gb)

Alarms ServerTelecomNetwork( 25 000devices) Alarms 10-100/secLive feed of dataAlarmsExplorerServerAlarms Explorer Server implements threereal-time scenarios on the alarms stream:1. Root-Cause-Analysis – finding which device isresponsible for occasional “flood” of alarms2. Short-Term Fault Prediction – predict whichdevice will fail in next 15mins3. Long-Term Anomaly Detection – detectunusual trends in the network system is used in British TelecomOperatorBig board display

Presented in “Planetary-Scale Views on aLarge Instant-Messaging Network” by JureLeskovec and Eric Horvitz WWW2008

Observe social and communicationphenomena at a planetary scaleLargest social network analyzed to dateResearch questions: How does communication change with userdemographics (age, sex, language, country)? How does geography affect communication? What is the structure of the communicationnetwork?33

We collected the data for June 2006Log size:150Gb/day (compressed)Total: 1 month of communication data:4.5Tb of compressed dataActivity over June 2006 (30 days) 245 million users logged in180 million users engaged in conversations17,5 million new accounts activatedMore than 30 billion conversationsMore than 255 billion exchanged messages34

35

36

Count the number of users logging in fromparticular location on the earth37

Logins from Europe38

141374015447616154217536181671971206 degrees of separation [Milgram ’60s]21Average distance between two random users is 6.6222390% of nodes can be reached in 8 hops2916103242253

Big-data' is similar to 'Small- data', but bigger but having data bigger consequently requires different approaches: techniques, tools & architectures to solve: New problems