Transcription
Marko Grobelnikmarko.grobelnik@ijs.siJozef Stefan InstituteLjubljana, SloveniaStavanger, May 8th 2012
Introduction What is Big data? Why Big-Data? When Big-Data is really a problem? TechniquesToolsApplicationsLiterature
‘Big-data’ is similar to ‘Small-data’, butbigger but having data bigger consequentlyrequires different approaches: techniques, tools & architectures to solve: New problems and old problems in a better way.
From “Understanding Big Data” by IBM
Big-Data
Key enablers for the growth of “Big Data” are: Increase of storage capacities Increase of processing power Availability of data
NoSQL MapReduce Storage Servers Processing DatabasesMongoDB, CouchDB, Cassandra, Redis, BigTable,Hbase, Hypertable, Voldemort, Riak, ZooKeeper Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine,S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie,Greenplum S3, Hadoop Distributed File System EC2, Google App Engine, Elastic, Beanstalk, Heroku R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene,ElasticSearch, Datameer, BigSheets, Tinkerpop
when the operations on data are complex: e.g. simple counting is not a complex problem Modeling and reasoning with data of different kindscan get extremely complex Good news about big-data: Often, because of vast amount of data, modelingtechniques can get simpler (e.g. smart counting canreplace complex model based analytics) as long as we deal with the scale
Research areas (suchas IR, KDD, ML, NLP,SemWeb, ) are subcubes within the datacubeUsageQualityContextDynamicityScalability
Good recommendationscan make a bigdifference when keepinga user on a web site the key is how richcontext model a system isusing to select informationfor a user Bad recommendations 1%users, good ones 5% erated in 20ms
DomainSub-domainPage URLURL sub-directories Page Meta TagsPage TitlePage ContentNamed Entities Has QueryReferrer Query GeoIP CountryGeoIP StateGeoIP City Referring DomainReferring URLOutgoing URL Absolute DateDay of the WeekDay periodHour of the dayUser AgentZip CodeStateIncomeAgeGenderCountryJob TitleJob Industry
Trend Detection SystemLog Files( 100Mpage clicksper day)Streamof clicksUserprofilesStream ofprofilesTrends andupdated segmentsNYTarticlesSegmentKeywordsStockMarketStock Market, mortgage, banking,investors, Wall Street, turmoil, NewYork Stock ExchangeHealthdiabetes, heart disease, disease, heart,illnessGreenEnergyHybrid cars, energy, power, model,carbonated, fuel, bulbs,Hybrid carsHybrid cars, vehicles, model, engines,dieselTraveltravel, wine, opening, tickets, hotel,sites, cars, search, restaurant SalesSegments Campaignto sellsegmentsAdvertisers
50Gb of uncompressed log files10Gb of compressed log files0.5Gb of processed log files50-100M clicks4-6M unique users7000 unique pages with more then 100 hitsIndex size 2GbPre-processing & indexing time 10min on workstation (4 cores & 32Gb) 1hour on EC2 (2 cores & 16Gb)
Alarms ServerTelecomNetwork( 25 000devices) Alarms 10-100/secLive feed of dataAlarmsExplorerServerAlarms Explorer Server implements threereal-time scenarios on the alarms stream:1. Root-Cause-Analysis – finding which device isresponsible for occasional “flood” of alarms2. Short-Term Fault Prediction – predict whichdevice will fail in next 15mins3. Long-Term Anomaly Detection – detectunusual trends in the network system is used in British TelecomOperatorBig board display
Presented in “Planetary-Scale Views on aLarge Instant-Messaging Network” by JureLeskovec and Eric Horvitz WWW2008
Observe social and communicationphenomena at a planetary scaleLargest social network analyzed to dateResearch questions: How does communication change with userdemographics (age, sex, language, country)? How does geography affect communication? What is the structure of the communicationnetwork?33
We collected the data for June 2006Log size:150Gb/day (compressed)Total: 1 month of communication data:4.5Tb of compressed dataActivity over June 2006 (30 days) 245 million users logged in180 million users engaged in conversations17,5 million new accounts activatedMore than 30 billion conversationsMore than 255 billion exchanged messages34
35
36
Count the number of users logging in fromparticular location on the earth37
Logins from Europe38
141374015447616154217536181671971206 degrees of separation [Milgram ’60s]21Average distance between two random users is 6.6222390% of nodes can be reached in 8 hops2916103242253
Big-data' is similar to 'Small- data', but bigger but having data bigger consequently requires different approaches: techniques, tools & architectures to solve: New problems