Big Data Analytics Spark Perspective - Cleveland State University

Transcription

Global Journal of Computer Science and Technology: CSoftware & Data EngineeringVolume 15 Issue 1 Version 1.0 Year 2015Type: Double Blind Peer Reviewed International Research JournalPublisher: Global Journals Inc. (USA)Online ISSN: 0975-4172 & Print ISSN: 0975-4350Big Data Analysis: Ap Spark PerspectiveBy Abdul Ghaffar Shoro & Tariq Rahim SoomroSZABIST Dubai Campus, United Arab EmiratesAbstract- Big Data have gained enormous attention in recent years. Analyzing big data is verycommon requirement today and such requirements become nightmare when analyzing of bulk datasource such as twitter twits are done, it is really a big challenge to analyze the bulk amount of twits toget relevance and different patterns of information on timely manner. This paper will explore theconcept of Big Data Analysis and recognize some meaningful information from some sample bigdata source, such as Twitter twits, using one of industries emerging tool, known as Spark by Apache.Keywords : big data analysis, twitter, apache spark, apache hadoop, open source.GJCST-C Classification : D.2.11, H.2.8BigDataAnalysisApSparkPerspectiveStrictly as per the compliance and regulations of: 2015. Abdul Ghaffar Shoro & Tariq Rahim Soomro. This is a research/review paper, distributed under the terms of the CreativeCommons Attribution-Noncommercial 3.0 Unported License http://creativecommons.org/licenses/by-nc/3.0/), permitting all noncommercial use, distribution, and reproduction inany medium, provided the original work is properly cited.

Big Data Analysis: Ap Spark PerspectiveKeywords: big data analysis, twitter, apache spark,apache hadoop, open source.II.Introductionn today’s computer age, our life has become prettymuch dependent on technological gadgets and moreor less all aspects of human life, such as personal,social and professional are fully covered withtechnology. More or less all the above aspects aredealing with some sort of data; due to immenseincrease in complexity of data due to rapid growthrequired speed and variety have originated newchallenges in the life of data management. This is whereBig Data term has given a birth. Accessing, Analyzing,Securing and Storing big data are one of most spokenterms in today’s technological world. Big Data analysisis a process of gathering data from different resourcesand then organizing that data in meaning full way andthen analyzing those big sets of data to discovermeaningful facts and figures from that data collection.This analysis of data not only helps to determine thehidden facts and figures of information in bulk of bigdata, but also it provides with categorize the data orrank the data with respect to important of information itprovides. In short big data analysis is the process offinding knowledge from bulk variety of data. Twitter asorganization itself processes approximately 10k tweetsper second before publishing them for public, theyanalyze all this data with this extreme fast rate, to ensureevery tweet is following decency policy and restrictedwords are filtered out from tweets. All this analyzingprocess must be done in real time to avoid delays inpublishing twits live for public; for example business likeForex Trading analyze social data to predict futurepublic trends. To analyze such huge data it is requiredto use some kind of analysis tool. This paper focuseson open source tool Apache Spark. Spark is a clustercomputing system from Apache with incubator status;this tool is specialized at making data analysis faster, itAuthor α σ: Department of Computing, SZABIST Dubai Campus,Dubai, UAE. e-mails: shoroghaffar@gmail.com, tariq@szabist.ac.aeII.Literature Reviewa) Big DataA very popular description for the exponentialgrowth and availability of huge amount of data with allpossible variety is popularly termed as Big Data. This isone of the most spoke about terms in today’sautomated world and perhaps big data is becoming ofequal importance to business and society as theInternet has been. It is widely believed and proved thatmore data leads to more accurate analysis, and ofcourse more accurate analysis could lead to morelegitimate, timely and confident decision making, as aresult, better judgment and decisions more likely meanshigher operational efficiencies, reduced risk and costreductions [2]. Big Data researchers visualize big dataas follows:i. Volume-wiseThis is the one of the most important factors,contributed to emergence of big data. Data volume ismultiplying to various factors. Organizations andgovernments has been recording transactional data fordecades, social media continuously pumping steams ofunstructured data, automation, sensors data, machineto-machine data and so much more. Formerly, datastorage was itself an issue, but thanks to advance andaffordable storage devices, today, storage itself is not abig challenge but volume still contributes to otherchallenges, such as, determining the relevance withinmassive data volumes as well as collecting valuableinformation from data using analysis [3].ii. Velocity-wiseVolume of data is challenge but the pace atwhich it is increasing is a serious challenge to be dealtwith time and efficiency. The Internet streaming, RFID 2015 Global Journals Inc. (US)Yearis pretty fast at both running programs as well as writingdata. Spark supports in-memory computing, thatenables it to query data much faster compared to diskbased engines such as Hadoop, and also it offers ageneral execution model that can optimize arbitraryoperator graph [1]. This paper organized as follows:section 2 focus on literature review exploring the BigData Analysis & its tools and recognize somemeaningful information from some sample big datasource, such as Twitter feeds, using one of industriesemerging tool, Apache Spark along with justification ofusing Spark; section 3 will discuss material and method;section 4 will discuss the results of analyzing of big datausing Spark; and finally discussion and future work willbe highlighted in section 5.7Global Journal of Computer Science and Technology (C ) Volume XV Issue I Version IAbstract- Big Data have gained enormous attention in recentyears. Analyzing big data is very common requirement todayand such requirements become nightmare when analyzing ofbulk data source such as twitter twits are done, it is really a bigchallenge to analyze the bulk amount of twits to get relevanceand different patterns of information on timely manner. Thispaper will explore the concept of Big Data Analysis andrecognize some meaningful information from some sample bigdata source, such as Twitter twits, using one of industriesemerging tool, known as Spark by Apache.2015Abdul Ghaffar Shoro α & Tariq Rahim Soomro σ

Big Data Analysis: Ap Spark PerspectiveYear2015tags, automation and sensors, robotics and much moretechnology facilities, are actually driving the need to dealwith huge pieces of data in real time. So velocity of dataincrease is one of big data challenge with standing infront of every big organization today [4].Global Journal of Computer Science and Technology (C ) Volume XV Issue I Version I8iii. Variety-wiseRapidly growing huge volume of data is a bigchallenge but the variety of data is bigger challenge.Data is growing in variety of formats, structured, unstructured, relational and non-relational, different filessystems, videos, images, multimedia, financial data,aviation data and scientific data etc. Now the challengeis to find means to correlate all variety of data timely toget value from this data. Today huge numbers oforganizations are striving to get better solutions to thischallenge [3].iv. Variability-wiseRapidly growing data with increasing variety iswhat makes big data challenging but ups and downs inthis trend of big data flow is also a big challenge, socialmedia response to global events drives huge volumesof data and it is required to be analyzed on time beforetrend changes. Global events impact on financialmarkets, this overhead increase more while dealing withun-structured data [5].v. Complexity-wiseAll above factors make big data a reallychallenge, huge volumes, continuously multiplying withincreasing variety of sources, and with unpredictedtrends. Despite all those facts, big data much beprocessed to connect and correlate and createmeaningful relational hierarchies and linkages right ontime before this data go out of control. This pretty muchexplains the complexity involved in big data today [5].To precise, any big data repository withfollowing characteristics can be termed big data. [6]: Accessible — highly available commercial or opensource product with good usability. Central management and orchestration Distributed redundant data storage Extensible — basic capabilities can be augmentedand altered Extremely fast data insertion Handles large amounts (a petabyte or more) of data Hardware agnostic Inexpensive (relatively) Parallel task processing Provides data processing capabilitiesb) Big Data Analysis ToolsThe following are brief introduction of some ofselected big data analysis tools along with briefoverview of Apache Spark and finally justification ofapache spark with other competitors to distinguish andjustify use of Apache Spark. 2015 Global Journals Inc. (US)i. Apache HiveHive is a data warehousing infrastructure, whichruns on top of Hadoop. It provides a language calledHive QL to organize, aggregate and run queries on thedata. Hive QL is similar to SQL, using a declarativeprogramming model [7]. This differentiates the languagefrom Pig Latin, which uses a more procedural approach.In Hive QL as in SQL the desired final results aredescribed in one big query. In contrast, using Pig Latin,the query is built up step by step as a sequence ofassignment operations.Apache Hive enablesdevelopers specially SQL developers to write queries inHive Query Language HQL. HQL is similar to standardquery language. HQL queries can be broken down byHive to communicate to MapReduce jobs executedacross a Hadoop Cluster.ii. Apache PigPig is a tool or in fact a platform to analyze hugevolumes of big data. Substantial parallelization of tasksis a very key feature of Pig programs, which enablesthem to handle massive data sets [7]. While Pig andHive are meant to perform similar tasks [8]. The Pig isbetter suited for the data preparation phase of dataprocessing, while Hive fits the data warehousing andpresentation scenario better. The idea is that as data isincrementally collected, it is first cleaned up using thetools provided by Pig and then stored. From that pointon Hive is used to run ad-hoc queries analyzing thedata. During this work the incremental buildup of a datawarehouse is not enabled and both data preparationand querying are performed using Pig. The feasibility ofusing Pig and Hive in conjunction remains to be tested.iii. Apache ZebraApache Zebra is a kind of storage layer for dataaccess at high level abstraction and especially tabularview for data available in Hadoop and relief’s users ofpig coming up with their own data storage models andretrieval codes. Zebra is a sub-project of Pig whichprovides a layer of abstraction between Pig Latin andthe Hadoop Distributed File System [9]. Zebra allows aPig programmer to save relations in a table-orientedfashion (as opposed to flat text files, which are, normallyused) along with meta-data describing the schema ofeach relation. The tests can be run using J Unit or asimilar Java testing framework [10].iv. Apache H BaseApache H Base is a data base engine builtusing Hadoop and modeled after Google's Big Table. Itis optimized for real time data access from tables ofmillions of columns and billions of rows. Among otherfeatures, H Base offers support for interfacing with Pigand Hive. The Pig API features a storage function forloading data from an H Base data base, but during thiswork the data was read from and written to flat HDFSfiles, because the data amounts were too small tonecessitate the use of H Base [11].

vii. Apache SparkApache Spark is a general purpose clustercomputing engine which is very fast and reliable. Thissystem provides Application programing interfaces invarious programing languages such as Java, Python,Scala. Spark is a cluster computing system fromApache with incubator status, this tool is specialized atmaking data analysis faster, it is pretty fast at bothrunning programs as well as writing data. Sparksupports in-memory computing, that enables it to querydata much faster compared to disk-based engines suchas Hadoop, and also it offers a general execution modelthat can optimize arbitrary operator graph. Initiallysystem was developed at UC Berkeley’s as researchproject and very quickly acquired incubator status inApache in June 2013 [9]. Generally speaking, Spark isadvance and highly capable upgrade to Hadoop aimedat enhancing Hadoop ability of cutting edge analysis.Spark engine functions quite advance and different thanHadoop. Spark engine is developed for in-memoryprocessing as well a disk based processing. This inmemory processing capability makes it much faster thanany traditional data processing engine. For exampleproject sensors report, logistic regression runtimes inSpark 100 x faster than Hadoop Map Reduce. Thissystem also provides large number of impressive highlevel tools such as machine learning tool M Lib,structured data processing, Spark SQL, graphprocessing took Graph X, stream processing enginecalled Spark Streaming, and Shark for fast interactivequestion device. As shown in Figure-2-1 below.c) Why Apache Spark?Following are some important reasons whyApache Spark is distinguished amongst other availabletools: Apache Spark is a fastest and general purposeengine for large-scale data processing [1]. Apache Spark is a data parallel general purposebatch processing engine Workflows are defined in a similar and reminiscentstyle of Map Reduce, however, is much morecapable than traditional Hadoop Map Reduce. Apache Spark is a full, top level Apache project Simple to Install Spark is implemented in Scala, which is power fullobject oriented languages and with ampleresources [10] Spark is relatively much junior compared to Strombut it achieved incubator status with few months ofits first production to Share through in early 2013[9]. Both Map R’s distributions and Cloudera’sEnterprise data platform support Spark Streaming.Also, very large company known as Databricksprovides support for the Spark stack, includingSpark Streaming. Spark Reliability can be judged from Intelrecommendation for spark to be used in healthcaresolutions [12]. Open source contributors, Cloudera, Databricks,IBM, Intel, and Map R has openly announced tosupport and fund standardization of Apache Sparkas Standard general purpose engine for big dataanalysis [1]. Host on works, the first company to provide supportfor Apache storm recommends Apache Spark asData Science tool [11]. One of the favorite features of Spark is the ability tojoin datasets across multiple disparate datasources.d) When Not to Use Apache SparkApache Spark is fasted General purpose bigdata analytics engine and it is very suitable for any kindof big data analysis. Only following two scenarios, canhinder the suitability of Apache spark [13]. 2015 Global Journals Inc. (US)Yearvi. Apache StormA dependable tool to process unbound streamsof data or information. Storm is an ongoing distributedsystem for computation and it is an open source tool,currently undergoing incubation assessment withApache. Storm performs the computation on livestreams of data in same way traditional Hadoop doesfor batch processing. Storm was originally aimed atprocessing twitter streams, and now available as opensource and being utilized in many organizations asstream processing tool. Apache spark is quick andreliable, scalable, and makes sure to transforminformation. It is also not very complex to be deployedand utilized [1].Figure-2-1 : Apache Spark9Global Journal of Computer Science and Technology (C ) Volume XV Issue I Version Iv. Apache Chu kwaA Map Reduce based data collection andmonitoring system called Chu kwa has been developedon top of Hadoop. Chu kwa is mainly aimed atprocessing log files, especially from Hadoop and otherdistributed systems [11]. Because Chu kwa is meantmostly for the narrow area of log data processing, notgeneral data analysis, the tools it offers are not asdiverse as Pig's and not as well suited for the tasksperformed in this work.2015Big Data Analysis: Ap Spark Perspective

Big Data Analysis: Ap Spark PerspectiveLow Tolerance to Latency requirements: If big dataanalysis are required to be performed on datastreams and latency is the most crucial point ratheranything else. In this case using Apache Storm mayproduce better results, but again reliability to bekept in mind. Shortage of Memory resources: Apache Spark isfasted general purpose engine due to the fact that itmaintains all its current operations inside Memory.Hence requires access amount of memory, so inthis case when available memory is very limited,Apache Hadoop Map Reduce may help better,considering huge performance gap.Year2015 Global Journal of Computer Science and Technology (C ) Volume XV Issue I Version I10III.Material and MethodsThe nature of this paper is to cope with hugeamount of data and process / analyze huge volume ofdata to extract some meaningful information from thatdata in real time.The big data is modern daytechnology term that have changed the way world havelooked at data and all of methods and principlestowards data. The Data gather of big data is totallydifferent than our traditional ways of data gathering andtechniques. Coping with big data specially analyzing inreal time has become almost impossible with traditionaldata warehousing techniques. This limitation haveresulted a race of new innovations in data handling andanalyzing field. Number of new technologies and toolshave emerged and claiming to resolve big dataanalyzing challenges. So technically speaking, Twitterstreaming API is used to access twitter’s big data usingApache Spark.a) Research Instrument Twitter Stream API: The Streaming APIs providepush deliveries of Tweets and other events, for realtime or low-latency applications. Twitter API is wellknown source of big data and used worldwide innumerous applications of a number of objectives.In fact there are some limitation in free Twitter APIthat should be considered while analyze the results. Apache Spark: As an open source computingframework to analyze the big data. Though apachespark is claiming to be fastest big data analyzingtool in market, but the trust level and validation ofresults will still be subject to comparison with someexisting tools like Apache storm, for example.In this paper the data processing is happeningusing Twitter streaming API and Apache Spark asshown in Figure-3-1 bellow. 2015 Global Journals Inc. (US)Figure-3-1: Apache Spark data processingIV.ResultsThis section illustrates and analysis the datacollected for the experiment purpose by Apache Sparkusing twitter streaming API. The amount of dataprocessed for each scenario, processing time andresults are given in tabular as well as graphical format.Following scenarios were executed for experimentpurpose on live streams of twits on twitter.1. Top ten words collected during a particular period oftime. (10 minutes)2. Top ten languages collected during a particularperiod of time. (10 minutes)3. Number of times a particular “word” being used intwits, twitted in a particular period of time.Scenario 1: Top ten words collected in last 10 minutesStatistics: The total number of tweets analyzed during thistime 23865 The total number of unique words 77548 The total number of words 160989 Total time duration 10 minutes (600 seconds). See Table 4-1 for top ten words in tabular form. See Figure 4-1 for top ten words shown graphicallyin charts

Big Data Analysis: Ap Spark 84 ىلع 26195Love862886Что290027 م هلل ا 111Global Journal of Computer Science and Technology (C ) Volume XV Issue I Version IS. No.2015Table 4-1 : Top ten words in last 10 minutesFigure 4-1: Top ten words in last 10 minutesScenario 2:Top ten languages collected in last 10minutes.Statistics: The total number of tweets analyzed during thistime 23311 The total number of unique languages 42 Total time duration 10 minutes (600 seconds). See Table 4-2 for top ten languages in tabular form See Figure 4-2 for top ten languages showngraphically in charts 2015 Global Journals Inc. (US)

Big Data Analysis: Ap Spark PerspectiveS. 154Turkish491YearTable 4-2 : top ten languages in last 10 an21099Japanese695710English8114Global Journal of Computer Science and Technology (C ) Volume XV Issue I Version I12Figure 4-2 : Top ten languages in last ten minutes Number of twits 42119SeeTable 4-3 for number of twits posted using wordtwits twitted in last 10 minutes.“mtvstars”in tabular formStatistics: SeeFigure4-3 for number of twits posted using Search String mtvstarsword“mtvstars”shown graphically in charts Time duration 10 minutesTable 4-3 : Number of twits “mtvstars” used to post a twit in last 10 minutesScenario 3: Number of times “mtvstars” being used inTwitsfrequencyTime durationin secondsTwitsfrequencyTime duration insecondsTwitsfrequencyTime duration 116689240309394423570471710424631601448 2015 Global Journals Inc. (US)

409Figure 4-3 : number of twits using “mtvstars” in last minutes 2015 Global Journals Inc. (US)Year5313Global Journal of Computer Science and Technology (C ) Volume XV Issue I Version I41002015Big Data Analysis: Ap Spark Perspective

Big Data Analysis: Ap Spark PerspectiveYear2015V.Global Journal of Computer Science and Technology (C ) Volume XV Issue I Version I14Discussion & Future WorkAs not many organizations share their big datasources. So study was limited to twitter free feed APIand all limitations of this API, such as amount of dataper request and performance etc. and that directlyimpact the results presented. Also a common laptopwas used to analyze tweets as compare to dedicatedServer. As a result of this study, following Scenarioswere considered and analyzed and their results werepresented in previous section.1. Top ten words twitted during last specific period oftime.2. Top ten languages used to twit during specificperiod of time.3. A list of twitted items matching a given searchkeyword.Considering the above mentioned limitations,Apache Spark was able to analyze streamed tweets withvery minor latency of few seconds. Which proves that,despite being big general purpose, Interactive andflexible big data processing engine, Spark is verycompetitive in terms of stream processing as well.During the process of analyzing big data using spark,couple of improvement areas were identified as ofutmost importance should be persuaded as future work.Firstly, like most open source tools, Apache Spark is notthe easiest tool to work with. Especially deploying andconfiguring apache spark for custom requirements. Aflexible, user friendly configuration and programmingutility for apache spark will be a great addition to apachespark developer community. Secondly, analyzed datarepresentation is poor, there is a very strong need tohave powerful data representation tool to providepowerful reporting and KPI generation directly fromSpark results, and having this utility in multiplelanguages will be a great added value.References Références Referencias1. Community effort driving standardization of ApacheSpark through expanded role in Hadoop Project,Cloudera, Databricks, IBM, Intel, and Map R, 26.html, Retrieved July 1 2014.2. Big Data: what I is and why it mater, 2014,http://www.sas.com/en us/insights/big-data/whatis-big-data.html3. Nick Lewis, 2014, information security threatquestions.4. Michael Goldberg, 2012, Cloud Security AllianceLists 10 Big data security Challenges, sts-10-bigdata-security-challenges/5. Securosis, 2012, Securing Big Data: SecurityRecommendations for Hadoop and No SQL 2015 Global Journals Inc. sis.com/assets/library/reports/SecuringBigData FINAL.pdfSteve Hurst, 2013, To 10 Security Challenges for2013, s-for-2013/article/281519/,Mark Hoover, 2013, Do you know big data’s top ends,http://people.apache.org/ rdonkin/hadooptalk/hadoop.html , RetrievedMay ache.org/, Retrieved May 2014.Casey Stella, 2014, Spark for Data Science: A ience-case-study/Abhi Basu, Real-Time Healthcare Analytics ers/big-data-real time health care-analyticswhite paper .pdf, Retrieved December 2014.Spark MLib, Apache Spark performance,https://spark.apache.org/mllib/ , Retrieved October2014.

Big Data Analysis: Ap Spark Perspective bdul Ghaffar Shoro α Tariq Rahim Soomro σ Abstract- Big Data have gained enormous attention in recent years. Analyzing big data is very common requirement today and such requirements become nightmare when analyzing of bulk data source such as twitter twits are done, it is really a big