Big Data - A Nightmare To Data Scientist - Ijisrt

Transcription

Volume 2, Issue 5, May – 2017International Journal of Innovative Science and Research TechnologyISSN No: - 2456 - 2165Big Data - A Nightmare to Data ScientistB. LakshmiAsst. Professor, Department of Computer Applications,V.R.S.E.C, Vijayawada -7, Andhra Pradesh, IndiaK. Anji ReddyHead of the Department, Dept. of Computer Applications,V.R.S.E.C, Vijayawada -7, Andhra Pradesh, IndiaAbstract - The rise of social media and mobile devicesinvolves rapid growth in data generation as a result datamanagement has become increasingly challenging. The termBig Data is coined for large scale of data, which containsinformation including audio and video files. As data prone torapid changes, traditional database management techniquesare not capable to maintain such large varying data. Moreover data analysis has become a nightmare to data scientists.New analytics tools are emerging into IT world with aperspective to analyze the big data. Hadoop (IOP) and BigInsights together provide a software platform for visualizing,discovering, and analyzing data from disparate sources. Thispaper spots a light on big data characteristics andcomparison between old and new architectures of datamanagement. This survey paper concludes with a discussionof Hadoop as a solution to big data and promising futuredirections.II.CATEGORIES OF BIG DATABig data is broadly classified into three uredA. StructuredIndex Terms – Big Data, Zetta Byte, SaaS, Variability.I.Figure: 1. Big DataINTRODUCTIONThe problem of working with big data which exceeds thecomputing power is not new; this type of computing has greatlywide spreading in recent years. Big data is a term used for largescale of data that comes in all shapes and sizes which prone torapid change and make them difficult to be captured.Data that can be stored in a fixed format is called structureddata. There are many database management techniques tomaintain and analyze this type of data. When size of this dataincreased to large extent, challenges involved in its discovery,management and processing are also increased. Tabular datastored in RDBMs is example of a structured data.Example:An exact definition of big data is difficult to nail downbecause projects, vendors, practitioners, and businessprofessionals use it quite differently. With that in mind, big datais: large datasets the set of computing strategies and technologiesthat are used to maintain large datasetsLarge dataset means a dataset too large to reasonablyprocess or store with traditional tooling or on a single computer.This means that the common scale of big datasets is constantlyshifting and may vary significantly from organization toorganization.IJISRT17MY203www.ijisrt.comFigure: 2. Example of Structured Data715

Volume 2, Issue 5, May – 2017International Journal of Innovative Science and Research TechnologyISSN No: - 2456 - 2165A. UnstructuredIII.Data with unknown format or structure is called unstructureddata. Processing of data and deriving values become majorchallenges of large datasets of un-structured. Heterogeneous datasets containing a set of simple text files, images, videos etc arebest examples of un-structured data.Example: Result of Search EngineCHARACTERISTICS OF BIG DATAThe basic requirements for working with Big Data are thesame as the requirements for working with datasets of any size.However, the massive scale, the speed of ingesting andprocessing, and the characteristics of the data that must be dealtwith at each stage of the process present significant newchallenges when designing solutions.The goal of most Big Data systems is to surface insights andconnections from large volumes of heterogeneous data thatwould not be possible using conventional methods.A. VolumeThe sheer scale of the information processed helps defineBig Data systems. Data at rest in big data can be orders of largermagnitudes than traditional datasets, which requires moreprofessional techniques to store. Hadoop distributed file systemand new Cluster management algorithms become moreprominent as they break tasks into smaller pieces and becomeimportant.B. VelocityFigure: 3. Examples of Un-Structured DataB. Semi-structuredA combination of both structured and un-structured data iscalled Semi-structured data. The resulted output of semistructured data has particular format but it is really not definedwith. E.g. a table definition of RDBMS table and data stored inXML file.Data in motion (Stream data with, millisecond to respond) the speed with which information moves through the system.Data is frequently flowing into the system from multiple sourcesexample; audio, video data is constantly being added, massaged,processed, and analyzed in social media. These systems requirerobust and reliable systems to guard against failures and tomaintain data integrity and security against networks.C. VarietyExample:Data in many forms (Like structured, unstructured, text,multimedia) - The formats and types of media can varysignificantly as well. Big data is set of different formats of datalike images, video files, audio, text files and structured logs, etc.Figure: 4. Example of Semi-Structured DataFigure: 5. Big Data as 3VsIJISRT17MY203www.ijisrt.com716

Volume 2, Issue 5, May – 2017International Journal of Innovative Science and Research TechnologyISSN No: - 2456 - 2165D. Other CharacteristicsVarious individuals and organizations have suggestedexpanding the original three Vs, though these proposals havetended to describe challenges rather than qualities of Big Data.Some common additions are: VeracityThe variety of sources and the complexity of theprocessing can lead to challenges in evaluating thequality of the data (and consequently, the quality ofthe resulting analysis). VariabilityVariation in the data leads to wide variation inquality. Additional resources may be needed toidentify, process, or filter low quality data to makeit more useful. ValueThe ultimate challenge of Big Data is deliveringvalue. Sometimes, the systems and processes inplace are complex enough that using the data andextracting actual value can become difficult.IV.Figure: 7. Master/Slave ArchitectureA. Hadoop As A SolutionARCHITECTURETraditional Database management systems used client serverarchitecture to process data.Apache Hadoop is an open-source Programming frameworkused to support distributed storage and processing of large datasets using the MAPREDUCE programming model a softwareframework where an application break down into various parts.Figure: 8. Hadoop ArchitectureFigure: 6. Client Server ArchitectureB. Main Components of HADOOPNormal data like small in size will be processed easily fordecision making but big data with larger datasets requiresparallel processing. So Big data is processed with Master SlavearchitectureIJISRT17MY2031.2.HDFS (digital data)MR(Map Reduce-write business logic to process(written in core java))www.ijisrt.com717

Volume 2, Issue 5, May – 2017International Journal of Innovative Science and Research TechnologyISSN No: - 2456 - 21653.4.5.6.7.8.SQOOP(SQL HADOOP: Can export or import SQLdata in Hadoop or vice versa)HIVE(Data warehouse)HBASE(NO SQL components)OOZIE (workflow)FLUME(contnous streaming data like twitter, facebooketc)PIG (predefinedcomponents used for processing likeMapReduce)V.HDFS ARCHITECTUREThe term HDFS called Hadoop distributed file system and itis the fault tolerant storage component of Hadoop framework.Figure: 10. MapReducer ArchitectureMapReduce framework is the processing backbone of hadooparchitecture. The framework divides the specifications(processing logic or business logic) of operations which are usedto process large datasets, and run them in parallel.In Hadoop, specifications are written as MapReducejobs in Java. Operations can also be written in Hive and pig.HDFS stores the all the combined results of MapReducer. Thetwo major functions of MapReducer are MAP and REDUCE.MAP – This function receives input from user, processes theinput using specifications of operations and generates anintermediate set of output pairs.REDUCE – This function merges all the intermediate valuesgenerated by the MAP.Figure: 9. HDFS ArchitectureFinally the stored status and results are stored in HDFS.HDFS can store large scale of information, upgradeincrementally and checks the system periodically to project datafrom loss and maintains integrity. Hadoop uses Master slavearchitecture which creates clusters nothing but a set of computerand coordinates work among them. This master slave systemprevents data loss and interrupting work as it maintains the workload with other machines in cluster. HDFS replicate the pieces ofincoming files, called “blocks” and stores across multiplemachines in the cluster and to different servers.A. Map Reducer ArchitectureProcessing part of the Bigdata is done by MapReduce(Business logic).IJISRT17MY203B. Key Points of HadoopHadoop is one of the emerging technologies in IT industry toovercome the challenges in processing of Big, rapidly changingand venerable datasets of Big Data. It is open source frameworkand has many advantages. The five Key points of Hadoop are:1.2.3.4.5.www.ijisrt.comBuilt on Java TechnologyCost effectiveFault tolerantScalability and capacity increased by adding nodesIndustry chosen technology for performinganalytics on un-structured data718

Volume 2, Issue 5, May – 2017International Journal of Innovative Science and Research TechnologyISSN No: - 2456 - 2165C. Cloudera Distribution for Hadoop Vs Ibm InfosphereBiginsightsCloudera Distribution for Hadoop is the world's popular,most complete, tested, and popular distribution of ApacheHadoop and related domains. CDH is 100% Apache-licensedopen source and is the only Hadoop solution to offer unifiedbatch processing, interactive SQL, and interactive search, androle-based access controls. More enterprises have downloadedCDH than all other such distributions combined.IBM BigInsights delivers a rich set of advanced analyticscapabilities that allows enterprises to analyze massive volumes ofstructured and unstructured data in its native format. Thesoftware combines open source Apache Hadoop with IBMinnovations including sophisticated text analytics, IBMBigSheets for data exploration, IBM Big SQL for SQL access todata in Hadoop, and a range of performance, security andadministrative features. The result is a cost-effective and userfriendly solution for complex, big data analytics.Figure: 11. BigInsights v3.0Click Start BigInsights to start all required services.D. Infosphere BiginsightsInfoSphere BigInsights v3.0 is a software platform designedto help organizations discover and analyze business insightshidden in large volumes of a diverse range of data.Examples of such data include log records, online shopping,click streams, social media data, news feeds, and electronicsensor output.To help firms derive value from such data in an efficientmanner, BigInsights incorporates several open source projects(including Apache Hadoop ) and a number of IBMdeveloped technologies. Basic word count problem is solvedthrough BigInsights in the following manner:Figure: 12. Icon of BigInsightsTo verify that all required BigInsights services are up andrunning from a terminal window, issue this command: BIGINSIGHTS HOME/bin/status.sh. hdm, zookeeper, hadoop,catalog, hive, bigsql, oozie, comsole, and httpfs all thecomponents started successfully.Figure: 13. Progress – Status of componentsTo find the word count: hadoop fs –ls WordCount outputIJISRT17MY203www.ijisrt.com719

Volume 2, Issue 5, May – 2017International Journal of Innovative Science and Research TechnologyISSN No: - 2456 - 2165Figure: 14. Running the commandTo view the contents of part-r-0000 file: hadoop fs –cat WordCount output/*00Figure: 15. Partial outputFigure: 16. Execution of word countVI.CONCLUSIONIn this paper the categories of Big Data were presented.Also, the characteristics of Big Data to deal with were discussed.Hadoop was explained as a solution because it enablesapplications to work with thousands of nodes and petabytes ofdata in a highly parallel, cost effective manner. This paper alsocovered the basic word count problem through InfoSphereIJISRT17MY203BigInsights which is an analytics platform for Big Data. Theother side of the coin states that Today’s threat environmentimposes the three Vs of big data: volume, variety, and velocity.Each of these is increasing at an astonishing rate and has requireda swing in how security vendors are going to manage thesethreats.www.ijisrt.com720

Volume 2, Issue 5, May – 2017International Journal of Innovative Science and Research TechnologyISSN No: - 2456 - 2165REFERENCES[1] “Implementing Virtual Private Database Using SecurityPolicies”, H. Lakshmi, IJEERT, Volume 2, Issue 5,August 2014, PP 146-152.[2] “Preserving Privacy using Column Masking and DataEncryption Techniques”, B. Lakshmi, H. Ravindra Babu,IJCSE, Volume-4, Issue-6,E-ISSN: 2347-2693.[3] Comparisons of Relational Databases with Big Data: aTeaching Approach, Ali Salehnia, South Dakota StateUniversity Brookings, SD 57007[4] “Big Data: Volume, Velocity, Variability, 342/bigdatavolume-velocity-variability-variety, Accessed April 2015.[5] B. Wiederhold, “18 essential Hadoop tools for crunchingbig data”, Network World,www.googletagmanager.com/ns.html,Accessed April, 2015.[6] A. K. Zaki, “NoSQL Database: New Millennium Databasefor Big Data, Big Users, Coud Computing and Its mes/IJRET/2014V03/I15/IJRET20140315080.pdf, Accessed May 2015.[7] L. Arthur, “What is Big 13/08/15/whatis-big-data/, Accessed May 2015.[8] S. Penchikala, “Virtual Panel: Security Considerations .,Accessed May 2015.[9] Oracle Databases from rview/ndex.h tml, Accessed May 2015.[10] Relational Database Management System (RDBMS) vsnoSQL.http://openproceedings.org/html/pages/2015 edbt.html, Accessed April 2015.[11] ionaldatabase-management-system-rdbms-vs-nosql/ AccessedApril 2015.[12] B. Lakshmi, K. Parish Venkata Kumar, A. Shahnaz Banuand K. Anji Reddy, “Data Confidentiality and LossPrevention using Virtual Private Database.”[13] M. Ramachandran “Relational Vs Non-Relationaldatabases”, ational-databases, Accessed May 2015.[14] L. P. Issac “SQL vs NoSQL Database DifferencesExplained with few Example osql-db/,Accessed May 2015.[15] Sherpa Software. “Structured and Unstructured Data:What is t-is-it/, Accessed May 2015.F. Chang, et al. "Bigtable: A distributed storage system forstructured data.", ACM Transactions on ComputerSystems (TOCS) 26.2 (2008): 4.D. Gosain, “A survey and comparison of relational andnon-Relational Databases‖. IJERT,Vol 1,Issue 6,2012.L. Okman, , N. Gal-Oz, Y. Gonen, E. Gudes, and J.Abramov, , "Security Issues in NoSQL Databases," Trust,Security and Privacy in Computing and Communications(TrustCom), 2IEEE 10th International Conference on , vol., no., pp.541547, 16-18 Nov. 2011 doi: 10.1109/TrustCom.2011.70Find cloud security /big-data/8B. Sullivan, “NoSQL, But Even Less Security‖, SQL-ButEven- Less-Security.pdf. Accessed April 2015.Find Source: www.couchbase.com/why-nosql/nosqldatabase.K. Madia, “The Five Most Interesting Talks at Black Hat,DEF CON and eps-for-bettersecurity-analytics-in-2015, Accessed August 2015.“The Four Pillars of Big Data for the big-data-forthe-cmo/, Accessed August 2015.Unlocking the Magic of Unstructured Content”, IDS apers/unlockingfinal.pdf, Accessed August 2015.Z.Liu, B. .Jiangz and J. Heer, “imMens: Real-time VisualQuerying of Big Data” Eurographics Conference onVisualization (EuroVis) 2013 Volume 32 (2013), Number3.S. Durbin, “A CEOs Guide to Big Data iew/32736/aceos-guide-to-big-data-security/ Accessed May ty.html?page all, Accessed May 2015.“Enhancing Big Data Security”, www.advantech.com,Accessed May 2015. H. Chen, R. H. L. Chiang, V. C.Storey, “Business Intelligence and Analytics: From BogData to Big impact”, MIS Quarterly, Special Issue:Business Intelligence Research.“Holistic Approach Needed for Big Data for-bigdata-security/, Accessed May 2015.J. Hurwitz, A. Nugent, F. Halper, and M. w.ijisrt.com721

Volume 2, Issue 5, May – 2017International Journal of Innovative Science and Research TechnologyISSN No: - 2456 - considerations-with-big-data.html, Accessed May 2015.[32] C. White, “Using Big Data for Smarter Decision Making”,BI Research, July 2011.[33] “Addressing Big Data Security Challenges: The RightTools for Smart Protection”, Trend Micro.[34] S. Grimes, “Unstructured Data and the 80 Percent /unstructureddata-and-the-80-percent-rule/, Accessed May 2015.IJISRT17MY203www.ijisrt.com722

BigSheets for data exploration, IBM Big SQL for SQL access to data in Hadoop, and a range of performance, security and administrative features. The result is a cost-effective and user-friendly solution for complex, big data analytics. D. Infosphere Biginsights InfoSphere BigInsights v3.0 is a software platform designed