Big Data Security Challenges: Hadoop Perspective

Transcription

International Journal of Pure and Applied MathematicsVolume 120 No. 6 2018, 11767-11784ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issuehttp://www.acadpubl.eu/hub/Big Data Security challenges: HadoopPerspectiveGayatri Kapil1 , Alka Agrawal2 ,Raees Ahmad Khan31,2,3Department of Information TechnologyBabasaheb Bhimrao Ambedkar University(A Central University), Lucknow, India.1gayatri1258@gmail.com2alka csjmu@yahoo.co.in3khanraees@yahoo.comJuly 29, 2018AbstractWith the exponential growth of big data, it has becomeincreasingly vulnerable and has been exposed to maliciousattacks. These attacks can damage the essential qualities ofprivacy, integrity and availability of information systems. Inorder to deal with these malicious intentions, it is necessaryto develop effective security mechanisms. This paper firstdescribes Hadoop and its components and its current security mechanism, and then analyzes security problems andits risks. In addition, some important aspects of big dataHadoopsecurity and privacy have been proposed to increaseyour tract and safety and, ultimately, based on previousdetails, Hadoop security Challenges concludes.Key Words:Hadoop, MapReduce, HDFS, Hadoop Components, Hadoop Security and Data Encryption and HDFSEncryption.111767

International Journal of Pure and Applied Mathematics1IntroductionA challenge that is gradually coming out in developing big datatechnology is to initiate new business opportunities for all eminentindustries. Nowadays, where almost everything is becoming digitalwhich is the reason for this era to be digitalized but a major concern of IT industries is in front to keep this data securely and processing it which is being produced from various different sources.Also, many IT industries are still facing problem to convert thedata generating from different sources or unstructured data intothe usable format so that it can be processed and can be used inother applications. Hadoop has emerged as a solution to almostall big data problems. As big data is different from other data interms of volume, velocity, variety, value [1], its processing has alsobecome difficult for most of the government and business applications. Because of the huge volume of big data, traditional methods for managing to extract and analysing the same are not veryuseful as these may not provide the accurate result for decisionmaking etc. Therefore, its management in real time has becomea major concern for research. For using big data in a managedway, researchers and practitioners have explored various tools andtechniques. Thus, big data is a moving target and requires moreattention to capture, curate, handle and process it. Though, initially, it was expected that the data was less and can be easilyhandled by RDBMS but now RDBMS tools have failed to managebig data. To overcome this, Apache software foundation has developed a system tool called Apache Hadoop. It is one of the mosthighly used technologies which can handle the large volume of dataas well as provide high-speed access to the data within the currentapplication. It is used for distribution, processing and running application for a large amount of datasets. It is a Java-based tooland works as a master-slave technique to handle the large volumeof continuous data traveling at a high speed from different sourceslike events, emails, social media, external feeds, etc. [1]. Also, itis an easily available tool to store process thelarge volume of dataand provides high-speed access within the application and is usedby big industries like Google, Yahoo, Facebook, etc. [2]. About63% of various communities and organizations are using Hadoop tomanage a huge amount of structured, semi-structured and unstruc211768Special Issue

International Journal of Pure and Applied Mathematicstured data [2]. Several Enterprises and Organizations are rapidlydepending with trust and confidence on Hadoop for storing theirprevious data and processing it. But security and protection of thestored data is somehow lacking in Hadoop. This is the major limitation of this processor. To understand this, consider the case ofstorage and processing systems security, which are now very popular. As storage and processing system nodes often exchange data,therefore the risk of privacy and security breach arisesand securityof this data requires a strong security mechanism.For these reasons,this paper presents details about Hadoop and its components because it remains the processing of necessary structures and largedata for management, and then about some exiting mechanisms toincrease security and privacy. Rest of the paper is organized asfollows: Section 2 defines the architecture of Hadoop including itscomponents and how it stores, process and manage big data.Bigdata security challenge and Existing Hadoop security mechanismsare discussed in Section 3 and Section 4 respectively. And enumerated the directions to be taken while using the big data Hadoopincluding security privacy measuresdiscussed in Section 4. Finallythe authors conclude their work in Section 5.2Hadoop: Big Processing SolutionApache Hadoop is an open sourceplatform and has introduced anew easy way of storing and processing data. This was actuallyinfluenced by the Googles published documents which highlightedits attempt for handling the barrage of data. Consequently, it hasbecome the basic standard for storing, processing and analysingenormous amount of data which is in terabytes and petabytes [3].Even Hadoop provides the same processing services of expensivehardware in affordable, industry-standard servers which can storeprocess data without any limits. By using straightforward programming models, it processes the data which comes vary in Gigabytesto Petabytes produced by series of computers. Nowadays, the scenario has changed and data is increasing rigorously hence RDBMSis not able to perform efficiently because of the large volume ofdata. Vikram S. et al. have defined, big data in terms of its fivedimensions including Volume, Velocity, Variety, Value complex-311769Special Issue

International Journal of Pure and Applied Mathematicsity and also suggested the basic idea to handle big data with thehelp of Hadoop architecture like name node, data node, edge nodeand HDFS [17]. The authors have also introduced the issues facedby different users of big data i.e. data privacy and search analysis which required urgent attention for the research work. In thepaper [16], the authors discussed the relation between the key components of Hadoop i.e. Map Reduce and Hadoop Distributed FileSystem. Map reduce is used for large-scale distributions whereasHadoop Distributed File System is used to store all input data andgenerate data for further applications. HDFS is further dividedinto three categories i.e. software architecture, bottleneck portability limitations, and portability assumption. A typical Hadooparchitecture is shown in figure-1.Figure-1: Hadoop Architecture2.1Hadoop Distributed File System (HDFS)Storage of DataHDFS has been developed using distributed file system design. Itis highly fault tolerant and holds the significant amount of dataand provides easier access to that data. HDFS is a core componentof Hadooparchitecture used to store various input and output datafor the application. HDFS is the block-structured files system [2-3].Currently, default block size is 128 MB which was previously 64 MBand default replication factor is 3. Block size and replication factorsare configurable parameters. An individual file is a divided intothe fixed size of blocks with the following characteristics. 1)Blocksare stored in a cluster of one or many machines with enough datastorage capacity and data node manages the data storage of itssystem. 2) HDFS will be responsible for recovery of Data Nodeand distributes data across the data node in groups. 3) HDFS play411770Special Issue

International Journal of Pure and Applied Mathematicsan administrative role to add or remove the node from a cluster asshown in Figure-2.Figure-2: HDFS StorageHDFS TerminologiesName NodeName node is the core part of Hadoop system. If name nodecrashes, the entire Hadoop system goes down. The name nodemanages the file system namespace and stores the metadata information along with the location of the data blocks.Secondary NameNodeSecondary name node is responsible for copying and merging thenamespace image and editing log. In case, if the name node crashes,the namespace image stored in secondary name node can be usedto restart the name node. Secondary name node is the backbone ofname node.Data NodeIt stores the blocks of data and retrieves them. The data nodesalso report the block’s information to the name node periodically.2.2MapReduce- Distributed Data ProcessingFrameworkHadoop MapReduce is a Java-based system developed by Google inwhich data from the HDFS store is processed by using MapReduceprogramming paradigm [2-3]. In the MapReduce paradigm, eachjob has a user-defined map reduce phase (which is a processed ina completely parallel manner, by splitting the input data set intoindependent chunks and using those data in user consumption orextra processing). HDFS is the storage system for both entry andexit of the MapReduce job. The main components of MapReduce511771Special Issue

International Journal of Pure and Applied Mathematicsare described as follows: 1) Job Tracker is the master of the systemwhich manages the jobs and resources in the cluster knows as TaskTrackers. 2) Task Trackers are the slaves which are deployed oneach machine. They are responsible for running the map and reduce tasks as instructed by the Job Tracker. 3) Job History Serveris a daemon that provides historical information about completedapplicationsas shown in Figure-3.Map Reduce ProcessFigure-3: Map Reduces ProcessStep1: Input data in the form of image, video, text files is converted into Key1 (K1) and Value1 (V1) which are done byinput record reader.Step2: Output (K1, V1) is again converted into Key 2 Value 2(K2, V2) by Mapper.Step3: Second stage output i.e. K2, V2 is converted into K2 list(Value2) with the help of shuffle and Sort techniques.Step4: Reducer takes the values of Key 2, List (Value2) and generated the output Key 3, ValueStep5: Final output is generated by Output Record Writer whichtakes the output of Reducer (K 3, V3) as an input.2.3Other Hadoop ComponentsHadoop is neither a single tool nor only a programming language.Hadoop is a software library written in Java used for processinglarge amounts of data in a distributed environment. Hadoop standalone cannot provide all the services or facilities that are requiredto process big data. Its ecosystem is a set of tools which help processlarge data of size ranging from Gigabytes to Petabytes simultaneously. Hadoop is an Apache Project which provides many facilitiessuch as Map Reduce for parallel computing, etc. However, there611772Special Issue

International Journal of Pure and Applied Mathematicsis much more to do if one wants to create recommendation engineover big data, to run clustering algorithm over big data, and toget the nearby real-time access using big data itself. To processesthese requirements, one has to add more and more componentsfrom Hadoop. Apache pig, Hive, HBase, HDFS, Map Reduce, Mahout, Oozie, Zookeeper, Sqoop, these are several components whencombined with original Hadoop help to make ecosystem much morescalable for a robust solution.Table-1:shows the Hadoop Components711773Special Issue

International Journal of Pure and Applied Mathematics3Big Data: Hadoop Security ChallengeTo achieve high quality performance in the field of availability andscalability, IT organizations are depending on Hadoop and its components. Amazon uses the same to build their product search indices and process their millions of sessions. Facebook is using datawarehouse, log processing and also recommendation systems [8].Hadoop and its components are used by cloud space for their customer projects. Twitter is also using the same to manage the datawhich is generated on their website daily. The New York Timesuses Video and Image Analysis in addition to these great performers, IBM, Firm, LinkedIn, and the University of Freiburg [15].Hadoop ecosystem is evolving to satisfy the needs of many organizations, researchers, and Government. At present, some or811774Special Issue

International Journal of Pure and Applied Mathematicsganizations and enterprises analyze the information and locationdata collected from the various customers of different areas. Later,they organize the collected for marketing activity, so personal datacan be disclosed when analyzing the data of customers.That hascreated a new target for hackers and other cyber criminals. Thisdata, which was previously used by organizations, is extremely valuable, subject to privacy laws and regulations. Consequently, thesecompanies have need security to secure and protect their privacy.That means, demand for data scientists and stronger security andprivacy have continue its ascent in protecting the users personalinformation.4Hadoop SecurityInitially, at the time of creation of Hadoop, the security issuesweren’t on the top priority [23]. The only thing in the mind ofthe developers was to develop a system for distributing and parallel processing of huge data. To solve these problems, need to bea strong security in Hadoop for securing sensitive information [23].Later on, some mechanisms have been proposed to Hadoop clusterto secure them. Authorization, authentication, encryption, and keymanagement are available and feasible pillars for securing Hadoopcluster. Firstly, Hadoop distributions performed much of the integration and setup work with central security as Active Directory orLDAP through Apache Knox Gateway [23]. It is system that provides a single point of authentication and access for Hadoop service.It accesses over HTTP/HTTPs to Hadoop cluster and elimi

3 Big Data: Hadoop Security Chal-lenge To achieve high quality performance in the eld of availability and scalability, IT organizations are depending on Hadoop and its com-ponents. Amazon uses the same to build their product search in-dices and process their millions of sessions. Facebook is using data warehouse, log processing and also recommendation systems [8]. 8. security. .