A Survey Paper On Machine Learning Approaches To Intrusion Detection

Transcription

Published by :http://www.ijert.orgInternational Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 10 Issue 01, January-2021A Survey Paper on Machine LearningApproaches to Intrusion DetectionOyeyemi OshoSungbum Hong (PhD)Computational Data and Enabled Science & EngineeringJackson State UniversityJackson, Mississippi, USAComputational Data and Enabled Science & EngineeringJackson State UniversityJackson, Mississippi, USAAbstract—This electronic document is a “live” template andalready defines the components of your paper [title, text, heads,etc.] in its style sheet. For any nation, government, or cities tocompete favorably in today’s world, it must operate smart citiesand e-government. As trendy as it may seem, it comes with itschallenges, which is cyber-attacks. A lot of data is generated dueto the communication of technologies involved and lots of data areproduced from this interaction.Initial attacks aimed at cybercity were for destruction, this has changed dramatically intorevenue generation and incentives. Cyber-attacks have becomelucrative for criminals to attack financial institutions and cartaway with billions of dollars, led to identity theft and many morecyber terror crimes. This puts an onus on government agencies toforestall the impact or this may eventually ground the economy.The dependence on cyber networked systems is impending andthis has brought a rise in cyber threats, cyber criminals havebecome more inventive in their approach. This proposeddissertation discusses various security attacks classification andintrusion detection tools which can detect intrusion patterns andthen forestall a break-in, thereby protecting the system fromcyber criminals.This research seeks to discuss some Intrusion DetectionApproaches to resolve challenges faced by cyber security and egovernments; it proffers some intrusion detection solutions tocreate cyber peace. It discusses how to leverage on big dataanalytics to curb security challenges emanating from internet ofthings. This survey paper discusses machine learning approachesto efficient intrusion detection model using big data analytictechnology to enhance computer cyber security systems.Keywords—Component; Intrusion Detection; cyber security;machine kearning; Cyber attacks; Security.I.INTRODUCTIONThe effects of cyber-attacks are felt around the world indifferent sectors of the economy not just a plot againstgovernment agencies. According to McAfee and Center forStrategic and International Studies (2014), nearly one percentof global GDP is lost to cybercrime each year. The worldeconomy suffered 445 billion dollars in losses from cyberattacks in 2014. Adversaries in the cyber realm include spiesfrom nation-states who seek our secrets and intellectualproperty; organized criminals want to steal our identities andmoney; terrorists who aspire to attack our power grid, watersupply, or other infrastructure; and hacktivist groups who aretrying to make a political or social statement (Deloitte 2014).According to Dave Evans (2011), Explosive growth ofsmartphones and tablet PCs brought the number of devicesconnected to the internet to 12.5 billion in 2010, while theworld’s human population increased to 6.8 billion, making thenumber of connected devices per person more than 1 (1.84 toIJERTV10IS010040be exact) for the first time in history. Reports show that thenumber of internets connected devices will be 31 billionworldwide by 2020.Internet and web technologies have advanced over the yearsand the constant interaction of these devices has led to thegeneration of big data. Using big data according to JohnWalker (2014) leads to better decisions. Using big data makesroom for better decisions, the current technology generateshuge amounts of data which enables us to analyze the datafrom different angles.Due to the amount of information put out by technologies,security of data has become a major concern. New securityconcerns are emerging, and cyber-attacks never cease,according to Wing Man Wynne Lam (2016) “it is common tosee software providers releasing vulnerable alpha versions oftheir products before the more secure beta versions”.Vulnerability refers to the loopholes in systems created, alltechnologies have their weak points which may not be openlyknown to the user until it is exploited by hackers.Cyber security concerns affect all facets of the societyincluding retail, financial organizations, transportationindustry and communication. H. Teymourlouei et al. [32],better actionable security information reduces the critical timefrom detection to remediation, enabling cyber specialists topredict and prevent the attack without delays. The rate ofincrease in devices which requires internet connection has ledto the emergence of internet of things. This makes the worldtruly global and in one space, although internet of things hasprovided many opportunities like new jobs, better revenue forgovernment and people involved in the industry, reduced costof doing business, increased efficiency handling the big dataassociated with this trend has become the issue.Almost all internet of things applications has sensors whichmonitors discrete events and mining data generated fromtransactions. The data generated through this device can beused in investigative research which will eventually impactdecision making on the part of the industries concerned.Vulnerability markets is a huge one because some softwaredevelopers sell their vulnerability for hackers in some cases,hence the hacker’s prey on users of the software. Hackers usedto be destructive in their approach, has we have seen in recenttimes has been purely for making money. Some ask you to callthem so that they can offer you support at certain feebargained. Sometimes hackers access government systemsthrough the network and seize important information’s storedon the system hence demand for ransom. Other times could bedetecting bugs in software’s purchased by governmentwww.ijert.org(This work is licensed under a Creative Commons Attribution 4.0 International License.)94

Published by :http://www.ijert.orgInternational Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 10 Issue 01, January-2021agencies and demand for ransom else they release the error tothe public which may lead to a huge loss in data and money.The emergence of technologies has led to smart cities, whichsimply implies to the application of electronic data collectionto supply required information used to manage availableresources effectively. The concept of smart cities is what hasbeen adopted by many states and nations, web-basedgovernment services brings about efficient run of government.This is something evident in first world nations of the world.To be highly competitive in today’s world, no reasonablegovernment will shy away from e-governance. As beautiful asthis may sound, there are challenges militating against it, oneprominent problem is hacking and e-terrorism. Due toinformation’s been put out by users, information that includetax information, social security numbers and other personalinformation on the web, this creates caution from governmentend to secure the information being posted by citizens ongovernment websites.Cybersecurity of government facilities including software’s,websites and networks is very challenging and most cases veryexpensive to maintain. That is why this paper proposes big dataanalytic tools and techniques as a solution to cyber security.According to Tyler Moore (2010),” Economics puts thechallenges facing cybersecurity into perspective better than apurely technical approach does. Systems often fail because theorganizations that defend them do not bear the full costs offailure. Many of the problems faced by cybersecurity areeconomic in nature and solutions can be profferedeconomically. In this paper, I will offer a big data analyticsperspective and recommendations that can help to amelioratethe state of internet of things and cybersecurity. Looking at thebusiness side of cybersecurity, I think it is either not properlyfunded or underfunded, if cooperation’s and governmentagencies pump enough funding into cyber city, they willbecome better for it and this might reduce cyber-attacks.According to Wamba et al. (2017), Elgendy & Elragal(2014),Holsapple et al.(2014), big data analytics is a holistic approachand system to manage, process and analyze huge amount ofdata in order to create value by providing a useful informationfrom hidden patterns to measuring performance and increasecompetitive advantages.Some techniques mentioned in this paper include biometricauthentication, data privacy and integrity policy, ApacheStorm algorithm, continuous monitoring surveillance, logmonitoring, data compression and event viewer. In this paperI will be implementing big data analytics using Rprogramming and Python programming, gephi, tableau, rapidminer for analysis and data visualization.II.INTRUSION DETECTION WITH GDA-SVMAPPROACHSome earlier research on cyber security Intrusion detectionthrough machine learning analytical tools are described below.Zulaiha et al. [13] used the Great Deluge algorithm (GDA) toimplement feature selection and Support Vector Machine forits classification. Great Deluge algorithm (GDA) was proposedby Dueck in 1993, it is a generic algorithm applied tooptimization problems. It has similarities to the high-climbingand simulated annealing algorithms, the main differenceIJERTV10IS010040between the Great Deluge algorithms and the SimulatedAnnealing algorithms is the deterministic acceptance functionof the neighboring solution.The inspiration of Great Deluge algorithm (GDA) is derivedfrom the inspiration of a person climbing up a hill preventinghis feet from getting wet as the water level rises. Finding theoptimum of an optimization problem is seen as finding thehighest point in a landscape. The GDA accepts the ‘level’which is where the absolute values of cost function is equal orless than the initial objective function, the initial objectivefunction is equal to the initial value of the level. The advantageof Great Deluge algorithm (GDA) is that it only depends onthe ‘up’ value which represents the speed of the rain, if the ‘up’is high the algorithm will be fast with poor results but if the‘up’ value is small the algorithm will produce better resultswith good computational time.Zulaiha Et al. [18] used SVM (support vector machine)classifier, the fitness of every feature is measured by means of10-fold cross validation, the 10-fold cross validation is used togenerate the accuracy of classification by SVM. In the 10FCVwhich contains 10 subsets, one is used for testing while theremaining is used for training, the accuracy rate is computedover 10 trials. Zainal et al. [21] describes complete fitnessfunction as:Where γR(D) is the average of accuracy rate obtained byconducting ten multiple cross-validation with SVM, D is thedecision, R is the ‘1’ number of position or the length ofselected feature subset, C is the total number of features, αand β are two parameters corresponding to the importance ofclassification quality and subset length α [0,1]and β (1-α), respectively.Zulaiha Et al. [18] proposed GDA-SVM Feature SelectionApproach:Step 1: (Initialization) randomly generates an initial solution,all features are represented by binary string, where ‘1’ isassigned to a feature if it will be kept and ‘0’ is assigned to afeature which will be discarded, while N is the original numberof features.Step 2: Measure the fitness of the initial solution, where theaccuracy of the SVM classification and all the chosen featuresare utilized to calculate the fitness function.Step 3: A random solution is generated when the algorithmsearch about the initial solution by mutation operator.Step 4: Evaluate the fitness of the new solution and accept thesolution where the fitness is equal or more than the level.Update the best solution if the fitness of the new solution ishigher than the current best solution and level with a fixincrease rate.Step 5: Repeat these steps until a stopping criterion is met. Ifstopping condition is satisfied, the solution with best fit ischosen; otherwise, the algorithm will generate new solution.Step 6: Train SVM based on the best feature subset, after this,conduct testing data sets.Pseudo code of GDA-SVM approach for feature selection.1.2.Initialize a random solutionEvaluate the fitness of the solutionwww.ijert.org(This work is licensed under a Creative Commons Attribution 4.0 International License.)95

Published by :http://www.ijert.orgInternational Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 10 Issue 01, January-20213.4.While (stopping criteria not met)Generate at random a new solution about initialsolution.5. Evaluate the fitness of the new solution. Accept thesolutions where fitness is equal or more than level.6. End while7. Output best feature subset.Zulaiha Et al. [18] used the KDD-CUP 99 data subset that waspre-processed by the Columbia University and distributed aspart of the UCI KDD Archive. The training data contains about5 million connection records and 10% of the training data has494,012 connection records.Chebrolu et al. [20] assigned a label (A to AO) to eachfeature for easy referencing, the training data has 24 attacktypes with four main categories: 1) DOS: Denial of service, 2)R2L: Unauthorized access from a remote machine (remote tolocal), 3) U2R: unauthorized access to local privileges (user toroot), 4) Probing: surveillance. Features used by Almori &Othman, 2011 was implemented and trained based on thedataset. The intrusion detection could either be an attack ornormal.GDA Performance based on Highest Fitness Function:Experiment conducted by Almori & Othman based on HighestFitness was used to train the SVM classifier. The featuresinvolved in the training process are B, G, H, J, N, S, W, G andL.According to Zulaiha Et al. [18], the table below comparesthe classification performance for the seven feature subsetsproduced by previous techniques. The mean gives the averageperformance of the feature subset proposed by the respectivetechnique on three different test sets.Comparison of Classification Rate.TABLE I.TechniqueLGP (C, E,L,AA,AE&AI)SVDF (B,D, E, W,X& AG)MARS (E,X, AA, AG,AH&AI)Rough Set(D, E, W, X,AI &AJ)RoughDSPO (B,D, X, AA,AH & AI)BA (C, LX,Y, AF &AK)GDA (B, G,HJ, N, S, W,GL)COMPARISON OF CLASSIFICATION RATE .1371.8593.23292.0785.72SVM’s are based on the idea of structural risk minimizationwhich results in minimal generalization error [44], the numberof parameters does not depend in input features rather onmargin between data points. Hence, SVM’s does not requirereduction of feature size to avoid overfitting, they provide asystem to fit the surface of the hyperplane to the data using theKernel function. The advantages of SVM is the binaryclassification and regression which results in low expectedprobability of generalization errors. SVM possess real timespeed performance and scalability, they are insensitive tonumber of data points and dimension of the data.Support vector machine (SVM) approach is a classificationtechnique based on Statistical Learning Theory (SLT). It isbased on the idea of a hyper plane classifier, or linearlyseparability. The goal of SVM is to find a linear optimal hyperplane so that the margin of separation between the two classesis maximized [45]. Suppose we have training datapoints{(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), (𝑥3 , 𝑦3 ), , (𝑥𝑁 , 𝑦𝑁 )},where 𝑥𝑖 𝑅𝑑and 𝑦𝑖 { 1, 1}. Consider a hyper plane defined by (w, b),where w is a weight vector and b are a bias. A new object xcan be classified with the following function:𝑁𝑓(𝑥) 𝑠𝑖𝑔𝑛 ( 𝛼𝑖 𝑦𝑖 𝐾(𝑥𝑖 , 𝑥) 𝑏)𝑖 1Where 𝐾(𝑥𝑖 , 𝑥) is the kernel function.III. SEQUENTIAL PATTERN MINING APPROACHBefore you begin to format your paper, first write and savethe content as a separate text file. Keep your text and graphicfiles separate until after the text has been formatted and styled.Do not use hard tabs, and limit use of hard returns to only onereturn at the end of a paragraph. Do not add any kind ofpagination anywhere in the paper. Do not number text headsthe template will do that for you.Wenke Lee et al. [49] proposes that to have an effective baseclassifier, enough data must be trained to identify meaningfulfeatures. Here are some algorithms that can be used to minepattern from Big Data”: Association rules: The goal of the mining associationrule is to determine feature correlations from aBigdata. An association rule is an expression of 𝑋 𝑌, 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒, 𝑠𝑢𝑝𝑝𝑜𝑟𝑡[49]. 𝑋 and 𝑌 are subsetsof items in a record, 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 is the percentage ofrecords that contain 𝑋 𝑌 while 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 is𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑋 𝑌).𝑠𝑢𝑝𝑝𝑜𝑟𝑡 (𝑋) 85.2596.6598.2093.3683.1690.7587.4587.12of 𝑋 𝑌 with 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 isPratik et al. [22] proposed the first three rows subset, in thedataset the mean value depict that Great Deluge algorithm(GDA) has the second highest average classification rate. Thisshow that GDA performs better than other techniques asideBA.IJERTV10IS010040Frequent Episodes: A frequent episode is a set eventsthat occur frequently withing a time frame. Events inserial episode must occur in partial order in timewhile events parallel episode does not have suchconstraint. For 𝑋 and 𝑌, 𝑋 𝑌 is a frequent episode, 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑋 𝑌)𝑠𝑢𝑝𝑝𝑜𝑟𝑡 (𝑋)and𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦(𝑋 𝑌) is a frequent episoderule[49].Using the discovered patterns: The association ruleand frequent rules can be combined into one unitusing the merge process:www.ijert.org(This work is licensed under a Creative Commons Attribution 4.0 International License.)96

Published by :http://www.ijert.orgInternational Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 10 Issue 01, January-2021 Find a match in the aggregate rule set wherea match is when both LHS and RHS rulesmatches, 𝜀 matches on the 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 and𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 values.Shi-Jie Song et al. [47] proposed a misuse intrusion detectionmodel based on sequential pattern mining using two steps, thefirst step is to search for one large sequence: Each item in the original database is a candidate ofone-large-sequence-one-itemset 𝐶1 , a 𝐶1 . Is takenout. The items in the database is scanned vertically,if any instance contains 𝐶1 , the support of 𝐶1 . Adds1. If the support is greater than the given minimalsupport, the 𝐶1 is one-large-sequence-one-itemset𝐿1 , tis process is repeated and binary records of 𝐿1 ’sis stored in a temporary database. Two one-large-sequence-(𝑘 1)-itemset 𝐿𝑘 1 iscombined into one-large-sequence-k-itemset 𝐶𝑘 . Inthe 𝐶𝑘 , the order of the 𝑘 1 bits are arranged as theoriginal order. When 𝐶𝑘 is taken out each 𝐿𝑘 1 is scanned verticallyin the temporary database and horizontally fornumber of k bits. If 𝐶𝑘 exist in a instance, the supportof 𝐶𝑘 adds 1, if the support is greater than theminimal support then 𝐶𝑘 is 𝐿𝑘 . The 𝐿𝑘 is found in turnand listed in one-large-sequence-itemset.The second step is to search for k-large sequence 𝑆𝐿𝑘 : The two k-1 sequence 𝑆𝐿𝑘 in the temporary databaseis combined to form the candidate k-large-sequence𝑆𝐶𝑘 . Sort out two 𝑆𝐿𝑘 ’s with the same former k-2bits, combine the bits to form two k-1 bits, the k-1 bitscan be arranged in four orders. When the 𝑆𝐶𝑘 is taken out, each 𝑆𝐿𝑘 1 is searchedvertically in the temporary database, if k bit exist inany instance, the support of 𝑆𝐶𝑘 adds 1. If the supportis greater than the minimal support, 𝑆𝐶𝑘 is 𝑆𝐿𝑘 , theprocess is repeated to find 𝑆𝐿𝑘 .Fidalcastro et al. [46] proposed using Fuzzy logic withsequential data mining in Intrusion Detection Systems. In thispaper Fuzzy logic is used as a filter for feature selection toavoid over fitting of pattern and reduce the dimensioncomplexity of the data. The filter-based approach of Fuzzylogic is used for feature selection from the training data, somefiltering criteria applied are:Information Gain it measures the expected reduction inentropy of class before and after observing features. It selectsfeatures by larger difference; it is measured as [46] 𝑺𝒗 𝑰𝒏𝒇𝒐𝑮𝒂𝒊𝒏(𝑺, 𝑭) 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝒔) 𝒗 𝑽 𝑺 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒗)Where S is the pattern set, 𝑆𝑣 is the subset of S, F as a value v, S is the number of samples in S, v is the value of the featureF. The entropy of class before observing features is defined as:𝑛 𝑆𝑐 𝑆𝑐 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) log 2𝑆 𝑆 𝑐 𝐶Where Sc is subset of S belonging to class c, C is the class setand IG is the fastest and simplest ranking method [46].Gain ratio (GR) normalizes the IG by dividing it by the entropyof S with respect to feature F, gain ratio is used to discouragethe selection of features uniformly distributed values, it isdefined as:𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜(𝑆, 𝐹) 𝐼𝑛𝑓𝑜𝐺𝑎𝑖𝑛(𝑆, 𝐹)/𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜(𝑆, 𝐹)IJERTV10IS010040𝑛 𝑆𝑖 𝑆𝑖 log 2 𝑆 𝑆 𝑖 1Where Si is the subset of S where feature F has its 𝑖 𝑡ℎ possiblevalue, n is the number of subclasses split by feature F.Chi-Square (CS): measures the chi square value of each featurewith respect to the classes, the values are ranked and the strongcorrelation with the classes are the large chi square values ofthe features. The chi square of feature F is defined as:𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜(𝑆, 𝐹) 𝑚 𝑘𝐶ℎ𝑖𝑆𝑞𝑢𝑎𝑟𝑒(𝑓) (𝐴𝑖𝑗 𝐸𝑖𝑗 )/𝐸𝑖𝑗𝑖 1 𝑗 1𝐸𝑖𝑗 (𝑅𝑖 , 𝐶𝑗 )/ 𝑆 K is the number of classes, 𝐶𝑗 is the number of samples in the𝑗𝑡ℎ class, m is the number of intervals discretized from thenumerical values of F, 𝑅𝑖 is the number of samples in the 𝑖 𝑡ℎinterval, 𝐴𝑖𝑗 is the number of samples in the 𝑖 𝑡ℎ interval with𝑗𝑡ℎ class and 𝐸𝑖𝑗 is the expected occurrence of 𝐴𝑖𝑗 [46].IV. CLUSTERING APPROACHThe objective of Cluster analysis is to find groups in data[53], the groups are based on similar characteristics. For adataset D featured by P attributes:𝐷𝑛 [𝐴1 , 𝐴2 , , 𝐴𝑝 ]D is partitioned into {𝐶1,𝑗 1 𝑘 } clusters so that:For each 𝐷𝑗 [𝐴1, 𝐴2 , , 𝐴𝑝 ]𝑆𝑀𝐷𝑗 (𝐴1 , , 𝐴𝑝 ) 𝐶𝑖𝑆𝑀𝐷𝑗 (𝐴1 , , 𝐴𝑝 ) max {} 𝐶𝑖 𝑙𝑖Where SM is the similarity between 𝐷𝑖 and 𝐶𝑘 , and {𝐶𝑖 } shouldmeet the following conditions:𝐶𝑖 ; 𝐶𝑖 𝐶𝑖 ; 𝑘𝑖 1 𝐶𝑖 𝐷 ,𝑖, 𝑙 𝑗 . 𝑘Clustering has three main partitioning approaches:Hierarchical approach where hierarchy of clusters are built,and each observation starts in its own cluster and pairs aremerges and moved up the hierarchy.Non-hierarchy approach requires a random initialization of theclusters and set of rules to define the criterion.Biomimetic approach is inspired from the ethology like antcolonies, the models developed from ideals provide bettersolutions to problems in Artificial Intelligence [53].Clustering analysis is a pattern recognition procedure whosegoal is to find patterns in a dataset. It identifies clusters andbuilds a typology [54] of sets using a certain set of data, acluster is a collection of data objects that are like one another.A good clustering method produces high quality cluster toensure that the inter-cluster similarity is low, and the intracluster similarity is high, this infers that members of a clusterare more like each other than they are with different clusters[54].Fixed-Width Clustering procedure:The Fixed-Width clustering algorithm is based on a set ofnetwork connections [56] 𝐶𝑇 for training, each connection 𝑐𝑖is represented by a di-dimensional vector feature. The FixedWidth clustering involves three stages:Normalization: This ensures features have the same influencewhen calculating distance between connections. Eachcontinuous feature 𝑥𝑗 is normalized in terms of the number ofstandard deviations from the mean of the feature.www.ijert.org(This work is licensed under a Creative Commons Attribution 4.0 International License.)97

Published by :http://www.ijert.orgInternational Journal of Engineering Research & Technology (IJERT)ISSN: 2278-0181Vol. 10 Issue 01, January-2021Cluster Formation: This process takes place afternormalization, the distance between each connection 𝑐𝑖 ismeasured in the training set 𝐶𝑇 to the center of each clustercentroid. If the distance [56] to the closest cluster is less thanthe threshold 𝑤, then the centroid of the closest cluster isupdated and the total number of points in the cluster isincremented, otherwise a new cluster is formed.Cluster Labelling: This is the process of labelling networkclusters based on its value, if a cluster contains more than theclassification threshold fraction 𝜏 of the total points in the dataset, such cluster is labelled normal else it is labelledanomalous.Test Phase: This is the final stage of the fixed-width clusteringapproach where each new connection is compared to eachcluster to determine if it is normal or anomalous. If thecalculated distance from the connection to each cluster is lessthan the cluster width parameter 𝑤 , then such connectionshares the label of its closest cluster, otherwise the connectionis labeled anomalous.S.Sathya et al. [56] proposed a clustering method in which thefrequency of common pairs of each cluster is found usingcluster index and choosing a cluster having maximum numberof common pairs with most 𝑘 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 neighbors formerging. This process reduces the number of computationsconsiderably and performs better.Nong Ye et al. [57] proposed a scalable Clustering techniquetitles CCA-S (Clustering and Classification AlgorithmSupervised). The CCA-S has a dataset considered asdatapoints in a dimensional space, for Intrusion Detection, thetarget variable is a binary with two possible values: 0 fornormal and 1 for intrusion. CCA-S clusters data based on twoparameters: the distance between data points and the classlabel for data points. Each cluster represents a pattern fornormal or intrusion activities depending on the class label ofthe data points in the cluster. The three stages of the CCA-Sare as follows:Training (Supervised Clustering): It takes two steps toincrementally group the N data points in the training data setinto set into clusters. The first stage calculates the correlation between thepredictor variable 𝑋𝑖 and the target variable 𝑌𝑖 .Dummy clusters are formed with one being for thenormal activities and the other dummy vector forintrusive activities, the centroid of the clusters isdetermined by the mean vector of all activities in thetraining dataset for both clusters. Incrementally group each training data points intoclusters, given a data point 𝑋, the nearest cluster 𝐿 tothis data point is determined by using a distancemetric weighted by the correlation coefficient of eachdimension. If 𝐿 is same class as 𝑋, then 𝑋 is groupedwill 𝐿, else we create a new cluster with this datapoint as the centroid of the training dataset.Classification: There are two methods to classify a data pointX in a testing dataset [57]. Assign the data point 𝑋 the class dominant in the knearest clusters which are found using a distancemetric weighted by the correlation coefficient of eachdimension.IJERTV10IS010040 Use the weighted sum of the distances of k-nearestclusters to this data point to calculate a continuousvalue for the target variable in the range of [0,1].Incremental Update: At this stage, the correlation andclustering of datapoints are calculated and the result is stored,as new training data are presented, each step of the training forthe new data points is observed and the clusters are updated fornew data points incrementally.V.HADDOP APPROACH TO INTRUSIONDETECTIONApache Hadoop is an open-source software framework, itprocesses big data and manages programs on a distributedsystem. It has two components, MapReduce, and HadoopDistributed file system.M. Mazhar et al. [57] proposed a Hadoop system to processnetwork traffic at real-time for intrusion detection with higheraccuracy in the high-speed Big Data environment. The trafficis captured with high-speed capturing device, the capturedtraffic is sent to the next layer filtration and loads balancingserver (FLBS). Only the undetermined traffic is filtered byefficient searching and comparisons in In-Memory intruder’sdatabase. The unidentified network flow and packet headerinformation to the third layer (Hadoop layer) master servers.The role of the FLBS is to decide the packets to be sent tomaster server depending on the IP addresses thereby causing aload balance. When the network traffic gets to the master, itgenerates sequence file for each flow so that it can beprocessed by the Hadoop data nodes, at the Hadoop node thepacket is extracted to secure the information carried be thepacket using Pcap (Packet capture) [57]. This processcontinues over a period of network flow, the process will resultin set of sequence file which are lined up in parallel, theprocessed sequence files are then analyzed by Apache Sparkwhich is a 3rd party tool in Hadoop system. The feature valuesare sent to layer four which is the Decision servers which thenclassifies the packets into normal or intrusion based theirparameter values, the decisions are then stored in In-memoryIntrusion database.Sanraj et al. [58] proposed a system based on integration oftwo different technologies, Hadoop and CUDA that worktogether to boost Network Intrusion Detection System. Thesystem is separated into different modules in which Hadoopwill take care of data organization and GPGPU (Generalpurpose graphic processing unit) takes care of intrusionanalytics and identification. The network packet and data logsgain access to the Hadoop system using Flume. The functionof Flume in the Hadoop system is to provide the real timestreaming the service of data collection and routing. The trafficingestion is done over HDFS (Hadoop distributed file system),the system pre-organization is done on server logs and packetdata. Data compilation done on the HDFS is moved forward toGPGPU for network intrusion detection. The NIDS (Networkintrusion Detection System) was designed with five phaseswhich is explained below:1. Traffic Ingestion: This stage refers to how packet datais streamed from the various onsite servers to theHadoop cluster via the DFS (Distributed file system).To make this trans

intrusion detection tools which can detect intrusion patterns and then forestall a break-in, thereby protecting the system from cyber criminals. This research seeks to discuss some Intrusion Detection Approaches to resolve challenges faced by cyber security and e- . According to McAfee and Center for Strategic and International Studies (2014 .