An Overview Of Intrusion Detection Using Soft Computing

Transcription

CS 591, Fall 2009 – PROJECT REPORT1An overview of Intrusion DetectionUsing Soft ComputingArchana Sapkota and Palden Lama Abstract— Many hacking tools and intrusive methods haveappeared along with a widespread use of computer networks.Using an intrusion detection system (IDS) is one way of dealingwith suspicious activities within a network. It monitors activitiesof a given environment and decides whether these activities aremalicious (intrusive) or legitimate (normal). Intrusion detectionin general is either anomaly based or misuse based. In both cases,an IDS needs to make an intelligent decision usually based onpattern recognition and probabilistic reasoning. With thismotivation, a lot of research efforts have explored the use of softcomputing in the intrusion detection field. Soft computing is aninnovative approach to construct a computationally intelligentsystem which parallels the extraordinary ability of the humanmind to reason and learn in an environment of uncertainty andimprecision. Typically, soft computing consists of severalcomputing paradigms, including neural networks, fuzzy sets,approximate reasoning, genetic algorithms, simulated annealing,etc. This paper will present an overview of various softcomputing techniques used for intrusion detection and elaborateon a few recently applied techniques.Index Terms— Intrusion Detection, Soft computing, SVM, DT,Fuzzy, LGP, ClassifierI. INTRODUCTIONThe first level of protection techniques in computer andnetwork security used in various organizations are userauthentication, data encryption, avoiding programming errorsand firewalls. As network attacks have increased in numberand severity over the past few years, the security infrastructurementioned above are not sufficient to provide the level ofsecurity needed. Therefore, intrusion detection is required asan additional wall for protecting systems despite theprevention techniques. Intrusion detection is the process ofmonitoring the events occurring in a computer system ornetwork and analyzing them for signs of intrusions, defined asattempts to compromise the confidentiality, integrity,availability, or to bypass the security mechanisms of acomputer or network.Intrusion detection systems (IDSs) are software or hardwaresystems that automate the process of monitoring the eventsoccurring in a computer system or network, analyzing themfor signs of security problems. Many ISDs available in theliterature contain mainly three functional components. A)information sources B) Analysis C) Response/Notification.InformationSource AInformationSource AAnalysisNotificationInformationSource AFigure 1: Components of IDSInformation Sources: The different sources of eventinformation can be used to determine whether an intrusion hastaken place. The information can be extracted from differentlevels of sources like network, host , application monitoring.Analysis: This is the component of intrusion detectionsystems that actually organizes makes the decision based onthe information source whether a particular event is anintrusion or not. The commonly used analysis approaches aremisuse detection and anomaly detection.Response: The set of actions that the system takes once theintrusion is detected. These are typically grouped into activeand passive measures. Active measures involve someautomated intervention on the part of the system, and passivemeasures involve reporting IDS findings to humans interactingwith the systems.Based on the information sources, there are mainly two kindsof IDS.Host Based IDS( HIDS) : Host-based IDSs operate oninformation collected from within an individual computersystem that is being monitored. Host-based IDSs normallyutilize information sources of two types, operating systemaudit trails, and system logs. They periodically analyze logs,perform file system integrity check. Some examples of HIDSare: ISS RealSecure Server Sensor(Generic); Tripwire,AIDE(Check host file system); BlackICE, PortSentry (Checkhost network connections); LogSentry, Swatch (Check host‟slog files).Network Based IDS( NIDS): NIDSs detect attacks bycapturing and analyzing network traffic contents and patterns.

CS 591, Fall 2009 – PROJECT REPORTSome examples of NIDS are Snort, Cisco IDS4235. Based onthe analysis techniques, there are mainly two kinds of IDSs:Misuse Detection and Anomaly Detection.Misuse Detection: Uses well-defined patterns of the attackthat exploit weaknesses in system and application software toidentify the intrusions. These patterns are encoded in advanceand used to match against the user behavior to detect intrusion.Anomaly Detection: Uses the normal usage behaviorpatterns to identify the intrusion. The normal usage patternsare constructed from the statistical measures of the systemfeatures. The behavior of the user is observed and anydeviation from the constructed normal behavior is detected asintrusionWith this background and different types of IDSs available inthe literature, we are interested more in the „Analysis‟Component of the IDS where the main decision regardingwhether a particular event is an intrusion or not. Thiscomponent is very important because this determines theperformance of IDS on the basis of false positives and falsenegatives. In out context the false positives mean somethingoccurs that causes IDS to incorrectly identify an intrusionwhen none has occurred and the false positives meansomething occurs that causes IDS to incorrectly fail to identifyan intrusion when one has in fact occurred. Accuracy of IDS isdetermined by the number of false positives and completenessis determined by the number of false negatives.Many techniques have been used for the analysis of intrusiondetection. The main techniques used are statistical approaches,predictive pattern generation, expert systems, keystrokemonitoring, model-based Intrusion detection, state transitionanalysis, pattern matching, and data mining techniques.Statistical approaches compare the recent behavior of a user ofa computer system with observed behavior and any significantdeviation is considered as intrusion. This approach requiresconstruction of a model for normal user behavior. Any userbehavior that deviates significantly from this normal behavioris flagged as an intrusion. Intrusion detection expert system(IDES) exploited the statistical approach for the detection ofintruders. IDES maintains profiles, which is a description of asubject‟s normal behavior with respect to a set of intrusiondetection measures. Profiles are updated periodically, thusallowing the system to learn new behavior as users alter theirbehavior. These profiles are used to compare the user behaviorand informing significant deviation from them as the intrusion.IDES also uses the expert system concept to detect misuseintrusions. The advantage of this approach is that it adaptivelylearns the behavior of users, which is thus potentially moresensitive than human experts. This system has severaldisadvantages. The system can be trained for certain behaviorgradually making the abnormal behavior as normal, whichmay make the intruders undetected. Determining the thresholdabove which an intrusion should be detected is a difficult task.Setting the threshold too low results in false positives andsetting it too high results in false negatives.2Predictive pattern generation uses a rule base of user profilesdefined as statistically weighted event sequences (Teng andChen, 1990). This method of intrusion detection attempts topredict future events based on events that have alreadyoccurred. This system develops sequential rules of the form:E1 - E2 - E3 - (E4 94%, E5 6%)This would mean that for the sequence of observed events E1followed by E2 followed by E3, the probability of event E4occurring is 94% and that of E5 is 6%. The rules are generatedinductively with an information theoretic algorithm thatmeasures the applicability of rules in terms of coverage andpredictive power. An intrusion is detected if the observedsequence of events matches the left-hand side of the rule butthe following events significantly deviate from the right-handside of the rule. The main advantages of this approach includeits ability to detect and respond quickly to anomalousbehavior, easier to detect users who try to train the systemduring its learning period. The main problem with the systemis its inability to detect some intrusions if that particularsequence of events have not been recognized and created intothe rules.The Keystroke monitoring technique utilizes a user‟skeystrokes to determine the intrusion attempt. The mainapproach is to pattern match the sequence of keystrokes tosome predefined sequences to detect the intrusion. The mainproblems with this approach is a lack of support from theoperating system to capture the keystroke sequences.Furthermore, there are also many ways of expressing thesequence of keystrokes for the same attack. Some shellprograms like bash, ksh have the user definable aliases utility.These aliases make it difficult to detect the intrusion attemptsusing this technique unless some semantic analysis of thecommands is used. Automated attacks by maliciousexecutables cannot be detected by this technique as they onlyanalyze keystrokes.In literature, intrusion detection systems have been also beenapproached by various soft computing techniques. We areparticularly interested in learning the role of soft computing inintrusion detection systems because unlike the traditional, hardcomputing, it is aimed at an accommodation with thepervasive imprecision of the real world. Thus, the guidingprinciple of soft computing is: '.exploit the tolerance forimprecision, uncertainty and partial truth to achievetractability, robustness, low solution cost and better rapportwith reality'. In the final analysis, the role model for softcomputing is the human mind.Section II describes some soft computing techniques that havebeen used in the literature for intrusion detection systems.Section III discusses various approaches of designingclassifiers for intrusion detection. Section IV presents thedatabase used for experiments. The experimental setup thathave been used and results obtained in the different techniquesare presented in Section V. The conclusion is given in SectionVI.

CS 591, Fall 2009 – PROJECT REPORTII. SOFT COMPUTING TECHNIQUES FOR INTRUSIONDETECTIONSoft computing is not just one methodology but a consortiumof various methodologies including machine learning,evolutionary computing, fuzzy logic, etc. These techniquescan be applied separately or in combination for intrusiondetection. In this section, we present a few selected techniquesindependently.A. Fuzzy Rule Based SystemsFuzzy logic has proved to be a powerful tool for decisionmaking to handle and manipulate imprecise and noisy data.The notion central to fuzzy systems is that truth values (infuzzy logic) or membership values (in fuzzy sets) are indicatedby a value on the range [0.0, 1.0], with 0.0 representingabsolute falseness and 1.0 representing absolute truth. Afuzzy system is characterized by a set of linguistic statementsbased on expert knowledge. The expert knowledge is usuallyin the form of if-then rules. It is commonly known as aparadigm for computing with words instead of numbers. Itmimics the intuitive decision making capability of a humanexpert based on fuzzy information available. For instance, ahuman driver does not need to know the exact numeric valuesof the speed, direction and location of his car to park itsuccessfully. His decisions and actions are based on fuzzyinformation such as “car is moving fairly slow, it is facingslightly right, it is near the parking spot”, etc. A simple rulestructure of a fuzzy system is: “If antecedent then consequent”In the context of intrusion detection, a simple rule might be:“If variable1 is low and variable2 is high then output is benignelse output is malignant”. In a fuzzy classification system, acase or an object can be classified by applying a set of fuzzyrules based on the linguistic values of its attributes. Every rulehas a weight, which is a number between 0 and 1 and this isapplied to the number given by the antecedent. It involves 2distinct parts. First the antecedent is evaluated, which in turninvolves fuzzifying the input and applying any necessaryfuzzy operators and second applying that result to theconsequent known as inference.3histogram mik (x) of class k patterns for the ith attribute iscalculated using the 20 membership functions fh (.) as follows:𝑚𝑖𝑘 𝑥𝑖 1𝑚𝑘𝑓ℎ 𝑥𝑝𝑖(1)𝑥 𝑝 𝐶𝑙𝑎𝑠𝑠 𝐾The smoothed histogram in (1) is normalized so that itsmaximum value is 1. A single fuzzy if-then rule is generatedfor each class. The fuzzy if-then rule for the kth class can bewritten as:If x1 is Ak1 and . and xn is Ak1 then class k,where Ak1 is an antecedent fuzzy set for the ith attribute. Themembership function of Ak1is specified as𝐴𝑘𝑖 (𝑥𝑖 ) 𝑒 ((𝑥 𝑖 𝜇 𝑖𝑘 )2)2(𝜎𝑖𝑘 )2Where µik is the mean of the ith attribute values xpi of class kpatterns, and σik is the standard deviation. Fuzzy if-then rulesfor the two-dimensional two class pattern classificationproblem are written as follows:If x3 is A31 and x4 is A41then class 2If x3 is A32 and x4 is 𝑎2 𝑏 2 then class 3For a new pattern xp (xp3,xp4), the winner rule is determinedas follows:𝐴 3 𝑥𝑝3 . 𝐴 2 𝑥𝑝4 max 𝐴1𝑘 𝑥𝑝3 . 𝐴𝑘2 𝑥𝑝4 𝑘 1,2}Rule Generation Based on Partition of Overlapping Areas(FR2)Figure 2 demonstrates a simple fuzzy partition, where the twodimensional pattern space is partitioned into 25 fuzzysubspaces by five fuzzy sets for each attribute (S: small, MS:medium small, M: medium, ML: medium large, L: large). Asingle fuzzy if-then rule is generated for each fuzzy subspace.Thus the number of possible fuzzy if-then rules in Figure 1 is25.To build a fuzzy classification system, the most difficult taskis to find a set of fuzzy rules pertaining to the specificclassification problem. We present three fuzzy rule generationmethods for intrusion detection systems. Let us assume thatwe have a n dimensional c-class pattern classification problemwhose pattern space is an n-dimensional unit cube [0, 1]n. Wealso assume that m patterns xp (xpl ,.,xpn) , p 1,2,.,m, are given for generating fuzzy if-then rules where xpє [0,1] for p 1,2,., m, i 1,2,.,n .Rule Generation Based on the Histogram of Attribute Values(FR1):In this method, use of histogram itself is an antecedentmembership function. Each attribute is partitioned into 20membership functions fh(.), h 1,2,.,20. The smoothedFigure 2: An example of fuzzy partition.One disadvantage of this approach is that the number ofpossible fuzzy if-then rules exponentially increases with the

CS 591, Fall 2009 – PROJECT REPORTdimensionality of the pattern space. Because the specificationof each membership function does not depend on anyinformation about training patterns, this approach usesfuzzy if-then rules with certainty grades. The FR2 approachgenerates fuzzy if-then rules in the same manner as the simplefuzzy grid approach except for the specification of eachmembership function. Because this approach utilizes theinformation about training patterns for specifying eachmembership function, the performance of generated fuzzy ifthen rules is good even when we do not use the certainty gradeof each rule in the classification phase. In this approach, theeffect of introducing the certainty grade to each rule is not soimportant when compared to conventional grid partitioning.Neural learning of fuzzy rules (FR3)The derivation of if-then rules and corresponding membershipfunctions depends heavily on the a priori knowledge about thesystem under consideration. However there is no systematicway to transform experiences of knowledge of human expertsto the knowledge base of a Fuzzy Inference System (FIS). In afused neuro-fuzzy architecture, neural network learningalgorithms are used to determine the parameters of fuzzyinference system (membership functions and number of rules).Fused neuro-fuzzy systems share data structures andknowledge representations. A common way to apply alearning algorithm to a fuzzy system is to represent it in aspecial neural network-like architecture.Figure 3: An integrated Neuro-Fuzzy systemFigure 3 shows a Mamdani type neuro-fuzzy system. It uses asupervised learning technique (backpropagation learning) tolearn the parameters of the fuzzy membership functions.Thedetailed function of each layer is as follows:Layer-1(input layer): No computation is done in this layer.Each node in this layer, which corresponds to one inputvariable, only transmits input values to the next layer directly.The link weight in layer 1 is unity.4Layer-2 (fuzzification layer): Each node in this layercorresponds to one linguistic label (excellent, good, etc.) toone of the input variables in layer 1. In other words, the outputlink represents the membership value, which specifies thedegree to which an input value belongs to a fuzzy set, iscalculated in layer 2. A clustering algorithm will decide theinitial number and type of membership functions to beallocated to each of the input variable. The finalshapes of the MFs will be fine tuned during network learning.Layer-3 (rule antecedent layer): A node in this layerrepresents the antecedent part of a rule. Usually a T-normoperator is used in this node. The output of a layer 3 noderepresents the firing strength of the corresponding fuzzy rule.Layer-4 (rule consequent layer): This node basically hastwo tasks. To combine the incoming rule antecedents anddetermine the degree to which they belong to the outputlinguistic label (high, medium, low, etc.). The numberof nodes in this layer will be equal to the number of rules.Layer-5 (combination and defuzzification layer): This nodedoes the combination of all the rules consequents using a Tconorm operator and finally computes the crisp output afterdefuzzification.B. Linear Genetic ProgrammingLinear genetic programming is a variant of the GP (GeneticProgramming) technique that acts on linear genomes [3]. Itsmain characteristics in comparison to tree-based GP lies inthat the evolvable units are not the expressions of a functionalprogramming language (like LISP), but the programs of animperative language (like C/C ). An alternate approach is toevolve a computer program at the machine code level, usinglower level representations for the individuals. This cantremendously hasten up the evolution process as, no matterhow an individual is initially represented, finally it always hasto be represented as a piece of machine code, as fitnessevaluation requires physical execution of the individuals. Thebasic unit of evolution is a native machine code instructionthat runs on the floating-point processor unit (FPU). Sincedifferent instructions may have different sizes, hereinstructions are clubbed up together to form instruction blocksof 32 bits each. The instruction blocks hold one or more nativemachine code instructions, depending on the sizes of theinstructions. A crossover point can occur only betweeninstructions and is prohibited from occurring within aninstruction. However the mutation operation does not have anysuch restriction.The settings of various LGP parameters are of utmostimportance for successful performance of the system.Typically, the population space is subdivided into multiplesubpopulation or demes. Migration of individuals among thesubpopulations causes evolution of the entire population. Ithelps to maintain diversity in the population, as migration isrestricted among the demes. Moreover, the tendency towards abad local minimum in one deme can be countered by otherdemes with better search directions. The various LGP searchparameters are the mutation frequency, crossover frequencyand the reproduction frequency: The crossover operator acts

CS 591, Fall 2009 – PROJECT REPORTby exchanging sequences of instructions between twotournament winners. A constant crossover rate of 90% is usedfor most experiments.C. Decision TreeA decision tree is made of decision nodes and leaf nodes. Eachdecision node corresponds to a test X over a single attribute ofthe input data and has a number of branches, each of whichhandles an outcome of the test X. Each leaf node represents aclass that is the result of decision for a case. The process ofconstructing a decision tree is basically a divide and conquerprocess. Suppose a set T of training data consists of k classes (C1 , C2 , , Ck ). If T only consists of cases of one singleclass, T will be a leaf. If T contains no case, T is a leaf and theassociated class with this leaf will be assigned with the majorclass of its parent node. If T contains cases of mixed classes(i.e. more than one class), a test based on some attribute ai ofthe training data will be carried and T will be split into nsubsets (T1 , T2 , , Tn ), where n is the number ofoutcomes of the test over attribute ai . The same process ofconstructing decision tree is recursively performed over eachT j , where 1 j n , until every subset belongs to a singleclass. The problem here is how to choose the best attribute foreach decision node during construction of the decision tree.One possible criterion for choice is Gain Ratio Criterion. Thebasic idea of this criterion is to, at each splitting step, choosean attribute which provides the maximum information gainwhile reducing the bias in favor of tests with many outcomesby normalization. Once a decision tree is built, it can be usedto classify testing data that has the same features as thetraining data. Starting from the root node of decision tree, thetest is carried out on the same attribute of the testing case asthe root node represents. The decision process takes thebranch whose condition is satisfied by the value of testedattribute. This branch leads the decision process to a child ofthe root node. The same process is recursively executed until aleaf node is reached. The leaf node is associated with a classthat is assigned to the test case.Intrusion detection can be considered as classification problemwhere each connection or user is identified either as one of theattack types or normal based on some existing data. Decisiontrees work well with large data sets. This is important as largeamounts of data flow across computer networks. The highperformance of decision trees makes them useful in real-timeintrusion detection. Decision trees construct easilyinterpretable models, which is useful for a security officer toinspect and edit. These models can also be used in the rulebased models with minimum processing. Generalizationaccuracy of decision trees is another useful property forintrusion detection model. There will always be new attackson the system, which are small variations of known attacksafter the intrusion detection models are built. The ability todetect these new intrusions is possible due to thegeneralization accuracy of decision trees.5D. Support Vector MachineSupport Vector Machines have been proposed as a noveltechnique for intrusion detection. SVM maps input (realvalued) feature vectors into a higher dimensional feature spacethrough some nonlinear mapping. SVMs are powerful toolsfor providing solutions to classification, regression anddensity estimation problems. These are developed on theprinciple of structural risk minimization. Structural riskminimization seeks to find a hypothesis for which one can findthe lowest probability of error. The structural riskminimization can be achieved by finding the hyper planewith maximum separable margin for the data. Computing thehyper plane to separate the data points, i.e. training a SVM,leads to a quadratic optimization problem. SVM uses a featurecalled a kernel to solve this problem. A kernel transformslinear algorithms into nonlinear ones via a map into featurespaces. SVMs classify data by using these support vectors,which are members of the set of training inputs that outline ahyper plane in feature space. Figure 4 shows a simple exampleof separable hyper plane between two data sets.Figure 4. Separable hyper plane between two datasets.III. CLASSIFIER DESIGN AND EVALUATIONA. Classifier DesignThe classifiers for intrusion detection can be designed invarious ways. In the literature people have used singleclassifier, hybrid classifiers and ensemble classifiers.Single classifier: Single classifier system use one of theclassification techniques like K – Nearest Neighbor, ArtificialNeural Networks, Support Vector Machines, Self OrganizingMap, Decision Tree, Bayes‟ Networks, Genetic Algorithms,Fuzzy Logic etc. Single classifiers are considered less robustthan the ensemble or hybrid classifiers.

CS 591, Fall 2009 – PROJECT REPORTHybrid Classifier: A hybrid intelligent system uses theapproach of integrating different learning or decision-makingmodels. Each learning model works in a different manner andexploits different set of features. Integrating different learningmodels gives better performance than the individual learningor decision-making models by reducing their individuallimitations and exploiting their different mechanisms. In ahierarchical hybrid intelligent system each layer providessome new information to the higher level. The overallfunctioning of the system depends on the correct functionalityof all the layers.Ensemble Classifier: Empirical observations show thatdifferent classifiers provide complementary information aboutthe patterns to be classified Although for a particular problemone classifier works better than the other, a set of misclassifiedpatterns would not necessarily overlap. This differentinformation combined together yields better performance thanindividual classifiers. The idea is not to rely on a singleclassifier for decision on an intrusion; instead informationfrom different individual classifiers is combined to take thefinal decision, which is popularly known as the ensembleapproach. The effectiveness of the ensemble approachdepends on the accuracy of the base classifiers.Figure 5: Example of hybrid classifierFigure 6: Example of Ensemble ClassifierB. Evaluation strategiesThe three main steps involved in the classification andevaluation process using soft computing techniques are :Feature Selection, Training and Testing. Though there exist6many techniques which do not require all three stages, butmost of the literatures have used all three.Feature Selection/AttributeReductionTrainingTestingFigure 7: General Evaluation StrategyBefore training, the step of feature (or variable) selection maybe considered. The process of feature selection identifieswhich features are more discriminative than the others. Thishas the benefit of generally improving system performance byeliminating irrelevant and redundant features.Featureselection is not very popular procedure in intrusion detection,however, few studies use different feature selection methodsfor their experiments and proved that that feature selectioncould improve some certain level of classification accuracy inintrusion detection. Generally while using the training andtesting methods, a subset of the whole dataset is selected as atraining set either randomly or using some criteria. Theclassifier is trained using this data and a subset of the wholedata usually exclusive from the training set is selected as a testdata. And the evaluation is based on its performance on testdata give a trained classifier.

CS 591, Fall 2009 – PROJECT REPORTIV. DATABASE FOR EXPERIMENTSA. 1998/1999 DARPA IDS dataset:In 1998 DARPA recognized the need to be able to performquantitative evaluations on intrusion detection systems. MIT‟sLincoln Labs was contracted to work with the Air ForceResearch Laboratory in Rome, NY to build an evaluationdataset and perform an evaluation of the then current IDSresearch being funded by DARPA. They built a small networkintending to simulate an Air Force base connected to theInternet, producing background activity with scripts, andinjecting attacks at well defined points, and gathered tcpdump,Sun BSM, process and file system information. Lincoln Labsmade the datasets available after the evaluations and manyresearchers used this data as the basis of evaluating theirsystems. However there have been many criticism against thisdataset. The main criticism against this dataset was the failureto verify that the network in which data was collectedrealistically simulated a real-world network.However,researches say DARPA IDS evaluation dataset is still usefulfor testing intrusion detection systems for good performancebut it is a necessary but not sufficient condition to demonstratethe capabilities of an advanced IDS[8].B. KDD99The 1999 KDD intrusion detection contest uses a version ofDARPA IDS dataset. All the network traffic including theentire payload of each packet was recorded in tcpdump formatand was provided for evaluation. The raw training data wasabout four gigabytes of compressed binary TCP dump datafrom seven weeks of network traffic. This was processed intoabout five million connection records. Similarly, the twoweeks of test data yielded around two million connectionrecords. A connection is a sequence of TCP packets startingand ending at some well defined times, between which dataflows to and from a source IP address to a target IP addressunder some well defined protocol. Each connection is labeledas either normal, or as an attack, with exactly one specificattack type. Each connection record consists of about 100bytes. Attacks fall into four main categories:- DOS: denial-of-service, e.g. syn flood;- R2L: unauthorized access from a remote machine,e.g. guessing password;- U2R: unauthorized access to local superuser (root)privileges, e.g., various buffer overflow'' attacks;- probing: surveillance and other probing, e.g., portscanning.But like DARPA 1999/1998 this database has also beencriticized al.0,t

Using an intrusion detection system (IDS) is one way of dealing with suspicious activities within a network. It monitors activities of a given environment and decides whether these activities are malicious (intrusive) or legitimate (normal). Intrusion detection in general is either anomaly based or misuse based. In both cases,