W Generations Of Machine Learning In Cybersecurity

Transcription

Artificial Intelligence Is More Than a Feature; It Must Be in Your DNAWHITE PAPERGenerations of MachineLearning in Cybersecurity

we expect thesophistication ofapplications of machinelearning to continue toevolve over time.SummaryIn this white paper, we aim to define generations of machine learning and to explainthe maturity levels of artificial intelligence (AI) and machine learning (ML) that arebeing applied to cybersecurity today. In addition, the paper seeks to explain thatwhile a great deal of progress has been made in the evolution of machine learning’sapplication to cybersecurity challenges, there remains an immense amountof opportunity for innovation and advancement in the field, and we expect thesophistication of applications of machine learning to continue to evolve over time.This white paper is organized into sections that provide the following information: An introduction which briefly summarizes the context of machine learning’sapplication within cybersecurity, and the case for an official categorization ofcybersecurity machine learning models into generations A review of key machine learning concepts and considerations when drawingdistinctions between generations Definitions for five distinct cybersecurity machine learning generations The greater implications of this machine learning generational framework A brief conclusionIntroductionThe Defense Advanced Research Projects Agency (DARPA) has defined AI in threefoundational ways, referring to these as the Three Waves of AI.Generations of Machine Learning in Cybersecurity 2

Statistical learning is often used in selfdriving cars, smartphones, or facialrecognition. This wave of AI uses machinelearning to perform probabilistic decisionmaking on what it should or should not do.including the higher likelihood of identifying novel, zeroday attacks and targeted malware, an increased difficultyof evasion, and continued efficacy during prolongedoffline periods.The first wave is handcrafted knowledge, which definesrules that humans use to carry out certain functions, andfrom which computers can learn to automatically applythese rules to create logical reasoning. However, withinthis first wave there is no learning applied to higher levels.One example of cybersecurity inside this first wave is theDARPA Cyber Grand Challenge.The second wave is statistical learning. Often used inself-driving cars, smartphones, or facial recognition, thiswave of AI uses machine learning to perform probabilisticdecision making on what it should or should not do. In thissecond wave, the systems are good at learning, but theirweakness lies in their ability to perform logical reasoning.In other words, the systems classify and predict data, butdon’t understand the context.This is where the third wave, known as contextualadaptation, comes into play. In this wave, the systemsthemselves construct explanatory models for the realworld itself. In the third wave, the systems should be ableto describe exactly why the characterization occurred justas a human would.Machine learning has been quickly adopted incybersecurity for its potential to automate the detectionand prevention of attacks, particularly for next-generationantivirus (NGAV) products. ML models in NGAV havefundamental advantages compared to traditional AV,Most attempts to apply ML and AI to cybersecurity fallinto DARPA’s first wave, handcrafted knowledge, usinghuman defined rules, and defined patterns. A scant fewcybersecurity technologies can claim involvement, muchless maturity, in DARPA’s second wave, statistical learning.The first wave ML models inevitably suffer from high falsepositive rates and can be easily bypassed. Since thereare now several iterations of ML applications for AV, itis no longer sufficient to differentiate only between thecurrent version or release of an AV, and the forthcomingone. Instead, the time has come to provide a high-leveldescription of the evolving generations of ML both as ithas been, and will be, applied to cybersecurity in the future.In this paper, we explore the sub-categories of machinelearning generations inside DARPA’s second wave,statistical learning. We aim to explain the maturity levels ofAI represented in applications within cybersecurity today,and how we expect them to evolve over time.Concepts and ConsiderationsThis section explains the terms and concepts employed inthis document that assist in drawing distinctions betweengenerations of ML models, and also provides commentaryon why these concepts are relevant to security.Generations of Machine Learning in Cybersecurity 3

RuntimeMachine learning algorithms universally involve twofundamental steps: Training, when a model learns from a data set ofknown samples Prediction, when a trained model makes an educatedguess about a new, unknown sampleThe training step is the much more intense computationaloperation — modern deep neural networks can takemonths to train even on large clusters of high-performancecloud servers. Once a model has been trained, prediction iscomparatively straightforward, although prediction oftenstill requires significant memory and CPU usage. To traina classifier, samples from the input dataset must haveassociated labels (e.g. malicious or non-malicious).Runtime is the environment where training or predictioncould occur: local, e.g. on endpoint, or remote, e.g. incloud. The runtime for each ML step informs how quicklya model can be updated with new samples, the impactsof decision making, and dependence on resources suchas CPU, memory, and IO. For supervised models, note thatlabels must be available during training, so training canonly occur where labels are available. In practice, trainingis typically done in a cluster of distributed servers in thecloud. Prediction is more common in the cloud as well,but increasingly performed locally. Distributed training onlocal user or customer devices is an emerging technology.Although there are major possible benefits, includingreduced IO and protection of sensitive data, there are manychallenges such as heterogeneous resources, unreliableavailability, and slower experimental iterations.FeaturesThe set of features, or feature space, specifies preciselywhat properties of each example are taken intoconsideration by a model. For portable executable (PE)files, the feature set could include basic statistics such asfile size and entropy, as well as features based on parsedsections of the PE, for example, the names of each entry inthe section table. We could include the base-2 logarithmof file size as another derived feature. Some featurescould be extracted conditionally based on other features;other features could represent combinations. The spaceof possible features is very large, considering that thereare innumerable transformations that can be applied tothe features.The features are critical to any ML model because theydetermine what and how information is exposed. Besidesthe important question of what information to include, italso matters how to encode the information. The processof creating model-amenable features is called featureengineering. Some models are more sensitive thanothers to how features are encoded. Although it is oftentempting to provide as many features as possible, thereare disadvantages in using too many features: greaterrisking of overfitting, higher resource consumption, andpossibly more vulnerability to adversarial attacks. Theefficacy, interpretability, and robustness of the model allhinge on the features.Data SetsThe data used to train and evaluate the modelfundamentally and hugely impacts its performance. Ifthe data used to train the model are not representativeof the real world, then the model will fail to do well in thefield. Labels for each sample, such as benign or malicious,are necessary for training classifiers. The labels need tobe vetted carefully, since any mislabeling, also known aslabel noise, can bias the model. As more data is gathered,the labeled datasets must be continuously monitored toensure consistency and ongoing hygiene. In practice, thedata may come from a wide variety of sources. Each sourcemust be evaluated for the degree of trust and reliability sothat downstream uses can take these factors into account.A common problem which is present for many securityapplications is how to handle unbalanced data, whichoccurs when one label (benign) is much more commonthan others (malicious). Unbalanced labeled data canbe mitigated by various modelling strategies, but ideally,there are many representative samples for each label.The feature set and dataset are closely related, sincemany features will be generated using the training set.The dataset also impacts crucial feature pre-processing,such as normalization, or weighting schemes, such as termfrequency-inverse document frequency (TF-IDF).For a sophisticated model, it’s necessary to have a verylarge dataset. However, it is easy to fall into the trap ofassuming that a sufficiently large dataset will lead to betterperformance. While, in general, larger datasets enabletraining of more sophisticated models, a huge datasetdoes not guarantee performance. A good dataset shouldhave a wide variety and should fairly represent the samplesthat a model might see when deployed. The desired varietycan be represented quantitatively as rough balance infeature values among labeled examples.Human InteractionModels are often thought of as black boxes, but they neednot be. Models which can support modes of interactionwith people have several advantages. They can receiveexpert feedback more readily, which can be useful forimproving both labels and features, and allowing the modelto improve in otherwise difficult ways. Human confidenceGenerations of Machine Learning in Cybersecurity 4

Data Set I14Data Set IIGoodness of FitSome models better represent the real world better thanothers. When a model is oversimplified, it has poor efficacybut generalizes well to new data. These models are called“underfit”, in the sense that there is more informationavailable to the model which it is not fully taking intoaccount. Conversely, a model can memorize, or “overfit”.When overfitting, the model learns too much about thespecific samples on which it was trained, but does nottransfer its representation well to new samples in thereal world.12y108642Data Set IVData Set III1412In Figure 2, the dashed line represents the decisionboundary of an overfit classifier for green vs. blue points.The green line represents an appropriate decisionboundary. Although it does not classify perfectly on theshown points, its performance will be better for new gure 1 — Four very different datasets, shown by points,which result in the same fitted predictive model, representedby each line. This famous set of datasets is known asAnscombe’s quartet.A well-fit model will maintain its validation performanceafter deployment. Concept drift is a related concept whichoccurs when there are nonstationary changes in the dataover time, e.g. the set of PE files on endpoints changesfrom year to year. As the population of sample PE fileschange, the model should be prepared to adapt to thechanges in the population it targets.and trust in the model can be made more quickly whenthere is some way of understanding how the modeldecisions are made.Having methods for exploring the model can also help tovalidate the underlying data. Figure 1 shows Anscombe’squartet, in which four very different input datasets yieldprecisely the same linear regression model. Based onthe summary statistics and model parameters, the fourcases are practically indistinguishable. When plotted, it isimmediately clear that only the upper-left quadrant modelis fit appropriately to its dataset. The other models, withmore dimensions and parameters, are much more difficultto explore and understand. However, without some type ofhuman validation, it is likely that qualitative model or dataissues could go quietly unnoticed, and lead to poor efficacyor vulnerabilities.Supporting modes of human interaction is also importantin cases where the model fails. If the model is a blackbox, it can be difficult to identify the cause of systematicmodelling errors. Tools for inspecting and understandingthe model enable troubleshooting and diagnostics. Suchtools need to be carefully controlled and may not beintegrated into the end product, since they leak intellectualproperty and could potentially expose vulnerabilitiesto adversaries.Figure 2 — Data points from two classes, each class indicated byits color. The two lines show alternative decision boundaries fromhypothetical classifiers.Generations of Machine Learning in Cybersecurity 5

How Generations Are DefinedCybersecurity machine learning generations aredistinguished from one another according to five primaryfactors, which reflect the intersection of data science andcybersecurity platforms. Runtime: Where does the ML training and predictionoccur (e.g. in the cloud, or locally on the endpoint)? Features: How many features are generated? How arethey pre-processed and evaluated? Datasets: How is trust handled in the process of datacuration? How are labels generated, sourced, andvalidated? Human Interaction: How do people understand themodel decisions and provide feedback? How aremodels overseen and monitored? Goodness of Fit: How well does the model reflect thedatasets? How often does it need to be updated?These factors enable us to separate cybersecuritytechnologies into five distinct generations of ML, eachdefined by its progression in each category. Typically, aproductized model takes two to three years to advancefrom one generation to the next, and the majority oftechnologies that integrate machine learning will becometrapped in the first or second generations. Only a few haveGenerationFirstThirdThe Greater Implicationsof Each Generation ofMachine LearningThe table below lays out

This white paper is organized into sections that provide the following information: . the data used to train the model are not representative of the real world, then the model will fail to do well in the field. Labels for each sample, such as benign or malicious, are necessary for training classifiers. The labels need to be vetted carefully, since any mislabeling, also known as label noise .