2005 Privacy Data Mining Report - Dhs.gov

Transcription

Data MiningReportDHS Privacy Office Response to House Report 108-774July 6, 2006

Report to Congress on the Impact of DataMining Technologies on Privacy and CivilLibertiesRespectfully submittedMaureen CooneyActing Chief Privacy OfficerU.S. Department of Homeland SecurityWashington, DCJuly 6, 2006ii

TABLE OF CONTENTSI.EXECUTIVE SUMMARY1A.DEFINITION OF DATA MINING1B.DATA MINING PROCESS STEPS AND ATTENDANT PRIVACYISSUES1C.DHS DATA MINING ACTIVITIES2I.INTRODUCTION5II.DESCRIPTION OF DATA MINING TECHNOLOGY6A.DEFINITION OF DATA MINING6B.THE PROCESS OF DATA MINING81.2.3.4.5.6.DEFINITION OF THE PROBLEM TO BE SOLVEDDATA IDENTIFICATION AND COLLECTIONDATA QUALITY ASSESSMENT AND DATA CLEANSINGMODEL BUILDINGMODEL VALIDATIONMODEL DEPLOYMENT999101010C.DATA MINING TECHNIQUES11III.PRIVACY AND CIVIL LIBERTIES CONCERNS IN THE USE OFDATA MINING TECHNOLOGIES FOR HOMELAND SECURITY11A.PURPOSES OF DATA MINING121.2.INAPPROPRIATE DATA MININGFUNCTION OR MISSION “CREEP”1212B.DATA IDENTIFICATION AND COLLECTION131.2.3.4.INAPPROPRIATE ACCESS TO INFORMATIONDUPLICATION OF DATADATA RETENTIONUSE OF DATA FOR PURPOSES INCOMPATIBLE WITH PURPOSES OFDATA COLLECTIONSUBJECT SYNOPSIS1314155.iii1516

C.DATA QUALITY ASSESSMENT AND CLEANSING161.INTRODUCTION OF ERRORS DURING DATA PREPARATION17D.MODEL BUILDING AND EVALUATION171.2.DATA LEAKAGEIMPROPER MODEL VALIDATION1717E.MODEL DEPLOYMENT181.2.3.NEW PERSONAL INFORMATION ABOUT INDIVIDUALSFALSE POSITIVESLACK OF APPROPRIATE REVIEW AND REDRESS181819F.CONCLUSION19IV.DHS DATA MINING ACTIVITIES19A.DATA ANALYSIS FOR IMPROVING OPERATIONAL EFFICIENCYU.S. CUSTOMS AND BORDER PROTECTION201.2.3.4.PURPOSES OF THE PROGRAMDATA SOURCESDEPLOYMENT DATESPOLICIES, PROCEDURES, AND GUIDANCE20212121B.LAW ENFORCEMENT ANALYTIC DATA SYSTEM (NETLEADS)IMMIGRATION AND CUSTOMS ENFORCEMENT211.2.3.4.PURPOSES OF THE PROGRAMDATA SOURCESDEPLOYMENT DATESPOLICIES, PROCEDURES AND GUIDANCE22222323C.ICE PATTERN ANALYSIS AND INFORMATION COLLECTIONSYSTEM (ICEPIC) IMMIGRATION AND CUSTOMS ENFORCEMENT241.2.3.4.PURPOSES OF THE PROGRAMDATA SOURCESDEPLOYMENT DATESPOLICIES, PROCEDURES, AND GUIDANCE24252525D.INTELLIGENCE AND INFORMATION FUSION (I2F) OFFICE OFINTELLIGENCE AND ANALYSIS261.2.3.4.PURPOSES OF THE PROGRAMDATA SOURCESDEPLOYMENT DATESPOLICIES, PROCEDURES AND GUIDANCE26262626E.FRAUD DETECTION AND NATIONAL SECURITY DATA SYSTEM(FDNS-DS) US CITIZENSHIP AND IMMIGRATION SERVICES271.2.PURPOSES OF THE PROGRAMDATA SOURCES2727iv

3.4.DEPLOYMENT DATESPOLICIES, PROCEDURES AND GUIDANCE2727F.NATIONAL IMMIGRATION INFORMATION SHARING OFFICE(NIISO) US CITIZENSHIP AND IMMIGRATION SERVICES281.2.3.4.PURPOSES OF THE PROGRAMDATA SOURCESDEPLOYMENT DATESPOLICIES, PROCEDURES AND GUIDANCE28282829V.CONCLUSIONS AND RECOMMENDATIONS29VI.APPENDIX A31v

Data Mining ReportDHS Privacy OfficeJuly 6, 2006I.Executive SummaryThis report is prepared pursuant to the requirements of House Report 108-774 – MakingAppropriations for the Department of Homeland Security for the Fiscal Year endingSeptember 30, 2005, and for Other Purposes. This report provides information related tothe status, issues, and programs related to DHS data mining activities.A.Definition of Data MiningThere is no agreed-upon definition for the term “data mining.” Based on the definitionsused by the Congressional Research Service and the Government Accountability Office,data mining is defined in this report as follows:Data mining involves the use of sophisticated data analysis tools to discoverpreviously unknown, valid patterns and relationships in large data sets. Datamining consists of more than collecting and managing data; it also includesanalysis and prediction.The application of patterns, relationships, and rules to searches, whether these are derivedthrough data mining, observation, intelligence, or theoretical models, is not addressed inthis report.1B.Data Mining Process Steps and Attendant Privacy IssuesData mining is a process that consists of a series of steps. Privacy and civil libertiesissues arise in every step of the data mining process.The first step in the data mining process is to define the business need that data miningexpects to address. As with any activity undertaken by a Federal agency, a data miningproject must be performed for a lawful purpose, consistent with the agency’s mission.After an agency determines the problem that data mining may be useful in solving, andfinds that it has the mission authority to perform the project, it needs to identify and thencollect or aggregate the data for analysis. The privacy and civil liberties issues that mayarise during this step include inappropriate access to information, duplication of data andthe resulting inability of the original data collector to control subsequent uses or maintainquality of the data, inappropriate data retention policies, use of data incompatible withpurposes for which it was originally collected, and profiling of individuals.After data is collected or aggregated, it undergoes a “cleansing” process. Inaccuracy ofdata is a significant concern in data mining. If data is inaccurate or incomplete, then the1Thus, this report would exclude searches using patterns, relationships, and rules focused on a particularindividual, such as used in a threat and risk assessment vetting program.1

Data Mining ReportDHS Privacy OfficeJuly 6, 2006patterns, relationships, or rules detected in the data may be meaningless or wrong. Worsefrom a privacy and civil liberties perspective, if patterns, relationships, or rules used forlaw enforcement or intelligence are determined through mining inaccurate data, suchpatterns, relationships or rules may implicate innocent individuals. For this reason, dataintended for data mining usually undergoes a “cleansing” or validation process prior tothe start of analysis. However, the data cleansing process can itself introduce inaccuraciesinto the data.After data is cleansed and validated, the model building process begins. This is the stepduring which patterns in the data are detected and validated and rules for predictingfuture events or behaviors are created. Potential privacy and civil liberties issues duringthis step of the process include security risks, such as access to data by unauthorizedpersons, as well as inappropriate disclosures by authorized users. Additional concernsarise if the model is inappropriately validated before deployment.The final step in data mining involves the deployment of the model to the field. It is atthis step of the data mining process that concerns arise about false positives andappropriate due process for individuals who are flagged by the model. There are alsoquestions about ownership and uses of new information about individuals producedthrough the use of data mining models.C.Recommendations for DHS Data Mining ActivitiesSeveral components of DHS engage or plan to engage in data mining activities, asdefined by this report. Based on our analysis of DHS activities that involve current andprojected future uses of data mining, we note that data mining is usually only one part ofa larger set of analytic activities and tools. Such analytic activities include searches andtraditional analyses.Although DHS programs that employ data mining tools and technologies also employtraditional privacy and security protections, such as Privacy Impact Assessments,Memoranda of Understanding between agencies that own source data systems, privacyand security training, and role-based access, we recommend additional protections thatare aimed specifically at addressing the privacy concerns raised by data mining and wewill take steps to implement these recommendations within the Department.1. Prior to the start of any data mining activity, the authority of the agency toundertake such activity should be determined to be consistent with the purposes ofthe data mining project or program. The authority to collect or aggregate datarequired to perform the data mining project should also be ascertained, whetherthe project involves collection of new data or aggregation of existing data fromvarious sources. While oversight functions exist in different components within2

Data Mining ReportDHS Privacy OfficeJuly 6, 2006DHS,2 the Department as a whole could benefit from more centralized oversightwith a broader view of DHS activities and data holdings. One such body, whichcould assist the function of overseeing DHS data mining programs, is the DHSPrivacy and Data Integrity Board, an internal privacy board that is charged underthe Privacy Act to examine and approve data matching agreements between DHSand other departments and that considers Departmental privacy issues. The Board,which includes representatives from all DHS components, the Office of theInspector General, the Chief Information Officer, and the Office of GeneralCounsel, and is chaired by the DHS Chief Privacy Officer, could provideoversight and confirmation to ensure responsible application of data mining toolsand technologies. The Board assisted with the collection of data for this reportconcerning current and planned data mining programs within DHS.2. As discussed in the report, data mining searches for patterns, relationships, andrules in the data without basing this search on observations or a theoretical model.Because the existence of patterns in the data may not reflect cause and effect, datamining tools should be used principally for investigative purposes. DHScomponents that use data mining tools should have written policies, stating thatno decisions may be made automatically regarding individual rights or benefitssolely on the basis of the results produced by patterns or rules derived from datamining.3. Because the patterns, relationships, and rules in the data may not be derived fromspecific personal identifiers, such as a name or Social Security Number, when adata set includes personally identifiable information, data mining projects shouldgive explicit consideration to using anonymized data in data mining activities. Adiscussion of the extent to which anonymization was considered should beincluded in the Privacy Impact Assessment for such a data mining project.4. Data quality plays an important role in the ability of data mining techniques toproduce accurate results. DHS should adopt data quality standards for data used indata mining. Application of these standards, which should affect systems usingboth data from government and commercial sources, should be ensured prior tothe deployment of data mining models or predictive rules for use in the field.5. In order to ensure that data mining models produce useful and accurate results,DHS should adopt standards for the validation of models or rules derived fromdata mining. Evaluation of the model validation process and the ability to meetthese standards should be reviewed and documented prior to the deployment ofthe model to the field.2Including the DHS Office of Civil Rights and Civil Liberties.3

Data Mining ReportDHS Privacy OfficeJuly 6, 20066. Each DHS component that employs data mining should implement policies andprocedures that provide an appropriate level of review and redress for individualsidentified for additional investigation by patterns, relationships, and rules derivedfrom data mining. Because data mining algorithms most times produce highlycomplex patterns, relationships, and rules that may not be fully understandable asto the particular reasons for identifying the individuals, a complete procedureshould include a step that a person, acting independently from the data miningprocess, substantiates the particular identification of individuals prior to anydeterminative processes and procedures. To ensure a complete understanding ofthe capabilities and limitations of data mining in this regard, employees who usedata mining processes should be required to complete training on these policesand procedures.7. In order to provide demonstrable accountability, each component that employsdata mining should include strong, automatic audit capabilities to record access tosource data systems, data marts, and data mining patterns and rules. Programsshould conduct random audits at regular intervals, and all employees should begiven notice that their activities are subject to such audits. These actions helpunderline the importance of transparency.4

Data Mining ReportDHS Privacy OfficeJuly 6, 2006I.IntroductionThis report is prepared pursuant to the requirements of House Report 108-774 – MakingAppropriations for the Department of Homeland Security for the Fiscal Year endingSeptember 30, 2005, and for Other Purposes. The report includes the followingrequirements:The conferees direct the DHS Privacy Officer, in consultation with the head of eachDepartment of Homeland Security agency that is developing or using data-miningtechnology, to submit a report no later than 90 days after the end of fiscal year 2005that provides (1) a thorough description of the data-mining technology, the plans foruse of such technology, the data that will be used, and the target dates for thedeployment of the technology; (2) an assessment of the likely impact of theimplementation of the technology on privacy and civil liberties; and (3) a thoroughdiscussion of the policies, procedures, and guidelines that are to be developed andapplied in the use of such technology for data-mining in order to protect the privacyand due process rights of individuals and to ensure that only accurate information iscollected and used.The Department of Homeland Security (“DHS”) Privacy Office is the first statutorilyrequired comprehensive privacy office in any U.S. federal agency. It operates under thedirection of the Chief Privacy Officer, who is appointed by and reports directly to theSecretary. The DHS Privacy Office serves as a steward of Section 222 of the HomelandSecurity Act of 2002, and has programmatic responsibilities involving the Privacy Act of1974, the Freedom of Information Act (“FOIA”), the privacy provisions of the EGovernment Act of 2002, and DHS policies that protect the collection, use, anddisclosure of personal information. Additionally, the Privacy Office develops privacypolicy and oversees certain information disclosure issues. The Office is also statutorilyrequired to evaluate all new technologies used by the Department for their impact onpersonal privacy.The Privacy Office wishes to acknowledge the generous assistance it received from U.S.Customs and Border Protection (“CBP”), Immigration and Customs Enforcement(“ICE”), U.S. Citizenship and Immigration Services (“USCIS”), and the Office ofIntelligence and Analysis in writing this report. We further wish to acknowledgeconsultation with other offices within the Department, including Civil Rights and CivilLiberties, the Science and Technology Directorate, and the Policy Office.The report contains the following sections. Section II describes data mining technologiesand how these technologies can be used in homeland security applications. Section IIIaddresses privacy and civil liberties concerns that have been raised with regard to datamining technologies. Section IV discusses current and anticipated DHS data miningactivities, including the policies, procedures, and guidelines designed to protect privacy,5

Data Mining ReportDHS Privacy OfficeJuly 6, 2006civil liberties and due process rights when data mining technologies are used. The finalsection presents the conclusions of the report.II.Description of Data Mining TechnologyThis section of the report defines data mining and examines the process and technologiesfor conducting data mining.A.Definition of Data MiningThere is no universally agreed-upon definition for the term “data mining.” Somedefinitions of the term are quite broad. For example, the Technology and PrivacyAdvisory Committee (“TAPAC”) of the Department of Defense defined data mining as:[S]earches of one or more electronic databases of information concerning U.S.persons, by or on behalf of an agency or employee of the government.3This presents too broad a definition of data mining. While searches, particularly patternbased searches and searches of multiple databases do raise privacy concerns, dataretrieval via computerized search is, in many cases, a faster and more efficient way toperform an activity that could be performed manually. Additionally, the TAPACdefinition covers activities requested by the individual who is a subject of theinformation, such as searches of a single database in response to a customer service queryor a request under FOIA, and it is our conclusion that such simple data retrievals shouldnot be included in the context of a discussion about data mining.Authors of other reports use narrower definitions. For example, the CongressionalResearch Service (“CRS”) defines data mining as follows:Data mining involves the use of sophisticated data analysis tools to discoverpreviously unknown, valid patterns and relationships in large data sets. These toolscan include statistical models, mathematical algorithms, and machine learningmethods (algorithms that improve their performance automatically throughexperience, such as neural networks or decision trees). Consequently, data miningconsists of more than collecting and managing data, it also includes analysis andprediction.4The Government Accountability Office (“GAO”) defines data mining similarly, as3Technology and Privacy Advisory Committee, “Safeguarding Privacy In the Fight Against Terrorism,”March 2004, p. viii.4J.W. Seifert, Data Mining: An Overview, Congressional Research Service, RL31798, June 2005, p. 1.6

Data Mining ReportDHS Privacy OfficeJuly 6, 2006[T]he application of database technology and techniques—such as statistical analysisand modeling—to uncover hidden patterns and subtle relationships in data and toinfer rules that allow for the prediction of future results.5There are two important components in the definitions used by CRS and GAO. The firstis the discovery of hidden patterns in the data and the second is the use of these patternsto predict future results. Only the first of these, the search of databases for hidden, validpatterns, relationships, and rules, is unique to data mining. As such, data mining wouldnot include searches for connections, direct or indirect, between data points focused on aknown subject.Looking for rules6 that allow prediction of future behavior or results is an important partof many branches of data analysis. For example, probability theory, a branch ofmathematics that has been studied since the seventeenth century, is a study of ways topredict future events from past occurrences. The significant difference between datamining and other analytic techniques is in the way the prediction rules are determined.Generally, analytic techniques test hypotheses generated through observation or theory.7In data mining, the analysis of the data itself is expected to produce patterns,relationships, and rules that are not known and that are not based on observation or atheoretical model, but are nevertheless valid.It is important to note that because data mining is not based on a theoreticalunderpinning, it can only identify patterns in the data; it cannot reveal whether anydiscovered pattern is meaningful or significant.8 Only someone who understands thebusiness problem under analysis can determine the significance of a discovered pattern.Most importantly, from a privacy and civil liberties point of view, the patterns,relationships, or rules produced through data mining do not reveal specifically the reasonthat such a pattern, relationship, or rule exists. That makes it essential that someonefamiliar with the reason for the analysis reviews and confirms the results.5United States Government Accountability Office, Data Mining: Agencies Have Taken Key Steps ToProtect Privacy in Selected Areas, but Significant Compliance Issues Remain, GAO-05-866, August 2005,p. 4.6In this report, “rules” specify a set of actions that are expected to follow a particular set of conditions. Anexample of a rule might be, “If an individual sponsors more than one fiancée for immigration at the sametime, there is likelihood of immigration fraud.”7This type of analysis is generally described as the scientific method. See, for example, “Steps of theScientific Method” at http://www.cdc.gov/ncbddd/folicacid/excite/Files in use/steps of the scientific method.htm , lastvisited December 27, 2005.8Two Crows Corporation, Introduction to Data Mining and Knowledge Discovery, Third Edition, 1999, p.1.7

Data Mining ReportDHS Privacy OfficeJuly 6, 2006Because analysis designed to predict future behavior or results is not unique to datamining, this report focuses on the feature of data mining that is unique—discovering newpatterns, relationships, and rules in data. Therefore, in this report data mining is definedas follows:Data mining involves the use of sophisticated data analysis tools to discoverpreviously unknown, valid patterns and relationships in large data sets. Data miningconsists of more than collecting and managing data; it also includes analysis andprediction.9This means data mining consists of the collection and management of data associatedwith analysis and prediction of future outcomes. The application of patterns,relationships, and rules to searches, whether these are derived through data mining,observation, intelligence, or theoretical models, is not addressed in this report.The term “data mining” is often used to describe analysis of numerical or structured data.The term “text mining” is often used to describe analysis of unstructured text. Followingthe definition of the CRS, the definition in this report includes the analysis of data in allforms: quantitative, textual and digitized images.B.The Process of Data MiningData mining is an analytic process that involves a series of steps.10 Definition of the problem to be solved Data identification and collection Data quality assessment and data cleansing Model building Model validation Model deploymentThe data mining process is iterative, with information learned in later steps leading theanalyst back to earlier steps for clarification and adjustment.9Thus, this report would exclude searches using patterns, relationships, and rules focused on a particularindividual, such as used in a threat and risk assessment vetting program.10The Appendix contains examples of data modeling processes used in the U.S. and Europe.8

Data Mining ReportDHS Privacy OfficeJuly 6, 20061.Definition of the Problem to be SolvedIt is essential that data mining begin with the understanding of a business need which willbe served by the data mining analysis. Without this understanding, it is not possible todetermine what data is needed or whether the patterns detected in the data are useful ormeaningful.2.Data Identification and CollectionOnce the business need is understood, the data can be identified, collected and preparedfor analysis. These activities can take the majority of the time and effort in the datamining process.11 The data to be analyzed is generally copied into a separate data base,usually called a data warehouse or data mart, although techniques for distributed datamining12 and the use of “virtual” data warehouses are being developed. The use ofseparate data warehouses or data marts can help prevent accidental changes in sourcedata, and allows analysts to work with the data without reducing performance of otherapplications being run on source databases.3.Data Quality Assessment and Data CleansingData quality assessment and data cleansing are essential for preparing data for analysis.Aggregating data from different sources into a single database brings with it severalpotential concerns.11 Individual data fields may have incorrect values. Some of these may be obvious,for example “Age 200”, but others may not be. There may be incorrect combinations of data values, such as associating data withan incorrect individual’s name. There may also be logically impossiblecombination of values, such as “City New York,” “State New York,”“Population 2,500.” There may be missing data values. Different databases may use the same term to describe data values that havedifferent meaning. For example, in one database the field labeled “Address” mayrefer to home address, but in another database it may refer to shipping address.Two Crows Corporation, p. 23.12“Distributed data mining” is a technique for doing data mining on databases that reside on differentcomputers or in different organizations. Data mining techniques are now being developed that permitanalysis of these databases without first combining data into one large database.9

Data Mining ReportDHS Privacy OfficeJuly 6, 2006This problem can be particularly severe when data collected for one purpose isused for other purposes.The issues listed above may be present in a single database, but can be exacerbated whendata from different sources are combined for analysis. The data cleansing process is theprocess of looking for and, when possible, correcting potential errors in the data.4.Model BuildingOften, when people talk about data mining, they mean the step of building the modelsthat correspond to the underlying information in the data. The goal of the data modelingprocess is to discover valid relationships between data elements. These relationships,sometimes referred to as patterns or rules, can then be used to predict future behavior orsearch for additional cases where the relationship between variables holds. For example,data mining may indicate that fraudulent applications for benefits have particularcharacteristics, which would lead to an investigation of future benefits applications withsimilar characteristics. Techniques used in building data mining models are discussed inthe next section of this report.5.Model ValidationFinally, before a model is deployed, it must be evaluated and validated. As mentionedabove, just because a pattern exists in the data does not mean that the pattern ismeaningful or valid. Some correlations between data elements can be spurious, such aswhen two people attend the same university at the same time. Correlations by themselvesdo not provide any information about cause and effect. For example, data mining maydemonstrate a correlation between high average family income and high quality ofeducation in local public schools, but it will not explain whether families with highincomes move to areas with good public schools or public school quality improvesbecause families with high incomes have more resources to devote to education.To validate patterns discovered through data mining, model builders often divide the datainto separate data sets—one set to build the model and the other set to validate themodel’s predictive ability. If a model cannot make predictions with a pre-specifieddegree of accuracy, it is generally rejected.6.Model DeploymentThe final step in data mining is the deployment of the model to the field. Models can beused to make recommendations or to analyze new data. In cases where the modelbecomes part of a set of analytic tools, new users must be trained on appropriate uses andlimitations of the model and on the process that must be followed with the resultsproduced by the model. Performance of the model must also be monitored over time asthe changes in the external environment affect the patterns of behavior that the model wasbuilt to analyze.10

Data Mining ReportDHS Privacy OfficeJuly 6, 2006C.Data Mining TechniquesData mining techniques look for various types of patterns, relationships, and rules in thedata: Association or link analysis (i.e., pattern in which events and/or people areassociated with one another)13 Sequence or path analysis (i.e., patterns where one event leads to anotherevent) Classification (i.e., looking for events, objects, or people with sharedcharacteristics) Clustering (i.e., finding and documenting groups of people or entities whoseattributes are similar to each other but different from those in other clusters) Forecasting (i.e., discovering patterns from which one can make reasonablepredictions regarding future activities or events)Visualization techniques, while not analytic techniques in themselves, are often used toassist analysts by displaying analytic results in easily comprehensible form. For example,link analysis can be presented as a group of objects connected by lines. By looking at thevisual representation of analytic results, an analyst can focus on a particular object,examine the underlying data, or look at ways in which connections between objectsevolved over time.III.Privacy and Civil Liberties Concerns in the Use of Data Mining Technologiesfor Homeland SecurityData mining provides a set of analytic tools. In conjunction with other tools, data miningcan provide the capability to explore and fully exploit enormous quantities of availabletransaction, operational, and other data. As is true of all tools, data mining can be usedappropriately to enhance security, reduce fraud and increase operational efficiency. Datamining can be used as a tool to provide insight and access into information not otherwiseavailable through other means. In particular, if the pattern, relationships, and rulesdiscovered validate other means of making determinations, especially in subjectidentification, data mining can enhance security.1413Again, this does not includes searches predicated upon a known subject.14Deployed appropriately, data mining can provide an effectual means to reduce not only false positives,but also false negatives. In a security setting, any tool and its capabilities should be viewed using a riskassessment model in order to recognize essential protections based upon the risks associated. Nonetheless,11

Data Mining ReportDHS Privacy OfficeJuly 6, 2006When data mining is used to analyze information about individuals so that decisions canbe made about these individuals, there are greater risk

a. definition of data mining 1 b. data mining process steps and attendant privacy issues 1 c. dhs data mining activities 2 i. introduction 5 ii. description of data mining technology 6 a. definition of data mining 6 b. the process of data mining 8 1. definition of the problem to be solved 9 2. data identification and collection 9 .