Discovering The Truth And The Unknown

Transcription

Discovering the truth andthe unknownTon de Waal, October 26, 2017

Let’s introduce myself Studied mathematics (“everything” except statistics) Since 1993 methodologist at Statistics Netherlandso statistical disclosure controlo statistical data editing & imputationo data integration Since 2014 also professor “data integration” at Tilburg University

Let’s introduce myself Studied mathematics (“everything” except statistics) Since 1993 methodologist at Statistics Netherlandso statistical disclosure controlo statistical data editing & imputationo data integration Since 2014 also professor “data integration” at Tilburg University

Contents What is statistical data editing? Why is it useful or interesting? Overview of statistical data editing A little bit on imputation The future of statistical data editing How do we do it in practice?

What is statistical data editing? Discovering the truth

What is statistical data editing? Discovering the truth Statistical data editing is process of detecting and correcting errors Statistical data editing can be subdivided into three steps:o finding erroneous recordso finding erroneous fields in those recordso replacing erroneous fields by better values (imputation)

Why is statistical data editing useful? Statistical data editing leads to data of higher quality

Why is statistical data editing useful? In traditional form statistical data editing is quite costly and timeconsumingo modern forms save money and time

Why is statistical data editing interesting? Difficult to answer9

Why do I like statistical data editing? Easier to answer Statistical data editing has many different aspectso statisticso process control & process optimizationo teaching, communicating with humanso mathematics operations research (optimization under constraints) combinatorics mathematical logico

Why do I like statistical data editing? Easier to answer Statistical data editing has many different aspectso statisticso process control & process optimizationo teaching, communicating with humanso mathematics operations research (optimization under constraints) combinatorics mathematical logico

Statistical data editing in the old days In the very old days statistical data editing was a manual processwith people checking and correcting data Since 1950’s computers are used for checking datao correcting data was still done manually by peopleo often several cycles of checking and correcting were needed

Statistical data editing in the old days In the very old days statistical data editing was a manual processwith people checking and correcting data Since 1950’s computers are used for checking datao correcting data was still done manually by peopleo often several cycles of checking and correcting were neededo very expensive and slow process

Statistical data editing in the old days In the very old days statistical data editing was a manual processwith people checking and correcting data Since 1950’s computers are used for checking datao correcting data was still done manually by peopleo often several cycles of checking and correcting were neededo very expensive and slow process

How do you check data? Outlier detection Edit ruleso edit rules capture subject-matter knowledge of admissible (orplausible) values and combinations of values in each recordo inconsistency of data values with edit rules means that there isan error or in any case that the values are implausibleo Examples: profit of enterprise should be equal to turnover minus costs male cannot be pregnant profit of enterprise should be less than 50% of its totalturnover female cannot have given birth to more than 20 children

How do you check data? Outlier detection Edit ruleso edit rules capture subject-matter knowledge of admissible (orplausible) values and combinations of values in each recordo inconsistency of data values with edit rules means that there isan error or in any case that the values are implausibleo Examples: profit of enterprise should be equal to turnover minus costs male cannot be pregnant profit of enterprise should be less than 50% of its totalturnover female cannot have given birth to more than 20 children

Modern forms of statistical data editing Interactive editing Selective editing Automatic editing Macro-editing

Interactive editing When PC’s became popular in the 1980’s, interactive editingbecame popular Checking and correcting can be done at the same timeo effects of adjusting data in terms of failed edits or distributionalaspects can been seen immediately on computer screen This immediate feedback, in combination with data themselves,direct subject-matter specialist to potential errors

Interactive editing When PC’s became popular in the 1980’s, interactive editingbecame popular Checking and correcting can be done at the same timeo effects of adjusting data in terms of failed edits or distributionalaspects can been seen immediately on computer screen This immediate feedback, in combination with data themselves,direct subject-matter specialist to potential errors Still, all records need to be checked

Selective editing Selective editing aims to identify records with influential errors Most common form of selective editing up to now is based on scorefunctions that are used to split data into two streamso critical stream: records that are likely to contain influential errors records in critical stream are edited interactivelyo noncritical stream: records that are unlikely to contain influentialerrors records in noncritical stream are edited automatically, or – insome cases – not at all.

Selective editing Score for a record (global score) is combination of scores for severalimportant target parameters (local scores) Local scores are often defined as product of two componentso likelihood of potential error (“risk”) measured by comparing rawvalue with “anticipated” valueo contribution on estimated target parameter (“influence”)measured as (relative) contribution of anticipated value toestimated total Records with scores above certain threshold are directed tointeractive editing

Selective editing Score for a record (global score) is combination of scores for severalimportant target parameters (local scores) Local scores are often defined as product of two componentso likelihood of potential error (“risk”) measured by comparing rawvalue with “anticipated” valueo contribution on estimated target parameter (“influence”)measured as (relative) contribution of anticipated value toestimated total Records with scores above certain threshold are directed tointeractive editing

Risk component Risk component can, for instance, be defined as 𝑥𝑖𝑗 𝑥𝑖𝑗 𝑅𝑖𝑗 𝑥𝑖𝑗where 𝑥𝑖𝑗 is the observed value of variable 𝑗 in unit 𝑖 and 𝑥𝑖𝑗 is thecorresponding “anticipated” value Large deviations from the “anticipated” value are taken as anindication that the raw value may be in error Small deviations indicate that there is no reason to suspect that thevalue is in error

Influence component Influence component can, for instance, be defined as𝐹𝑖𝑗 𝑤𝑖 𝑥𝑖𝑗 where 𝑥𝑖𝑗 is again the anticipated value and 𝑤𝑖 is the designweight of unit 𝑖

Local scores Multiplying the risk factor by the influence factor results in a measurefor the effect of editing a field on the estimated total. In our example,the local score would be given by𝑠𝑖𝑗 𝑤𝑖 𝑥𝑖𝑗 𝑥𝑖𝑗 which measures the effect of editing variable 𝑗 in unit 𝑖 on the totalfor variable 𝑗

Global scores Local scores are scaled before combining them into a global score,e.g. byo dividing local scores by (approximated) total of variableo dividing local scores by standard deviation of “anticipated”values Scaled local scores can be combined to form a global score inseveral different ways.o sum of the local scoreso maximum of the local scores

Automatic editing Let computer do all the work Main role of human is to provide computer with metadata, such asedits and imputation models

Automatic editing: two kinds of errors Systematic errorso errors reported consistently among (some) responding units andfor which we know underlying mechanism Random erroro errors for which we do not know underlying mechanism andseem to occur randomly

Automatic editing of systematic errors “Thousand-errors”: values reported in units instead of thousands ofunitso can be detected by comparing present values with values fromprevious years or from other sources Typing errors (interchanged or mistyped digits, forgotten minus signsor interchanged pairs of revenues and costs) and rounding errorso can be detected and corrected by using mathematicaltechniques Other errors that are specific for certain topic or branch of industryo can be detected and corrected by applying correction rulesspecified by subject-matter experts

Automatic editing of random errors Random errors occur by accident, not by systematic reason Methods can be subdivided into three classes:o methods based on deterministic checking rules: “if components do not sum up to total, total is erroneous”o methods based on statistical models and outlier detectiono methods based on solving a mathematical optimization problem often based on paradigm proposed by Fellegi & Holt (1976)

Fellegi-Holt paradigm Data in each record should be made to satisfy all edits by changingfewest possible number of fieldso for instance, if we can satisfy all edits by changing value of onlyone variable, we change that value

Using Fellegi-Holt paradigm To use Fellegi-Holt paradigm, we have to construct mathematicaloptimization problemo target function: change as few fields as possibleo constraints: modified record satisfies all specified edits Mathematical optimization problem can be solved in several ways:o original algorithm proposed by Fellegi and Holt (1976)o using branch-and-bound algorithmo using standard solver for mathematical optimization problems

Using Fellegi-Holt paradigm To use Fellegi-Holt paradigm, we have to construct mathematicaloptimization problemo target function: change as few fields as possibleo constraints: modified record satisfies all specified edits Mathematical optimization problem can be solved in several ways:o original algorithm proposed by Fellegi and Holt (1976)o using branch-and-bound algorithmo using standard solver for mathematical optimization problems

Using standard solver 𝑥𝑖0 (𝑖 1, , 𝑛): known observed values 𝑥𝑖 (𝑖 1, , 𝑛): “corrected” values (to be determined) Assume 𝛼𝑖 𝑥𝑖 𝛽𝑖 for certain constants 𝛼𝑖 and 𝛽𝑖 (𝑖 1, , 𝑛)

Using standard solver Minimize𝑖 𝑦𝑖subject to 𝑥𝑖 ’s satisfy edits 𝑥𝑖0 (𝑥𝑖0 𝛼𝑖 ) 𝑦𝑖 𝑥𝑖 𝑥𝑖0 (𝛽𝑖 𝑥𝑖0 )𝑦𝑖 𝑦𝑖 {0,1}

Using standard solver 𝑦𝑖 0if𝑥𝑖 𝑥𝑖0o we assume that 𝑥𝑖0 is correct 𝑦𝑖 1if𝑥𝑖 𝑥𝑖0o we assume that 𝑥𝑖0 is incorrecto we delete 𝑥𝑖0 and later impute that field

Macro-editing Macro-editing offers solution to some of the problems of microeditingo in particular, macro-editing can deal with editing tasks related tothe distributional aspects of the data Macro-editing checks whether data set as a whole is plausible

Macro-editing Two forms of macro-editing:o aggregation method compares quantities in tables with same quantities inprevious publications, or with related quantities from othersourceso distribution method available data used to characterize data distribution individual values are compared with this distribution visualizations are often used once suspicious data have been detected in a visualization,one can usually drill-down to individual records and editthese records interactively

Macro-editing Two forms of macro-editing:o aggregation method compares quantities in tables with same quantities inprevious publications, or with related quantities from othersourceso distribution method available data used to characterize data distribution individual values are compared with this distribution visualizations are often used once suspicious data have been detected in a visualization,one can usually drill-down to individual records and editthese records interactively

Imputation Discovering the unknown

Imputation Main problem of imputation is preservation of statistical distributionof (complete, but partly unknown) data as well as possible Many imputation techniques exist and are described in manyexcellent books and articles

Univariate imputation methods Mean (mode) imputationo impute mean (or mode) in class defined by auxiliary variables Ratio imputationo impute value of auxiliary variable times estimated ratio Regression imputationo impute value predicted by estimated regression model (with orwithout stochastic term)

Multivariate imputation methods Random hot decko impute values from random donor record from class defined byauxiliary variables Nearest-neighbor hot decko impute values from donor record that is closest to record to beimputed Maximum likelihood imputationo approach to estimate model parameters for, e.g., multivariatenormal datao values predicted by model are imputed (with or withoutstochastic term)

Variance of imputed data Standard variance formulas underestimate variance when applied toimputed data Variance can be estimated correctly by multiply imputing the datao in multiple imputation model parameters are varied by drawingfrom prior Alternatively, resampling methods may be used such as bootstrap orjackknifeo in resampling methods data are varied by resampling from theobserved data

Variance of imputed data Standard variance formulas underestimate variance when applied toimputed data Variance can be estimated correctly by multiply imputing the datao in multiple imputation model parameters are varied by drawingfrom prior Alternatively, resampling methods may be used such as bootstrap orjackknifeo in resampling methods data are varied by resampling from theobserved data

National Statistical Institutes and imputation Specific imputation problems at National Statistical Institutes (NSIs):o satisfying editso preserving totals

NSIs and imputation: satisfying edits For NSIs “logically consistent” data, i.e. data without impossiblecombinations (“pregnant males”), is important issue Imputations should preferably satisfy edit restrictions

NSIs and imputation: preserving totals Some NSIs have adopted one-figure policy: publish only one figure onthe same phenomenon Imputations should preferably also preserve known or previouslyestimated totals

The future of statistical data editing

The future of statistical data editing Many technical points need to be answered, for instanceo how to combine selective and automatic editing as efficiently aspossible?

The future of statistical data editing Developing effective and efficient editing techniques for Big Data No generally accepted definition of Big Data exists Recurring descriptions include Volume, Velocity and Varietyo Volume is what makes data sets big: larger than regular systemscan handle smoothlyo Velocity refers to short time lag between occurrence of event andit being available in data set, or to frequency at which databecome availableo Variety refers to wide diversity of data sources and formats

Big Data and statistical data editing Traditional editing methods generally require data to be structuredand require a lot of knowledge about these datao Big Data are often not very structured and knowledge about BigData is often limitedo in many cases traditional editing techniques are not suitable forBig Data Some new approaches have been developed, for instance the useof signal processing techniques to edit and impute traffic loop data Possibly data science methods, such as machine-learning, may beuseful here

The future of statistical data editing “Statistical data editing is part of the total quality improvementprocess, not the whole quality process” (Leopold Granquist, 1997)o Focus of statistical data editing has thus far been on detectingand correcting errorso Possibly more important are assessing quality of edited data obtaining more knowledge on sources of errors in the data,and subsequently using this information to improve futureversions of the survey (process)

The future of statistical data editing Moving from single source statistics to multi source statisticso traditionally only survey data were availableo nowadays much more data are available (administrative data,Big Data) Instead of editing each data source separately it is probably moreeffective and more efficient to edit them jointly Availability of several data sources on same topic also opensopportunities for other statistical methodso latent variable models when you have multiple observations ofsame variable, where true value is considered as latent variableo imputation satisfying edits and preserving totals

Finally, how do we do it in practice? Detect and correct certain systematic and/or obvious errors Use selective editing to split record in critical and non-critical streamo records in critical stream are edited interactivelyo records in non-critical stream are edited automatically After data have been edited on a micro-level: apply macro-editing

Finally, how do we do it in practice? "In theory, practice is simple“

Finally, how do we do it in practice? "In theory, practice is simple“ "But, is it simple to practice theory?"

Finally, how do we do it in practice? All statistical data editing techniques have their advantages anddisadvantages At NSIs we often integrate various statistical data editing techniquesin order to overcome drawbacks of the individual techniques

Typical data editing strategy for business data Detect and correct certain systematic and/or obvious errors Use selective editing to split record in critical and non-critical streamo records in critical stream are edited interactivelyo records in non-critical stream are edited automatically After data have been edited on a micro-level: apply macro-editing

Typical data editing strategy for business data Detect and correct certain systematic and/or obvious errors Use selective editing to split record in critical and non-critical streamo records in critical stream are edited interactivelyo records in non-critical stream are edited automatically After data have been edited on a micro-level: apply macro-editing This strategy is, for instance, used for annual Structural BusinessStatistics at Statistics Netherlands

Typical data editing strategy for social data Detect and correct certain systematic and/or obvious errors Use deterministic checking and correcting rules (micro-aggregation) After data have been edited on a micro-level: apply macro-editing

How do we do it in practice? – part 2 We try to keep it simple – no complicated models NSIs have always been reluctant to use model-based techniqueso NSIs don’t want to be accused of partiality and manipulating theirresults by favoring a certain modelo NSIs produce many statistics as part of their day-to-day routine finding best model would be very time-consuming Keeping it simpleo for some enterprises we have administrative data that can be usedduring selective editing and imputationo for some other enterprises we have data from previous periodo for some other enterprises we have data from other surveyso Even if you keep it simple, it will become complicated

How do we do it in practice? – part 2 We try to keep it simple – no complicated models NSIs have always been reluctant to use model-based techniqueso NSIs don’t want to be accused of partiality and manipulating theirresults by favoring a certain modelo NSIs produce many statistics as part of their day-to-day routine finding best model would be very time-consuming Keeping it simpleo for some enterprises we have administrative data that can be usedduring selective editing and imputationo for some other enterprises we have data from previous periodo for some other enterprises we have data from other surveyso Even if you keep it simple, it will become complicated

Thank you for your attention Any questions?

What is statistical data editing? Discovering the truth Statistical data editing is process of detecting and correcting errors Statistical data editing can be subdivided into three steps: o finding erroneous records o finding erroneous fields in those records o replacing erroneous fields by better values (imputation)