## Can ‘Dirty’ Data Be Your Friend?

### Considering the Impact on Disclosure Risk as Illustrated in the Education Data Context

When collecting and disseminating survey data, the goals for a federal agency are to maintain the confidentiality of data provided by respondents and limit any confidentiality-based edits to the data to minimize the impact on data quality. Some agencies provide guidelines and standardized procedures for ensuring data confidentiality and quality, as discussed in the Federal Committee on Statistical Methodology Working Paper 22. (Detailed standards on confidentiality can be found on federal agency websites such as the National Center for Education Statistics [Section 4.2].)

In the context of education studies, the data collected are most often representative samples of schools, students, and/or teachers, whether at the national, state, or local-district level. For public-school surveys, there is concern about whether matching key survey data (e.g., school enrollment, percentage of students who are Hispanic) to publicly available school-level universe data files can be used to identify the school. Such key survey data are referred to as the “matching key.” The survey values for the matching key may come directly from the universe collection used as the sampling frame.

For the public-school survey example, the universe file could be the Common Core of Data (CCD), a publicly available database of U.S. public schools. If the survey values do not come from the CCD or some other publicly available frame source, the values may be estimated from the data collected from the sampled students in the school, such as the percentage of Hispanic students in the school, if some information is present in the survey data set that provides the connection between the students and their school.

How best to deal with “dirty” data in a way that uses the most empirical methodological approach is the main focus of this article. “Dirty data” is not a statement demeaning the validity or usability of the data or survey estimates, but is used in the context of the high precision required for matching purposes at a data-record level. Typically, guidance or risk measures for re-identification (i.e., linking a school or a student to survey data) that do not incorporate an actual matching process make the lofty assumption that the data sleuth’s matching key is 100% noise-free, which then overestimates the risk and triggers more confidentiality edits than necessary. Even though the data sets (or available information) may be of high quality, there are several inherent risk-reducing elements to consider:

- Either data set may not have complete coverage of its population.
- The definition of the target population for a survey may be specific when it comes to details relating to exclusions.
- Mobility rate among the student population may cause changes in the demographics between universe frame and the survey collections, with some uncertainty occurring as to the timing of the move.
- Depending on when it was last collected, the public-use universe database may have records that are out of date; for example, schools may provide a total student enrollment count, but their actual enrollment count is a dynamic number caused by new students entering the school and old students graduating or dropping out.
- Either data set may contain missing data or use imputed data to fill in for missing values.
- A data set may rely on self-reported information, which is not always accurate or at least can be different from administrative data, to provide its content.
- There may have been keying errors when creating the data set.
- Survey estimates are subject to sampling variance, which adds noise if used in the matching key.

#### Rationale and Rules for Probabilistic Matching in Disclosure Risk Assessment

Because in some agencies, including the National Center for Education Statistics (NCES), public-use data cannot be used to attempt to identify any of the respondents within a study, safeguards are necessary to protect against accidental disclosure, complacency, or malicious attacks. A reasonable approach is needed to address the degree of dirty-ness in the data with respect to measuring the disclosure risk. Therefore, besides the standard confidentiality edits such as data suppression, top- and bottom-coding, and collapsing of categories, confidentiality standards may (depending on the agency) generally require two separate procedures for public-use microdata files for sample surveys:

- Identify high-risk variables and records and mask them using a directed (deterministic) approach, such as data-swapping, where select data values are exchanged between two closely matched records.
- Introduce an additional measure of uncertainty into the data such as using random swapping as discussed by Fienberg and McIntyre in their
*Journal of Official Statistics*article, “Data Swapping: Variations on a Theme by Dalenius and Reiss.”

The results of the risk assessment in the first procedure can be used to inform the degree of random swapping in the second procedure. An approach can be outlined that limits data perturbation (and potential data distortion) based on the risk assessment, but what is the best approach to conduct the risk assessment given a degree of dirty-ness in the data?

Returning to the public-school survey example, protecting the identity of a school significantly protects the identity of an individual respondent from the school. Thus, if one can identify a school by matching the data against external public-use data files (e.g., CCD), one or more of the matching key variables on the sample survey must be perturbed to prevent such identification. The “rule of three” guideline is sometimes used to determine whether a match is treated as a disclosure risk. In this context, the rule of three means that a school participating in a sample survey cannot have its public-use record as the first- or second-highest match when compared to the related public-use universe data file.

The matching approach, software, and methodology are critical in identifying any records that can be matched (and, thus, need confidentiality edits).

An accepted approach in disclosure analysis is to compare the matching key variables from the sample survey with publicly available universe data using a probabilistic, record-linkage approach, as proposed by Fellegi and Sunter in their 1969 *Journal of the American Statistical Association* article, “A Theory for Record Linkage.” This approach does not require a 100% successful matching of variables between the sample survey and public-use universe data, but instead calculates match rates by measuring how closely the variables agree. For survey-data matching, exact matching procedures and Euclidean distance measures were not deemed sufficient for disclosure analyses in most studies. Since the 1990s, probabilistic record-linkage matching has been a preferred methodology for identifying disclosure risk data for public-use sample-survey microdata release.

The most recent probabilistic linkage methods used by federal agencies rely on extensions to Fellegi and Sunter’s theory, due to Winkler’s and Yancey’s works—”Overview of Record Linkage and Current Research Directions,” *Research Report Series* (*Statistics* #2006-2), and “The BigMatch Program for Record Linkage,” *Proceedings of the Section on Survey Research Methods of the American Statistical Association*, respectively. Linkage of sample survey data to other publicly available universe data sources begins with a comparison vector. The vector represents, for a product space *A* x *B*, the agreement of attributes (e.g., variables) of data sources *A* and *B*.

In the education survey context, *A* refers to the sample survey records for schools, and *B* to public-use universe records for schools. All possible pairs of records from *A* and *B* partition into a set *M* of correct links (where the pair of records represents the same school) and a set *U* of incorrect links (where the pair of records represents different schools). A data sleuth seeks to separate set *M* from set *U*. It is the goal of data-disclosure control to make the risk of that happening quantifiably small.

An agreement pattern is defined as the outcomes of a series of comparisons of variables *v* in common between sample survey data being released (*A*) and in universe data already available to intruders (*B*). The agreement pattern can be notated as (*A.v _{i}* ≈

*B.v*), (

_{i}*A.v*≈

_{j}*B.v*), …., where the subscripts denote specific variables taken in any order from

_{j}*A*and

*B*and the ≈ symbol is a comparison operator that yields, in the original sense of a comparison vector, a binary (0,1) agreement pattern that depends on the similarity of the values of the variables.

After comparisons of all pairings of records in *A* and *B*, a summary of the results reveals distinct agreement patterns with frequencies. This (*c*) represents the outcomes of comparisons of common variables (*v*) in files *A* and *B* as a comparison vector populated by a *k*-tuple of binary values (or, more generally, in the range {0,1}). Each distinct agreement pattern has unobserved *m* (match) and *u* (unmatch) probabilities given the record (*r*): *m _{c}*=

*Pr*(

*c*|

*r*in

*M*);

*u*=

_{c}*Pr*(

*c*|

*r*in

*U*). The likelihood that a pairing of rows from

*A*and

*B*with agreement pattern

*c*belongs in

*M*is defined as

*m*/

_{c}*u*, so agreement patterns can be ranked by score

_{c}_{c}= log (

*m*/

_{c}*u*) and decision rules defined to control Type I and Type II errors. Approaches to approximate or infer the m and u probabilities have been developed since they cannot be observed directly.

_{c}#### Matching Methodology and Dirty Data

There are two components for reviewing and preparing data for evaluating the risk factors with matching data. The first component entails identifying all of the possible variables that could be matched, whether they are defined or derived. The second entails evaluating the reliability of each of the variables selected for matching, based on the variable type (categorical, continuous, or ordinal). Certain categorical variables, such as geographic-based categorical variables, should always or almost always match, while most continuous numeric variables and character string variables are nearly always different.

A question that arises, however, is whether data should be masked, withheld, or otherwise perturbed to the extent that they are if the source data have inherent masking. Most, if not all, data collected in education surveys and educational administrative offices have some differences, mainly due to the various reasons for dirty data mentioned above. What becomes important to incorporate in the model is the cause of the differences and the degree of the differences. These differences can be considered as noise or error and can be incorporated into probabilistic matching.

Some variables will be more reliable than others; for example, gender will be more reliable than race, since self-identification and/or administrative categories are not always consistent on race, but are with gender. Also, albeit outside of the context of education data, in an evaluation by DiSogra, et al., in their *Proceedings of the Section on Survey Research Methods of the American Statistical Association* paper, “On the Quality of Ancillary Data Available for Address-Based Sampling,” the correlation between the ancillary data (such as race/ethnicity from commercial files) and self-report survey data is not very high (ranges are from 0.26 to 0.63). This may help to reduce the threat of some explicit information being accessed from other publicly available files.

Several approaches can be used to develop a measure of data reliability. If universe data records are available, a subset of the data can be verified and an error rate calculated. If the target and universe databases containing the same respondents can be compared, the agreement/disagreement between variables can be calculated. If neither approach is available, a model is needed that can inform on the potential reliability of data based on given factors.

Regardless of the source of error, it is possible to look into a matching methodology to factor in the divergence between survey and public-use universe data files. The design of Jaro’s AutoMatch probabilistic record-linkage software, for example, incorporates a reliability measure (matching weight) for each matching key variable based on the agreement/disagreement between the matching pairs. Using a similarly derived dynamic measure for agreement/disagreement when conducting the risk analyses, allowing for distance measures when matching continuous variables, may produce a more-reliable risk measure.

A number of probabilistic-matching software packages are commercially available. One notable package is the Centers for Disease Control and Prevention’s Fine-grained Record Integration and Linkage (FRIL) software. It provides all the functionality needed for handling exact (fuzzy and distance-based deterministic matching) and delta/tolerance-based variables, inherently calculating data reliability (matching and non-matching probabilities) to generate matching weights to identify disclosure risk entities.

#### Accounting for Dirty Data in Other Risk-Measure Approaches

As mentioned in the introduction, guidance and risk measures have been developed, such as by the Office for Civil Rights (OCR) and El Emam, et al., to estimate re-identification risk without the existence or use of a matching file. Suppose *r _{i}* is the re-identification risk for sample survey record

*i*. Once estimated, the risk value is compared to a pre-specified tolerance τ. If > τ, then other statistical disclosure control (SDC) treatments have to occur to reduce the risk below the threshold. To incorporate the risk-reducing features of “dirty” data, applying a risk-reducing factor

*f*may be considered as follows:

*f*×

*r*. If

_{i}*f*is different for each record, or for groups of records, the risk-reducing factor could be applied as

*f*×

_{i}*r*. An example of this approach can be found in Krenzke and Hubble’s 2009

_{i}*Proceedings of the Section on Survey Research Methods of the American Statistical Association*paper, “Toward Quantifying Disclosure Risk for Area-Level Tables When Public Microdata Exists.”

Conversely, instead of applying the risk-reducing factor to the risk measure, one could equally apply it to the threshold (τ/*f*). In their 2010 *Annals of Applied Statistics* paper, “Assessing the Protection Provided by Misclassification-based Disclosure Limitation Methods for Survey Microdata,” Shlomo and Skinner incorporated the concept of misclassification error from data processes or from SDC treatments in calculating risk using a log-linear model. Note that, when missing values are not addressed in the risk estimation, it may cause an overestimate of the risk, as discussed recently by Krenzke, Li, and Li in “An Evaluation of the Impact of Missing Data on Disclosure Risk,” published in the *Proceedings of the Survey Research Methods Section of the American Statistical Association*.

#### Summary

The one-size-fits-all approach to identifying and ameliorating disclosure risk should be reviewed in terms of how the approach used can be adjusted to account for dirty data to enhance the overall statistical disclosure-control process for sample surveys. Overly conservative risk assessments unnecessarily reduce data utility, due to overly cautious masking of data.

Probabilistic matching is a great solution that can handle differential dirty-ness across variables. This approach is useful not only for identifying potential disclosure risk for individual data records, but also for providing insights into the overall risk level of the sample survey file. Records with outlier data, unusual combinations of data, and/or a large matching key often lead to a high-risk file, and therefore, require a certain level of perturbation, such as random swapping. However, studies that have either no or a very limited number of actual matches should, perhaps, relax the level or need for random swapping, collapsing of data, and data suppression.

Still, dirty data may not help in the case of outliers. For example, while income may not be reported or measured with 100% perfection, the magnitude of some income value may be in a “neighborhood” by itself. Therefore, outliers may still exist and risk-reducing practices such as top-coding (such as assigning a maximum value of $75,000 to any income values greater than $75,000) or grouping continuous values into a small number of categories are common.

This means the functionality of probabilistic linkage matching can be considered as two-fold. The risk assessment using the matching approach should consider the sources and amount of error in the data. Once the risk levels are determined, the amount of error can influence the perturbation rates and the variables that may or may not be perturbed.

For example, if the variables selected for perturbation already have a significant amount of noise or uncertainty about its values, then the perturbation rate for that variable could be lowered. Overall, through the use of probabilistic matching procedures, measures could be developed that quantify the overall disclosure risk in the file. These measures could be used to signal and potentially lessen the amount of perturbation required to ensure that the data meet the standards for dissemination.

#### Further Reading

DiSogra, C., J.M. Dennis, and M. Fahimi. 2010. On the quality of ancillary data available for address-based sampling. *Proceedings of the Section on Survey Research Methods of the American Statistical Association*: 4174–4783. Alexandria, VA: American Statistical Association.

El Emam, K. 2011. Methods for the de-identification of electronic health records for genomic research. *Genome Medicine*, 3, 1–9. (appendix (PDF download)).

Elliot, M.J., C.J. Skinner, and A. Dale. 1998. Special uniques, random uniques, and sticky populations: Some counterintuitive effects of geographical detail on disclosure risk. *Research in Official Statistics* 1(2).

Federal Committee on Statistical Methodology Working Paper 22.

Fellegi, I.P., and A. Sunter. 1969. A theory for record linkage. *Journal of the American Statistical Association*. Vol. 64(328):1183–1210.

Fienberg, S., and J. McIntyre. 2005. Data swapping: Variations on a theme by Dalenius and Reiss. *Journal of Official Statistics* 21(2):309–323.

Jaro, M.A. 1989. Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida. *Journal of the American Statistical Association* 84:414–420.

Krenzke, T., and D. Hubble. 2009. Toward quantifying disclosure risk for area-level tables when public microdata exists. *Proceedings of the Section on Survey Research Methods of the American Statistical Association*: 4707–4717. Alexandria, VA: American Statistical Association.

Krenzke, T., J. Li, and L. Li. 2014. An evaluation of the impact of missing data on disclosure risk. *Proceedings of the Survey Research Methods Section of the American Statistical Association*: 548–557. Alexandria, VA: American Statistical Association.

Office for Civil Rights (OCR). 2012. Guidance regarding methods for de-identification of protected health information in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. Washington, DC: Office for Civil Rights.

Shlomo, N., and C. Skinner. 2010. Assessing the protection provided by misclassification-based disclosure limitation methods for survey microdata. *The Annals of Applied Statistics* 4(3):1291–1310.

Winkler, W. 2006. Overview of record linkage and current research directions. *Research Report Series* (Statistics #2006-2). Washington, DC: Statistical Research Division, U.S. Census Bureau.

Yancey, W.E. 2004. The BigMatch program for record linkage, *Proceedings of the Section on Survey Research Methods of the American Statistical Association*. Alexandria, VA: American Statistical Association.

#### About the Authors

Stephen “Shep” Roeyis a senior systems analyst at Westat. He has, over the past 20 years, been involved in developing and implementing Statistical Disclosure Control (SDC) measures for the U.S. Department of Education and other federal agencies. He has managed a number of federal studies and is currently the project director for the NAEP High School Transcript Study. He has provided his database and statistical expertise in workshops and seminars nationally and internationally. His undergraduate degree is from Muhlenberg College and his graduate degree is from Schiller International University (SIU), Paris campus.

Tom Krenzkeis a senior statistician and associate director of Westat’s Statistical Staff, and has more than 20 years of experience in survey sampling and estimation techniques. He leads Westat’s Confidentiality Work Group and serves on the Westat Institutional Review Board. He is an appointed member of the American Statistical Association’s Committee on Privacy and Confidentiality, 2016 Program-chair elect for the Survey Research Methods Section, and at-large representative of the Washington Statistical Society, where he serves on the Mentoring Subcommittee of WSS and co-coordinates the member Spotlight series.

Robert Perkinsis a senior systems analyst at Westat. He has worked with Stephen “Shep” Roey in the development and implementation of statistical disclosure control measures for various national and international education surveys. He previously worked at the U.S. Census Bureau, where he was involved with both the decennial census and population estimates and projections. He has a master’s degree in statistics from Virginia Tech.

**O Privacy, Where Art Thou?**takes a statistical look at data privacy and confidentiality. If you are interested in submitting an article, please contact Aleksandra Slavkovic, column editor, at

*sesa@stat.psu.edu*.