LibGuides: How to Work with Sensitive Data: Reducing Data Sensitivity (Text)

Overview

Text-based datasets can contain dozens to hundreds of variables. This can include both structured (e.g., tabular) and unstructured (e.g., interview transcripts) data. Even if you are including text-based data that are more like accessory metadata than the focal research dataset (e.g., a spreadsheet of participant demographics), this information can still be sensitive. In large datasets (thousands of entries or more) or datasets that sample broad scales (e.g., country-level sampling), many variables or attributes are not identifying in isolation (e.g., there are millions of Americans who identify as 'male.'). However, in combination with other variables or other information (e.g., specifying a small geographic region of sampling in the Methods section of an associated paper), a given variable may be the "missing piece" that leads to re-identification of a participant (e.g., a 43yo white male who works in real estate in Jackson Hole, Wyoming and is married with 2 children).

Examples of potentially identifying variables

The following is a non-exhaustive list of common variables that are typically not directly identifying (e.g., Social Security Number, name) but that may become identifying in tandem with other information. It's important to keep in mind that context of both the scope of the data (the target group) and the data themselves (the sampled group) influence the relative likelihood of a variable contributing to re-identification.

Qualitative variables

Race, ethnicity, Indigenous or tribal status/affiliation
Gender, sex
Sexual orientation, marital/relationship status
Religious affiliation
Political affiliation, voting history
Medical conditions and diagnoses
Location-based attributes (e.g., birthplace, current residence [e.g., city], facility for medical treatments)
Physically observable traits (e.g., eye color)
Educational attributes (e.g., highest degree, field of degree, granting institution)
Qualitative professional attributes (e.g., job title, employer)
Competencies / proficiencies (e.g., spoken languages)
Qualitative ownership (e.g., do you own a house or not)

Quantitative variables

Year of birth, exact age
Physically observable traits (e.g., height)
Household composition (e.g., number of minors/children, number of pets/livestock)
Quantitative professional attributes (e.g., salary, years in position)
Quantifiable ownership (e.g., number of vehicles)

Examples of high-risk single variables

In some instances, a single variable may carry a high risk of being identifying on its own, or at least significantly narrowing down the possible identities. This is often because a variable or at least one value in a variable is very precise/specific or is an outlier. In many instances, a given value or variable may not be sufficient to definitively re-identify a single person, but it could still rule out 99% of the possibilities and thus require only a little bit more information to re-identify. Some examples are given below (all of these are based on real-world examples):

Numerical outlier (salary): multi-million dollar salaries at a university
Numerical outlier (household composition): more than six children in a household
Numerical outlier (ownership): more than four vehicles in a household
Numerical outlier (age): a 40-year-old professional NBA player
Unique values (job title): Open Research Coordinator for Data and Software (the job title of Bryan Gee, one of the maintainers of the LibGuide)

Context matters, which also makes it difficult to establish absolute rules around treatment of demographic data. For example, large households are more common in some geographic regions (city/town, state/province, country) than others

Removal

The most robust way of reducing sensitivity of data is to simply remove variables that are unnecessary for reproducing the results of associated publications. In some studies, there may be variables that are automatically or manually collected but that are not analyzed in a specific study. For example, most online survey platforms will record the exact timestamp that someone submitted a form, but this is often irrelevant for a study unless data are being collected over a protracted time as part of a longitudinal study (but could be used to differentiate students who filled out a course survey in class versus after class). In other cases, some variables might have been collected with the intent to analyze them but were not analyzed for a certain project, although they could be useful for someone else or for another study. In that case, those data can also be removed.

If a single dataset may or will be used in multiple studies, with different sets of variables being analyzed, be aware of the potential for 'jig-sawing,' which is when someone is able to combine multiple datasets related to the same participants that, in isolation, are not identifying, but when recombined become identifying.
It is often useful to indicate when a variable was collected but has been entirely removed. This could be done in a variety of ways (e.g., list of variables in a README file, blank column with a column header in the data file, column with uniform value like 'REDACTED' in the data file).

Some variables must be removed for public dissemination of data. This includes data with legal prescriptions like HIPAA and FERPA data. In general, researchers should be careful about variables that are widely accepted to be sensitive (e.g., HIV status, rare medical diagnosis).

Generalization

Sometimes, variables are collected in one form (e.g., exact age) but analyzed in a different form (e.g., 10-year age bins). In these cases, the publicly shared data should use the form of the variable that is the one that was analyzed (which is usually the more generalized form). For variables that were not analyzed but that a research team wants to share, generalizing the variables might be an acceptable means of including those variables in a publicly disseminated dataset.

Both qualitative and quantitative variables can be generalized, although the methods will naturally vary. Some examples are provided for each:

Quantitative

Age: converting exact age (e.g., 37) to binned age (e.g., 30-39 years)
Salary: converting exact salary (e.g., $50,061) to binned salary (e.g., $50,000-59,999)

Qualitative

Job titles: converting possibly one-of-a-kind titles (e.g., Open Research Coordinator for Data and Software) to more general categories / classes (e.g., Research Data Librarian)
Location: converting specific names to a more generic categorization (e.g., employment at Target to employment at large, multinational retail company)

Some disciplines may have standards, either for analysis or reporting, in terms of how generalization is done. If you are unsure whether these exist for your type of data, we recommend talking to some senior researchers and examining previously published datasets.

Transformation

Sometimes quantitative variables can be transformed in a way that preserves the original data but that presents it in a different, more generalized form. Some examples are provided below:

Birth year: can be converted to exact age (if the dates of data collection do not allow for retro-calculation) or binned age
Timestamps and dates: can be date-shifted by a uniform value that is not made public to maintain relationship between data points (182 days is usually considered the safe minimum)
- If there are multiple timestamps for one entry/participant (e.g., multiple medical treatment dates), these can be converted to the duration/interval between them; an alternative scheme is to set the oldest/first date as the "anchor" of 0 and then assign values to subsequent related dates based on the time since (e.g., 7 days)

In general, there are a variety of ways to modify data to de-identify them. That some kind of modification (e.g., jittering/adding noise, date-shifting, rounding, substituting synthetic data with identical distribution to the original data) was performed should be clearly stated, although the specifics should not be made publicly available to avoid retro-calculation of the original values. The original values should always be retained by a researcher or research group. Explicit mention of data modification to protect participants is essential, otherwise potential re-users may assume that the data are "real-world" (e.g., Evans et al., 2023)