Text-based datasets can contain dozens to hundreds of variables. This can include both structured (e.g., tabular) and unstructured (e.g., interview transcripts) data. Even if you are including text-based data that are more like accessory metadata than the focal research dataset (e.g., a spreadsheet of participant demographics), this information can still be sensitive. In large datasets (thousands of entries or more) or datasets that sample broad scales (e.g., country-level sampling), many variables or attributes are not identifying in isolation (e.g., there are millions of Americans who identify as 'male.'). However, in combination with other variables or other information (e.g., specifying a small geographic region of sampling in the Methods section of an associated paper), a given variable may be the "missing piece" that leads to re-identification of a participant (e.g., a 43yo white male who works in real estate in Jackson Hole, Wyoming and is married with 2 children).
The following is a non-exhaustive list of common variables that are typically not directly identifying (e.g., Social Security Number, name) but that may become identifying in tandem with other information. It's important to keep in mind that context of both the scope of the data (the target group) and the data themselves (the sampled group) influence the relative likelihood of a variable contributing to re-identification.
Qualitative variables
Quantitative variables
In some instances, a single variable may carry a high risk of being identifying on its own, or at least significantly narrowing down the possible identities. This is often because a variable or at least one value in a variable is very precise/specific or is an outlier. In many instances, a given value or variable may not be sufficient to definitively re-identify a single person, but it could still rule out 99% of the possibilities and thus require only a little bit more information to re-identify. Some examples are given below (all of these are based on real-world examples):
Context matters, which also makes it difficult to establish absolute rules around treatment of demographic data. For example, large households are more common in some geographic regions (city/town, state/province, country) than others
The most robust way of reducing sensitivity of data is to simply remove variables that are unnecessary for reproducing the results of associated publications. In some studies, there may be variables that are automatically or manually collected but that are not analyzed in a specific study. For example, most online survey platforms will record the exact timestamp that someone submitted a form, but this is often irrelevant for a study unless data are being collected over a protracted time as part of a longitudinal study (but could be used to differentiate students who filled out a course survey in class versus after class). In other cases, some variables might have been collected with the intent to analyze them but were not analyzed for a certain project, although they could be useful for someone else or for another study. In that case, those data can also be removed.
Some variables must be removed for public dissemination of data. This includes data with legal prescriptions like HIPAA and FERPA data. In general, researchers should be careful about variables that are widely accepted to be sensitive (e.g., HIV status, rare medical diagnosis).
Sometimes, variables are collected in one form (e.g., exact age) but analyzed in a different form (e.g., 10-year age bins). In these cases, the publicly shared data should use the form of the variable that is the one that was analyzed (which is usually the more generalized form). For variables that were not analyzed but that a research team wants to share, generalizing the variables might be an acceptable means of including those variables in a publicly disseminated dataset.
Both qualitative and quantitative variables can be generalized, although the methods will naturally vary. Some examples are provided for each:
Quantitative
Qualitative
Some disciplines may have standards, either for analysis or reporting, in terms of how generalization is done. If you are unsure whether these exist for your type of data, we recommend talking to some senior researchers and examining previously published datasets.
Sometimes quantitative variables can be transformed in a way that preserves the original data but that presents it in a different, more generalized form. Some examples are provided below:
In general, there are a variety of ways to modify data to de-identify them. That some kind of modification (e.g., jittering/adding noise, date-shifting, rounding, substituting synthetic data with identical distribution to the original data) was performed should be clearly stated, although the specifics should not be made publicly available to avoid retro-calculation of the original values. The original values should always be retained by a researcher or research group. Explicit mention of data modification to protect participants is essential, otherwise potential re-users may assume that the data are "real-world" (e.g., Evans et al., 2023)
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 Generic License.