LibGuides: How to Work with Sensitive Data: Reducing Data Sensivity (Geospatial/Location)

Overview

Location-based data is another common source of sensitivity that may introduce risk for both human and non-human (e.g., archaeological sites, organisms) entities. Location-based data can come in various forms, ranging from precise GPS coordinates to postal information (e.g., zip codes) to proximity to a landmark (e.g., living within a 5-minute walk of Times Square in New York City). This page provides an overview of some strategies for reducing sensitivity of this type of data, divided into three general categories: coordinates; addresses/postal information; and physical entities/landmarks.

Coordinates

Coordinates are typically the most easily recognized location-based data with potential to be sensitive. Increasing precision of GPS instruments and routine incorporation of GPS into common consumer devices like cellphones make this data both ubiquitous and highly identifying in many instances. Many disciplines also emphasize reporting of high-precision coordinates as metadata for documentation of occurrences of important sites or organisms.

The majority of coordinate data generated by scholarly research pertains to non-humans, either to culturally relevant sites like archeological localities or to non-human organisms like animals. Some common considerations include:

Is the site or organism at risk for poaching or over-collection (e.g., there is a known commercial market)?
- Conversely, if the organism is considered at-risk but not due to hunting/poaching/collection (e.g., habitat loss), providing precise coordinates may be important for furthering conservation efforts.
Is the organism immobile (sites are all presumably immobile)?
Is the organism or site culturally significant to a local group?
When were the location data collected?

Whether coordinates can be published, and if so, to what precision, is highly context-dependent. Consider the following example:

Euphoria obesa (sometimes the 'baseball plant') is a flowering plant native to South Africa. It is a highly sought plant and is now classified as endangered because of over-collecting and poaching. As plants cannot move, reporting even a relatively general set of coordinates documenting a wild occurrence of E. obesa could permit a poacher to find and harvest individuals of this species.
Balaenoptera physalus (better known as the 'fin whale') is the second-largest living whale and can be found across nearly the entire world's oceans. It is one of the species that is commercially hunted, which contributes in part to its conservation designation as 'Vulnerable.' Given the extensive range of even somewhat restricted populations, reporting precise coordinates of a sighting is likely less risky than for an immobile plant (and might benefit some groups, such as whale-watchers, for example).

Handling sensitive coordinate data

In general, it is advisable to reduce unnecessary precision (decimal places or seconds) in public reporting of coordinates. Another approach is to generalize points to a centroid or to use an alternative, generalized framework (e.g., township-range in the United States).

Generalizing location-based data is not assured to anonymize the location. Many organisms are restricted to certain types of habitat such that even if coordinates are highly generalized, it may be possible for someone to confirm and essentially locate the organism because such habitat is rare in a broad geographic area (e.g., bodies of water, caves, certain types of forest or vegetation). Sites with generalized coordinates may be similarly identifiable if they are shown (e.g., photos) or described as being near prominent features (e.g., buttes, canyons).

Coordinate data for human participants is less common than other location-based data and can rarely be shared publicly, so it requires careful handling. For both humans and non-human entities, there may be various legal prohibitions imposed by funding, regulatory, or collaborative entities on sharing precise coordinate data (e.g., federal land management agencies, personal data protection frameworks).

Postal information

In addition to coordinates, postal information can also be identifying and thus sensitive. It is relatively intuitive that full addresses, especially those that relate to a small group of people or a single person (e.g., residential address) are highly sensitive. However, other attributes of postal information, such as zip codes and town/city names, can also be identifying, especially when combined with either intrinsic information about those locations (e.g., racial demographics) or in tandem with other attributes.

Zip code or equivalent postal code: Although some postal codes encompass large populations, others can be highly-specific (e.g., referring to a specific organization like 78712 [only for UT Austin]) or encompass a very small population (e.g., 98222 is for Blakely Island, a privately owned island in Washington state with a population < 50 as of the 2020 census).
Administrative unit names: In the same vein as zip codes, some cities are very populous (e.g., Delhi, Mexico City, Tokyo) such that being listed as a resident of, visitor to, or employee in, that city is unlikely to be identifying. In other instances, small towns are likely to be highly identifying. These considerations apply across a range of administrative levels (e.g., counties, states, or provinces, census tracts).

Handling postal data

For zip codes or similar postal codes, it may be possible to truncate the value to preserve a more generalized form of the location. For zip codes, truncating to the first three digits (e.g., 787 for 78712), as these digits only represent the mail sorting and distribution center for the area.

For qualitative location-based data like names, there are a few options.

Categorize units: It may be possible to generalize the geographic unit into categories (e.g., 'urban' vs. 'rural' or binning cities by coarse-grained population size and using those bins in lieu of names).
Replace with a qualitative descriptor: Especially in unstructured data, it may be possible to substitute a specific name (e.g., Houston) with a descriptor that preserves some generic information (e.g., 'large metropolitan city in Texas'). This can potentially allow for customized modification of different cities (i.e. avoids the need for a formulaic approach, as would probably be needed in structured data).

Landmarks/Physical Entities

In addition to postal information, names of landmarks and physical entities are another form of location-based data. These are more likely to occur in unstructured and qualitative datasets where data collection is not always narrowly constrained (e.g., open-ended, conversational questions). Predictably, people are often likely to mention notable landmarks to help situate others, whether this is a globally-recognized landmark like the Eiffel Tower or a locally-recognized landmark like a local park. Landmarks need not be an officially designated or labeled entity like a business, monument, etc. These landmarks might be related to where a person lives, works, regularly commutes, or experienced an event. Businesses and other types of branded, physical locations are one of the most common landmarks:

Unique entity: A one-of-a-kind entity that is unambiguously identifiable.
- Example: Texas Science & Natural History Museum
Small or uncommon chain: Entities with a small number of locations at a local scale (could be a local brand or not)
- Example: Proud Mary Coffee has one storefront in Austin and one storefront in Portland (OR); it may be possible to easily discern which one is being referenced from other context (e.g., mention of a street name), functionally rendering it a "one-of-a-kind" landmark.
- Example: Buc-ee's is regarded as a regional chain such that simply mentioning that one was at a Buc-ee's in Texas is not identifying without additional context. However, there is only one Buc-ee's in other states (e.g., Colorado).
Large national or international chain: Entities with a large number of locations; note that some large or geographically expansive entities may have a small number of locations at a local scale.
- Example: there are nearly a dozen Wal-Mart Supercenters in Austin, making mere mention of one without additional context likely not identifying.
- Example: IKEA is an international company, but there is only one in/around Austin (Round Rock); mentioning an IKEA in the Austin area would be immediately identifiable.

It's important to keep in mind that physical entities that don't have an official name/designation (e.g., a particularly large tree) or that are only named generically (e.g., "the market") can still be precisely identifying based on local context (e.g., if there is only one market in a town). If researchers are not local to the area, this may be harder to know.

Handling sensitive physical entity data

When location-based data is mentioned as an aside and has little to no relevance to the primary dataset; in these instances, it may be possible to fully remove location references.
When location-based data is important only as a general attribute (e.g., to indicate that someone goes to the market once a week on Sunday, but the specific market is not important). In this case, it may be possible to edit the data to generalize it. (e.g., to change mentions of "H-E-B" in a transcript to "[market]")
When location-based data is essential in the level of detail provided, researchers should assess whether other attributes might lead to the identification of a participant; knowing that someone visited a major tourist location like the Grand Canyon is unlikely to be identifying without additional information (nearly 5 million people visited the Grand Canyon in 2024). If location-data is essential but carries a moderate to high risk of re-identification, restricted access may be the only appropriate approach.