Skip to Main Content
University of Texas University of Texas Libraries

Research Data Services

replacement for my website

Documentation Overview

Overview

This page provides an overview of the two most common forms of data documentation, a data dictionary and a README file, and explains why data documentation is essential, even if your data are associated with a paper.

Why do I have to describe my data when the paper is right there?

Researchers often wonder why they have to provide documentation with their data when those data were collected for, and are associated with, a research publication. There are a few reasons why an associated publication is not sufficient documentation:

  • Many openly available datasets have paywalled articles: Journal paywalls continue to represent barriers to accessing information. Even if you are relatively responsive to requests via email or another medium like ResearchGate, article paywalls are still intrinsic barriers, and some researchers are not always responsive to requests for their papers. If a dataset is openly available, but critical information needed to access and understand it is only in a paywalled article, the dataset is itself paywalled in a sense.
  • Most publications describe interpretations of data, not data themselves: With the caveat that data are not immutable and are intrinsically subject to biases, preferences, and ideas of the people who collected them, data are relatively neutral compared to downstream products like papers, which represent interpretations of data that can vary widely between individuals. Most of the scholarly literature analyzes data and then makes certain interpretations of varying robusticity but does not describe the data themselves. Common examples of data attributes that are not described in papers include the meaning of variable formatting (e.g., empty cells; colored font or cells); abbreviations and acronyms used in column headers or as entered data; and conditions for access and reuse of the data.
  • Data may be of interest outside of their relationship to a paper: Many datasets can be reused in ways that are not simple replication of the results of the associated paper. Some data might be repurposed into larger datasets, reanalyzed in different ways (e.g., bibliometric studies; methodological surveys; literature reviews), or used in teaching purposes. Some data can even be "consumed" by the broader public (e.g., videos, 3D models), especially when the data pertain to topics of broad public interest or when the data are about people themselves (example of how NASA data were used for art). In these instances, someone who wants to re-use the data should not have to read a technical, potentially paywalled article (see first point) to try and find the necessary information.

 

What are data dictionaries?

Data dictionaries are structured descriptions of datasets or databases. They are sometimes called codebooks. These terms have a range of connotations and may have specific meanings in certain programs, research organizations, or disciplines. Data dictionaries are often tabular in structure and often describe tabular data, although they can be used for all types of data. Data dictionaries commonly include the following elements:

  • Variable name as given in the data file (e.g., 'geo')
  • Full / human-readable translation of the variable name (e.g., 'geographic region' for 'geo')
  • Definition / description of the variable (e.g., 'state, province, or similar geopolitical entity')
  • Units of measurement (if appropriate)
  • Allowable values (e.g., [0,122] for human age in years)

Data dictionaries should typically be formatted in a non-proprietary format to maximize human and machine accessibility and readability. CSV, TSV, and plain text (.txt) files are the most preferable formats for data dictionaries in tabular format. Plain text and PDFs are ideal for data dictionaries constructed in non-tabular format. Excel files (proprietary format) should be avoided unless there is a reason why specific formatting needs to be used. In instances where the data need to be in Excel format (e.g., where macros are essential), including the data dictionary as a separate tab (usually the first tab) is common.

What are README files?

Most people have encountered a README file (although they might not have actually read it); one of the core best practices for distributing software of all sorts, regardless of whether it is free or not, is to provide a README file. This file typically includes information such as:

  • What files are included and how they relate to each other
  • Instructions for installing and operating the software
  • Copyright and licensing information
  • Known bugs, troubleshooting, and a changelog (version history / summary)
  • Contact information for the creator(s)/maintainer(s)

README files for data should contain similar information but will differ in certain areas where software-specific content is not relevant (e.g., how to install) or where there is data-specific content (e.g., information on standards and calibration). High-quality READMEs for research data should contain:

  • Author information (names, affiliations, ORCIDs, emails)
  • Funder information
  • Spatiotemporal information (e.g., years of data collection)
  • List of files with a short description of contents and relationship(s) to other files
  • Copyright and licensing information
  • Links to materials describing how data were collected and processed (e.g., filtering, cleaning); if none exist or are not sufficiently detailed, those details should be included in the README
  • Data-specific information (e.g., description of what blank cells or 'N/A' values mean; description of abbreviations and acronyms; units of measurement; scoring key)

README files should typically be formatted in a non-proprietary format to maximize human and machine accessibility and readability. Text (.txt) and Markdown (.md) files are the most common as they can be opened on any operating system and with a wide range of freely available programs. PDF files are also acceptable.

Need help with documentation?

Profile Photo
Bryan Gee
he/him

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 Generic License.