LibGuides: Research Data Services: Documentation How-To

Overview

This page provides an overview of some best practices and tips for creating good data documentation.

Data dictionaries

In many instances, you may not need to create a data dictionary from scratch, or maybe even at all! You may already have created a file that describes important information, such as how different variables are coded (e.g., 0 = no; 1 = yes; 2 = not answered), units of measurement, or allowable values, as part of designing your experiment and to guide data collection. For data collected through a survey or similar mechanism, a text file with the questions may also work as a data dictionary (provided that allowable answers are listed for non-open-ended questions). Some software programs may also be able to automatically generate a data dictionary or codebook from a given dataset (e.g., SPSS).

Below is a short example of a hypothetical data dictionary. Refer to the right sidebar for some real-world examples from UT Austin researchers.

Variable	Description	Allowable values	Units	Notes
SL	Skull length	[0,15]	cm	only measured for complete specimens; 'N/A' for incomplete specimens.
m	Mass	[0,55]	kg	only measured for complete specimens; 'N/A' for incomplete specimens.
taxon	Species of Dasypus	['D. hybridus'; 'D. kappler'; 'D. pilosus']	N/A	none
stage	Developmental stage	['pup'; 'juvenile'; 'sub-adult'; 'adult']	N/A	none

Some additional guides to creating data dictionaries:

README files

The information that should be included in a README will vary by discipline and data type, but some information should always be provided. Essential metadata to include are:

Author information (names, affiliations, ORCIDs, indication of who to contact about the dataset)
Terms of use (copyright/licensing, any other restrictions on re-use or re-distribution)
A list of files with brief descriptions of what each file (or group of similar files) contains
- If your dataset contains many folders, the folder organization should also be explained
Any information needed to process or analyze the data (either to reproduce results or for novel analyses); this could include information on any associated scripts and information on programs that can access certain formats (especially for proprietary/less common formats)

Additional metadata will vary by the type of data but could include information such as:

Links to original sources of data if data were re-used from somewhere else
Links to related deposits, articles, preprints, etc.
File-level information (e.g., description of column headers for a CSV file)
Version history, if a dataset has been versioned
Any modifications made to a dataset from its 'raw' form that would be important (e.g., if modifications have been made to protect patients' identities)

Q: Isn't some of this information redundant with what I fill out in the submission form?

A: Yes, oftentimes, researchers have to enter metadata for data deposits in repositories similar to entering metadata for manuscripts in journal systems. Most of this information (like author names) should still go in the README. One of the major benefits of a README file is that it will be downloaded with the data files, making it harder to separate or lose that metadata; most people are not going to download the XML of a webpage or take a screenshot that they save with the data files, for example.

Cornell has an excellent guide to creating README style metadata, including a template.

Need help with documentation?

Bryan Gee

he/him

Email Me

Subjects: Data Management

Need examples?

Below are some examples of datasets published by UT researchers with good data dictionaries:

Below are some examples of datasets published by UT researchers with good README files: