This page provides an overview of the two most common forms of data documentation, a data dictionary and a README file, and explains why data documentation is essential, even if your data are associated with a paper.
Researchers often wonder why they have to provide documentation with their data when those data were collected for, and are associated with, a research publication. There are a few reasons why an associated publication is not sufficient documentation:
Data dictionaries are structured descriptions of datasets or databases. They are sometimes called codebooks. These terms have a range of connotations and may have specific meanings in certain programs, research organizations, or disciplines. Data dictionaries are often tabular in structure and often describe tabular data, although they can be used for all types of data. Data dictionaries commonly include the following elements:
Data dictionaries should typically be formatted in a non-proprietary format to maximize human and machine accessibility and readability. CSV, TSV, and plain text (.txt) files are the most preferable formats for data dictionaries in tabular format. Plain text and PDFs are ideal for data dictionaries constructed in non-tabular format. Excel files (proprietary format) should be avoided unless there is a reason why specific formatting needs to be used. In instances where the data need to be in Excel format (e.g., where macros are essential), including the data dictionary as a separate tab (usually the first tab) is common.
Most people have encountered a README file (although they might not have actually read it); one of the core best practices for distributing software of all sorts, regardless of whether it is free or not, is to provide a README file. This file typically includes information such as:
README files for data should contain similar information but will differ in certain areas where software-specific content is not relevant (e.g., how to install) or where there is data-specific content (e.g., information on standards and calibration). High-quality READMEs for research data should contain:
README files should typically be formatted in a non-proprietary format to maximize human and machine accessibility and readability. Text (.txt) and Markdown (.md) files are the most common as they can be opened on any operating system and with a wide range of freely available programs. PDF files are also acceptable.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 Generic License.