Metadata is often an after-thought in the research data and software sharing process, but it can be one of the most important parts of any deposit because it is the primary means of ensuring that data are findable. This page provides best-practice guidance and real-world examples of rich metadata.
Many researchers wonder why they have to provide high-quality metadata for research data and software since these outputs are usually (though not always) associated with some sort of scholarly output like a preprint, book, or journal article that links to related data and/or software. Quality metadata is essential for ensuring Findability (the 'F' in FAIR Principles), which requires materials to be findable without knowing that they exist a priori. In other words, data and software should be discoverable by someone who hasn't read the related output and who maybe doesn't even care about it. Let's consider a real-world example to demonstrate why metadata are important.
Consider a Google search for chocolate chip cookie recipes. Whether you make this search frequently, or this is the first time you've looked this up, you are virtually assured to find a website with a recipe that you didn't know existed before you made the general search. Metadata like the visible text on the page and the invisible tags in the HTML code help search engines to filter the vast expanse of digital content for your specific query and to omit irrelevant or minimally relevant content. For example, if you are looking for chocolate chip cookie recipes, you probably don't want to see recipes for banana bread at the top of your search results. If someone had only labeled their recipe as something generic like "Mom's recipe" or "Cookie recipe," it would be harder to find and likely be lower-ranked in search results compared to more specific pages.
Like manuscript titles, dataset titles should be descriptive and specific. Generic titles like 'Raw data' or 'Data for paper' (yes these are real, frequently used dataset titles) are not helpful for finding data or for understanding what data are contained in a deposit. Similarly, a common practice is to use the title of the associated manuscript as the title of the dataset. This is a poor practice for several reasons:
The ideal dataset title should be descriptive but concise (i.e. not a full sentence) and include descriptors and details that will help it stand out from similar datasets. Some examples of good titles in UT-authored datasets:
There can be instances in which creating a unique dataset title is difficult because the dataset pertains to a common topic (e.g., Drosophila genomics). It is therefore also acceptable in some instances to use the manuscript title with a prefix. Some examples of UT-authored datasets:
It's also possible to use a more descriptive prefix. Examples:
Just like with preprints, theses/dissertations, books, and journal articles, research datasets should also have informative, specific keywords. These should be words that do not appear in the title or the abstract/description (if one is present), as these fields are already indexed, so using only terms from these fields will be redundant. For datasets specifically, the keywords should pertain specifically to the dataset, rather than to any associated paper.
Many data repositories, especially generalist ones, have a separate field for selecting one or more disciplines (e.g., biological sciences, social sciences), so these types of terms often do not need to be included as keywords.
It is common for researchers to just reuse words from the title or abstract, but this is not a good way to pick keywords! Titles and abstracts are already scraped by web search engines, so researchers should use words that are not found in these fields to further increase the discoverability of the dataset, rather than being redundant. Using different permutations is perfectly fine though (e.g., if the title uses 'isotopic,' a keyword could be 'isotope,' which is probably more likely to be queried in a search).

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 Generic License.

