LibGuides: Research Data Services: Share and Preserve

What should I share?

It may not be required or possible to share all data that were collected as part of a project. At bare minimum, you should share the data (and associated materials like code) necessary to reproduce the findings of any published articles, books, or other written outputs (e.g., technical reports). Some funders may stipulate that data be shared by the completion date of the supporting grant, even if an associated output is not yet published. Some studies that generate particularly large volumes of data (TB to PB) may not be able to share all of those data in a standard, publicly accessible repository. In those instances, it may be acceptable to share only the more manageable outputs (e.g., intermediate data instead of unprocessed raw data), provided that details are provided about how the full dataset can be accessed (e.g., tape storage, AWS Glacier). Where possible, data that cannot be shared publicly should still be hosted on a secure platform (rather than only local storage) and follow best practices for 3-2-1 back-up where possible. Researchers who anticipate generating large volumes of data should address these scenarios in their Data Management and Sharing Plan (DMSP).

Where should I share data?

We have several separate pages to provide guidance on data repositories and how to determine the best location for sharing your data:

What should I keep?

Regardless of whether you are able to publish all of your data in a repository, you may want to retain a local copy as well. Be elective about what data you plan to retain, as every file requires some measure of overhead in terms of storage and maintenance for the long term. As a general rule of thumb, it’s a good idea to:

keep anything irreproducible, such as observations specific to a particular time and place (e.g., written or video field notes);
retain results that are tied to a specific publication or presentation;
review the university's Record Retention Schedule, which adheres to federal and state legislation regarding what must be retained and for how long;
review any funding agency policies on retention of materials (e.g., NSF guidelines).

These are only general guidelines, and the best approach will vary by project.

How long should I keep data?

Check with your funding agency to find out if there is a specific policy that spells out a data retention period. For publicly funded research in the US, this is often a minimum of three years but may be more. It is better to aim for even longer, if possible, in case you or someone else need the data later on. Five to ten years is a good rule of thumb. For data that are not hosted publicly, you should seek to extend the retention period for as long as possible to avoid permanent loss; as infrastructure develops, a solution that is able to host previously un-hostable data may arise.

How should I preserve data?

It is important to think about a long-term plan from the earliest outset of your project so that you can set aside enough time and resources to ensure that your data will be accessible long after your project is over.

Keep files readable

Making sure your data remain accessible for the long term is a big challenge, especially since technology changes so quickly (e.g., programs become obsolete, restricted to a single operating system, or subscription-based). Choosing the right file formats can help avoid obsolescence and minimize the chance that you need to convert data for reuse down the road. Use formats that are:

Non-proprietary, open, documented standards (e.g., .tif, .txt, .csv, .pdf)
Used commonly in your research community
Encoded with standard characters (e.g., ASCII, UTF-8)
See the Library of Congress guide to file formats which are likely to have long term support

Preserve supporting documentation

It's a good practice to publish data with a README file. Any other supporting documentation that may be important or essential to understanding nuances of the data (e.g., folder organization, filenaming conventions, asymmetrical data when rectangular is expected) should also be preserved, particularly for any data that may not have been shared publicly (e.g., intermediate files, outcomes of trial runs).