Over the course of a research project, a researcher may generate an enormous amount of data. While it's important to preserve the data, it is near impossible to aim to preserve all of the data. Instead, select data for preservation.
The selection process can be aided by asking questions of the data.
It is possible that your data is appropriate for preservation, but in order to share the data publicly, you need to redact information. Some researchers chose to create two copies of the data one to be redacted and shared publicly and the other to retain for internal preservation depending on the sensitive nature of some of the data. Some examples are consent forms, DNA data, personally identifying information, etc.
The above questions aim to begin to outline some of the considerations. For more full considerations please see the following checklist.
Assessing External Requirements | Yes | No | Unsure |
---|---|---|---|
Are there funder requirements? | |||
Are there repository requirements (stability, reliability, usage by others, security, appropriate, terms, license)? | |||
Are there publisher requirements? | |||
Are there any disciplinary expectations/norms? | |||
Are there institutional requirements? | |||
Are there any legal requirements? | |||
Do any of the above include a retention schedule? |
Assessing Scientific Value and Reuse Potential | Yes | No | Unsure |
---|---|---|---|
Does this data enable other researchers to verify or reproduce your published findings? | |||
Is your data original? | |||
Is the data available elsewhere? | |||
Could the data have value for future research? | |||
Could you or your colleagues use this data again for future outputs? |
Ethical Considerations | Yes | No | Unsure |
---|---|---|---|
Do you have permission to share the data from all stakeholders (collaborators, study participants, etc.)? | |||
Can the data be anonymized? | |||
Are there any copyright or license restrictions on any part of the data? |
Practical Considerations | Yes | No | Unsure |
---|---|---|---|
Are there storage costs? | |||
Is appropriate metadata and documentation in order? | |||
Is there any risk in preserving or sharing this data? (i.e. could this data be used to target a particular population or community?) |
UT Libraries, CC License
The ten-year reproducibility challenge is a practice in which researchers try to run code that was written ten years ago. Often this is difficult not due to changes in the hardware available but due to a lack of description. While the challenge refers to written code, it can easily be transposed to data as well. Description of data and what variables mean is often an afterthought, but years later it can be difficult to reconstruct the intended meaning. Now imagine trying this process with someone else's data from ten years ago. Whether you are thinking about publishing your data openly so others can view, analyze, and reuse or preserving internally to meet standards of preservation, the description should be present and thorough enough so that a third party can understand the data.
It is important to think through what data you will preserve and what data you will share. These won't always be the same. You'll want to consider:
It is important to be selective about what data you plan to retain, as every file requires some measure of overhead in terms of storage and maintenance for the long term. It’s a good idea to:
This work is licensed under a Creative Commons Attribution-NonCommercial 2.0 Generic License.