LibGuides: Research Data Services: Where Not to Share Data

Overview

Data and other supporting research outputs can and have been shared through a wide range of outlets that are not data repositories. Most of these are not appropriate for long-term preservation and may be out of compliance for publisher or funder requirements. Some of the most common examples (supplemental information; GitHub) are detailed here in order to explain why these are sub-optimal avenues for data sharing.

Why not Supplemental Information?

Supplemental Information (SI) and its many nominal permutations (Supplementary Materials, etc.) is probably the most common location for researchers to share supporting materials, ranging from raw data to supplemental figures to appendices to code. However, SI is not a good location for sharing most/all of these materials anymore.

SI has been acceptable for years, what's the problem now?

Both SI and data sharing are products of the rapid evolution of digital technology and its application to research. A hundred years ago, everything had to go into a print copy or be left out. When digital publishing started, the only data repositories were physical entities like museums. SI arose as a means for authors to provide more information without increasing print costs, sometimes in formats that cannot be included in article PDFs (e.g., videos); SI is often critical for journals with page/word limits, as most of the substance may be found in the SI (e.g., Nature, Science) (Sacher, 2011). For most researchers, the second most popular option was maintaining supporting material on a personal website or a university website, both of which are now regarded as poor practices because these pages can disappear overnight if a researcher leaves an institution or forgets to pay the bill for their website domain ('link rot'; see Briney et al., 2024 [PLOS ONE], for an assessment of how much link rot occurs with research data). Data repositories are the solution to many of these potential problems, among others (e.g., ensuring content is available beyond the career or lifespan of an individual researcher or research group).

Limitations of SI

Publishers do not necessarily make a commitment to maintain access/integrity: In general, journals do not assign DOIs to SI, which means that it is impossible to discern what happens if content disappears. This can be a particularly acute problem when journals switch publishers and have to move content to a new website and usually cannot be resolved through archived snapshots of a webpage because external linked content is not downloaded.
Terms of (re)use and (re)distribution are unclear: Journal articles range in copyright status from fully copyrighted (paywalled articles) to licensed for certain reuse through licenses like CC BY (open access). However, the status of their associated SI is often unclear. For one, 'data' as a factual entity (e.g., specimen measurements, geographic points) are not copyrightable, so neither the copyright status of the associated article nor any other copyright claim is applicable. Other materials may be copyrightable but should not be copyrighted under the same terms as the associated article (e.g., Creative Commons [CC] licenses should not be used for software, per Creative Commons). For paywalled articles where the author(s) transfer copyright to the publisher, the copyright of the SI may also be considered to have been transferred and thus be at the whims of the publisher to use as they see fit in ways that researchers may not like or at least did not agree to (e.g., for training AI models).

The NIH DMS Policy expects that researchers will share scientific data through established data repositories. Sharing data through publications, local servers, or lab websites is not the same as using a repository and does not meet the Policy’s expectations.” – NIH Office of Data Science Strategy

For researchers supported by U.S. federal research grants, there is an expectation that data are shared through proper data repositories; use of SI is considered to fall under "through publications" and is not acceptable for meeting funder requirements.

Why not GitHub?

GitHub is a common tool used for software development, and increasingly, in academic research. Oftentimes, researchers are confused when they hear that GitHub does not meet qualify as a proper repository for publishing any of their research materials, but it is important to keep a few things in mind:

GitHub is a cloud-based platform where you can store, share, and work together with others to write code. -GitHub Docs

GitHub was developed for non-academic software developers, who remain its primary user base. It is not itself a repository, and individual 'repositories' in GitHub are defined only as a "a location to store code" (see GitHub's own description) rather than the definition used for publishing research data and software, which is a platform specifically designed for long-term preservation of academic research data.

Why do GitHub repositories not meet the standards for publishing academic research data or software?

Repositories can be deleted at any point: Any platform where a depositor can freely and immediately delete any publicly distributed content does not qualify as a proper research data and software repository. For proper repositories, de-accessioning requires certain protocols to be followed and validated by the repository upon request by a researcher; these are only done in exceptional cases (e.g., copyright violation, exposure of sensitive data). In this regard, GitHub is no different than a temporary cloud sharing link.
Repositories do not receive persistent identifiers: Unlike in proper research repositories, GitHub does not issue DOIs, ARKs, or another persistent identifier. GitHub is well-aware of this, so it recommends using one of its integrations with a proper repository like Zenodo to obtain a DOI.
GitHub is not committed to perpetual, free access to content: Unlike a data repository like Dryad or a general-purpose repository like Zenodo, GitHub is a commercially owned product (Microsoft) and has not made any organizational commitment to ensuring that users will always be able to deposit, maintain, and access content for free or for perpetuity. For example, Microsoft could decide to make GitHub a subscription-based service tomorrow.

Journals are increasingly prohibiting exclusive sharing of content through GitHub

Some researchers may have seen other papers published where data and/or code are exclusively shared through GitHub, or perhaps you have done this yourself. Journals have increasingly been burned when GitHub repositories are made private, or worse, deleted, rendering the data/code inaccessible through the paper. An increasing number are thus prohibiting data and code from being exclusively shared in this way.

For some of our journals, material may be provided via GitHub, Google drives, Dropbox, or similar services for the review stage, but they must be moved to a permanent, publicly accessible repository during revision. -The Royal Society of London

Are you saying that researchers can't use GitHub?

Not at all! The Research Data Services team routinely uses GitHub in our library and non-library work, and the UT Open Source Program Office recommends GitHub as a key tool in software development. The key point is to use GitHub as it was designed, which is as a great collaborative tool during the research process but not for the long-term preservation of academic research material. Many data repositories, including the Texas Data Repository, have GitHub integrations that facilitate the deposition of content hosted on GitHub into proper repositories. For researchers who are developing code but who may need to archive a static version for a publication, many of these integrations also enable linked versioning of the deposit in a repository (e.g., through making a new release). Code can thus be continually updated in a public platform while ensuring that requirements for scholarly publishing are also met by depositing a static version in a repository. If you have both a GitHub repository and a linked DOI-backed deposit in a proper research repository, you can link/cite both in an associated paper so that readers know that a specific version was used for the paper but that the product is still being worked on (see Jones et al., 2023 [Methods Ecol Evol] for an example).

Have questions?

Bryan Gee

he/him

Email Me