Skip to Main Content
University of Texas University of Texas Libraries

Scan Tech Studio (STS)

This guide provides orienting information and tutorials for the Digitization and Text Recognition Hub in the PCL Scholars Lab.

Preparing Your Project in the STS

Project Flow Diagram

Develop a Research Question

When considering a digital scholarship project, it is essential to identify your research question first. Some researchers try to approach digital scholarship by first choosing a method or tool and then deciding on a research question. However, this approach is not sustainable in the long term.

Be curious, ask why things are the way they are. Identify a general interesting topic that you would like to research. Carefully examine the existing literature on your topic of interest to learn more about what others have already done or are doing. A research question should be clear, focused, complex, and arguable:

Is your research question clear?

  • It should provide enough specifics that your audience can easily understand its purpose without needing additional explanation.

Is your research question focused?

  • It should be narrow enough that it can be answered thoroughly in the space that the assignment allows.

Is your research question complex

  • It should not be answerable with a simple “yes” or “no,” but rather requires synthesis and analysis of ideas and sources prior to producing any answer.

Is your research question arguable?

  • Its potential answers should be open to debate rather than exist as accepted facts.

Adapted from: “How to Write a Research Question.” n.d. The Writing Center. https://writingcenter.gmu.edu/writing-resources/research-based-writing/how-to-write-a-research-question.

Copyright & Digitizing

Copyright

Before digitizing your materials, you will want to consider the end-use case and the legality of creating a digital copy. As of 1 January 2024, books published in the US before 1929 and sound recordings published before 1924 are considered to be public domain. If your material still retains copyright status, or its copyright status is unknown, you may still be able to create a digital reproduction of the material under the Fair Use Doctrine. In short, this doctrine allows the use and reproduction of some copyrighted material under specific circumstances on a case-by-case basis, such as for academic research and educational purposes. Cornell University Library has a good checklist that can help determine fair use status when using copyrighted materials.

 

When seeking to digitize materials owned by a library, archive, or other institution, there may be other restrictions that apply. If in doubt, refer back to the owning institution’s policies and contact their staff for specific questions.

 

For further general information about fair use and copyright in libraries and archives, see the American Library Association’s resource guide and this easy to follow chart on what is considered public domain in the United States from Cornell University Library.

 

For professional legal advice, contact an intellectual property attorney.

Digitizing

When scanning material, consider the following:

  • Take several test scans to determine what would be the best way to capture your image; be prepared to re-adjust throughout the scanning process. Further instructions on how to use the Scholars Lab equipment can be found here.
  • How will you manage your data in the long-term?
    • Maintain consistent file-naming practices.
    • Save files in stable, non-proprietary, and lossless file formats, whenever possible.
    • Document your process for future reference of your workflow, decision-making, and to maintain consistency throughout. 
    • Save data in at least two places: ideally a physical drive and Cloud-based location
      • Cloud storage option provided by UT can be found here.
      • Check the health of the drive on a regular basis and transfer/copy files to a new drive as needed.

Text Recognition (OCR) + Analysis

Text Recognition & OCR

Text recognition, also known as Optical Character Recognition (OCR), is the conversion of images with text or handwritten text into machine-encoded text. In other words, it's the process that makes your physical item (book, newspaper, pamphlet, etc) something that a computer can understand and manipulate.

The following tools allow you to perform OCR on a variety of textual materials, such as newspapers, handwritten documents, and computer-generated texts. You can find even more tool recommendations here.

To learn more about text recognition see the OCR LibGuide.

Analysis

Once your text is transcribed, you might want to use various text analysis methods. These methods will assist you in analyzing and visualizing the data extracted from your texts. 

  • Sentiment analysis
  • Network analysis
  • Named entity recognition
  • Topic modeling

Analysis of a text can also be based on various linguistic features, such as word frequencies, sentence lengths, and other peculiarities of an author’s style. Text analysis can be performed using a variety of tools like 

Additionally, exploring the programming language Python can be a great place to start. There are many existing Python packages and tutorials focused on text analysis that can help you get started.


Preserve & Publish Your Work

Why you Should Share Your Work

It is important to think about a long-term plan from the earliest outset of your project so that you can set aside enough time and resources to ensure that your data will be accessible long after your project is over.

Publication in a digital repository can provide persistent URLs or a digital object identifier (DOI), full-text indexing and long-term preservation.

Publishing your work outside of a journal’s paywall will help your work become more discoverable to a wider audience. Need more reasons?

  • Repository content is included in Google Scholar results, expanding its reach and impact. 
  • Publications in repositories are openly accessible, allowing people without expensive institutional journal subscriptions to engage with your work
  • You secure a stable link or identifier to your work, easily shareable to your portfolio or CV

This guide page on Archiving and Sharing Your Work provides more info on increasing access to your work

Repositories

When looking to store your work in a repository, consider using one provided by UT. Some benefits of using a UT repository for your work are:

  • The library commits to the long-term preservation of content deposited in Texas ScholarWorks(TSW) , and commits to at least 10 years of access to content deposited in Texas Data Repository (TDR). 
  • Each item uploaded gets a digital object identifier (DOI) that makes citing work easier and more persistent.
  • TSW and TDR are indexed by major search engines like Google.
  • Usage data for your uploaded works are available.
  • Texas Data Repository (TDR) was purpose-built for data, so it has functionality like version control and access options that are more robust than a typical publications repository.
  • Texas ScholasWorks (TSW) offers a library managed option that frees up your time for other responsibilities.

Texas ScholarWorks (TSW)

  • TSW is UT’s web-accessible DSpace repository, managed by UT Libraries. A free and secure place for archiving and sharing faculty research output, it provides persistent URLs, searchable metadata, full-text indexing and long-term preservation.

Texas Data Repository (TDR)

  • TDR is hosted by the Texas Digital Library, and based on Harvard University’s Dataverse platform, TDR is a long-term solution for the preservation and dissemination of UT’s research data. Affiliates of UT-Austin may deposit and publish datasets of up to 4GB each in TDR free of charge. Published datasets in the TDR are assigned Digital Object Identifiers (DOIs), are publicly accessible, and are free to access and download.

Discipline Specific Repositories:

  • Archive of the Indigenous Languages of Latin America (AILLA) AILLA's primary mission is to preserve materials in and about the indigenous languages of Latin America.
  • Inter-university Consortium for Political and Social Research (ICPSR) ICPSR maintains a data archive of more than 250,000 files of research in the social and behavioral sciences. It hosts data collections in education, aging, criminal justice, substance abuse, terrorism, and other fields. Free with UT institutional membership.
  • Qualitative Data Repository (QDR) QDR curates, stores, preserves, publishes, and enables the download of digital data generated through qualitative and multi-method research in the social sciences.
  • Re3data.org is a global registry of data repositories organized by academic discipline. A rating system and faceted browsing can help you find the best place to deposit your data. Free with UT institutional membership.
  • Scientific Data Recommended Repositories - A list of disciplinary and open repositories that meet the data access, preservation and stability requirements of Nature's Scientific Data journal.​
  • NIH Data Repositories - National Institutes of Health-supported data repositories that make data accessible for reuse. Most accept submissions of appropriate data from NIH-funded investigators (and others), but some restrict data submission to only those researchers involved in a specific research network.
  • Open Access Directory - A list of data repositories worldwide.

Make an Appointment

Please contact us for assistance with your project using this form.

Data Curation Tools

  • Humanities Data Curation Checklist
    Created by Adriana Cásarez in spring 2020, this checklist guides humanities researchers and humanities liaison librarians on key considerations for making their data findable, accessible and clear to interested scholars and institutions.

  • Data Curation in the Texas Data Repository
    Spring 2020 capstone report from Brenna Wheeler. Brenna created a data curation workflow based on the Data Curation Network’s CURATE(D) Model to improve the findability and reusability of datasets. The workflow is localized to the Texas Data Repository using needs identified in interviews with academic librarians and assessment of datasets currently in the repository. The final product is a specialized Data Curation workflow and a list of recommendations that may be used by a team of liaison librarians to curate newly deposited datasets in the future.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 Generic License.