Skip to Main Content
University of Texas University of Texas Libraries

Scan Tech Studio

This guide provides orienting information and tutorials for the Digitization and Text Recognition Hub in the PCL Scholars Lab.

Understanding Terms & Tools

Transcription, OCR, HTR, NLP

At the Scholars Lab Scan Tech Studio, you can make digital versions of print materials and apply digital methods to the study of the new versions. For example: you can scan a poetry collection, make the text machine-readable (which means that a computer can understand the text), and then explore your research questions with text analysis methods. These are some key terms and techniques that you'll want to keep in mind as you think about your project in the Scan Tech Studio:

Transcription is the process of copying text from one medium to another, and in the Scanner Studio it means copying text from digitized images into a machine-readable format (like .docx). This method of creating machine-readable texts is the most manual: users must type what they read into a new document, usually following standard conventions of transcription. The By the People project at the Library of Congress is a great example of a successful crowd-sourced transcription. Read More

OCR stands for optical character recognition, and is an automated method of creating machine-readable texts. An OCR program will interpret the text on a digital image and attempt to render it in a known alphabet. Often, you need to specify for the OCR software what language(s) are in the digital image, and sometimes you also need to help the software understand the formatting of the text in the image (for example, with columns of text, photographs, and advertisements in a newspaper). There are many notable OCR-focused projects, including Mapping TextsEighteenth Century Collections Online, and the Nusus CorpusRead More

HTR stands for handwritten text recognition and, much like OCR above, is an automated method of creating machine-readable texts from handwritten sources. An HTR program renders handwritten text into machine-readable, "print" text, which is not only easier for researchers to read, but also for them to manipulate and explore with digital methods. Ottoman Studies has had much success with implementing HTR on Ottoman archives and literature, and you can check out some of those projects here. Read More

NLP stands for natural language processing. These are methods for how to program computers to process and analyze large amounts of language data, such as digital images of texts that have been OCR'd or HTR'd. The digital images have been rendered machine-readable, so they are a ready for a computer to try to understand them. NLP software helps researchers explore their questions about context, organization, and language use. Read More

Tools to Consider

We recommend you to consider these helpful tools based on the type of project you're working on (transcription, OCR/HTR, NLP, Text Analysis, or other methods).

  • eScriptorium

    Open-source platform used to automatically transcribe and recognize text in images of handwritten or printed documents, particularly for historical or archival documents.
  • FromThePage

    Web-based platform used for collaborative transcription of historical or archival documents.
  • Python - Anaconda

    Specific libraries/tools such as Google Cloud Speech-to-Text API or Tesseract OCR engine can help with transcription.
  • Transkribus

    For transcribing handwritten or difficult-to-read historical documents, transcribe large volumes of text.
 
 

 

 

  • ABBYY FineReader

    Has advanced OCR/HTR technology to recognize text and characters in scanned documents and images; supports multiple languages, including Asian languages with complex character sets.
  • Adobe Acrobat Pro

    Includes built-in OCR technology to convert scanned documents or images into searchable and editable text; useful for digitizing historical documents or extracting text from images.
  • eScriptorium

    Open-source platform used to automatically transcribe and recognize text in images of handwritten or printed documents, particularly for historical or archival documents.
  • OCRopus

    For processing scanned documents, images, and other sources of text data, especially useful for recognizing text in low-quality scans, noisy images, and handwritten text.
  • Pytesseract

    Python wrapper for the Tesseract OCR engine. Allows usage of Tesseract's OCR capabilities in Python applications, making it easy to recognize text in scanned documents, images, and other sources of text data.
  • Python - Anaconda

    Specific libraries/tools such as Google Cloud Speech-to-Text API or Tesseract OCR engine can also help with OCR/HTR.
  • Tesseract

    For recognizing printed and handwritten text in scanned documents, images, and other sources of text data; supports a wide range of languages, including many non-Latin scripts and character sets; can be trained on new fonts and languages.
  • Transkribus

    Includes a powerful OCR/HTR engine that can recognize handwriting and other difficult-to-read text.
 
 

 

 

  • Natural Language Toolkit (NLTK)

    Python library is used for natural language processing tasks such as tokenization, stemming, lemmatization, part-of-speech tagging, and sentiment analysis.
  • Python - Anaconda

    Specific libraries/tools such as NLTK (Natural Language Toolkit), spaCy, gensim, and TextBlob can be used for text classification, sentiment analysis, named entity recognition, and topic modeling.
  • SpaCy

    Python library for advanced natural language processing (NLP) tasks; industrial-strength toolset for processing large volumes of text data, with features such as named entity recognition, dependency parsing, and tokenization.
  • Stanza

    For working with human language data, including tokenization, part-of-speech tagging, named entity recognition, dependency parsing, sentiment analysis, text classification, and machine translation; has pre-trained models optimized for speed and accuracy.
  • Spyder

    An environment for Python incorporating advanced editing, data visualization, and data exploration
  • TextBlob

    Python library for processing textual data includes NLP tasks such as sentiment analysis, part-of-speech tagging, and named entity recognition; built on top of NLTK library and provides a simple and easy-to-use interface for performing common NLP tasks.
  • Adobe Acrobat Pro

    Provides a range of tools for analyzing PDF content, including the ability to search for specific words or phrases, extract text and data, and create custom reports and summaries, can be useful for text analysis.
  • OCRopus

    Includes a range of tools for analyzing and processing text data, such as language identification, text segmentation, and text normalization.
  • OpenRefine

    For cleaning and transforming messy or inconsistent data; supports a variety of data formats, including CSV, Excel, and JSON; can handle large datasets with ease. Can also be used to prepare data for analysis in R or Python.
  • Python - Anaconda

    Specific libraries/tools such as 'Pandas' and 'NumPy' for data manipulation and analysis, 'Matplotlib' and 'Seaborn' to create visualizations and plots, 'Jupyter Notebook' for interactive data analysis and visualization.
  • R/RStudio

    For statistical computing and data analysis. Also provides libraries such as tm (Text Mining), quanteda, and tidytext for tokenization, stemming, sentiment analysis, and topic modeling, ggplot2 to create plots and visualizations to explore and analyze text data.
  • VoyantTools

    For analyzing text, such as word frequency analysis, keyword extraction, and collocation analysis. Can also identify trends and patterns in text data, such as sentiment analysis and topic modeling.
 
 

 

 

Provides a range of tools for analyzing PDF content, including the ability to search for specific words or phrases, extract text and data, and create custom reports and summaries, which can be useful for text analysis.
With this software, you can create interactive web maps from imported data and visualize patterns in the data in a geographical context.
Google Earth’s project mode aids in the creation of presentations of geographical data.
Input data from text, images, or video to create visual presentations in graph, table, and chart forms.
 A desktop environment for iterative analysis that enables you to see the process of different algorithms as they are run.
Geospatial visualization and analysis software
Create and share appealing data visualization online.
Software with a vast array of built in functions, with capabilities to visual data, handle geographical information, and even natural language processing.
Adobe’s PDF reader with text editing and tools to convert to and from other formats.
Adobe’s asset manager enables metadata editing and management of collections.
An open source e-book manager. Allows for edits and markup, as well as supporting a wide range of text types.
Adobe’s webpage creation application, including tools to learn HTML coding and templates.
An open source Java development environment that allows for numerous plug-ins.
Adobe’s vector imaging application for creating logos, web graphics, and other elements of UX design. Indesign: Adobe’s layout tool for publications. Can be used to create and publish ebooks, magazines, and PDFs.
Adobe’s photo touchup software may be applied to the scans created in the studio to enhance clarity for OTR software.
Adobe’s media encoder allows you to input audio and video files and output them in a variety of formats.
An advanced photo editing and image creation software that can be applied to scans to make them more legible or easier to interpret for OCR software.
professional video editing software that can be applied in creating media for digital scholarship presentations.
open source file management for Windows that enables file transfer between a computer and a remote server.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 Generic License.