Skip to Main Content
University of Texas University of Texas Libraries

Scan Tech Studio (STS)

This guide provides orienting information and tutorials for the Digitization and Text Recognition Hub in the PCL Scholars Lab.

Understanding Terms

Transcription, OCR, HTR, NLP

Transcription

The process of copying text from one medium to another, and in the Scanner Studio it means copying text from digitized images into a machine-readable format (like .docx). This method of creating machine-readable texts is the most manual: users must type what they read into a new document, usually following standard conventions of transcription.

OCR

Optical Character Recognition, or OCR, is an automated method of creating machine-readable texts. An OCR program will interpret the text on a digital image and attempt to render it in a known alphabet. Often, you need to specify for the OCR software what language(s) are in the digital image, and sometimes you also need to help the software understand the formatting of the text in the image (for example, with columns of text, photographs, and advertisements in a newspaper).

HTR

Handwritten Text Recognition, or HTR, is an automated method of creating machine-readable texts from handwritten sources and is much like OCR above. An HTR program renders handwritten text into machine-readable, "print" text, which is not only easier for researchers to read, but also for them to manipulate and explore with digital methods.

NLP

NLP stands for Natural Language Processing. These are methods for how to program computers to process and analyze large amounts of language data, such as digital images of texts that have been OCR'd or HTR'd. The digital images have been rendered machine-readable, so they are a ready for a computer to try to understand them. NLP software helps researchers explore their questions about context, organization, and language use.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 Generic License.