At the Scholars Lab Scan Tech Studio, you can make digital versions of print materials and apply digital methods to the study of the new versions. For example: you can scan a poetry collection, make the text machine-readable (which means that a computer can understand the text), and then explore your research questions with text analysis methods. These are some key terms and techniques that you'll want to keep in mind as you think about your project in the Scan Tech Studio:
Transcription is the process of copying text from one medium to another, and in the Scanner Studio it means copying text from digitized images into a machine-readable format (like .docx). This method of creating machine-readable texts is the most manual: users must type what they read into a new document, usually following standard conventions of transcription. The By the People project at the Library of Congress is a great example of a successful crowd-sourced transcription. Read More
OCR stands for optical character recognition, and is an automated method of creating machine-readable texts. An OCR program will interpret the text on a digital image and attempt to render it in a known alphabet. Often, you need to specify for the OCR software what language(s) are in the digital image, and sometimes you also need to help the software understand the formatting of the text in the image (for example, with columns of text, photographs, and advertisements in a newspaper). There are many notable OCR-focused projects, including Mapping Texts, Eighteenth Century Collections Online, and the Nusus Corpus. Read More
HTR stands for handwritten text recognition and, much like OCR above, is an automated method of creating machine-readable texts from handwritten sources. An HTR program renders handwritten text into machine-readable, "print" text, which is not only easier for researchers to read, but also for them to manipulate and explore with digital methods. Ottoman Studies has had much success with implementing HTR on Ottoman archives and literature, and you can check out some of those projects here. Read More
NLP stands for natural language processing. These are methods for how to program computers to process and analyze large amounts of language data, such as digital images of texts that have been OCR'd or HTR'd. The digital images have been rendered machine-readable, so they are a ready for a computer to try to understand them. NLP software helps researchers explore their questions about context, organization, and language use. Read More
We recommend you to consider these helpful tools based on the type of project you're working on (transcription, OCR/HTR, NLP, Text Analysis, or other methods).
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 Generic License.