Transcription, OCR, HTR, NLP

Transcription

The process of copying text from one medium to another, and in the Scanner Studio it means copying text from digitized images into a machine-readable format (like .docx). This method of creating machine-readable texts is the most manual: users must type what they read into a new document, usually following standard conventions of transcription.

OCR

Optical Character Recognition, or OCR, is an automated method of creating machine-readable texts. An OCR program will interpret the text on a digital image and attempt to render it in a known alphabet. Often, you need to specify for the OCR software what language(s) are in the digital image, and sometimes you also need to help the software understand the formatting of the text in the image (for example, with columns of text, photographs, and advertisements in a newspaper).

HTR

Handwritten Text Recognition, or HTR, is an automated method of creating machine-readable texts from handwritten sources and is much like OCR above. An HTR program renders handwritten text into machine-readable, "print" text, which is not only easier for researchers to read, but also for them to manipulate and explore with digital methods.

NLP

NLP stands for Natural Language Processing. These are methods for how to program computers to process and analyze large amounts of language data, such as digital images of texts that have been OCR'd or HTR'd. The digital images have been rendered machine-readable, so they are a ready for a computer to try to understand them. NLP software helps researchers explore their questions about context, organization, and language use.

Scan Tech Studio (STS)

Understanding Terms

Transcription, OCR, HTR, NLP

Transcription

OCR

HTR

NLP

Transcription Readings

OCR Readings

HTR Readings

NLP Readings