Skip to Main Content
University of Texas University of Texas Libraries

Digital Humanities Workshops @PCL

Schedule and course content from Digital Humanities Workshops @PCL series

Optical Character Recognition (OCR) for Non-Roman Texts

Workshop Description

Attendees will be introduced to the basics of optical character recognition (OCR)––which allows for full-text searching and other types of text manipulation of a digitized document––with a particular focus on OCR for materials in languages other than English, and in scripts other than Roman/Latin. OCR is fairly commonplace for English and Roman-script languages like French or Spanish, but it does not work so seamlessly for languages such as Arabic, Hindi, or Chinese. This workshop will be an opportunity to explore an open source OCR tool (Kraken) that has demonstrated success with some non-Roman scripts. The workshop will look at a few different non-Roman scripts; however, participants are encouraged to bring a digitized, highly legible sample text of interest.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 2.0 Generic License.