Skip to Main Content
University of Texas University of Texas Libraries


Corpus Linguistics Resources

Resources for Working with Corpora

For a useful overview, please see the  Digital Humanities tools and resources guide.   Other starting points : 

HathiTrust Research Center (HTRC) provides computational research access to the HathiTrust Digital Library.  Helpful guide from UCB. 

Linguistic Data Consortium:  LDC libguide for specifics on access. 

UT College of Liberal Arts Linguistics Research Center offers resources dedicated the Indo-European Language Family  such as the online Indo-European Lexicon (IELEX).

See also:

The Caselaw Access Project, making 360 years of case law freely available online as a machine-readable text corpus, digitized from the collections of the Harvard Law School Library. Here are some ways to access the data, including:

Chinese-English Parallel Corpora: TranslateFX researchers and linguists have developed these corpora, comprised of aligned sentence pairs from quality bilingual texts, covering the financial and legal domains in Hong Kong. 

Data is Plural: "useful/curious" datasets, sign up for email digest

Digitized Archives from Digital Libraries - from UIUC 

Digital Humanities Libguide-Datasets: humanities, social sciences, and government datasets.

glottobankfive global databases documenting variation in language structure.

Open GLAM: OA data from GLAM institutions (galleries, libraries, archives, museums)

Social Media corpora:

reddit APIs: Access data from posts, threads, comments, and users from reddit and subreddits.

Social Sciences Data: subscription and free social science data resources

Scraping Twitter Libguide

Twitter Streaming APIs: public streams provide public access to public data flowing through Twitter.

Yelp Fusion API: Access to business data, including location, photos, Yelp rating, price, hours, types of transactions.   

Sociocultural Anthropology and Linguistics Liaison

Profile Photo
Susan Macicak
PCL 3.316 S5482
Austin, TX 78712
(512) 495-4335

Library Support

We can provide:

  • This research guide with resources, contact information, other details
  • Consultation with librarians 
  • Annual payment for membership in HathiTrust, LDC, and some one time purchases of datasets depending on funding availability
  • Negotiation with vendors to include general text mining provision in licenses for library resources

Alas, we are not able to provide:

  • Storage other than UTBox for vendor data 
  • Licensing for individual text mining projects.  In most cases, the researcher needs to negotiate the license directly with the vendor, unless the vendor requests addendum to the library-wide license
  • Guarantees on enforcing user behavior and handling of vendor data

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 2.0 Generic License.