Skip to Main Content
University of Texas University of Texas Libraries


Corpus Linguistics Resources

Resources for Working with Corpora

For a useful overview, please see the UT Libraries Digital Humanities LibGuide

UTL has licensed access to : 

BYU Corpus Data 

Corpus of Historical American English (COHA) and Global Web-based English (GloWbe)  Additional information.

HathiTrust Research Center (HTRC) provides computational research access to the HathiTrust Digital Library.  Helpful guide from UCB. 

Linguistic Data Consortium:  LDC libguide for specifics on access. 

Additional eresources for Textual Analysis: UTL licensed content with options supporting textual analysis projects.


See also:

The Caselaw Access Project, making 360 years of case law freely available online as a machine-readable text corpus, digitized from the collections of the Harvard Law School Library. Here are some ways to access the data, including:

Here are some of the ways people have been using the data 


Chinese-English Parallel Corpora: TranslateFX researchers and linguists have developed these corpora, comprised of aligned sentence pairs from quality bilingual texts, covering the financial and legal domains in Hong Kong. 

Data is Plural: "useful/curious" datasets, sign up for email digest

Digitized Archives from Digital Libraries - from UIUC 

Digital Humanities Libguide-Datasets: humanities, social sciences, and government datasets.

Open GLAM: OA data from GLAM institutions (galleries, libraries, archives, museums)

Social Media corpora:

reddit APIs: Access data from posts, threads, comments, and users from reddit and subreddits.

Social Sciences Data Libguide:subscription and free social science data resources

Scraping Twitter Libguide

Twitter Streaming APIs: public streams provide public access to public data flowing through Twitter.

Yelp Fusion API: Access to business data, including location, photos, Yelp rating, price, hours, types of transactions.   

Sociocultural Anthropology and Linguistics Liaison

Profile Photo
Susan Macicak
PCL 3.316 S5482
Austin, TX 78712
(512) 495-4335

Library Support

We can provide:

  • This research guide with resources, contact information, other details
  • Consultation with librarians 
  • Annual payment for membership in HathiTrust, LDC, and some one time purchases of datasets depending on funding availability
  • Negotiation with vendors to include general text mining provision in licenses for library resources

Alas, we are not able to provide:

  • Storage other than UTBox for vendor data 
  • Licensing for individual text mining projects.  In most cases, the researcher needs to negotiate the license directly with the vendor, unless the vendor requests addendum to the library-wide license
  • Guarantees on enforcing user behavior and handling of vendor data

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 2.0 Generic License.