Skip to Main Content
University of Texas University of Texas Libraries

Linguistics

Corpus Linguistics Resources

Resources for Working with Corpora

For a useful overview, please see the UT Libraries Digital Humanities LibGuide

UTL has licensed access to : 

BYU Corpus Data 

Corpus of Historical American English (COHA) and Global Web-based English (GloWbe)  Additional information.

HathiTrust Research Center (HTRC) provides computational research access to the HathiTrust Digital Library.  Helpful guide from UCB. 

Linguistic Data Consortium:  LDC libguide for specifics on access. 

Additional eresources for Textual Analysis: UTL licensed content with options supporting textual analysis projects.

 

See also:

The Caselaw Access Project, making 360 years of case law freely available online as a machine-readable text corpus, digitized from the collections of the Harvard Law School Library. Here are some ways to access the data, including:

Here are some of the ways people have been using the data 

 

Chinese-English Parallel Corpora: TranslateFX researchers and linguists have developed these corpora, comprised of aligned sentence pairs from quality bilingual texts, covering the financial and legal domains in Hong Kong. 

Data is Plural: "useful/curious" datasets, sign up for email digest

Digitized Archives from Digital Libraries - from UIUC 

Digital Humanities Libguide-Datasets: humanities, social sciences, and government datasets.

Open GLAM: OA data from GLAM institutions (galleries, libraries, archives, museums)

Social Media corpora:

reddit APIs: Access data from posts, threads, comments, and users from reddit and subreddits.

Social Sciences Data Libguide:subscription and free social science data resources

Scraping Twitter Libguide

Twitter Streaming APIs: public streams provide public access to public data flowing through Twitter.

Yelp Fusion API: Access to business data, including location, photos, Yelp rating, price, hours, types of transactions.   

Sociocultural Anthropology and Linguistics Liaison

Profile Photo
Susan Macicak
Contact:
PCL 3.316 S5482
Austin, TX 78712
(512) 495-4335

Library Support

We can provide:

  • This research guide with resources, contact information, other details
  • Consultation with librarians 
  • Annual payment for membership in HathiTrust, LDC, and some one time purchases of datasets depending on funding availability
  • Negotiation with vendors to include general text mining provision in licenses for library resources

Alas, we are not able to provide:

  • Storage other than UTBox for vendor data 
  • Licensing for individual text mining projects.  In most cases, the researcher needs to negotiate the license directly with the vendor, unless the vendor requests addendum to the library-wide license
  • Guarantees on enforcing user behavior and handling of vendor data

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 2.0 Generic License.