Skip to Main Content
University of Texas University of Texas Libraries

Linguistics

Corpus Linguistics Resources

Resources for Working with Corpora

For a useful overview, please see the  Digital Humanities tools and resources guide.   Other starting points : 

HathiTrust Research Center (HTRC) provides computational research access to the HathiTrust Digital Library.  Helpful guide from UCB. 

Linguistic Data Consortium:  LDC libguide for specifics on access. 

UT College of Liberal Arts Linguistics Research Center offers resources dedicated the Indo-European Language Family  such as the online Indo-European Lexicon (IELEX).

See also:

The Caselaw Access Project, making 360 years of case law freely available online as a machine-readable text corpus, digitized from the collections of the Harvard Law School Library. Here are some ways to access the data, including:

Chinese-English Parallel Corpora: TranslateFX researchers and linguists have developed these corpora, comprised of aligned sentence pairs from quality bilingual texts, covering the financial and legal domains in Hong Kong. 

Data is Plural: "useful/curious" datasets, sign up for email digest

Digitized Archives from Digital Libraries - from UIUC 

Digital Humanities Libguide-Datasets: humanities, social sciences, and government datasets.

glottobankfive global databases documenting variation in language structure.

Open GLAM: OA data from GLAM institutions (galleries, libraries, archives, museums)

Social Media corpora:

reddit APIs: Access data from posts, threads, comments, and users from reddit and subreddits.

Social Sciences Data: subscription and free social science data resources

Scraping Twitter Libguide

Twitter Streaming APIs: public streams provide public access to public data flowing through Twitter.

Yelp Fusion API: Access to business data, including location, photos, Yelp rating, price, hours, types of transactions.   

Sociocultural Anthropology and Linguistics Liaison

Profile Photo
Susan Macicak
she/her
Contact:
PCL 3.316 S5482
Austin, TX 78712
(512) 495-4335

Library Support

We can provide:

  • This research guide with resources, contact information, other details
  • Consultation with librarians 
  • Annual payment for membership in HathiTrust, LDC, and some one time purchases of datasets depending on funding availability
  • Negotiation with vendors to include general text mining provision in licenses for library resources

Alas, we are not able to provide:

  • Storage other than UTBox for vendor data 
  • Licensing for individual text mining projects.  In most cases, the researcher needs to negotiate the license directly with the vendor, unless the vendor requests addendum to the library-wide license
  • Guarantees on enforcing user behavior and handling of vendor data

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 Generic License.