Skip to Main Content
University of Texas University of Texas Libraries

Linguistics

Corpus Linguistics Resources

Linguistic corpora and related resources

Linguistic Data Consortium:  LDC libguide for specifics on access. 

Chinese-English Parallel Corpora: TranslateFX researchers and linguists have developed these corpora, comprised of aligned sentence pairs from quality bilingual texts, covering the financial and legal domains in Hong Kong. 

Data is Plural: "useful/curious" datasets, sign up for email digest

Digitized Archives from Digital Libraries - from UIUC 

Digital Humanities Libguide-Datasets: humanities, social sciences, and government datasets.

glottobankfive global databases documenting variation in language structure.

HathiTrust Research Center (HTRC) provides computational research access to the HathiTrust Digital Library.  Helpful guide from UCB. 

Open GLAM: OA data from GLAM institutions (galleries, libraries, archives, museums)

Social Media corpora:

reddit APIs: Access data from posts, threads, comments, and users from reddit and subreddits.

Social Sciences Data: subscription and free social science data resources

Scraping Twitter Libguide

Twitter Streaming APIs: public streams provide public access to public data flowing through Twitter.

US Census Lanaguage Data: Data on language use tied to location in the United States

UT College of Liberal Arts Linguistics Research Center offers resources dedicated the Indo-European Language Family  such as the online Indo-European Lexicon (IELEX).

UW Libguide for Computational Linguistics

Yelp Fusion API: Access to business data, including location, photos, Yelp rating, price, hours, types of transactions.   

Sociocultural Anthropology and Linguistics Liaison

Profile Photo
Susan Macicak
she/her
Contact:
PCL 3.316 S5482
Austin, TX 78712
(512) 495-4335

Library Support

We can provide:

  • This research guide with resources, contact information, other details
  • Consultation with librarians 
  • Annual payment for membership in HathiTrust, LDC, and some one time purchases of datasets depending on funding availability
  • Negotiation with vendors to include general text mining provision in licenses for library resources

Alas, we are not able to provide:

  • Storage other than UTBox for vendor data 
  • Licensing for individual text mining projects.  In most cases, the researcher needs to negotiate the license directly with the vendor, unless the vendor requests addendum to the library-wide license
  • Guarantees on enforcing user behavior and handling of vendor data

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 Generic License.