LibGuides: Linguistics: Corpus Linguistics Resources

Linguistic corpora and related resources

Linguistic Data Consortium: LDC libguide for specifics on access.

Chinese-English Parallel Corpora: TranslateFX researchers and linguists have developed these corpora, comprised of aligned sentence pairs from quality bilingual texts, covering the financial and legal domains in Hong Kong.

Data is Plural: "useful/curious" datasets, sign up for email digest

Digitized Archives from Digital Libraries - from UIUC

Digital Humanities Libguide-Datasets: humanities, social sciences, and government datasets.

glottobank: five global databases documenting variation in language structure.

HathiTrust Research Center (HTRC) provides computational research access to the HathiTrust Digital Library. Helpful guide from UCB.

Open GLAM: OA data from GLAM institutions (galleries, libraries, archives, museums)

Social Media corpora:

reddit APIs: Access data from posts, threads, comments, and users from reddit and subreddits.

Social Sciences Data: subscription and free social science data resources

Scraping Twitter Libguide

Twitter Streaming APIs: public streams provide public access to public data flowing through Twitter.

US Census Lanaguage Data: Data on language use tied to location in the United States

UT College of Liberal Arts Linguistics Research Center offers resources dedicated the Indo-European Language Family such as the online Indo-European Lexicon (IELEX).

UW Libguide for Computational Linguistics

Yelp Fusion API: Access to business data, including location, photos, Yelp rating, price, hours, types of transactions.

Librarian

Ian Goodale

Email Me

Contact:

Office: PCL 2.312L

(512) 495-4226

Subjects: Czech Literature, East European Studies, European Studies, French Studies, Iberian Studies, Italian Studies, Linguistics, Russian Literature, Slavic and Eurasian Studies

Library Support

We can provide:

This research guide with resources, contact information, other details
Consultation with librarians
Annual payment for membership in HathiTrust, LDC, and some one time purchases of datasets depending on funding availability
Negotiation with vendors to include general text mining provision in licenses for library resources

Alas, we are not able to provide:

Storage other than UTBox for vendor data
Licensing for individual text mining projects. In most cases, the researcher needs to negotiate the license directly with the vendor, unless the vendor requests addendum to the library-wide license
Guarantees on enforcing user behavior and handling of vendor data