Linguistic Data Consortium: LDC libguide for specifics on access.
Chinese-English Parallel Corpora: TranslateFX researchers and linguists have developed these corpora, comprised of aligned sentence pairs from quality bilingual texts, covering the financial and legal domains in Hong Kong.
Data is Plural: "useful/curious" datasets, sign up for email digest
Digitized Archives from Digital Libraries - from UIUC
Digital Humanities Libguide-Datasets: humanities, social sciences, and government datasets.
glottobank: five global databases documenting variation in language structure.
HathiTrust Research Center (HTRC) provides computational research access to the HathiTrust Digital Library. Helpful guide from UCB.
Social Media corpora:
reddit APIs: Access data from posts, threads, comments, and users from reddit and subreddits.
Social Sciences Data: subscription and free social science data resources
Twitter Streaming APIs: public streams provide public access to public data flowing through Twitter.
US Census Lanaguage Data: Data on language use tied to location in the United States
UT College of Liberal Arts Linguistics Research Center offers resources dedicated the Indo-European Language Family such as the online Indo-European Lexicon (IELEX).
UW Libguide for Computational Linguistics
Yelp Fusion API: Access to business data, including location, photos, Yelp rating, price, hours, types of transactions.
We can provide:
Alas, we are not able to provide:
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 Generic License.