For a useful overview, please see the Digital Humanities tools and resources guide. Other starting points :
HathiTrust Research Center (HTRC) provides computational research access to the HathiTrust Digital Library. Helpful guide from UCB.
Linguistic Data Consortium: LDC libguide for specifics on access.
UT College of Liberal Arts Linguistics Research Center offers resources dedicated the Indo-European Language Family such as the online Indo-European Lexicon (IELEX).
See also:
The Caselaw Access Project, making 360 years of case law freely available online as a machine-readable text corpus, digitized from the collections of the Harvard Law School Library. Here are some ways to access the data, including:
Chinese-English Parallel Corpora: TranslateFX researchers and linguists have developed these corpora, comprised of aligned sentence pairs from quality bilingual texts, covering the financial and legal domains in Hong Kong.
Data is Plural: "useful/curious" datasets, sign up for email digest
Digitized Archives from Digital Libraries - from UIUC
Digital Humanities Libguide-Datasets: humanities, social sciences, and government datasets.
glottobank: five global databases documenting variation in language structure.
Social Media corpora:
reddit APIs: Access data from posts, threads, comments, and users from reddit and subreddits.
Social Sciences Data: subscription and free social science data resources
Twitter Streaming APIs: public streams provide public access to public data flowing through Twitter.
Yelp Fusion API: Access to business data, including location, photos, Yelp rating, price, hours, types of transactions.
We can provide:
Alas, we are not able to provide:
This work is licensed under a Creative Commons Attribution-NonCommercial 2.0 Generic License.