For a useful overview, please see the Digital Humanities tools and resources guide. Other starting points :
The Caselaw Access Project, making 360 years of case law freely available online as a machine-readable text corpus, digitized from the collections of the Harvard Law School Library. Here are some ways to access the data, including:
Chinese-English Parallel Corpora: TranslateFX researchers and linguists have developed these corpora, comprised of aligned sentence pairs from quality bilingual texts, covering the financial and legal domains in Hong Kong.
Digitized Archives from Digital Libraries - from UIUC
Digital Humanities Libguide-Datasets: humanities, social sciences, and government datasets.
glottobank: five global databases documenting variation in language structure.
Social Media corpora:
reddit APIs: Access data from posts, threads, comments, and users from reddit and subreddits.
Social Sciences Data: subscription and free social science data resources
Twitter Streaming APIs: public streams provide public access to public data flowing through Twitter.
Yelp Fusion API: Access to business data, including location, photos, Yelp rating, price, hours, types of transactions.
This work is licensed under a Creative Commons Attribution-NonCommercial 2.0 Generic License.