Below is an alphabetical list of UT-licensed e-resources that offer textual analysis options for UT-affiliated students, faculty and staff. Contact me to discuss options or speak with your subject liaison librarian about your textual analysis research projects and needs. For more information, please see:
Accessible Archives databases
See webinar http://www.accessible-archives.com/webinars/text-data-mining/ for an introduction. Currently UT has access to Accessible Archives content: https://guides.lib.utexas.edu/db/aatrial.
An example project with this content is NCSU Libraries' partnership with Accessible Archives to enable the creation of the Nineteenth-Century Newspaper Analytics project.
Adam Matthew Databases | AM Explorer
Adam Matthew is a provider of unique primary source digital collections that are readily available for mining. Researchers may contact Adam Matthew directly at info@amdigital.co.uk to discuss data mining requests. For more information see: http://www.amdigital.co.uk/files/amdigital/data-mining.pdf and Adam Matthew API
Early English Books Online EEBO Text Creation Partnership (TCP)
The Text Creation Partnership creates standardized XML/SGML encoded electronic text editions of early print books, transcribe and mark up the text from the millions of page images in ProQuest's Early English Books Online, Gale Cengage's Eighteenth Century Collections Online, and Readex's Evans Early American Imprints. This work, and the resulting text files, are jointly funded and owned by more than 150 libraries worldwide. All of the TCP's work will be released the public domain for anyone to use.
EEBO-TCP Phase I: 25,000 texts freely available for anyone to use without restriction to text mine, modify or share with others. You can download the files from box.comhere. The readme.txt file describes the file formats.
EEBO-TCP Phase II :(ongoing) More than 30,000 texts created. Access to and use of these texts are subject to restrictions specified in UTL;s license. You may download these files for local use, but you may not share or redistribute them to users at non-TCP partner institutions without permission.
Collections currently available to the UT Community include:
GALE Cengage Forthcoming : Digital Scholarship Lab
Primary source databases can be searched simultaneously in a single cross-search interface with Term Frequency and Term Clusters features. Available as Artemis Primary Sources on the UT Libraries A-Z eresource list, these include 17th and 18th Century Burney Collection Newspapers , Eighteenth Century Collections Online (ECCO), Nineteenth Century Collections Online (NCCO), Nineteenth (19th) Century U.S. Newspapers and others.
For the cost of a hard drive, and with a license addendum, Gale will deliver a hard drive with the XML from each of the resources that the library requests. Gale will not provide this drive to an individual user.
Data Mining the Gale Digital Collections Frequently Asked Questions
Gale Data Mining and Textual Analytics
Using Term Frequency & Term Clusters
HathiTrust makes collections of works available for research purposes, including the public domain works digitized by Google in the Google Books project. See HathiTrust datasets for more information about the process of establishing research access. The HathiTrust Research Center supports researchers using TDM computation to plumb the HathiTrust collection by developing innovative tools and infrastructure. To learn more about their services, support, and community, visit their website.
JSTOR Data for Research (DfR)
This service, freely available to the public, provides text and data mining tools for selecting and interacting with the content in JSTOR. The tools include faceted searching, topic modeling, and data visualization. Researchers can obtain, view and bulk download document-level datasets, including word frequencies, citations, key terms and ngrams. JSTOR will work with you individually to tailor datasets to your needs. For more information, see the Data for Research FAQ
See also:
JSTORr, a package of simple functions in R to work with DFR output.
JSTOR's Text Analyzer, a reverse search engine that analyzes documents that you upload (your own, or other articles) to find related materials in JSTOR.
NexisUni (recently rebranded Lexis-Nexis Academic)
Text files can be downloaded, batching up to 500 articles at a time to build a corpus from which one can text mine. Some restrictions apply to automated search techniques; refer to the LexisNexis Terms & Conditions, clause 2.2 for more information. “2.2 Use of the Online Services via mechanical, programmatic, robotic, scripted or any other automated means is strictly prohibited. Unless otherwise agreed to by LN in writing, use of the Online Services is permitted only via manually conducted, discrete, individual search and retrieval activities.”
Oxford University Press grants research access to the Corpus for academic projects that can demonstrate a strong practical need for this data. To apply for research access to the Corpus fill out and email this application form. See also the Oxford English Corpus Sketch Engine Documentation
Oxford Scholarship Online:
OUP offers consultation with a technical project manager to assist in planning your project. To request a consultant for your TDM project, please email Data.Mining@oup.com
ProQuest
Certain slices of ProQuest History Vault and other ProQuest Databases outlined below can be acquired on a hardrive as structured XML files that contain basic metadata fields (newspaper title, date, page number, article type) and OCR text that is fully searchable as well as page images (if applicable). Please note that delivery takes a minimum of 6 months.
These include:
Austin American Statesman (1871-1925)
Chicago Defender (1877-1936)
ProQuest Congressional Legislative and Executive:
|
|||||||||
|
ProQuest DNSA Digital National Security Archive sections:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
EEBO: Early English Books Online 1473-1700 (note other access points above)
Entertainment Industry Magazine Archive (EIMA) (1880-2000)
Los Angeles Times (1877-1936)
Patrologia Latina (200-1216)
Times of India (1838-2005)
UK Parliamentary Papers (1688-2005)
Vogue Archive (1892-)
Permission to text mine Readex resources requires an additional form providing a detailed explanation of the related project. UTL’s Readex primary source collections include:
Elsevier automatically enables researchers at subscribing institutions to text mine for non-commercial research purposes and to gain access to full text content in XML for this purpose. Researchers are able to obtain an API key via the developers portal , which will require them to self-register before receiving a personal API key. For access to data not available through API, researcher must contact Elsevier directly to negotiate integrationsupport@elsevier.com.
To get an API key:
Go to the "My Projects" page on the Elsevier developer portal. Log in with your ScienceDirect username or create a new profile (a separate account from your EID and password must be created). On the "My Projects" page, click on "Register a New Text Mining Project." Enter a project name and description and accept the text mining user agreement. You will now see your newly registered project listed under "My Text Mining Projects". Click "View API Key" to get your API key.
For further instructions, see: Text Mining Policies, Technical Documentation for Text Mining, Developer's Portal starting page, Text and Datamining FAQs., Text and Data Mining page on Elsevier.com, Elsevier User Support for Developers
TDM rights, for non-commercial research, are now included in new and renewed subscription agreements. "Individual researchers are encouraged to download subscription and open access content for TDM purposes directly from the SpringerLink platform. No registration or API key is required. Full-text content can be accessed easily and programmatically at friendly URLs based on the content’s Digital Object Identifier (DOI)." via Springer's Text and Data Mining Policy.
"In order to maximize platform stability and security for all users, Wiley asks that access to content for TDM purposes takes place through an approved API service, rather than through crawling Wiley Online Library. Our preferred access solution for TDM is the Crossref Text and Data Mining Service. Academic subscribers can register with Crossref and will then be able to access subscribed content once they have accepted the Wiley click-through TDM license and received an API token. Further details on the Crossref Text and Data Mining Service at http://tdmsupport.crossref.org/researchers/." See also Wiley’s Text and Data Mining Policy and Text and Mining Agreement.
This work is licensed under a Creative Commons Attribution-NonCommercial 2.0 Generic License.