Skip to main content
University of Texas University of Texas Libraries

Linguistics

Resources for Textual Analysis

Licensed Electronic Resources Supporting Textual Analysis

Below is an alphabetical list of UT-licensed e-resources that offer textual analysis options for UT-affiliated students, faculty and staff. Contact me to discuss options or speak with your subject liaison librarian about your textual analysis research projects and needs.  For more information, please see:  


Accessible Archives databases 

See webinar http://www.accessible-archives.com/webinars/text-data-mining/ for an introduction. Currently UT has access to Accessible Archives content: https://guides.lib.utexas.edu/db/aatrial

An example project with this content is NCSU Libraries' partnership with Accessible Archives to enable the creation of the Nineteenth-Century Newspaper Analytics project.

Adam Matthew Databases | AM Explorer

Adam Matthew is a provider of unique primary source digital collections that are readily available for mining. Researchers may contact Adam Matthew directly at info@amdigital.co.uk to discuss data mining requests. For more information see: http://www.amdigital.co.uk/files/amdigital/data-mining.pdf and Adam Matthew API

Early English Books Online  EEBO Text Creation Partnership (TCP)

The Text Creation Partnership creates standardized XML/SGML encoded electronic text editions of early print books, transcribe and mark up the text from the millions of page images in ProQuest's Early English Books Online, Gale Cengage's Eighteenth Century Collections Online, and Readex's Evans Early American Imprints. This work, and the resulting text files, are jointly funded and owned by more than 150 libraries worldwide. All of the TCP's work will be released the public domain for anyone to use.

EEBO-TCP Phase I: 25,000 texts freely available for anyone to use without restriction to text mine, modify or share with others. You can download the files from box.comhere. The readme.txt file describes the file formats.
EEBO-TCP Phase II :(ongoing) More than 30,000 texts created. Access to and use of these texts are subject to restrictions specified in UTL;s license. You may download these files for local use, but you may not share or redistribute them to users at non-TCP partner institutions without permission.  

Collections currently available to the UT Community include:

  • ECCO-TCP Full text available to everyone
  • EEBO-TCP 25,000 texts available to everyone + 35,000 texts available only to EEBO-TCP partners (UTL is a funding partner for phase I and II)
  • Evans-TCP Full text available to everyone

EBSCOhost API FAQ

GALE Cengage  Forthcoming : Digital Scholarship Lab

Primary source databases can be searched simultaneously in a single cross-search interface with Term Frequency and Term Clusters features.  Available as Artemis Primary Sources  on the UT Libraries A-Z eresource list, these include  17th and 18th Century Burney Collection Newspapers  , Eighteenth Century Collections Online (ECCO), Nineteenth Century Collections Online (NCCO), Nineteenth (19th) Century U.S. Newspapers and others.

For the cost of a hard drive, and with a license addendum, Gale will deliver a hard drive with the XML from each of the resources that the library requests. Gale will not provide this drive to an individual user.

Data Mining the Gale Digital Collections Frequently Asked Questions  

Gale Data Mining and Textual Analytics

Using Term Frequency & Term Clusters  

HathiTrust

HathiTrust makes collections of works available for research purposes, including the public domain works digitized by Google in the Google Books project. See HathiTrust datasets for more information about the process of establishing research access.   The HathiTrust Research Center supports researchers using TDM computation to plumb the HathiTrust collection by developing innovative tools and infrastructure. To learn more about their services, support, and community, visit their website.  

JSTOR Data for Research (DfR)

This service, freely available to the public, provides text and data mining tools for selecting and interacting with the content in JSTOR. The tools include faceted searching, topic modeling, and data visualization. Researchers can obtain, view and bulk download document-level datasets, including word frequencies, citations, key terms and ngrams. JSTOR will work with you individually to tailor datasets to your needs.  For more information, see the Data for Research FAQ

See also:

JSTORr, a package of simple functions in R to work with DFR output.

JSTOR's Text Analyzer, a reverse search engine that analyzes documents that you upload (your own, or other articles) to find related materials in JSTOR.

NexisUni (recently rebranded Lexis-Nexis Academic)

Text files can be downloaded, batching up to 500 articles at a time to build a corpus from which one can text mine. Some restrictions apply to automated search techniques; refer to the LexisNexis Terms & Conditions, clause 2.2 for more information.  “2.2 Use of the Online Services via mechanical, programmatic, robotic, scripted or any other automated means is strictly prohibited. Unless otherwise agreed to by LN in writing, use of the Online Services is permitted only via manually conducted, discrete, individual search and retrieval activities.”

Oxford English Dictionary

Oxford University Press grants research access to the Corpus for academic projects that can demonstrate a strong practical need for this data. To apply for research access to the Corpus fill out and email this application form. See also the  Oxford English Corpus Sketch Engine Documentation

Oxford Scholarship Online:

OUP offers consultation with a technical project manager to assist in planning your project. To request a consultant for your TDM project, please email Data.Mining@oup.com

ProQuest

Certain slices of  ProQuest History Vault  and other ProQuest Databases outlined below can be acquired on a hardrive as structured XML files that contain basic metadata fields (newspaper title, date, page number, article type) and OCR text that is fully searchable as well as page images (if applicable).   Please note that  delivery takes a minimum of 6 months. 

These include: 

Austin American Statesman  (1871-1925)

Chicago Defender (1877-1936) 

ProQuest Congressional Legislative and Executive: 

Executive Orders & Presidential Proclaimations 1789-2014

Hearings - Part A - 1824-1979

Hearings - Part B - 1980-2003

Hearings - Part C - 2004-2010

Hearings - Part D - 2011

Hearings - Part E - 2012

Hearings - Part F - 2013

Hearings - Part G - 2014

Hearings - Part H - 2015

Historic Digital Bills & Resolutions 1789-2013

Historic Digital Bills & Resolutions 2014

Historic Digital Bills & Resolutions 2015

ProQuest DNSA Digital National Security Archive sections:

  • Afghanistan: The Making of U.S. Policy, 1973–1990 
  • Argentina, 1975-1980: The Making of U.S. Human Rights Policy
  • Chile and the United States: U.S. Policy toward Democracy, Dictatorship, and Human Rights, 1970-1990
  • China and the United States: From Hostility to Engagement, 1960–1998 
  • CIA and Covert Operations: From Carter to Obama
  • CIA Covert Operations, Part II: The Year of Intelligence, 1975
  • Colombia and the United States: Political Violence, Narcotics, and Human Rights, 1948-2010
  • Cuban Missile Crisis Revisited: An International Collection of Documents, from the Bay of Pigs to the Brink of Nuclear War
  • Death Squads, Guerrilla War, Covert Operations, and Genocide: Guatemala and the United States, 1954-1999
  • El Salvador: The Making of U.S. Policy, 1977–1984 
  • El Salvador: War, Peace, and Human Rights, 1980–1994 
  • Electronic Surveillance: From the Cold War to the War on Terror
  • Iran: The Making of U.S. Policy, 1977–1980 
  • Iraqgate: Saddam Hussein, U.S. Policy and the Prelude to the Persian Gulf War, 1980–1994 
  • Japan and the U.S.: Diplomatic, Security, and Economic Relations, Part II: 1977 – 1992 
  • Japan and the United States: Diplomatic, Security, and Economic Relations, 1960–1976 
  • Japan and the United States: Diplomatic, Security, and Economic Relations, Part III, 1961-2000
  • Mexico-United States counternarcotics policy, 1969-2013
  • National Security Agency: Operations and Organization, 1945-2009
  • Nicaragua: The Making of U.S. Policy, 1978–1990 
  • Peru: Human Rights, Drugs, and Diplomacy: 1980-2000
  • Presidential Directives on National Security, Part I:  From Truman to Clinton 
  • Presidential Directives on National Security, Part II:  From Truman to George W. Bush
  • South Africa: The Making of U.S. Policy, 1962–1989 
  • Terrorism and U.S. Policy, 1968-2002 
  • The Berlin Crisis, 1958–1962 
  • The Cuban Missile Crisis, 1962 
  • The Iran-Contra Affair: The Making of a Scandal 
  • The Kissinger Conversations, Supplement II: A Verbatim Record of U.S. Diplomacy, 1969-1977
  • The Kissinger Conversations, Supplement: A Verbatim Record of U.S. Diplomacy, 1969-1977
  • The Kissinger Telephone Conversations: A Verbatim Record of U.S. Diplomacy, 1969-1977 
  • The Kissinger Transcripts: A Verbatim Record of U.S. Diplomacy, 1969-1977 
  • The Philippines: U.S. Policy during the Marcos Years, 1965–1986 
  • The President's Daily Brief: Kennedy, Johnson and the CIA 1961-1969
  • The Soviet Estimate: U.S. Analysis of the Soviet Union, 1947–1991 
  • The U.S. Intelligence Community, 1947–1989 
  • The United States and the Two Koreas, Part II: 1969-2010
  • U.S. Espionage and Intelligence, 1947–1996 
  • U.S. Intelligence on Weapons of Mass Destruction: From World War II to Iraq
  • U.S. Military Uses of Space, 1945–1991 
  • U.S. Nuclear History, 1969-1976: Weapons, Arms Control, and War Plans in an Age of Strategic Parity
  • U.S. Nuclear History: Nuclear Arms and Politics in the Missile Age, 1955–1968 
  • U.S. Nuclear Non-Proliferation Policy, 1945–1991 
  • U.S. Policy in the Vietnam War, Part I: 1954-1968  
  • U.S. Policy in the Vietnam War, Part II: 1969 – 1975   
  • United States and The Two Koreas, 1969-2000
  • US Intelligence and China: Collection, Analysis, and Covert Action
  • US Intelligence Community after 9/11

EEBO: Early English Books Online 1473-1700 (note other access points above)

Entertainment Industry Magazine Archive (EIMA) (1880-2000) 

Los Angeles Times  (1877-1936)

Patrologia Latina (200-1216)

Times of India  (1838-2005)

UK Parliamentary Papers (1688-2005)

Vogue Archive (1892-)

 

Readex Databases 

Permission to text mine Readex resources requires an additional form providing a detailed explanation of the related project. UTL’s Readex primary source collections include:

ScienceDirect / Elsevier

Elsevier automatically enables researchers at subscribing institutions to text mine for non-commercial research purposes and to gain access to full text content in XML for this purpose.  Researchers are able to obtain an API key via the developers portal , which will require them to self-register before receiving a personal API key.  For access to data not available through API, researcher must contact Elsevier directly to negotiate integrationsupport@elsevier.com.

To get an API key:

Go to the "My Projects" page on the Elsevier developer portal. Log in with your ScienceDirect username or create a new profile (a separate account from your EID and password must be created). On the "My Projects" page, click on "Register a New Text Mining Project." Enter a project name and description and accept the text mining user agreement. You will now see your newly registered project listed under "My Text Mining Projects". Click "View API Key" to get your API key.

For further instructions, see: Text Mining Policies,   Technical Documentation for Text Mining,  Developer's Portal starting page,  Text and Datamining FAQs., Text and Data Mining page on Elsevier.comElsevier User Support for Developers

Springer LINK

TDM rights, for non-commercial research, are now included in new and renewed subscription agreements.   "Individual researchers are encouraged to download subscription and open access content for TDM purposes directly from the SpringerLink platform. No registration or API key is required. Full-text content can be accessed easily and programmatically at friendly URLs based on the content’s Digital Object Identifier (DOI)." via Springer's Text and Data Mining Policy.

Wiley Online Library   

"In order to maximize platform stability and security for all users, Wiley asks that access to content for TDM purposes takes place through an approved API service, rather than through crawling Wiley Online Library. Our preferred access solution for TDM is the Crossref Text and Data Mining Service. Academic subscribers can register with Crossref and will then be able to access subscribed content once they have accepted the Wiley click-through TDM license and received an API token.  Further details on the Crossref Text and Data Mining Service at http://tdmsupport.crossref.org/researchers/." See also Wiley’s Text and Data Mining Policy and  Text and Mining Agreement.

 

Central & Electronic Collections Strategist, Scholarly Resources Licensing and Linguistics Liaison

Susan Macicak's picture
Susan Macicak
Contact:
PCL 3.316 S5482
Austin, TX 78712
(512) 495-4335
Subjects:Linguistics

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 2.0 Generic License.