Skip to main content
Research Guides

Digital Scholarship Research Guide: Text Mining

The UW Libraries does not provide support for these tools.

What is text mining?

Text and data mining (TDM) uses computational analysis to process large quantities of information. By taking advantage of computers’ ability to find patterns, researchers can identify patterns in texts and data sets. Text mining is often focused on natural language texts, while data mining is focused on large data sets. The data found while text mining can be used to understand the relationships between words and documents.

Software

Open sourced tools to help with text mining:

(Knowledge of coding needed)

  Mallet

  Mallet Website

  Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

  

python logoNatural Language Toolkit (NLTK)

   NLTK Website

   A suite of open sourced Python modules, data sets, and tutorials supporting research and development in Natural Language Processing.

  

tm

   tm Website

   R package that provides a framework for text mining applications within R.

Online Tools

Online collections with supporting text mining tools:

JSTOR logoJSTOR Text Analyzer

JSTOR Analyzer website

A new tool from JSTOR that allows you to upload your own text or document. Text Analyzer will process the text to find the most significant topics and then recommend similar content on JSTOR.

 

 

HathiTrust Research Center Analytics

HTRC Analytics Website

HathiTrust Digital Library has a collection of easy-to-use computational tools ideal for beginners, as well as more complex tools for more advanced data analysis. HTRC now provides access to the text of the complete 16.7 million-item HathiTrust for non-consumptive research. These tools allow you to assemble collections of digitized text from HathiTrust corpus, preform text analysis on them and more.

What is an API?

An Application Programming Interface (API) is a set of clearly defined methods of communication that allows two applications to talk to each other. Just as humans need a structured mechanism to share information (i.e. spoken or written language), so do computers. An API is a set of instructions for how a particular machine/information source is able to share the data it contains.

 

APIs are often used to extract the data used for text and data mining. Many databases and publishers have their own APIs that allow researchers to access information. These APIs are often needed because the databases and publishers prohibit web scrapers and crawlers. Always check the individual policies for text mining and API use before starting your project.

Open Source APIs

 

Resource

Description                          

Fee?

Result Format      

Limitations                     

Registration

Help Contact

arXiv

Provides access to metadata and article abstracts for the e-prints hosted on arXiv.org.

Free

Atom

none

None

arXiv help

SAO/NASA Astrophysics Data System (ADS)

Provides access to ADS database of bibliographic data on astronomy and physics publications

Free

JSON

Rate limits apply

Key required

adshelp@cfa.harvard.edu

BioMed Central

Provides access both to metadata and full-text content for the 260,000 open access journals published on BioMed Central.

Free

XML, JSON

none

Key required

info@biomedcentral.com

Chronicling America Access to information about historic newspapers and select digitized newspaper pages. Free HTML (default), JSON, Atom None None ndi@loc.gov

CrossRef


 

Provides access to metadata records with CrossRef DOIs, covering about 75 million scholarly works from around 5000 publishers.

Free

JSON

None

None

tdm@crossref.org

Digital Public Library of America

Provides metadata on items and collections indexed by the DPLA. Also includes partner data from Harvard, New York Public Library, ARTstor, and others.

Free

JSON-LD

None

Key required

codex@dp.la

 

Troubleshooting & FAQ

HathiTrust (Bibliographic API)

Provides bibliographic and rights information for items in the HathiTrust Digital Library. Please note that this API is not intended for bulk-retrieval of records.

Free

MARC-XML, JSON

no specific limits, however only intended for small numbers of items. Permission must be sought for bulk retrieval.

None

 feedback@issues.hathitrust.org

HathiTrust (Data API)

Provides access to HathiTrust and Google digitized texts of public domain works. Volumes digitized by Google will require agreement with Google.

Free

XML, JSON

No specific limits, however please see their policies on data use

Key required

feedback@issues.hathitrust.org

JSTOR Data for Research


 

Not a true API, but provides access to content on JSTOR for research and teaching.

Free

Zip files, XML

Max 25,000 documents per dataset; users can get larger datasets by special request


 

Requires MyJSTOR account registration

Data for Research help

Library of Congress

Multiple APIs available to download bibliographic data and search Library of Congress digital collections, including images, public radio and television, and historic newspapers

Free

Varies

varies

Most APIs do not require key

ndi@loc.gov

Nature

Provides access to the Metadata and more than 460,000 open access full-text documents from Springer Nature.

Free

XML, JSON, and more

No specific limits, however downloads should be limited to “reasonable rates”

 

Springer Nature TDM Policy

Varies

tdm@springernature.com

National Library of Medicine

NLM offers 29 separate APIs for accessing a wide variety of content from various NLM databases.

 

varies

varies

varies

varies

National Center for Biotechnology Information

Several public APIs to access many databases and tools including PubMed, PMC, Gene, Nuccore and Protein.

Free

Varies

Varies

Key required for some

NCBI Help Manual

OECD

Provides access to a selection of top used OECD datasets.

Free

JSON, XML

Max 1,000,000 results per query, max URL length of 1,000 characters

None

OECD.Stats help

Open Academic Graph

Downloadable datasets for citations drawn from two large academic graphs: Microsoft Academic Graph (MAG) and AMiner. (Not an API)

Free

Zip, JSON

None

None

 

ORCID

Queries and searches the ORCID researcher identifier system and obtain researcher profile data

Free, with subscription options

HTML, XML, or JSON

Two options: 1) Users can access the free Public API, which only returns data marked as “public”; 2) Become an ORCID member to receive API credentials: see here

ORCID ID Account required


 

ORCID API FAQ

Oxford English Dictionary (OED)

Oxford University Press grants research access to the Corpus for academic projects that can demonstrate a strong practical need for this data.

Free, with subscription options

JSON

3,000 request per month and 60 calls per minute with free option, other options available

Key required. Academic Researchers can request free access.

API FAQ

 

Oxford Dictionaries Contact

 

API Forum

PLoS Article-Level Metrics

Retrieves article-level metrics (including usage statistics, citation counts, and social networking activity) for articles published in PLOS journals and articles added to PLOS Hubs: Biodiversity

Free

XML, JSON, CSV

Results limited to batches of 50 at a time

 

Contact api@plos.org for high-volume use requests

Key required


 

api@plos.org ; Questions can also be posted in PLoS API Google Group

PLOS Search

Allows PLoS content to be queried for integration into web, desktop, or mobile applications

Free

XML, JSON

Max is 7200 requests a day, 300 per hour, 10 per minute; users should wait 5 seconds for each query to return results; requests should not return more than 100 rows; high-volume users should contact api@plos.org; API users are limited to no more than five concurrent connections from a single IP address

Key required

api@plos.org ; Questions can also be posted in PLoS API Google Group

Worldbank

Provide access to World Bank statistical databases, indicators, projects, and loans, credits, financial statements and other data related to financial operations

Free

Varies

Request volume limits are unspecified, but should be “reasonable”

None

data@worldbankgroup.org

The information in this table has been adapted from MIT, Purdue, and USC's API LibGuides.

Subscription Based APIs

These APIs require an affiliation with a subscribed institution. The University of Washington has licenses with these services.

Resource

Description

Fee?

Result Format

Limitations

Registration

Help Contact

IEEE Xplore

Provides metadata and DOIs for IEEE Xplore articles.

Cost negotiated per request

JSON, XML

Max 200 results per query

Key required,

Must subscribe to or be a member of an institution that subscribes to IEEE Xplore (UW subscribes)

onlinesupport@ieee.org

ScienceDirect and Scopus

Provides access to full-text content from ScienceDirect and Scopus, as well as seven other APIs with various functionalities.

Free (with subscription)

XML

None

Key required

integrationsupport@elsevier.com

STAT!Ref OpenSearch

Bibliographic search service for displaying STAT!Ref results on a website.

Free (with subscription)

RSS, ATOM, HTML

Limits exist but are not specified; high-volume users should contact STAT!Ref

Free to register for users at a subscribing institution

support@statref.com

Wiley

Allows text- and data-mining access to content in the Wiley Online Library

Free (with subscription)

JSON

Rate-limits implemented through CrossRef rate-limiting headers, exact limitations not specified

Must be part of a subscribing institution to have full text access. Users will encounter a click-through agreement and will receive a Client API Token, which is needed when requesting full text of articles

TDM@wiley.com

Wiley TDM help page

The information in this table has been adapted from MIT, Purdue, and USC's API LibGuides.

ProQuest Content

ProQuest allows users associated with a subscribed institution to extract and compile data from purchased content for teaching, learning, and research purposes. Users can arrange for text mining specific ProQuest content and cost is negotiated per request. Members of the UW community can get more information about text mining data from ProQuest by submitting a request through AskUs!.

News Sources

If you are looking for news sources to text mine, you can look at Chronicling America, which gives free access to information about historic newspapers and select digitized newspaper pages. Default results from the API will be given in HTML, however JSON and Atom formats are available.

 

Library news database vendors are exploring ways to allow text mining of their aggregated news content.  For more information about data/text mining in library news databases, contact News Librarian Jessica Albano.

Twitter APIs

Twitter provides access to tweets via API. Twitter has several different APIs with varying degrees of access that range from free to $2,499+/month. To find the right API package for your project, visit Twitter's API comparison chart and policies.

Additional Tool for acquiring tweets: