Text and Data Mining Guide: Home

Step-by-step guide on how to get started with your text mining project along with examples of past text mining projects from UW researchers and students.

What You'll Find in This Guide?

In this guide you will find the following:

Getting started with Text Mining Project - Step-by-step guide on how to get started with your text mining project along with examples of past text mining projects from UW researchers and students.
Text Mining Tools - Overview of tools for data collection, web scraping, text cleaning and analytics which can be utilized in your text mining project.
Data Collection Methods - Learn about the various ways to acquire data for your Text and Data Mining (TDM) projects.
Text Analytics & Visualizations - A beginner's guide to working with text data to perform analysis and generate insightful visualization.
TDM Licenses - Learn which Libraries subscription tools are licensed to support TDM
Text Mining Crash Course - Learn basic concepts in Text Mining and Analysis with modules.

What is Text Mining?

Text Mining, also known as Text Data Mining, is a branch of Artificial Intelligence focused on extracting high-quality information and insights from unstructured textual data. Text Mining is broadly utilized for information retrieval, data mining and knowledge discovery. It uses computational analysis to process large quantities of information. By taking advantage of computers’ ability to find patterns, researchers can identify patterns in texts and data sets. Text mining is often focused on natural language texts, while data mining is focused on large data sets. The data found while text mining can be used to understand the relationships between words and documents.A few common text mining tasks are as follows:

Text Classification
Sentiment Analysis
Topic Modeling
Document Clustering & Classification
Entity & Information Extraction
Text Summarization

How to get started with Text Mining Project?

Before starting a text mining project, it is important to go through the following steps to gauge the feasibility of the project.

What is your research question you are trying to answer?
What kind of data is required to address this question?
Do you need textual data? Where can you find it? A website? PDF? API?
What is the format of the text data at the identified sources?
Is the data quality at par for our purpose? If not, how do we improve it or acquire higher quality data?

Sample Text Mining Projects

Which 2020 US Democratic Presidential Nominee said this? Warren? Biden? Sanders?

Using the quotes extracted from debates between candidates we wish to classify which candidate said one of the quotes. Since we have existing labels and want to predict which nominee was recorded saying the new, unseen quote, this is an example of text classification. This project used supervised & semi-supervised learning techniques.

Concepts: Text Classification, Text Preprocessing, Feature Engineering

Natural Language Recipe Parser

To build a database of recipes and its ingredients, we decided to scrape websites to extract recipes using BeautifulSoup, however, almost all food-blogging or recipe curating websites never separate the name of the ingredient from the measurement, quantity and additional description. This natural language recipe ingredient parser can be used by food blogging websites or apps likewise to improve management of opaque ingredients by converting into easy to manipulate & exploit strings stored in tabular format. Thus to extract the ingredients name, quantity and unit we used a custom named entity recognition model which in a given sentence will identify these entities and return them for easy use.

Concepts: Web Scraping, Entity & Information Extraction, Text Cleaning & Analytics

COVID-19 Twitter Sentiment Analysis

Twitter is a rich mine of opinions and in this project tweets from two states were analyzed - Washington and Florida to understand how people reacted to preventative measures and policies during COVID. This The project hypothesizes that this sentiment will be captured in twitter data and hence can be utilized by the government at all levels to create effective strategies to inform the public better.

Concepts: Sentiment Analysis, Data Extraction using API

Have a Text Mining Question?

The Libraries and eScience Institute offer limited office hour support for text mining. To learn more or schedule a consultation, email uwtextmine@uw.edu