In this guide you will find the following:
Text Mining, also known as Text Data Mining, is a branch of Artificial Intelligence focused on extracting high-quality information and insights from unstructured textual data. Text Mining is broadly utilized for information retrieval, data mining and knowledge discovery. It uses computational analysis to process large quantities of information. By taking advantage of computers’ ability to find patterns, researchers can identify patterns in texts and data sets. Text mining is often focused on natural language texts, while data mining is focused on large data sets. The data found while text mining can be used to understand the relationships between words and documents.A few common text mining tasks are as follows:
Before starting a text mining project, it is important to go through the following steps to gauge the feasibility of the project.
Which 2020 US Democratic Presidential Nominee said this? Warren? Biden? Sanders?
Using the quotes extracted from debates between candidates we wish to classify which candidate said one of the quotes. Since we have existing labels and want to predict which nominee was recorded saying the new, unseen quote, this is an example of text classification. This project used supervised & semi-supervised learning techniques.
Concepts: Text Classification, Text Preprocessing, Feature Engineering
Natural Language Recipe Parser
To build a database of recipes and its ingredients, we decided to scrape websites to extract recipes using BeautifulSoup, however, almost all food-blogging or recipe curating websites never separate the name of the ingredient from the measurement, quantity and additional description. This natural language recipe ingredient parser can be used by food blogging websites or apps likewise to improve management of opaque ingredients by converting into easy to manipulate & exploit strings stored in tabular format. Thus to extract the ingredients name, quantity and unit we used a custom named entity recognition model which in a given sentence will identify these entities and return them for easy use.
Concepts: Web Scraping, Entity & Information Extraction, Text Cleaning & Analytics
COVID-19 Twitter Sentiment Analysis
Twitter is a rich mine of opinions and in this project tweets from two states were analyzed - Washington and Florida to understand how people reacted to preventative measures and policies during COVID. This The project hypothesizes that this sentiment will be captured in twitter data and hence can be utilized by the government at all levels to create effective strategies to inform the public better.
Concepts: Sentiment Analysis, Data Extraction using API
The Libraries and eScience Institute offer limited office hour support for text mining. To learn more or schedule a consultation, email uwtextmine@uw.edu