Text and Data Mining Guide: Module 2: Data Preparation & Cleaning

Step-by-step guide on how to get started with your text mining project along with examples of past text mining projects from UW researchers and students.

Module 2

Module 2: Data Preparation & Cleaning

Before you can analyze text, it's crucial to collect and clean your data so that it can be processed effectively. This module will walk you through how to gather text data and prepare it for analysis.

Data Acquisition
- Web Scraping (e.g., using tools like Beautiful Soup in Python)
- APIs (e.g., Twitter API, PubMed API)
- Library Databases and Institutional Repositories
Text Cleaning & Normalization
- Removing Noise: punctuation, numbers, URLs
- Case Conversion: making all text lowercase for uniformity
- Tokenization: splitting text into words (tokens)
- Stopword Removal: filtering out common words (e.g., “the,” “and,” “of,” etc.)
- Stemming or Lemmatization: reducing words to their base form (e.g., “running” → “run”)
Document Representation
- Bag-of-Words Model: treats a text as an unordered collection of words
- TF-IDF (Term Frequency - Inverse Document Frequency): highlights important terms by reducing the weight of commonly used words

Short Exercise

Take a short piece of text (e.g., a paragraph from an online article). Convert it to lowercase, remove punctuation, and list the unique tokens you end up with. Note any interesting observations.

Once your text data is thoroughly cleaned and properly represented, you're ready to move on to the analysis phase covered in Module 3!

Fill out this Google Colab Notebook to practice the techniques in this module.