Skip to Main Content
Research Guides

Text and Data Mining Guide: Module 6: Practice Questions

Step-by-step guide on how to get started with your text mining project along with examples of past text mining projects from UW researchers and students.

Questions

  Below are some review questions spanning all modules. Feel free to review them to test your knowledge!

Stopwords (e.g., “the,” “and,” “of,” etc.) typically appear very frequently but do not add meaningful semantic information. Removing them reduces noise and can improve the efficiency of various text mining tasks, such as topic modeling or sentiment analysis.

Tokenization breaks text into smaller units (tokens), which are often words or phrases. All subsequent processes—like removing stopwords, counting word frequencies, or applying machine learning models—depend on having clearly defined tokens.

Stemming chops off the ends of words (e.g., “running” → “run”), sometimes creating non-standard forms. Lemmatization reduces words to their dictionary form (lemma) based on morphological analysis (e.g., “mice” → “mouse,” “children” → “child”). Lemmatization usually produces more accurate root forms.

TF-IDF (Term Frequency–Inverse Document Frequency) weights how often a word appears in a document (term frequency) but reduces the impact of words that are common across many documents (inverse document frequency). This approach highlights words that characterize specific documents.

N-grams are sequences of n words (e.g., “New York City” is a trigram). They capture context and multi-word expressions that single-word frequency counts might miss, offering deeper insights into patterns and meanings within text.

A lexicon-based approach uses predefined dictionaries of positive/negative words to calculate sentiment scores. A machine-learning approach trains a classifier on labeled data, learning more nuanced patterns that correlate with various sentiment categories.

Topic modeling is an unsupervised technique (e.g., Latent Dirichlet Allocation) used to discover hidden thematic structures in a corpus. It groups words that tend to appear together, helping identify overarching topics without manual labeling.

NER automatically identifies and classifies named entities—such as people, locations, or organizations—in text (e.g., detecting “Albert Einstein,” “Los Angeles,” or “Google”). It’s useful for extracting structured information from unstructured text, facilitating tasks like relationship mapping and trend analysis.

Precision measures how many of the predicted positives are correct, recall measures how many of the actual positives were caught, and F1-score is the harmonic mean of both. These metrics offer a more nuanced view of model performance than accuracy alone.

Example: “How did public sentiment around a major political event evolve over time?”

Possible Steps:

  • Collect articles by publication date
  • Clean the text (remove stopwords, tokenize, etc.)
  • Apply sentiment analysis to each article
  • Aggregate sentiment scores by month/year to identify trends
  • Visualize changes in sentiment and interpret findings in a historical context

Complete this Google Colab Notebook to practice the techniques in this module.