Text and Data Mining Guide: Module 6: Practice Questions

Step-by-step guide on how to get started with your text mining project along with examples of past text mining projects from UW researchers and students.

Questions

Below are some review questions spanning all modules. Feel free to review them to test your knowledge!

Question 1: What are some reasons for removing stopwords when cleaning text?

Stopwords (e.g., “the,” “and,” “of,” etc.) typically appear very frequently but do not add meaningful semantic information. Removing them reduces noise and can improve the efficiency of various text mining tasks, such as topic modeling or sentiment analysis.

Question 2: Why is tokenization a crucial step in text mining?

Tokenization breaks text into smaller units (tokens), which are often words or phrases. All subsequent processes—like removing stopwords, counting word frequencies, or applying machine learning models—depend on having clearly defined tokens.

Question 3: Describe the difference between stemming and lemmatization.

Stemming chops off the ends of words (e.g., “running” → “run”), sometimes creating non-standard forms. Lemmatization reduces words to their dictionary form (lemma) based on morphological analysis (e.g., “mice” → “mouse,” “children” → “child”). Lemmatization usually produces more accurate root forms.

Question 4: What is TF-IDF, and why might you choose it over raw word frequency counts?

TF-IDF (Term Frequency–Inverse Document Frequency) weights how often a word appears in a document (term frequency) but reduces the impact of words that are common across many documents (inverse document frequency). This approach highlights words that characterize specific documents.

Question 5: How do n-grams (like bigrams and trigrams) provide insights beyond single-word frequency?

N-grams are sequences of n words (e.g., “New York City” is a trigram). They capture context and multi-word expressions that single-word frequency counts might miss, offering deeper insights into patterns and meanings within text.

Question 6: In sentiment analysis, how does a lexicon-based approach differ from a machine-learning approach?

A lexicon-based approach uses predefined dictionaries of positive/negative words to calculate sentiment scores. A machine-learning approach trains a classifier on labeled data, learning more nuanced patterns that correlate with various sentiment categories.

Question 7: What is topic modeling (e.g., LDA), and why is it useful?

Topic modeling is an unsupervised technique (e.g., Latent Dirichlet Allocation) used to discover hidden thematic structures in a corpus. It groups words that tend to appear together, helping identify overarching topics without manual labeling.

Question 8: Explain the purpose of Named Entity Recognition (NER). Give an example.

NER automatically identifies and classifies named entities—such as people, locations, or organizations—in text (e.g., detecting “Albert Einstein,” “Los Angeles,” or “Google”). It’s useful for extracting structured information from unstructured text, facilitating tasks like relationship mapping and trend analysis.

Question 9: When using machine learning for text classification, why track metrics like precision, recall, and F1-score?

Precision measures how many of the predicted positives are correct, recall measures how many of the actual positives were caught, and F1-score is the harmonic mean of both. These metrics offer a more nuanced view of model performance than accuracy alone.

Question 10: Think of a project using historical newspaper articles. What could you explore using text mining?

Example: “How did public sentiment around a major political event evolve over time?”

Possible Steps:

Collect articles by publication date
Clean the text (remove stopwords, tokenize, etc.)
Apply sentiment analysis to each article
Aggregate sentiment scores by month/year to identify trends
Visualize changes in sentiment and interpret findings in a historical context

Complete this Google Colab Notebook to practice the techniques in this module.