Below are some review questions spanning all modules. Feel free to review them to test your knowledge!
Stopwords (e.g., “the,” “and,” “of,” etc.) typically appear very frequently but do not add meaningful semantic information. Removing them reduces noise and can improve the efficiency of various text mining tasks, such as topic modeling or sentiment analysis.
Tokenization breaks text into smaller units (tokens), which are often words or phrases. All subsequent processes—like removing stopwords, counting word frequencies, or applying machine learning models—depend on having clearly defined tokens.
Stemming chops off the ends of words (e.g., “running” → “run”), sometimes creating non-standard forms. Lemmatization reduces words to their dictionary form (lemma) based on morphological analysis (e.g., “mice” → “mouse,” “children” → “child”). Lemmatization usually produces more accurate root forms.
TF-IDF (Term Frequency–Inverse Document Frequency) weights how often a word appears in a document (term frequency) but reduces the impact of words that are common across many documents (inverse document frequency). This approach highlights words that characterize specific documents.
N-grams are sequences of n words (e.g., “New York City” is a trigram). They capture context and multi-word expressions that single-word frequency counts might miss, offering deeper insights into patterns and meanings within text.
A lexicon-based approach uses predefined dictionaries of positive/negative words to calculate sentiment scores. A machine-learning approach trains a classifier on labeled data, learning more nuanced patterns that correlate with various sentiment categories.
Topic modeling is an unsupervised technique (e.g., Latent Dirichlet Allocation) used to discover hidden thematic structures in a corpus. It groups words that tend to appear together, helping identify overarching topics without manual labeling.
NER automatically identifies and classifies named entities—such as people, locations, or organizations—in text (e.g., detecting “Albert Einstein,” “Los Angeles,” or “Google”). It’s useful for extracting structured information from unstructured text, facilitating tasks like relationship mapping and trend analysis.
Precision measures how many of the predicted positives are correct, recall measures how many of the actual positives were caught, and F1-score is the harmonic mean of both. These metrics offer a more nuanced view of model performance than accuracy alone.
Example: “How did public sentiment around a major political event evolve over time?”
Possible Steps:
Complete this Google Colab Notebook to practice the techniques in this module.