Skip to Main Content
Research Guides

Text and Data Mining Guide: Web Scraping

Step-by-step guide on how to get started with your text mining project along with examples of past text mining projects from UW researchers and students.

How to get started with Web Scraping?

  1. Identify the kind of information you want for your project
  2. Identify reliable websites for extracting this information
    • It is quite likely that you might not always get all the information you want in a single website, and might have to scrape multiple websites
    • NOTE: Ensure that it is legal to scrape the website
  3. Extract URL of all the websites you wish to scrape
  4. Locate the data you wish to extract within the website by exploring the website using inspect element
  5. Identify package – Selenium, BeautifulSoup, Scrapy you wish to use depending on your use case
  6. Extract & clean data

How to pick between Selenium, Scrapy or Beautiful Soup?

The above three are the most frequently used for programmatically scraping websites.

Comparison of Beautiful Soup, Selenium, Scrapy

  Beautiful Soup Selenium Scrapy
Performance Slow for a few tasks Faster than Beautiful Soup but has its limitations Pretty fast and has the best performance out of the three
Extensibility Best suited for small, low-complexity projects & beginners Best suited for core JavaScript featured website Best suited for large, complex project. Makes project robust & flexible
Ecosystem Has a lot of dependencies in the ecosystem Good ecosystem but can't use proxies Can use proxies and VPMs and hence suitable for complex projects

 

How to clean messy, scraped data?

The scraped data more often than not will be extremely messy and this is expected. There are a lot of white spaces and irregularities. These are the reasons for such messy data:

  1. The HTML is not well-formatted as a result of which it is difficult to parse and extract relevant information

  2. There is whitespace and special characters

A good strategy is always – PRINT EVERYTHING!

Printing lets you see what the extracted data looks like and identify issues with its formatting. Once the issue has been identified, we can use regular expressions to extract our desired information from the messy scraped data. 

Software & Tools

  • Programming based 
    • Python  - Scrapy, BeautifulSoup
    • Selenium
    • R - rvest, RCrawler
  • Software
    • Parse Hub
    • Dexi.io
    • Scraping-bot.io