Text and Data Mining Guide: Web Scraping

Step-by-step guide on how to get started with your text mining project along with examples of past text mining projects from UW researchers and students.

How to get started with Web Scraping?

Identify the kind of information you want for your project
Identify reliable websites for extracting this information
- It is quite likely that you might not always get all the information you want in a single website, and might have to scrape multiple websites
- NOTE: Ensure that it is legal to scrape the website
Extract URL of all the websites you wish to scrape
Locate the data you wish to extract within the website by exploring the website using inspect element
Identify package – Selenium, BeautifulSoup, Scrapy you wish to use depending on your use case
Extract & clean data

How to pick between Selenium, Scrapy or Beautiful Soup?

The above three are the most frequently used for programmatically scraping websites.

Comparison of Beautiful Soup, Selenium, Scrapy
	Beautiful Soup	Selenium	Scrapy
Performance	Slow for a few tasks	Faster than Beautiful Soup but has its limitations	Pretty fast and has the best performance out of the three
Extensibility	Best suited for small, low-complexity projects & beginners	Best suited for core JavaScript featured website	Best suited for large, complex project. Makes project robust & flexible
Ecosystem	Has a lot of dependencies in the ecosystem	Good ecosystem but can't use proxies	Can use proxies and VPMs and hence suitable for complex projects

How to clean messy, scraped data?

The scraped data more often than not will be extremely messy and this is expected. There are a lot of white spaces and irregularities. These are the reasons for such messy data:

The HTML is not well-formatted as a result of which it is difficult to parse and extract relevant information
There is whitespace and special characters

A good strategy is always – PRINT EVERYTHING!

Printing lets you see what the extracted data looks like and identify issues with its formatting. Once the issue has been identified, we can use regular expressions to extract our desired information from the messy scraped data.

Software & Tools

Programming based
- Python - Scrapy, BeautifulSoup
- Selenium
- R - rvest, RCrawler
Software
- Parse Hub
- Dexi.io
- Scraping-bot.io