Research Guides
Beautiful Soup | Selenium | Scrapy | |
Performance | Slow for a few tasks | Faster than Beautiful Soup but has its limitations | Pretty fast and has the best performance out of the three |
Extensibility | Best suited for small, low-complexity projects & beginners | Best suited for core JavaScript featured website | Best suited for large, complex project. Makes project robust & flexible |
Ecosystem | Has a lot of dependencies in the ecosystem | Good ecosystem but can't use proxies | Can use proxies and VPMs and hence suitable for complex projects |
The scraped data more often than not will be extremely messy and this is expected. There are a lot of white spaces and irregularities. These are the reasons for such messy data:
The HTML is not well-formatted as a result of which it is difficult to parse and extract relevant information
There is whitespace and special characters
A good strategy is always – PRINT EVERYTHING!
Printing lets you see what the extracted data looks like and identify issues with its formatting. Once the issue has been identified, we can use regular expressions to extract our desired information from the messy scraped data.