Text and Data Mining Guide: Data Collection Methods

Step-by-step guide on how to get started with your text mining project along with examples of past text mining projects from UW researchers and students.

How to Identify Potential Data Sources?

When starting a project, it is highly recommended to identify the objective and the type of data which will be needed to answer the research question to the best of your ability.

Thus, it should be your top priority to:

Identify the kind of data you want to collect? Text? Numerical? PDFs?
Identify reliable data sources for the data you want to collect
- Look for industry-accepted or professional data sources in your domain
- Look for highly cited sources
- Make a list of all possible sources & weigh the pros and cons
  - Services offered by data providers
  - Payment plan, if any
  - Frequency of update
  - Attributes of dataset
Identify how the data might be collected legally from these sources

Once you have these answers, you can move on to setting up data collection pipelines/procedures.

Data Collection Methods

API

A lot many companies which acquire/collect data on a regular basis, often choose to expose that data via Application Programming Interface (API).

In simple terms, API allows you to interact with a website via some commands to extract the data you need. To ensure that someone doesn’t abuse the APIs or hog all the bandwidth, there is usually a limit on the number of API requests you can make per unit time.

There are several APIs available out there for different domains, here we are highlighting a few:

Social Media - Twitter API, Instagram API, Facebook Graph API
Social Science - Census Data API
Finance - WorldBank API

Downloading Datasets

A lot of data aggregation and distribution websites make available public information in form of files - csv, json, xls etc which you can directly download to your local machine. This data can be further manipulated and analyzed by using tools or platforms of your choice. To explore text analytics tools, explore the Text Analytics Guide.

Web Scraping

Data can be scraped from websites directly using tools such as BeautifulSoup, ParseHub etc. With the rise of digitized collections and articles online, it is quite natural to find relevant information on a website. In such a case, one of the many web scraping tools can be utilized depending on the needs of the project. To learn more about web scraping, refer to Web Scraping Guide.