When starting a project, it is highly recommended to identify the objective and the type of data which will be needed to answer the research question to the best of your ability.
Thus, it should be your top priority to:
Identify the kind of data you want to collect? Text? Numerical? PDFs?
Identify reliable data sources for the data you want to collect
Look for industry-accepted or professional data sources in your domain
Look for highly cited sources
Make a list of all possible sources & weigh the pros and cons
Identify how the data might be collected legally from these sources
Once you have these answers, you can move on to setting up data collection pipelines/procedures.
A lot many companies which acquire/collect data on a regular basis, often choose to expose that data via Application Programming Interface (API).
In simple terms, API allows you to interact with a website via some commands to extract the data you need. To ensure that someone doesn’t abuse the APIs or hog all the bandwidth, there is usually a limit on the number of API requests you can make per unit time.
There are several APIs available out there for different domains, here we are highlighting a few:
Social Media - Twitter API, Instagram API, Facebook Graph API
Social Science - Census Data API
Finance - WorldBank API
A lot of data aggregation and distribution websites make available public information in form of files - csv, json, xls etc which you can directly download to your local machine. This data can be further manipulated and analyzed by using tools or platforms of your choice. To explore text analytics tools, explore the Text Analytics Guide.
Data can be scraped from websites directly using tools such as BeautifulSoup, ParseHub etc. With the rise of digitized collections and articles online, it is quite natural to find relevant information on a website. In such a case, one of the many web scraping tools can be utilized depending on the needs of the project. To learn more about web scraping, refer to Web Scraping Guide.