Let me first start by stating that there is a clear difference between a data analyst and a data scientist. It is however not the goal of this article to point out the differences. It is important to know that both Data analysts and Data scientist collect data and perform analysis though at varying degrees. For this reason, I may use the terms interchangeably. Read this article to know about a Data Scientist.
Secondly, there are multiple ways to work with your data. You can use Excel, R or Python among other options. We have elected to use Python programming language due to its popularity and simplicity.
In this article, we will explore different ways you can collect data for the purposes of analysis. We will focus on the main ones. That said, it is my assumption that you are new to data collection. This article will touch on the basics but sufficient enough to get you started in your data collection journey.
That said, let us dig deeper. As a data analyst, you may be interested in a research topic. Whichever subject you are working on, you will definitely need data to support your conclusions. How do you gather this data and use a computer to perform analysis? The following are ways you can collect data;
- uploading a flat file readily available on your machine
- Using an API to get data from a website
- Scraping a web page.
1.) Working with flat files
According an article by Aditya Sharma, Flat files are data files that contain records with no structured relationships between the records, and there’s also no structure for indexing like you typically find it in relational databases. These files can contain only basic formatting, have a small fixed number of fields, and can or can not have a file format.
Examples of flat files include csv, txt and xml files. In Python, we can use pandas to upload flat files into our code for analysis. We can use the read_csv() function. More about this function can be found in the official pandas documentation. The function can be used as follows;
import pandas as pd pd.read_csv('example.csv')
Note that we need to first import pandas in order to use it to import files. Also note that you need to specify whether the file is a csv, tsv, txt, excel, xml etcetra file. The example above is used for csv files. Read_csv can also be used to load other file formats, eg let us the function to load a Tabs Separated Values file.
df = pd.read_csv('filename.tsv', sep='\t')
In the code above, we have specified that the file is a tabs separated file and not a comma separated file. We have also stored the data in a dataframe that we have named df. This is helpful when we want to access the dataframe, we can easily call it using its name ie df
2.) Web scraping
Before we look at web scrapping as a means to collect data, let us first set a few things clear. Whether web scraping is legal or not is a matter of debate. Some websites allow for their data to be scrapped while others forbid it. Other websites are silent about the whole issue. It is also important to note that scrapping data from a website consumes server resources for the website.
To put that into perspective, if we were to scrap data from just one page maybe once a week, then that should not be a problem. However, if we are scrapping tens of pages from one website every 10 minutes, then this can be a matter of concern especially to sites that do note have huge resources. Before settling for web scrapping, I advice exploring other alternatives such as using APIs. It is also good to seek permission from the web administrators of the website before you start scrapping.
That said, let us define web scrapping. Web scraping is an automated method used to extract large amounts of data from websites. When you run the code for web scraping, a request is sent to the URL that you have mentioned. As a response to the request, the server sends the data and allows you to read the HTML or XML page. The code then, parses the HTML or XML page, finds the data and extracts it.
Python is preferred to other other solutions due to the simplicity of its syntax. Python also has a large number of libraries which are useful in so many applications. In short, there is little you can not do with python. For web scrapping, we use the Request library (the de facto standard for making HTTP requests in Python). It is one of the integral part of Python for making HTTP requests to a specified URL, whether it be REST APIs or Web Scrapping
Another useful library required to collect data via web scrapping in python is the Beautiful Soup library. This is a Python library for pulling data out of HTML and XML files. It works with your favourite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
Let us combine the two in an example.
import pandas as pd import requests page = requests.get("https://thetargeturl.com") page
Above code gets the contents of the web page and stores them in container we have named page. The end of the code call page and displays the HTML of the downloaded page. We can also check if the download was successful through the page.status_code command. This should return a status code of 200 which means success.
Parsing a page with BeautifulSoup
from bs4 import BeautifulSoup soup = BeautifulSoup(page.content, 'html.parser')
We can further organise the HTML with proper indentation using prettify()
BeautifulSoup allows us to use HTML classes and IDs to find specific paragraphs. To learn more about BeautifulSoup, check out its documentation.
3.) Use of API (Application Programming Interface)
Finally, let us check out the API method. This method should be preferred to scrapping especially for websites that provide APIs. It is simpler to use and it is not as brittle as scrapping. APIs allow us to perform a number of actions to the website using commands. Examples of the commands include:
- GET command : It enables the users to fetch the data from the APIs onto their system in a specific format that is usually JSON.
- POST command : This command enables us to add data to the API i.e. to the service on web.
- DELETE command: It enables us to delete certain information from the API service on web.
- PUT command : Using PUT command, we can update an existing data or information in the API service on web.
The APIs also have a way to let us know if we are successful or not. This is through status codes. It is nice to understand what these codes mean. The most common ones include:
- 200 : healthy connection with the API made.
- 204 : successfully made a connection to the API but did not return any data
- 401 : Authentication failed
- 403 : Access is forbidden by the API service.
- 404 : The requested API service is not found on the server/web.
- 500 : An Internal Server Error has occurred.
Let us use the GMAIL API as an example. We start by importing the necessary libraries. Remember the requests library we used in web scrapping? Well, we will be needing it here too. So the code will look something like
import requests import json ourResult=requests.get('https://gmail.googleapis.com/$discovery/rest?version=v1') print(ourResult.status_code)
The above code should return a status code of 200. This means we successfully connected to the server. Next, we can extract the data from the JSON object such as the description of the API and the description of the key.
data = ourResult.text parse_json = json.loads(data) info = parse_json['description'] print("Info about API:\n", info) key = parse_json['parameters']['key']['description'] print("\nDescription about the key:\n",key)
Let us know in the comment section what response you get.
We have covered the main ways you can use to gather data for your data science or data analytics project. We hope to see your thoughts in the comments section. Should you have alternatives or more efficient ways, feel free to dive into the discussion.