The term Data Science -first introduced in the 1980s- remains unclear to very many people. Ironically, it has been thrown around to symbolise the saviour for the modern businesses. Therefore, it becomes a puzzle to many as to how Data Science is supposed to solve their problems yet they have minimal, if any, understanding about the subject. This article will strive to demystify this term. We will attempt to create a fairly solid foundation for our readers to build on. However, we will avoid digging too deep. The aim is to introduce the reader to the concepts. Before we dive into what data science is all about, let us first define a few terms.
Definition of terms.
- Artificial intelligence (AI). The theory and development of computer systems able to perform tasks normally requiring human intelligence. Such tasks include visual perception, speech recognition, decision-making, and translation between languages.
- Machine learning (ML). The use and development of computer systems that are able to learn and adapt without following explicit instructions, by using algorithms and statistical models to analyse and draw inferences from patterns in data.
- Deep learning. A type of machine learning based on artificial neural networks in which multiple layers of processing are used to extract progressively higher level features from data.
- Statistics. The practice or science of collecting and analysing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample.
- Data Analytics. The process of examining data sets in order to find trends and draw conclusions about the information they contain.
- Algorithm. a process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer.
The terms above are important in our understanding of Data Science. In some cases, the terms have been mistakenly interchanged with data science. Let it be known from this early part of the article that data science is a new field hence the uncertainty of terms witnessed.
What is Data Science.
Data science is a subset of Artificial intelligence. It is an interdisciplinary field. Its true foundations are in statistics, mathematics, computer science, and business. Therefore, it is difficult to draw a clear separation line indicating what lies in the realm of data science and what does not. Simply put, Data science is the study of data. It involves developing methods of recording/collecting, cleaning, storing, and analysing data to effectively extract useful information. The goal of data science is to gain insights and knowledge from data. The scientist has to present the insights in a way that can easily be understand.
Data Science has dominated almost all the industries of the world today. There is no industry in the world today that does not use data. As such, data science has become fuel for industries. Industries like banking, finance, manufacturing, transport, e-commerce, education, etc rely on data science in order to make data driven decisions. It is common knowledge that an organisation making decisions based on data is more likely to succeed compared to its competition. It is therefore paramount to acquire the skills relevant for extracting these insights.
The volume of data available is growing exponentially. Each day, new data is collected and stored. According to projections from Statista, 74 zettabytes of data will be created in 2021. That’s up from 59 zettabytes in 2020 and 41 zettabytes in 2019. All this data is useless unless we can derive insights from it. Each organisation is working around the clock to minimise losses resulting from poor decisions. This is where a data scientist comes in. Next, let us examine a how data scientist go about creating solutions.
A day into a Data Scientist’s life
There is a common joke among data scientists that if you were to partition the time taken to deal with data, 80% of the time is spent cleaning the data while 20% is spent complaining about the data. Well, this shows the importance of cleaning the data. If you were to work with the wrong data, then you are bound to get wrong results. As the infamous saying goes, “Garbage in garbage out.” So what are the steps taken when presented with raw data to derive insights?
1.) Problem formulation
Once presented with a problem, the data scientist must be able to fashion the problem in a way that data science techniques can be applied. The problem may require predictive analytics where one tries to predict the future or exploratory analytics where one seeks to answer the questions of how and why something happened. A keen look at the data will also guide the data scientist on the appropriate algorithm he/she will be working with.
2.) Data cleaning
Next, we need to clean the data. This is where the bulk lies. Data Cleaning ensures that the data has been entered correctly. The data scientist works to eliminate spelling mistakes, ensure correct units of measurements have been used, correct columns or rows with missing data, among other things. This process is guided by the problem statement we formulated and the kind of data we are dealing with. It is very important to get this step right. Scientist will spend most of the allocated time in this stage.
3.) Exploratory Data Analysis (EDA)
At this point, we try to find any relationships that the data can present us. We can calculate some statistics like mean, median, quartiles to have an understanding of the data. Data visualisation libraries such as seaborn come in handy. By graphically displaying the data, we can notice a few relationships that we can start to build on. The goal at this stage is to get a bird’s eye view of the data. It is encouraged to ask as many questions as necessary in order to get a good understanding of these relationships. Such questions may include;
- Which values are the most common?
- Which values are rare?
- Does that match your expectations?
- Can you see any unusual patterns?
- What might explain them?
4.) Feature selection
At this stage, you have a good idea of which features are necessary for your model. It is important to know that not all the features will be used in creating your model. It is important to consult an expert in the field if you are not sure which features are not significant. We can also rely on the observations made during EDA. Ability to choose the right features will reduce the number of iterations one has to make before arriving at the best results.
5.) Model Development
Here we apply the available machine learning algorithms to our problem. The choice of algorithm will be influenced by the insights gained up to this step. Depending on whether we are dealing with structured or unstructured data, one may need to use a supervised learning algorithm or an unsupervised learning algorithm. There are numerous algorithms to choose from. It is not the scope of this article to dig deep into the pros and cons of these algorithms. However, here are a few popular algorithms in use today;
- Linear regression
- Logistic regression
- Classification and regression trees (CART)
- Naive Bayes
- KNN
- K-means
6.) Model evaluation
We now want to test our model. This is an equally important stage. It allows us to know if we did a good job. Model evaluation aims to estimate the generalisation accuracy of a model on future (unseen) data. Simply put, we want to know how well our model can make a prediction when it is subjected to real world problems. Several metrics can be used here. Again, we will not go into the mathematics behind these metrics. To list but just a few;
- Accuracy
- Precision
- Recall or Sensitivity
- F1 score
- ROC curve
- Log loss
Conclusion
We have briefly taken a look at what data science is. We have gone further to touch on the steps taken by a data scientist in solving a problem. Note: during the problem solving process, the data scientist may have to go back and revisit the stages multiple times. This revisiting is done until our model is capable of rendering the desired results. The Data scientist must also work hard to eliminate any bias. Head to the comments section and let us know what you think about data science. Feel free to ask questions and/or participate in the discussion.