## Discovering the ‘Science’ in Data Science – Part 2

Let us consider the steps followed for using a scientific method to study and understand the natural world. Then we can attempt to do the same to study and understand data science.

## Scientific Method

Typically, scientists undertake the following steps when they want to explore the natural world.

1. #### Define Objective

Here is when you get acquainted with purpose of the scientific study. In most cases this objective will translate into a problem statement or question.

Let us consider this- Objective:  Determine the shape of the Earth

2. #### Acquire Information

At this stage, you will come up with various ways to make observations.  These observations must provide elaborate information about the problem/objective. You look at the horizon on a sea shore and observe its shape.

3. #### Formulate Hypothesis

This is the most crucial step. Here you will be making an educated guess regarding the solution or the answer to the question.How can you do this? You employ logical reasoning to make a statement that you can test and either prove or disprove by an experiment. Hypothesis: The Earth is Spherical in shape.However, remember a hypothesis is an educated guess based on observations and inferences from these observations. In our example, you made an observation that there is a curvature in the horizon.Hence this is a hypothesis.

4. #### Conduct Experiment

Here you may need engineering skills to design an experiment that can take measurements of the independent and dependent variables with accuracy. Einstein was famous for his ‘thought experiments’.  If interested, you can read more about his famous five thought experiments here. Experimentation is the fuel that drives the machine.So, in our example let’s keep things simple by observing and recording lunar eclipse with the naked eye on a clear night. It will clearly show the spherical shadow of the Earth on the Moon.  To be 100% sure let’s measure this phenomena in both hemispheres and in the four corners of the earth.Credit: https://images-cdn.9gag.com

5. #### Analyse Results

I believe this is the most exciting part of the process – ‘the proof of the pudding is in the tasting’.  However, a word of caution – in any scientific endeavour, one has to have the courage to accept both success or failure. At this point, you review the measured values, use mathematics or statistical models to make sense of the data and tabulate it in an understandable manner.If the tabulation and subsequent analysis results is an inconclusive result, a decision is made whether further experimentation is needed.  So, you iterate between experimentation and analysis until you get conclusive results.In our example, we review the multiple measurements we made of the lunar eclipse across the globe. We record the findings that in all the measurements we find the shadow to be spherical. Additionally, we could diagram the data, analyse photographs, etc to support our analysis.

6. #### Draw Conclusion

If a scientifically arrived conclusion is to stand up to scrutiny, it is critical that every step above is not just followed, but followed without bias – personal or professional.  From the analysis, then an unbiased and obvious conclusion should emerge automatically. Here we either prove the hypothesis or disprove it and support the conclusion with our analyses.The way the hypothesis is framed, the experiment conducted, the data collected and tabulated, and analysed will determine the success of the experiment as well as the credibility of the conclusion.In our example,  here is where we would make the proclamation that the Earth is spherical in shape.Note: The simplistic method of measurement we used indicates the Earth is spherical in shape. In reality the Earth is an ‘Oblate Spheroid’. We will need more advanced techniques to prove that.

## Data Science (Scientific) Method

Now that we know how to apply science to study the natural world let’s attempt to apply the above methods to a Data Science problem and see how far we can go.

Let’s take universal problem of every CEO – i.e. boosting sales.  Now, let’s review an existing solution and apply the above methods to see if they fit.

Boosting Sales using Machine Learning, written by Per Harald Borgen, is the article we will refer to.  The article covers two important concepts in Machine Learning: Natural Language Processing (NLP) and Prediction.

1. #### Define Objective

The key objective here is to target the right leads so that it will lead to better conversion and hence boost the sales.  Typically, you can get an outline of the objective for finding a solution to a business problem in a few sentences, and may sometimes be ambiguous. It is extremely critical to study and understand the objective and then ask the right questions to expand the problem definition into multiple smaller parts.  Having the required domain knowledge is a big advantage since that will help in asking the relevant questions.

2. #### Acquire Information

This involves data capture, collection and preparation for downstream use. As mentioned in this article we extract the company information from a website through an API. However, some preparatory work went into coming up with input URL. Most often than not data integration is hard due to lack of standards and it is extremely difficult to templatise.  In our example, data size is not that high. However, for large scale data extraction and preparation proper data engineering practices must be followed.  Especially when dealing with big data, an advanced technology stack is needed for building a data pipeline to process the data.

3. #### Formulate Hypothesis

In our example, we want to pick the companies that are better leads than others and thus have a higher chance for converting into customers.  Since we have the company descriptions how would we go about figuring out who would be the better leads? This makes us understand how to formulate this hypothesis: Given a company description we can predict the possibility of that company being a potential customer. The work involved would be to build a classifier.

4. #### Conduct Experiment

This happens to one of the most interesting steps in the Data Science journey and generates the most value, if done right.  You can get a head start by understanding, which algorithm you can use to solve which problem. You will have to test with multiple algorithms and pick the best performing one.  Typically, you divide the available data into a ‘training dataset’ and a ‘test dataset’. Going back to our example the author has made a split of 70% for training data and 30% for test data.

Since this is a classification problem, you pick a Random Forest algorithm.

You develop a model using this algorithm and then evaluate the training data set with the test data set.  The training phase is analogous to the experimental setup required for conducting an experiment in the natural world.  The testing phase is analogous to conducting the experiment and recording the results.

5. #### Analyse Results

At this point, you review the predictions made by the model with the test data and determine the level of accuracy. For a classifier it is better to use a confusion matrix to measure the performance of the model.  If the results satisfy you then you can iterate steps 4 and 5 with different algorithms and parameters.  You then deploy the model into a production environment where it faces the real data and makes the predictions. The accuracy of the predictions made is determined to understand if the expected outcomes are released.In our example the author has tweaked the algorithm parameters to reach an acceptable level of accuracy.

6. #### Draw Conclusion

Just like experimenting in the natural world, here too you approve or disprove the hypothesis, based on the results. If the model can maintain good accuracy then it proves the hypothesis. If not, it disapproves and it is back to the drawing boards. Solving a Data Science problem is very similar to how a scientist would undertake a scientific inquiry. To some extent the techniques used for processing and analysing data are the same in both cases. The IT world is slowly moving from a ‘programming’ culture to a ‘learning’ culture. And the learning is not just for humans but machines too.  Machines have evolved from simple transistors and capacitors to learning and eventually to thinking.  In the future, machines will have intelligence to develop collective intelligence. Like, very similar to how human beings have evolved through learning over thousands of years.

Humans and machines can collaborate to solve complex unsolved mysteries like the origin of life and origin of the universe.  This is the promise of the ‘SCIENCE’ in Data Science.

To know more, feel free to reach us!