Components of Data Science

Mehmet Akturk

6 min readOct 17, 2020

This article series consists of 4 main parts and this article is “What is data science?” is the second in the series.

Photo by https://www.techiexpert.com/top-data-science-use-cases-in-finance/

Note: I usually will use some abbreviated words below:

Data Science — DS
Artificial Intelligence — AI
Machine Learning — ML
Big Data — BD
Deep Learning — DL

Lets talk about Components of DS:

Statistics:

We can say that Modern DS is based on statistical modeling of the world. The discipline called Statistical Learning Theory aims to reach the solution of the problems through optimum parameters by expressing the problems we want to solve as statistical models. To make this a little more understandable, let’s give a small example:

For example, we want to understand how the dollar rate increases and decreases. Statistical Learning Theory offers us some methods for how to do this. Let’s say we decide to model the dollar rate with a simple linear regression model. Considering the variables that can affect the rate, interest and inflation, we established an equation:

dollar rate = constant + parameter1 x INTEREST + parameter2 x INFLATION + error

According to Statistical Learning Theory, it is possible to find the optimum values of the fixed, parameter1 and parameter2 variables that can be modeled by using historical data. These values should somehow minimize the error so that the optimum word can find its equivalent. I won’t go into how we did this right now. But when we find these optimum parameter values, we have a model of the dollar exchange rate.

Photo by https://www.futbaltalktics.com/single-post/2018/06/12/Statistics-in-Football-Relevant-or-Not

Today, with the rapid development of computers and other devices with processing power, training the machines indirectly as above instead of directly coding makes Statistical Learning Theory a very important tool. In the field we call ML, we can actually say that Statistical Learning Theory is focused on codeable devices. If we give a shortly:

“ML is the discipline that aims computers to learn to do a task without being explicitly and directly programmed.”

ML does its “learning” job by looking at the data. In other words, the field we call ML consists of all the algorithms that take the data as input and can represent a task as a model. What I call task is to categorize a text, to recognize people in a photograph, to estimate the value of the dollar rate the next day, etc.

2. Computer Science:

DS, of course, does not only make use of Statistics. As important as Statistics is programming. While defining ML above, I mentioned that computers learn without being directly programmed. Although I seem to contradict myself here, the situation is actually different. Basically, DS uses programming to define algorithms that can learn without being programmed into computers. Of course, we don’t use programming languages just for this purpose. The purposes for which we use programming languages can be listed as follows:

To pull data from repositories, databases or files.
Data manipulation, cleaning and production.
Visualizing the data.
To generate descriptive Statistics by performing mathematical operations on data.
To translate ML methods into codes that computers can understand.
Training our models with data.
Transfer our models to production systems to serve the outside world and keep our models alive in the production environment.

The above list are just a few of what we do with programming languages. But it is obvious how important each of them is in terms of today’s DS.
What about programming language?
Python or R?
What about Julia?
Java, Scala and Go?

Photo by https://medium.com/agileactors/top-10-programming-languages-to-learn-in-2020-infographic-9760263abb27

There is a lot of writing written about this and I will not go into it. Let me talk about just one principle and close the subject:

Programming languages are tools. You don’t use a wrench when you need a screwdriver, or pliers when you need a wrench. The same goes for the choice of programming language in DS. However, if you are just starting out, you will have to pick someone and start. I exist and participate in the ongoing Python rumor that I have seen so far. I will write a separate short article on the Data Scientist later. Now I just write that word that I believe and pass it.

“A Data Scientist is a person who knows more statistics than a programmer, who knows more programming than a statistician.”

On the other hand, programming is not the only thing DS benefits from in Computer Science. In fact, a lot of things from distributed architectures to BD technologies fall into the toolset DS uses. To give an interesting example, GPUs, which were once developed to provide performance in computer games, have become an indispensable technology for ML.

3. Area information:
If you have done research on what DS is, you have seen that most of the explanations talk about Statistics and programming, but not field knowledge. There are many reasons why I insist on field knowledge.

Photo by https://evolytics.com/blog/three-reasons-web-analytics-data-wrong/

“If you don’t have specialist knowledge specific to your field, you run the risk of being deceived by data.”

Actually, we do not need to deal with an issue so far. With a simple logic, we can think like this: Field knowledge guides us in the point of which data is useful, as well as shed light on our path at the point of causality. The expertise in which factors can lead to what results, to be sure, is worth beyond any data and method.

So, if we are discussing DS at an introductory level, we are not talking about specialties, but if we are talking about the basics, it is only because of laying the stones of the road to expertise.

4. Intuition:
Is DS a science?

Everything that can be falsified is scientific… Karl Popper.

Photo by https://medium.com/@connorwforsyth/music-is-a-universal-language-of-intuition-905f5a1b426c

Of course, if DS talks about falsifiable things and the results are falsifiable things, it can be considered a science. However, you will hear from some that DS is also an art. You ask why? Let’s illustrate briefly:

At this point, models created with Artificial Neural Networks produce satisfactory outputs in certain tasks. But unfortunately, we cannot explain exactly what these neural networks learned and how, we can only catch clues. That’s why you have read many articles describing Artificial Neural Networks as “black box” models. But we also don’t know exactly how our brains work, right? Still, no one doubts that the human brain is doing very special things.

Photo by https://simplysuccessful-llc.com/business-success-and-the-art-of-better-thinking/

If you develop models with Artificial Neural Networks, you will see that creating something original; It is sometimes a matter of intuition to predict which architectures will work better in the task you are working on. Yes, your intuition, which you have grown with past experiences of yourself and others, is one of the most helpful friends of a Data Scientist. The works you take without intuition will be those in which time flows like water from your palms in the effort of trial and error of a thousand and one possibilities.

My next post is “What do we aim to do with Data Science?” See you…

Do not forget!
“In life, the most real guide is science … MKA”
“and you can explain this with data… MA”

My other Articles:

Components of Data Science

References

Written by Mehmet Akturk

No responses yet