The Data Science Project Cycle series consists of 5 separate articles, and this part is the third article in the series. In this part, we will talk about “Feature Engineering”.
Note: I usually will use some abbreviated words below:
- Data Science — DS
- Artificial Intelligence — AI
- Machine Learning — ML
- Big Data — BD
- Deep Learning — DL
- Statistical Learning — SL
1. Feature Engineering
Since we are talking about a ML-oriented project development process, we can gather the titles of tidy process, exploratory data analysis and data manipulation under this heading. Each of these titles has a special importance. Detailed articles will be included for both these steps and the feature engineering process, which will be included in the modeling step.
1.1. Tidy Process:
It will not go into much detail here, but the tidy process in summary: Before any analytical work, the data is structured in order. The regular expression here means that each row represents the unit of observation of interest, and each column represents a variable. In other words, there should only be variable values belonging to one observation in each row.
1.2. Exploratory Data Analysis
Let’s say we will buy a car, we will both give money and we will entrust our life in a way. Would you just look at a hood and get that car?
We wonder right? How many km, is there paint, is it changing, automatic or manual, diesel or gasoline, how is the fuel consumption?
If the goal is just to buy a car and only this result is concerned, you are successful, you bought the car.
The outputs of DS projects that do not include SL and Exploratory Data Analysis, unfortunately, only consist of buying a car.
This process, which statisticians have named Descriptive Statistics for years, is one of the most critical steps of DS processes. Understanding the current situation and revealing the structures within the data, which we express while enumerating the characteristics of the Data Scientist, are carried out at this step. This includes all of the pattern detection and data visualization techniques.
Get ready scripts, apply. Do not memorize, learn the logic! I gave the algorithm I received the inputs so they continue to be the ones in the world of outputs. These friends, whom we define as “Script Scientists” and even call SS among us, were called “lamer” in the web world of the 90s.
There were those who remember, there was the concept of respect for labor, and there were gifs of respect for labor at the end of the articles. For this reason, let’s respect the labor, friends! By sharing our article, commenting, giving feedback, ensure that our only diesel, our interaction, so that our car goes.
And if you are one of those who think data is just rows and columns, repent immediately. They are not called rows and columns. Data is alive and contains life. There are things that you call that line, they are not lines but observations, that is, they have a quality. Start here first.
Well, they are called in feature, and they even call them pretty big guys. Quite right! Perfect point. There are also those who say variable… The distinction of this subject will be expressed in the article called causality principle. Yes, these are not always called observations. Sometimes they get lines. Sometimes they become features. The important thing is to be aware of his distinction.
This section, which will examine the changes in the focus of variables in the data set, is the most unexpected part of a data science project. There will be very detailed articles related to this section. For now, let us state that the analysis of the current situation will be done at this step and it will be a deficiency to move on to other steps without this step.
1.3. Data Manipulation & Feature Engineering
In this step, there are the steps of applying multivariate statistical approaches such as incomplete data analysis, outlier observation examination, examination of inconsistencies on the data that has been tidyized and then understood in the data visualization step. All items in this topic will be covered. The most striking situation at this step is that the Data Scientist uses the creativity of the Data Scientist for feature engineering operations and tries to create the variables that do not exist physically by considering the causality context.
Let’s finish our topic and our article with a quote from the statistics 101 course, which is clear.
Of course, we do not mean taking the mortality rate of seals into the model to estimate aircraft delays! We also don’t mean correlation! Correlation does not always imply causation!
See you in our next topic “Modelling and Model Evaluation”.