I tried to bring clarity to this area with the second and last part of the series on this topic. After that, when the data comes in, you will know what to do next!
Note: I usually will use some abbreviated words below:
- Data Science — DS
- Exploratory Data Analysis — EDA
Then let’s go…
In the first article you have questions about how to behave when it comes to data.
1. What’s the Purpose?
2. Tidy Data Process
3. Determining and Setting Variable Types
4. Summary Statistics: Showing the Basic Structure of the Data Set
We tried to bring explanations under the headings. In other words, we can say that the preparation stage of data for processing.
Now let’s have a look at the basic headings encountered while processing data.
5. Exploratory Data Analysis
Now is the time to dive into the data set that is regular, whose types of variables are determined and more or less known what kind of structure it has. In EDA, it is aimed to reveal basic statistics and structures that cannot be detected by eye by applying multivariate statistical analysis and data visualization approaches.
For this purpose, variables in the data set are examined in a univariate, bivariate and multivariate manner. In univariate studies, variances, means, outliers and deficiencies of variables can be addressed. The point that should not be forgotten here is that adverse observation and under-observation investigations should be examined in a multivariate way before reaching a final decision. The aim of the univariate analysis is to comprehend the structure of the variables on its own.
After univariate analysis, bivariate analyzes are performed. Here, the relationships of variables with each other are examined with correlation analysis. The states of categorical variables and continuous variables with respect to each other can be examined. Again, the study is shaped according to the curiosity of the researcher and the things he wants to examine.
In the multivariate statistical analysis part, the structures that the variables exhibit together are focused. For example, cluster analysis can be performed on the basis of observations or variables. If we do it on the basis of observations, we will cluster the observations, and when we do it on the basis of variables, we do principal component analysis / factor analysis. Decisions will be made on the transitions to the next steps according to the structures that can be discovered through these situations.
One of the most important aspects of EDA is data visualization. Data visualization methods, which provide great support for univariate, bivariate and multivariate analyzes mentioned above, should definitely be applied on the data set. Descriptions expressed with quantitative values become difficult to make meaningful when the size of the data increases. In this case, we can perform all the above operations much more effectively and accurately with data visualization support.
With data visualization, we can effectively reveal the relationships between categorical variables and continuous variables, as well as create graphs such as histograms, probability density representation, and bar graph. However, in cases where it may be necessary, 3 or 4 dimensions and more can be used for data visualization. Dimension is considered as the display situation, not the first thing that comes to mind. An example image:
First dimension: variables
Second dimension: distribution of variables (color scatter plot)
Third dimension: densities (lines inside the diagonal from left to right)
Fourth dimension: relations of variables with each other, correlation values
Fifth dimension: scatter plot
Sixth dimension: model line (the line on the scatter plot)
I think it is enough to be able to express all the above situations with a single visual to show the power of data visualization. In fact, realizing everything we describe under EDA with data visualization techniques will literally mean an EDA. The modeling step should not be started without examining the changes, correlations and structures in the data with crosses. If it is passed, the blame of some results that create a failure can be attributed to the algorithms.
6. Continuing to Work According to the Targeted Analysis Type or Application
Now we know everything about the data set. Let’s say we’re going to buy a car, do we just look at a hood like this? We wouldn’t buy it? Therefore, a data scientist whose job is data should not go by skipping these steps. Yes, what do we do from now on? We will not continue with a new title as the results after this point may differ greatly.
After this item, progress will be made according to the type of analysis desired or the algorithm to be used. This will be determined according to the Data Scientist’s preferences. As a result, we learned what to do first when we got data.
For this article title, EDA steps will be applied with more than one sample application in the future.
This is all I have written about the “The Data Is Ahead, What Will I Do Now?”. If you want to know more about DS and related others, you can check out my other serial articles. Sample:
What is Data Science(DS) and How can it be learned?
You can reach me from my Linkedin account for all your questions and requests.
Hope to meet you in other series articles and articles…🖖🏼