I wrote this topic in two articles without boring you. We start with the first part of the series. We got data and we did the reading with the pandas library so what are we going to do? Here we will look for the answer to this question!
Note: I usually will use some abbreviated words below:
- Data Science — DS
- Exploratory Data Analysis — EDA
I don’t mean a problem at the beginning of the data preparation phase, but unfortunately it is! If we talk about that subject here, we will deviate from our noble purpose.
First of all, facing such a situation represents a problem in itself. Under the DS project cycle, you don’t get such data at once. For this reason, this article is not prepared in the context of the DS project cycle paper. This article, prepared on the questions, has been prepared for the need to quickly enter and exit data. This may be happening for a homework or a quick application test.
Let’s assume that you have acquired the data set with a certain systematic of your own will or that you have obtained it directly in some way and proceed accordingly. The purpose of this article is to show what data can be done first when we have data.
1. What’s the purpose?
First of all, it should determine what the purpose is. Okay, we have the data, but what is the thing we plan to do with this data? These questions were questions that had to be asked before data was obtained, but we assumed we skipped it. In this case, it should be clearly stated what the purpose is. What’s the purpose? After answering the question, our perspective on the data set will get a little more orderly.
2. Tidy Data Process
Does the data set being studied have the necessary order to conduct some analytical examinations? What is Tidy Data? Briefly, do each row in the data set correspond to an observation, each column to a variable, and the intersection of observations and variables to a value? If so, go ahead. If not, the data set should be brought to this format for its purpose.
3. Determining and Setting Variable Types
We need to get closer to the data set in a tidy format, which is a regular format, and determine what the types / scales of the variables that make up the data set should be and report this to the data set. Are variables continuous, categorical, ordered categorical or date variable? In our opinion, these known variable types should be set and introduced to the data set. This step is important. For example, when we treat an ordered variable that is unequal between classes (ordinal variable, f.e. military ranks) as equal between classes and assign them nominal, we will make a mistake. For example, when a variable whose values consist of characters such as city names is introduced to R, R will recognize them as “character” type. If these city names are a categorical variable for us and there is no difference between classes, that is, if the classes are equal, then this variable consisting of city names should be introduced to R as a “factor” variable. Therefore, we must know the types of variables included in the data set and declare this to the data set.
Variables and statistical scale types will be discussed in detail in another article.
4. Summary Statistics: Showing the Basic Structure of the Data Set
Here, basic statistics such as types of variables, mean mode median are examined. It refers to a rough look at the data. The steps above are just a check. In this step, basic statistics such as mean, mode, median, standard deviation and variance are examined and prior knowledge is obtained. In the next step, EDA, it is tried to understand how these average and change values are realized.
Let’s leave it here for now and say to see you in my next and last article under this topic.