This series of articles consists of 4 main chapters and this article is “Who Is A Data Scientist, What Is It, What Does It Do?” is the third part of the series.
The photo above is a kind of depiction of what the Data scientist did, expertly showing that he was in meditation mode. Every application or feature that you do / will do as you love your job will give you a kind of pleasure and you will enjoy that job.
Now, Briefly What Was Data Science?
In all kinds of assets that can be defined as data; Data science is the entire process of information-oriented work that defines the current situation, makes information-oriented discoveries, categorizes, classifies and makes predictions about the future through the current situation. Statistics, machine learning, programming, big data and open source software are used extensively in data science processes. However, socially; “Data Science” deserves the current attention, as it should include skills such as research, inquiry, linking between problems and technical solutions, interpretation and presentation of findings.
What about a Data Scientist? What Does It Do?
The data scientist can describe the data he/she has with various tools and make inferences over the structures within it; It is the person standing at the intersection of the visual features that can make predictive, preventive and prescription models:
He is the manager of the process of extracting useful information from data. It is an artist of knowledge. It is the person who distorts only the “physical space occupying” form of the data. In other words, it is “Information Explorer”.
He knows where to get the data. Otherwise, it provides production. Knows the structure of the incoming data, what it means, what limitations and deficiencies there are. It shapes the data according to the problem it wants to solve. If there is no problem to solve, he finds those mysterious structures that everyone overlooked through the discovery process, he is curious. Knows how to apply all kinds of descriptive, inferential, clustering, classifying, predictive, predictive, and preventive approaches in which situations.
Then What are the Responsibilities of the Data Scientist?
It is to extract useful information, action recommendations, decision support systems and data-oriented products from data using all kinds of tools and scientific techniques.
What Skills Should a Data Scientist Have?
To put it minimally:
- Mathematics, Statistics, Machine Learning
- Personal Skills
- Business Knowledge (the most important item, but this competence will of course not be at the beginning level)
To explain a little:
- Individual Abilities: “Excitement”, Curiosity, Asking the Right Questions, Analytical Perspective, Problem Solving Ability, Effective Communication, Narrative and Presentation Ability (these are very important)
- Scientific Foundations: Mathematics, Statistics, Probability, Linear Algebra
Programming: Algorithmic Approach, Programming Logic, SQL (databases), NoSQL, Bash Script, R, Python, Scala, SPSS, SAS, MATLAB etc.
- Big Data Technologies: Understanding of Big Data Concept, Hadoop, Spark, Hive, Impala, DBs, PySpark, SparkR, SparklyR and others.
- Cloud Technologies: AWS, Google Cloud, Microsoft Azure, IBM etc.
- Statistical Learning (SL):
- Tidy Data Process and Data Pre-Processing (missing data, outlier, inconsistency reviews, etc.)
- Discovery Data Analysis (Descriptive Statistics, Data Visualization)
Inferential Statistics (sample theory, probability distributions, random variables, hypothesis testing, bayesian inference, robust methods)
- Multivariate Statistical Methods (correlation, dimension reduction (PCA, LDA, Kernel PCA), analysis of variance, cluster analysis, factor analysis, fit analysis, path analysis, separation analysis etc.)
- Regression Models: Linear regression, logit-probit, m.logit-m.probit, quantile regression etc.
- Resampling Methods (resampling methods: cross-validation, bootstrap)
- Linear Model Selection and Regularization
- Linearity and Causality
6. Machine Learning (ML):
- Regression Models: Multiple Regression, Polynomial Regression, SVR, Regression Trees, Random Forest Regression…
- Classification: Logistic Regression, K-NN, SVM, Naive Bayes, Decision Trees, Community Learning Methods (bagging, boosting, RF,…,)
- Clustering: Hierarchical and Non-Hierarchical Clustering Methods (Hierarchical clustering, K-Means)
- Association Rule Learning (Association Rules: Apriori, Eclat)
- Text Mining, NLP
- Reinforcement Learning
- Deep Learning
- Model Selection (validation, test failure methods, model performance evaluation, parameter tuning) and Knowing Learning Disorders (underfitting, overfitting, good fitting)
- Awareness that the simple will always be better and the words “All Models Are Bad, Some Are Useful” (George E.P. Box)
- Is the forecast closeness? Causation? Very good understanding of their situation.
The places, order and title of the items can be changed. Generally speaking, these abilities define a good Data Scientist. In this case, we can assume that each of the concepts such as data mining, machine learning, and data science will be intertwined.
Some statisticians will think that what is written in the ML part should actually be in the SL part, yes actually SL and ML are intertwined and express the same things. With a few distinctions. After looking at the relevant article, we will separate the two with a single sentence in the current situation:
“If the causality principle and actioner modeling aim is to be carried out, that is, if a study focused on human and institutional behavior and the aim is to understand the reasons of the events, SL, if there is only concern for predictive closeness, ML, if both causality and predictive proximity are concerned, first SL then ML = SL * ML “
In this case, the issue is actually reduced to the point of breaking from linearity and not breaking. Because we cannot perform causality inquiries in nonlinear models.
You can find many articles and articles titled The Importance of Statistics in Data Science. I will not go into it, as this is an area per se.
It can actually be added in Econometric Modeling besides SL and ML, but I have expressed it in SL. So why is this important? He is among Causality and Econometric Modeling in a detailed Data scientist training. The causality principle is a completely different field.
My next post will be about “Data Scientist with Questions”. See you as soon as possible …
Do not forget!
“In life, the most real guide is science … MKA”
“and you can explain this with data… MA”