In this blog, I would like to highlight a few aspects and differences about what data scientists work on with respect to their close neighbors, statisticians, that share more or less simialar technicalities.


Who Are Data Scientists And What Do They Do?

Someone gave a very subtle definition of a data scientist, and it could not have been said any better:

“A data scientist is someone who is better at programming than any statistitican, and is better at statistics than any programmer.”

Being a data scientist means you would have to have technical knowledge of both statistics, and computer science. You would not only have to know how to extract useful data from a huge data set, but also have to tell a story using that data such that it turns into a useful chunk of information that can be used either in business purposes, or just in general to improve the efficiency of operations. A hidden third requirement to be a data scientist is to have a in-depth knowledge on the field that they are working in, be it a particularly niche industry, or a generalized horizontal business model. Without the third factor, the first two becomes obsolete. To be an excellent data scientist, the three factors have to work in a closed coalition.

But enough about the requirements, what do data scientists actually do?
A data sceintist is able to fetch and read huge amount of data efficiently, scapre it, mine it, transform it when needed, and most importantly extract the useful information that lies in it (sometimes it is done using just simple visualization techniques, or sometimes it is done by rigorous programming to extract hidden information from layers and layers of data). A data scientist is someone who has a solid foundation of statistics, knows the principles and understands how to interprete the results from various statistical models. Advancing from that step, a data scientist is also able to develop machine learning algorithms to find and extract a pattern of “target” to predict future values of the partucaular “niche target” more accurately, and can drive business decisions accordingly.

To summarize in brief, a data sceintist has knowledge of (atleast) one dominating programming lanhuage (Python, R, SAS, etc), SQL, a strong foundation of statistics (model buidling, various regressions, etc), and can build machine learning models to estimate and drive future busniess decisions more accurately.

Data Scientists v/s Statisticians

Even though there are a lot of similar tasks that a data scientist and a statistician do, there are a few major differencees between the responsibilities of the two. A statistician typically handles datasets that are not as huge in quantity as compared to those that a data scientist handles. A statistican uses traditional small-scaled methods of data collection, like conducting experiments, surveys, etc, making their own data set, while a data scientist deals with existing data (generally of a huge amount). A data scientist is expected to mine, clean, and transform the data, while statistican is expected to understand the data, and relations of variables in it.

A data scientist works on many different statistical and machine learning models, and selects the best model that has highest accuracry for the dataset that can be used for accurate predictions. A statistician however works on typically simpler models like regressions, and aims to improve that model such that it would best fit the data. Statisticians also aim to determine the relation between each factor and response variable, check for the consistencies in the model and data, and check for any violations, and address them. However a data scientist does not care for any violations in the models since machine learning algorithms do not care for that. A statistican works on prediction that is generally for a shorter period in future, and is mostly for “internal usage”, while a data scientist works on heavier machine learning based predictions that impact and drive the business needs and decisions.

A few terminology changes are also observed in the two fields time to time: While statisticians refer to a predicting variable as a “factor”, a data scientist refers it to as a “feature”.

Who do I closely relate to, a Data Scientist or a Statistician?

From my personal industry experience, I am fond and passionate about progarmming that helps with real world problems, and wokring on various programming languages (Python, R, Java, SAS) and SQL has helped me gain even better insights about operations. To add to my competencies, at NCSU, I have taken courses to further learn Python, SQL, statistical modelling, and data science. Over the course of summer 2022, I interned at a major rail road company as an ‘Operations Research’ intern working on a end to end data science project. There I got exposed with multiple statistical and machine learning models using Python, R, and SQL, and gained a deeper knowledge of algorithms, and what the day to day tasks and processes of a data scientist are. I am also more inclined to go in data science field, and with my recent industry expereince, I beleive that I am more closer to a Data Scientist than I am to a statistician.


<
Blog Archive
Archive of all previous blog posts
>
Next Post
Project 1: Getting Started with Functions in R