Statistics 101: When Simple Becomes Tricky - Issue 157
How not to get trapped in descriptive statistics during data science interviews
Did you know that to get the median value of an array, you must first sort the array?
Now you do!
You can’t calculate the median of an unsorted dataset (but you can get the mean and the mode). Welcome to another “let me remind you of some basics” data science and analysis newsletter.
We are so accustomed to built-in functions in SQL and Python. I don’t even remember what variance means anymore. I just type VAR() and don’t overthink it. A few weeks ago, I was asked to help replicate STDDEV() step by step and let me tell you, I had to take a good minute to collect my thoughts.
Very often, you can be asked to break down common built-in functions during technical interviews, e.g. imagine there is no mode() function - write code to get the mode (rumors say the exercises below were asked at Reddit, GitHub, Amazon, and YouTube data science interviews). They are not meant to be difficult but instead aim to test your understanding of statistics, and if you’re able to recognize when and which function to apply.
There is no excuse for failing to replicate the mean or median.
I assume everyone is familiar with MEAN(), MIN(), and MAX(). I’ll focus below on the basic must-know functions, which you definitely have to use for A/B test analysis, time series analysis, probability estimations, and predictions, such as variance, standard deviation, mode, median, and distribution.
Keep reading with a 7-day free trial
Subscribe to Data Analysis Journal to keep reading this post and get 7 days of free access to the full post archives.