Outliers: To Drop or Not to Drop? - Issue 196
From Analytics to ML: How to detect outliers and what to do with them.
There is a common misconception that outliers are bad. They skew the distribution, so we should detect and remove them early to proceed with modeling or analysis.
Here is what typically data scientists do when working on ML:
Check null values. If they are sparse, remove them. If too many values are missing, find a way to fill them in.
Create a distribution of values. Locate outliers. Remove outliers.
Convert categorical values into numerical ones for modeling.
Group values into features. The more user attributes the dataset has, the better the model performs.
The dataset is now clean - there are no outliers, null values, or numerical data - and ready for modeling.
Each of the steps above may be flawed. Some are easier to troubleshoot and improve, while others are more complex and require more context.
Today, I will focus on outliers.
Outliers are not necessarily bad and do not always have to be removed. It depends on their use case:
Certain ML models handle outliers quite well, while others will degrade in performance.
While some KPIs and metrics, like DAU, ARR, or Churn, remain unaffected by outliers, others can become misleading, such as Time-to-Value, Transactions Per User, Average actions, etc.
Below, I will discuss the different types of outliers, show how to detect them, and how to figure out when you should remove, keep, or adjust them. Why, in some cases, outliers are harmful, and in others, you have to keep them in your dataset to make your analysis or model more precise and accurate.
Techniques to detect outliers
Keep reading with a 7-day free trial
Subscribe to Data Analysis Journal to keep reading this post and get 7 days of free access to the full post archives.