Data Analysis Journal

Data Analysis Journal

Share this post

Data Analysis Journal
Data Analysis Journal
Outliers: To Drop or Not to Drop? - Issue 196
Machine Learning

Outliers: To Drop or Not to Drop? - Issue 196

From Analytics to ML: How to detect outliers and what to do with them.

Olga Berezovsky's avatar
Olga Berezovsky
Apr 10, 2024
∙ Paid
9

Share this post

Data Analysis Journal
Data Analysis Journal
Outliers: To Drop or Not to Drop? - Issue 196
1
Share

There is a common misconception that outliers are bad. They skew the distribution, so we should detect and remove them early to proceed with modeling or analysis.

Here is what typically data scientists do when working on ML: 

  1. Check null values. If they are sparse, remove them. If too many values are missing, find a way to fill them in.

  2. Create a distribution of values. Locate outliers. Remove outliers.

  3. Convert categorical values into numerical ones for modeling.

  4. Group values into features. The more user attributes the dataset has, the better the model performs.

  5. The dataset is now clean - there are no outliers, null values, or numerical data - and ready for modeling.

Each of the steps above may be flawed. Some are easier to troubleshoot and improve, while others are more complex and require more context. 

Today, I will focus on outliers.

Outliers are not necessarily bad and do not always have to be removed. It depends on their use case: 

  • Certain ML models handle outliers quite well, while others will degrade in performance. 

  • While some KPIs and metrics, like DAU, ARR, or Churn, remain unaffected by outliers, others can become misleading, such as Time-to-Value, Transactions Per User, Average actions, etc.

Below, I will discuss the different types of outliers, show how to detect them, and how to figure out when you should remove, keep, or adjust them. Why, in some cases, outliers are harmful, and in others, you have to keep them in your dataset to make your analysis or model more precise and accurate.

Techniques to detect outliers

Keep reading with a 7-day free trial

Subscribe to Data Analysis Journal to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Olga Berezovsky
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share