5 Comments
User's avatar
Kirill's avatar

Hi Olga, thank you for your article.

Do you mind please share some materials on ways to improve logistic regression when dealing with imbalanced dataset? i.e. my problem is that only 5% of observations have the value of 1, and the rest is 0. How should I approach cases like this? Thank you.

Expand full comment
Olga Berezovsky's avatar

Hi Kirill, when you say "the rest is 0", do you mean (a) the model output doesn't make sense / incorrect? Or (b) the data you used for modeling is incomplete and missing values?

Expand full comment
Kirill's avatar

Hi Olga,

My apologies for being unclear in my Q. I was wondering if the ratio between positive/negative outcome of dependent variable requires some balancing.

For example, I want to predict user chance to buy something. However, because my store conversion rate is only 2%, my training set will contain 98% of users who didn't buy anything.

I have faced similar problem and logistic regression was inconclusive in predicting chances to buy.

How would you recommend to deal with problems like that? Appreciate your advice :)

Expand full comment
Olga Berezovsky's avatar

First thing first, make sure your 2% conversion rate is significant data.

Spend more time on EDA before modeling. Logistic regression WII FAIL if:

(1) there is no correlation between your features. They are random and are not connected. There is no signal in data to work with.

(2) converted users are "in a bulk" and not normally distributed. E.g 90% of converted users made a purchase in 1 day after some campaign. There should be healthy variance in your dataset.

For your model to work, you need to have a few strong features to train it on. And by strong, there must be solid correlation between user property/action and transaction. The more such connections you find, the stronger features will be, and therefore, the more conclusive model response.

Expand full comment
Kirill's avatar

Brilliant, thank you 😸

Expand full comment