<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Data Analysis Journal: Machine Learning]]></title><description><![CDATA[Supervised Machine Learning algorithms basics and implementation]]></description><link>https://dataanalysis.substack.com/s/machine-learning</link><image><url>https://substackcdn.com/image/fetch/$s_!WdsI!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd7029b3-f274-4215-ac43-d275f496ecf8_200x200.png</url><title>Data Analysis Journal: Machine Learning</title><link>https://dataanalysis.substack.com/s/machine-learning</link></image><generator>Substack</generator><lastBuildDate>Thu, 16 Apr 2026 15:33:10 GMT</lastBuildDate><atom:link href="https://dataanalysis.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Olga Berezovsky]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[dataanalysis@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[dataanalysis@substack.com]]></itunes:email><itunes:name><![CDATA[Olga Berezovsky]]></itunes:name></itunes:owner><itunes:author><![CDATA[Olga Berezovsky]]></itunes:author><googleplay:owner><![CDATA[dataanalysis@substack.com]]></googleplay:owner><googleplay:email><![CDATA[dataanalysis@substack.com]]></googleplay:email><googleplay:author><![CDATA[Olga Berezovsky]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[From Forecast to Action - Issue 279]]></title><description><![CDATA[Why communicating, packaging, and embedding forecasts is where most data scientists fail.]]></description><link>https://dataanalysis.substack.com/p/from-forecast-to-action</link><guid isPermaLink="false">https://dataanalysis.substack.com/p/from-forecast-to-action</guid><dc:creator><![CDATA[Olga Berezovsky]]></dc:creator><pubDate>Wed, 03 Sep 2025 12:02:14 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!81hK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4fde48d-c9d0-4bea-a70a-489716bc6454_1444x800.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Please welcome the final section of my forecasts series, where I&#8217;ll share the challenges, pitfalls, and caveats of what is allegedly the easiest part of the forecasting lifecycle - delivering results and communicating them to stakeholders.</p><p>If you are new, start here:</p><ol><li><p><a href="https://dataanalysis.substack.com/p/forecasting-in-analytics">Forecasts Part 1: Choosing the Right Approach</a> - How to predict revenue, user growth, and &#8230;</p></li></ol>
      <p>
          <a href="https://dataanalysis.substack.com/p/from-forecast-to-action">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[From Analytics to Data Science: Building Forecasts - Issue 257]]></title><description><![CDATA[A step-by-step guide to ARIMA and Prophet to forecast events, transactions, and customer growth.]]></description><link>https://dataanalysis.substack.com/p/from-analytics-to-data-science-forecasting</link><guid isPermaLink="false">https://dataanalysis.substack.com/p/from-analytics-to-data-science-forecasting</guid><dc:creator><![CDATA[Olga Berezovsky]]></dc:creator><pubDate>Wed, 07 May 2025 12:02:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe95f1254-a53c-4baa-bb43-dec69715c1fe_1600x962.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome to the Data Analysis Journal, a weekly newsletter about data science and analytics.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataanalysis.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dataanalysis.substack.com/subscribe?"><span>Subscribe now</span></a></p><p>Forecasting is one of the most over-discussed topics in data science, and yet it&#8217;s still hard to get right. It&#8217;s one thing to build a model for Kaggle, but it&#8217;s entirely different to deploy it in a real business or product context and fine-tune it while accounting for anomalies, seasonality, external factors, ad spend fluctuations, new feature rollouts, and a hundred other variables we&#8217;re expected to quantify</p><p>This publication is a follow-up to <em><a href="https://dataanalysis.substack.com/p/forecasting-in-analytics">Forecasting in Analytics: Choosing the Right Approach</a></em>, published a few weeks ago, where I introduced different types of forecasts and ML models, as well as common use cases for predictive modeling in analytics. I went over common financial and revenue forecasts, including methods like:</p><ol><li><p>Historical Growth Rate (or Straight Line)</p></li><li><p>Moving Average</p></li><li><p>Simple Linear Regression</p></li><li><p>Multiple Linear Regression</p></li></ol><p>Today, I take it a step further into data science, covering more complex ML modeling and discussing time series forecasting methods used to predict events, transactions, subscriptions, downloads, and more. I&#8217;ll walk you through my approach to forecating, steps, and code.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yEd_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ba1c49b-f1ba-4728-9d43-ac483f48a216_200x200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yEd_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ba1c49b-f1ba-4728-9d43-ac483f48a216_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!yEd_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ba1c49b-f1ba-4728-9d43-ac483f48a216_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!yEd_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ba1c49b-f1ba-4728-9d43-ac483f48a216_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!yEd_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ba1c49b-f1ba-4728-9d43-ac483f48a216_200x200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yEd_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ba1c49b-f1ba-4728-9d43-ac483f48a216_200x200.png" width="164" height="164" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ba1c49b-f1ba-4728-9d43-ac483f48a216_200x200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:200,&quot;width&quot;:200,&quot;resizeWidth&quot;:164,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yEd_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ba1c49b-f1ba-4728-9d43-ac483f48a216_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!yEd_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ba1c49b-f1ba-4728-9d43-ac483f48a216_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!yEd_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ba1c49b-f1ba-4728-9d43-ac483f48a216_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!yEd_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ba1c49b-f1ba-4728-9d43-ac483f48a216_200x200.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><h1>My approach to forecasting</h1><p>I typically use 4 different methods (3 ML and 1 projection) to model multiple predictions, ranging from more conservative (lower bound) to more aggressive (upper bound):</p><ol><li><p>ARIMA - Moving Average to smooth time series data, w/o adjusting for seasonality.</p></li><li><p>SARIMAX - Moving Average to smooth time series data while incorporating seasonality. This is typically the safest and conservative forecast. Expect it to be on the lower bound.</p></li><li><p>PROPHET - Forecasting for non-linear trends, incorporating seasonality. It's more aggressive in predictions but also proven to be the most advanced among most other forecasts, and often is an accurate method.</p></li><li><p>Projection - <em>Olga&#8217;s secret, overly complicated manual projection.</em> I plot every available metric&#8217;s historical D/D, W/W, M/M, and Y/Y % change and analyze their (a) correlations and relationships and (b) seasonal thresholds. It takes ages and a kidney to complete, but it consistently delivers the most precise forecast. IF done right. IF I can account for everything the teams are doing.</p></li></ol><p>When reporting, I typically present only Prophet alongside my Projection, keeping ARIMA and its variations for myself as checks.</p><h3>Prophet forecast</h3>
      <p>
          <a href="https://dataanalysis.substack.com/p/from-analytics-to-data-science-forecasting">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Forecasting in Analytics: Choosing the Right Approach - Issue 249]]></title><description><![CDATA[How to predict revenue, user growth, and key business metrics using moving averages, regression, and ML]]></description><link>https://dataanalysis.substack.com/p/forecasting-in-analytics</link><guid isPermaLink="false">https://dataanalysis.substack.com/p/forecasting-in-analytics</guid><dc:creator><![CDATA[Olga Berezovsky]]></dc:creator><pubDate>Wed, 12 Mar 2025 12:03:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48c0145b-8b92-4b9c-a566-d3b9996c018d_1182x848.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome to the Data Analysis Journal, a weekly newsletter about data science and analytics.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataanalysis.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dataanalysis.substack.com/subscribe?"><span>Subscribe now</span></a></p><p>I will be in Seattle next week - reach out if you&#8217;d like to meet and chat about all things analytics.</p><p>Also, this year, I am partnering with <a href="https://www.datacouncil.ai/bay-2025">Data Council</a>, a &#8220;<em>no BS data conference,</em>&#8221; and have a 20% discount to share with my subscribers. Use the code <strong>daj20</strong> to save on your ticket. The event takes place April 22-24 in Oakland, California. After quite a few years, it&#8217;s finally back home in the Bay Area. Hope to see you there.</p><div><hr></div><p>One of my recent learnings is that it&#8217;s surprisingly difficult to find analysts experienced in forecasting and prediction. Which is unexpected, given that regressions and time series forecasting are extensively taught in most academic programs today. But I&#8217;ve noticed that analysts tend to either overcomplicate things - applying complex ML to forecast moving average (?) or simply do regression plotting without deeper analysis or understanding the coefficients.</p><p>There are many models for forecasting and predictions, each with its own caveats and context. I won&#8217;t be going over the basics of forecasts and modeling. Instead, I will focus on the practical side - how forecasting varies depending on the type of project. Forecasting revenue, for example, is very different from forecasting LTV, Churn, or subscription growth, even when using the same model and dataset.</p><p>Today, I will cover the different types of forecasts and ML, and the common use cases that require predictive modeling in analytics. I&#8217;ll walk you through the steps and models to forecast revenue, ARR, and paid customers to answer questions like:</p><ul><li><p>How many page views do we need to double signups?</p></li><li><p>How many more games, streaks, or logs must users complete before converting to paid?</p></li><li><p>How to forecast MRR for the next 2 years.</p></li><li><p>When will we reach 1M of subscriptions?</p></li><li><p>When will we hit $1M ARR with our current baselines and ad spent?</p></li></ul><p>In another follow-up piece, I&#8217;ll share examples of my forecasts and explain how to adapt them to different use cases.</p>
      <p>
          <a href="https://dataanalysis.substack.com/p/forecasting-in-analytics">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Refresher on Statistics - Issue 237]]></title><description><![CDATA[Key statistical concepts often used in data analytics.]]></description><link>https://dataanalysis.substack.com/p/refresher-on-statistics</link><guid isPermaLink="false">https://dataanalysis.substack.com/p/refresher-on-statistics</guid><dc:creator><![CDATA[Olga Berezovsky]]></dc:creator><pubDate>Wed, 18 Dec 2024 13:02:28 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcff0487d-5b9b-4ce3-8db8-9cd035184a5c_1600x950.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Welcome to my<a href="https://dataanalysis.substack.com/"> Data Analytics Journal</a>, where I write about data science and analytics.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataanalysis.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dataanalysis.substack.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><p>Wrapping up the year with another &#8220;refresher&#8221;. This time, it&#8217;s about statistics - compiling all publications on statistics into one concise &#8220;take-home&#8221; guide.</p><p>Applying core statistics in analytics typically involves:</p><ol><li><p>Distributions</p></li><li><p>Significance and Confidence</p></li><li><p>Correlations and Regressions</p></li><li><p>Causal inference</p></li><li><p>Law of Large Numbers and Central Limit Theorem</p></li></ol><p>This publication is targeted at analysts, data scientists, and product owners who work with data and ML products.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VOJW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d0fc2-98cc-4660-be36-779fcb5dcbef_200x200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VOJW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d0fc2-98cc-4660-be36-779fcb5dcbef_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!VOJW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d0fc2-98cc-4660-be36-779fcb5dcbef_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!VOJW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d0fc2-98cc-4660-be36-779fcb5dcbef_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!VOJW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d0fc2-98cc-4660-be36-779fcb5dcbef_200x200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VOJW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d0fc2-98cc-4660-be36-779fcb5dcbef_200x200.png" width="200" height="200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6d5d0fc2-98cc-4660-be36-779fcb5dcbef_200x200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:200,&quot;width&quot;:200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VOJW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d0fc2-98cc-4660-be36-779fcb5dcbef_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!VOJW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d0fc2-98cc-4660-be36-779fcb5dcbef_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!VOJW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d0fc2-98cc-4660-be36-779fcb5dcbef_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!VOJW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d0fc2-98cc-4660-be36-779fcb5dcbef_200x200.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><h2>Why do we need statistics? To bring certainty and confidence</h2><p>The whole purpose of statistics (statistical theory, methods, and analysis) is to provide<a href="https://www.statisticshowto.com/uncertainty-in-statistics/"> certainty out of uncertainty</a>. In other words, when you don&#8217;t have a high degree of trust (in either quantity or quality of data), how do you make sure you make the right decision?</p>
      <p>
          <a href="https://dataanalysis.substack.com/p/refresher-on-statistics">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[When and Why You Need Randomly Distributed Users - Issue 226]]></title><description><![CDATA[Statistics 101 or Intro to Random Sampling: Simple and advanced techniques for applying sampling methods for your analysis.]]></description><link>https://dataanalysis.substack.com/p/when-and-why-you-need-randomly-distributed</link><guid isPermaLink="false">https://dataanalysis.substack.com/p/when-and-why-you-need-randomly-distributed</guid><dc:creator><![CDATA[Olga Berezovsky]]></dc:creator><pubDate>Wed, 09 Oct 2024 12:01:28 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!qL6A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3fffefe-e5d7-4d66-976e-3e90e4ee434a_694x1280.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome to the <a href="https://dataanalysis.substack.com/">Data Analysis Journal</a>, a weekly newsletter about data science and analytics.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataanalysis.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dataanalysis.substack.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><p>Welcome back to the <em><a href="https://dataanalysis.substack.com/s/machine-learning">Statistics 101</a></em> series. Today, let&#8217;s talk about:</p><ol><li><p>How to get randomly distributed users, using both basic and advanced methods.</p></li><li><p>When to apply sampling techniques to randomize users.</p></li><li><p>Why these techniques matter.</p></li></ol><p><em>Sampling </em>helps uncover patterns and trends within larger datasets. Samples can be random, uniform, structured, or sorted in specific ways. Here, <em>random </em>does not mean a different number every time, but rather that <em>it can&#8217;t be predicted</em>.</p><h2>When do you need to get randomly distributed users?</h2>
      <p>
          <a href="https://dataanalysis.substack.com/p/when-and-why-you-need-randomly-distributed">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Applying Regressions in Analytics - Issue 221]]></title><description><![CDATA[How to interpret regression plots and understand the equation that drives analysis and predictions]]></description><link>https://dataanalysis.substack.com/p/applying-regressions-in-analytics</link><guid isPermaLink="false">https://dataanalysis.substack.com/p/applying-regressions-in-analytics</guid><dc:creator><![CDATA[Olga Berezovsky]]></dc:creator><pubDate>Wed, 11 Sep 2024 12:01:09 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!DlfG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7bfaa09-ee16-47a2-b00f-bb3a98443ec6_1294x930.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome to the <a href="https://dataanalysis.substack.com/">Data Analysis Journal</a>, a weekly newsletter about data science and analytics.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataanalysis.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dataanalysis.substack.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><p>Today, I want to talk about the most commonly used type of analysis, which is overused by executives, PMs, and engineers yet somehow underused by data scientists and analysts: <strong>regressions</strong>.</p><p>People who borrow data-hat often like to plug numbers into regression models and share plots left and right, using them to back up their hypotheses or assumptions. What they might not realize is that linear regression is a form of ML, which requires data cleaning, feature engineering, addressing <a href="https://towardsdatascience.com/overfitting-and-underfitting-principles-ea8964d9c45c">underfitting and overfitting</a>, and more. Simply reading scatterplots is not enough. They apply regression without understanding what the scores and equation actually mean or how to interpret them.&nbsp;</p><p>Statisticians, on the other hand, often use more advanced regression techniques and ML models, sometimes making the analysis more complex than necessary.</p><p>I am absolutely guilty of the latter.</p><p>In this publication, let&#8217;s recap:</p><ul><li><p>Use cases across different regression types.</p></li><li><p>Examples of product and marketing hypotheses that can be solved with regression.</p></li><li><p>What does the regression equation mean, and how do we interpret the regression scores?</p></li><li><p>How to break down scatterplot slope using the regression equation.</p></li><li><p>Does regression prove causation?</p></li><li><p>How to apply casual inference in analytics.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AHHy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe253746b-3409-4b11-9e9b-c19649de5786_200x200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AHHy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe253746b-3409-4b11-9e9b-c19649de5786_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!AHHy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe253746b-3409-4b11-9e9b-c19649de5786_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!AHHy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe253746b-3409-4b11-9e9b-c19649de5786_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!AHHy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe253746b-3409-4b11-9e9b-c19649de5786_200x200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AHHy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe253746b-3409-4b11-9e9b-c19649de5786_200x200.png" width="160" height="160" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e253746b-3409-4b11-9e9b-c19649de5786_200x200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:200,&quot;width&quot;:200,&quot;resizeWidth&quot;:160,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AHHy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe253746b-3409-4b11-9e9b-c19649de5786_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!AHHy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe253746b-3409-4b11-9e9b-c19649de5786_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!AHHy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe253746b-3409-4b11-9e9b-c19649de5786_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!AHHy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe253746b-3409-4b11-9e9b-c19649de5786_200x200.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h1>When and why to use regression</h1>
      <p>
          <a href="https://dataanalysis.substack.com/p/applying-regressions-in-analytics">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Handling Missing Data: Should You Drop or Impute? - Issue 219]]></title><description><![CDATA[Exploratory Data Analysis: Techniques for handling NULL values in modeling and analysis]]></description><link>https://dataanalysis.substack.com/p/handling-missing-data-for-ml</link><guid isPermaLink="false">https://dataanalysis.substack.com/p/handling-missing-data-for-ml</guid><dc:creator><![CDATA[Olga Berezovsky]]></dc:creator><pubDate>Wed, 28 Aug 2024 12:02:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa34a4ac9-5f28-4dcf-9511-0cb27fa18b5d_1360x1082.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Here&#8217;s what I&#8217;ve noticed about analysts:</p><p>Beginners often ignore missing values, even when the percentage is impactful. (It&#8217;s actually funny - they might report that 86% of the data is missing and then proceed with the analysis or modeling as if nothing is wrong).</p><p>Senior data scientists tend to rush to impute missing values with averages, regardless of whether it&#8217;s necessary.</p><p>There are guidelines for when it&#8217;s acceptable to ignore missing data and when it&#8217;s not. Obviously, this depends on the volume of missing values. However, it also depends on their distribution, significance, variance, dependence, and the type of modeling you are performing.</p><p>Today, I want to recap the most common and, in my opinion, underrated issue in analytics: handling missing values.</p>
      <p>
          <a href="https://dataanalysis.substack.com/p/handling-missing-data-for-ml">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Predicting LTV with ML - Issue 217]]></title><description><![CDATA[ML models for predicting LTV in freemium apps: research and case studies]]></description><link>https://dataanalysis.substack.com/p/predicting-ltv-with-ml</link><guid isPermaLink="false">https://dataanalysis.substack.com/p/predicting-ltv-with-ml</guid><dc:creator><![CDATA[Olga Berezovsky]]></dc:creator><pubDate>Wed, 14 Aug 2024 12:02:19 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ddee98-3893-4500-b800-c1307f90bacb_1096x998.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome to the <a href="https://dataanalysis.substack.com/">Data Analysis Journal</a>, a weekly newsletter about data science and analytics.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataanalysis.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dataanalysis.substack.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><p>Thank God for LinkedIn and <a href="https://www.linkedin.com/in/paullevchuk/">Paul Levchuk</a>, who enriched us this summer with <a href="https://www.linkedin.com/posts/paullevchuk_ltv-activity-7221845053260894208-Ypgb/">his research</a> on numerous academic studies on LTV. Now, it&#8217;s my turn to pick up the baton.</p><p>During my blogging tenure, I&#8217;ve tried to stay away from LTV in my newsletter because:</p><ol><li><p>LTV is the most over-discussed and over-studied KPI. It feels like <em>everyone</em> is writing about it.</p></li><li><p>There are 100 ways to forecast LTV, 90 of which are very complex. There isn&#8217;t one perfect and universally appropriate method to calculate LTV; each has a lot of nuance.&nbsp;</p></li><li><p>You&#8217;re not an analyst until you deliver an LTV prediction. Sooner or later, everyone does it. It&#8217;s like graduating into a dark world of under-budgeting and overspending. Fortunately, marketing analytics is not my passion.</p></li><li><p>Most importantly, you don&#8217;t need LTV to build a great product and bring it to market. Investors need LTV. Like I said, it&#8217;s a dark world of under-budgeting and overspending.</p></li></ol><p>So today, instead of offering my own method for forecasting LTV (and I have 4 main models running, each returning completely different values for the same dataset &#128579;), I decided to share 3 case studies on using ML to predict LTV for freemium apps, inspired by and borrowed from Paul Levchuk research.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qAPY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a56545-bca2-425d-acb8-8c12ea86d2a3_200x200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qAPY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a56545-bca2-425d-acb8-8c12ea86d2a3_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!qAPY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a56545-bca2-425d-acb8-8c12ea86d2a3_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!qAPY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a56545-bca2-425d-acb8-8c12ea86d2a3_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!qAPY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a56545-bca2-425d-acb8-8c12ea86d2a3_200x200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qAPY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a56545-bca2-425d-acb8-8c12ea86d2a3_200x200.png" width="170" height="170" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d0a56545-bca2-425d-acb8-8c12ea86d2a3_200x200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:200,&quot;width&quot;:200,&quot;resizeWidth&quot;:170,&quot;bytes&quot;:2197,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qAPY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a56545-bca2-425d-acb8-8c12ea86d2a3_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!qAPY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a56545-bca2-425d-acb8-8c12ea86d2a3_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!qAPY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a56545-bca2-425d-acb8-8c12ea86d2a3_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!qAPY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0a56545-bca2-425d-acb8-8c12ea86d2a3_200x200.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>What I appreciate about Paul is that he&#8217;s one of the rare analysts today with strong attention to detail, critical thinking, and thorough due diligence. In other words, you can&#8217;t bullshit him with uncontrolled A/B testing, sloppy research, or flawed logic. I enjoy watching him call out inaccuracies in studies, incomplete analyses, or shallow comprehension, whether it&#8217;s a junior analyst or an operating partner at OpenView with the largest PLG newsletter.</p>
      <p>
          <a href="https://dataanalysis.substack.com/p/predicting-ltv-with-ml">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Outliers: To Drop or Not to Drop? - Issue 196]]></title><description><![CDATA[From Analytics to ML: How to detect outliers and what to do with them.]]></description><link>https://dataanalysis.substack.com/p/outliers-to-drop-or-not-to-drop</link><guid isPermaLink="false">https://dataanalysis.substack.com/p/outliers-to-drop-or-not-to-drop</guid><dc:creator><![CDATA[Olga Berezovsky]]></dc:creator><pubDate>Wed, 10 Apr 2024 12:00:57 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9fdfc3b-dd7a-435c-84c9-4e7e97df3fdc_1028x1054.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There is a common misconception that outliers are bad. They skew the distribution, so we should detect and remove them early to proceed with modeling or analysis.</p><p>Here is what typically data scientists do when working on ML:&nbsp;</p><ol><li><p>Check null values. If they are sparse, remove them. If too many values are missing, find a way to fill them in.</p></li><li><p>Create a distribution of values. Locate outliers. Remove outliers.</p></li><li><p>Convert categorical values into numerical ones for modeling.</p></li><li><p>Group values into features. The more user attributes the dataset has, the better the model performs.</p></li><li><p>The dataset is now clean - there are no outliers, null values, or numerical data - and ready for modeling.</p></li></ol><p>Each of the steps above may be flawed. Some are easier to troubleshoot and improve, while others are more complex and require more context.&nbsp;</p><p>Today, I will focus on outliers.</p><p>Outliers are not necessarily bad and do not always have to be removed. It depends on their use case:&nbsp;</p><ul><li><p>Certain ML models handle outliers quite well, while others will degrade in performance.&nbsp;</p></li><li><p>While some KPIs and metrics, like DAU, ARR, or Churn, remain unaffected by outliers, others can become misleading, such as Time-to-Value, Transactions Per User, Average actions, etc.</p></li></ul><p>Below, I will discuss the different types of outliers, show how to detect them, and how to figure out when you should remove, keep, or adjust them. Why, in some cases, outliers are harmful, and in others, you have to keep them in your dataset to make your analysis or model more precise and accurate.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0Kob!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f66185-606d-4ce2-9b4a-46833875d453_200x200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0Kob!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f66185-606d-4ce2-9b4a-46833875d453_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!0Kob!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f66185-606d-4ce2-9b4a-46833875d453_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!0Kob!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f66185-606d-4ce2-9b4a-46833875d453_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!0Kob!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f66185-606d-4ce2-9b4a-46833875d453_200x200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0Kob!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f66185-606d-4ce2-9b4a-46833875d453_200x200.png" width="148" height="148" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e2f66185-606d-4ce2-9b4a-46833875d453_200x200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:200,&quot;width&quot;:200,&quot;resizeWidth&quot;:148,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0Kob!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f66185-606d-4ce2-9b4a-46833875d453_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!0Kob!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f66185-606d-4ce2-9b4a-46833875d453_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!0Kob!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f66185-606d-4ce2-9b4a-46833875d453_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!0Kob!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f66185-606d-4ce2-9b4a-46833875d453_200x200.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><h2>Techniques to detect outliers</h2>
      <p>
          <a href="https://dataanalysis.substack.com/p/outliers-to-drop-or-not-to-drop">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Data Portfolio Done In Notebook - Issue 191]]></title><description><![CDATA[How to use notebooks as your portfolio to showcase your skills, and why notebooks are great for analytics.]]></description><link>https://dataanalysis.substack.com/p/data-portfolio-done-in-notebook</link><guid isPermaLink="false">https://dataanalysis.substack.com/p/data-portfolio-done-in-notebook</guid><dc:creator><![CDATA[Olga Berezovsky]]></dc:creator><pubDate>Wed, 06 Mar 2024 13:01:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f32efb2-8a2b-485f-9fd5-885102128250_1240x938.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome to the <a href="https://dataanalysis.substack.com/">Data Analysis Journal</a>, a weekly newsletter about data science and analytics.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataanalysis.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dataanalysis.substack.com/subscribe?"><span>Subscribe now</span></a></p>
      <p>
          <a href="https://dataanalysis.substack.com/p/data-portfolio-done-in-notebook">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How To Measure Data Quality - Issue 185]]></title><description><![CDATA[Does data quality have ROI? Ways to measure and quantify data governance metrics]]></description><link>https://dataanalysis.substack.com/p/how-to-measure-data-quality-issue</link><guid isPermaLink="false">https://dataanalysis.substack.com/p/how-to-measure-data-quality-issue</guid><dc:creator><![CDATA[Olga Berezovsky]]></dc:creator><pubDate>Wed, 07 Feb 2024 13:00:44 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F806c9537-c44a-44b3-a853-58c3bb9acb3b_1200x853.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome to the <a href="https://dataanalysis.substack.com/">Data Analysis Journal</a>, a weekly newsletter about data science and analytics.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataanalysis.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dataanalysis.substack.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><p>Typically, I avoid data engineering and data governance topics in my newsletter. While the success of analytics is directly linked to data governance initiatives, measuring data quality often falls outside the primary responsibilities of data scientists. Also, the subject of data quality is well-covered today, and there is no need for yet another <em>how-to-improve-data-quality</em> publication.</p><p>However, a few weeks ago, during the <a href="https://go.thoughtspot.com/webinar-data-and-analytics-trends-2024.html">ThoughtSpot webinar on data trends and data quality</a>, a question arose about <strong>whether there is a common criteria for defining poor-quality data vs. good-quality data</strong>. To my surprise, <a href="https://www.linkedin.com/in/sonnyrivera/">Sonny Rivera</a>'s response was that (a) there is no industry standard to define criteria for data quality, and (b) it has to be &#8220;good enough.&#8221;</p><p>I respectfully disagree with (a), and the only context where I&#8217;d settle for &#8220;good enough&#8221; is when I am preparing my homemade fajitas.&nbsp;</p><p>That being said, I haven&#8217;t seen actual KPIs related to data quality. So, I embarked on a quest to find <strong>data quality metrics and KPIs I could use to scale the data governance initiatives and measure data quality.</strong> Below, after long hours of research, I share what industry leaders have to offer on this subject and my consolidated list of the top metrics you can use to measure data governance ROI and the state of data quality.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v3ec!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c9efaf7-2928-4dbd-a7e7-4b26fa889d52_200x200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v3ec!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c9efaf7-2928-4dbd-a7e7-4b26fa889d52_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!v3ec!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c9efaf7-2928-4dbd-a7e7-4b26fa889d52_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!v3ec!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c9efaf7-2928-4dbd-a7e7-4b26fa889d52_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!v3ec!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c9efaf7-2928-4dbd-a7e7-4b26fa889d52_200x200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v3ec!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c9efaf7-2928-4dbd-a7e7-4b26fa889d52_200x200.png" width="200" height="200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5c9efaf7-2928-4dbd-a7e7-4b26fa889d52_200x200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:200,&quot;width&quot;:200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v3ec!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c9efaf7-2928-4dbd-a7e7-4b26fa889d52_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!v3ec!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c9efaf7-2928-4dbd-a7e7-4b26fa889d52_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!v3ec!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c9efaf7-2928-4dbd-a7e7-4b26fa889d52_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!v3ec!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c9efaf7-2928-4dbd-a7e7-4b26fa889d52_200x200.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><h1>What data quality experts offer</h1><p>First, I checked <a href="https://www.selectstar.com/">(*)SELECT STAR</a> blog, the company that is budgeted to solve your data challenges. With the over-promising title of <a href="https://www.selectstar.com/resources/how-to-build-a-modern-data-governance-framework">How to Build a Modern Data Governance Framework</a>, here is what they say:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Drey!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46e2f79-6772-4ea5-bd0e-18f856972613_1600x1005.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Drey!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46e2f79-6772-4ea5-bd0e-18f856972613_1600x1005.png 424w, https://substackcdn.com/image/fetch/$s_!Drey!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46e2f79-6772-4ea5-bd0e-18f856972613_1600x1005.png 848w, https://substackcdn.com/image/fetch/$s_!Drey!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46e2f79-6772-4ea5-bd0e-18f856972613_1600x1005.png 1272w, https://substackcdn.com/image/fetch/$s_!Drey!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46e2f79-6772-4ea5-bd0e-18f856972613_1600x1005.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Drey!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46e2f79-6772-4ea5-bd0e-18f856972613_1600x1005.png" width="1456" height="915" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d46e2f79-6772-4ea5-bd0e-18f856972613_1600x1005.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:915,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Drey!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46e2f79-6772-4ea5-bd0e-18f856972613_1600x1005.png 424w, https://substackcdn.com/image/fetch/$s_!Drey!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46e2f79-6772-4ea5-bd0e-18f856972613_1600x1005.png 848w, https://substackcdn.com/image/fetch/$s_!Drey!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46e2f79-6772-4ea5-bd0e-18f856972613_1600x1005.png 1272w, https://substackcdn.com/image/fetch/$s_!Drey!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46e2f79-6772-4ea5-bd0e-18f856972613_1600x1005.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is the best measurement of data quality the company, which is solving data governance for $900/month, came up with through their 74 blog articles and videos. I found it hard to believe that a company of that scale and funding couldn&#8217;t cover the basics of data governance. So, I took time to listen to their videos, but most of them weren&#8217;t helpful.</p><p>There was one exception, though. During a live<a href="https://www.youtube.com/watch?v=1qjp3v9N0-w&amp;t=99s"> YouTube stream</a> interviewing SELECT STAR CEO <a href="https://www.linkedin.com/in/shinjikim/">Shinji Kim</a>, a question arose about measuring the impact of data governance. The host, <a href="https://www.linkedin.com/in/georgefirican/">George Firican</a>, brought up several metrics, including:&nbsp;</p><ul><li><p>time spent searching data&nbsp;</p></li><li><p>time spent handling data quality&nbsp;</p></li><li><p>cost-per-decision</p></li><li><p>the impact of data governance on some business-related metrics&nbsp;</p></li></ul><p>Shinji mentioned the <em>time to get compliant</em>, <em>the number of deprecated views and reports</em>, <em>the number of tickets the data team received</em>, <em>the number of solved tickets</em>, and <em>how many times the system(s) goes down</em> because of data changes. However, these metrics primarily measure data team performance rather than the impact of data quality.</p><p>Like many of you, when I think of data quality, I immediately think of <a href="https://www.linkedin.com/in/chad-sanderson/">Chad Sanderson</a>. He developed the concept of data as a product, introduced data contracts, founded a data quality camp community, and eventually co-founded a company solving data quality issues. For years, Chad was publishing and advocating for the importance of data quality. I&#8217;ve reviewed <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Data Products&quot;,&quot;id&quot;:887230,&quot;type&quot;:&quot;pub&quot;,&quot;url&quot;:&quot;https://open.substack.com/pub/dataproducts&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/7dfc8bc7-6378-4adb-8342-a4f2f7d8b045_512x512.png&quot;,&quot;uuid&quot;:&quot;d4387250-7e65-48c1-8c2d-c2bfb194bcf0&quot;}" data-component-name="MentionToDOM"></span>, read his guides, watched several of his talks, and am fascinated by the legacy he has created and his detailed approach to understanding and scaling data quality. Yet, I couldn&#8217;t find a consolidated high-level list of data quality metrics. I must be a bad detective.</p><p><a href="https://www.linkedin.com/in/ergestx/">Ergest Xheblati</a> frequently discussed metrics, so I also went through his publications in his <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Data Patterns&quot;,&quot;id&quot;:20473,&quot;type&quot;:&quot;pub&quot;,&quot;url&quot;:&quot;https://open.substack.com/pub/ergestx&quot;,&quot;photo_url&quot;:null,&quot;uuid&quot;:&quot;9f96a9ea-c5e0-4589-b0ac-73b5923422e1&quot;}" data-component-name="MentionToDOM"></span>. I learned a lot of good insights and bookmarked a bunch. However, none of them focused on data quality metrics. Ergest often mentions <a href="https://github.com/Levers-Labs/SOMA-B2B-SaaS/tree/main">SOMA</a> in his publications (I encourage you to check this list of <a href="https://github.com/Levers-Labs/SOMA-B2B-SaaS/blob/main/definitions/metrics/csv_to_json.py">metrics definitions</a>). While it&#8217;s really awesome, it primarily consists of business metrics and KPIs rather than indicators for measuring data quality.</p><p><a href="http://fivetran">Fivetran</a>&#8217;s blog, as the &#8220;one platform for all your data movement,&#8221; referred me to articles on AI and database replication, which, unfortunately, were not particularly relevant. <a href="https://www.fivetran.com/blog/master-the-mds-balancing-act-data-governance-vs-self-serve">Data Governance vs. Self-serve</a> was an entertaining read.</p><p>Perhaps Sonny was right, and there is no industry standard to define criteria for data quality.&nbsp;</p><p><a href="https://www.linkedin.com/in/kevinzenghu/">Kevin Hu</a>, CEO at <a href="https://www.metaplane.dev/">Metaplane</a>, the data observability platform, offers a guide - <a href="https://www.metaplane.dev/blog/data-quality-metrics-for-data-warehouses">Data Quality Metrics for Data Warehouses (or: KPIs for KPIs)</a>:&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LwkJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F806c9537-c44a-44b3-a853-58c3bb9acb3b_1200x853.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LwkJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F806c9537-c44a-44b3-a853-58c3bb9acb3b_1200x853.png 424w, https://substackcdn.com/image/fetch/$s_!LwkJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F806c9537-c44a-44b3-a853-58c3bb9acb3b_1200x853.png 848w, https://substackcdn.com/image/fetch/$s_!LwkJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F806c9537-c44a-44b3-a853-58c3bb9acb3b_1200x853.png 1272w, https://substackcdn.com/image/fetch/$s_!LwkJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F806c9537-c44a-44b3-a853-58c3bb9acb3b_1200x853.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LwkJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F806c9537-c44a-44b3-a853-58c3bb9acb3b_1200x853.png" width="1200" height="853" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/806c9537-c44a-44b3-a853-58c3bb9acb3b_1200x853.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:853,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LwkJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F806c9537-c44a-44b3-a853-58c3bb9acb3b_1200x853.png 424w, https://substackcdn.com/image/fetch/$s_!LwkJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F806c9537-c44a-44b3-a853-58c3bb9acb3b_1200x853.png 848w, https://substackcdn.com/image/fetch/$s_!LwkJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F806c9537-c44a-44b3-a853-58c3bb9acb3b_1200x853.png 1272w, https://substackcdn.com/image/fetch/$s_!LwkJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F806c9537-c44a-44b3-a853-58c3bb9acb3b_1200x853.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>He takes a step further by offering extrinsic vs. intrinsic data quality measurements, which, to me, seemed more complicated than necessary. The article discusses data quality dimensions and their importance, yet it still lacks specific metrics I could utilize.&nbsp;</p><p><a href="https://www.montecarlodata.com/">Monte Carlo</a>&#8217;s Guide on <a href="https://www.montecarlodata.com/blog-data-quality-metrics/">12 Data Quality Metrics That ACTUALLY Matter</a> presents metrics such as the <em>number of data incidents</em>, <em>time-to-detection</em>, <em>time-to-resolution, table uptime, importance score, table health, table coverage, custom monitors created, number of unused tables and dashboards, deteriorating queries, and status update rate.</em></p><p><a href="https://seattledataguy.substack.com/p/how-and-why-we-need-to-implement">How And Why We Need To Implement Data Quality Now!</a> by <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;SeattleDataGuy&quot;,&quot;id&quot;:4963622,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/1ec905aa-9a7b-4f21-b0ff-fec92e8916d1_512x512.jpeg&quot;,&quot;uuid&quot;:&quot;116857da-e28a-4185-aaa7-57ed8efd1f34&quot;}" data-component-name="MentionToDOM"></span> serves as a good introduction to data governance. He breaks down data quality into 6 measurable pillars, offering examples and use cases for each:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ojed!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e5b732-2858-472a-a443-dacd2b386fa8_727x730.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ojed!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e5b732-2858-472a-a443-dacd2b386fa8_727x730.png 424w, https://substackcdn.com/image/fetch/$s_!Ojed!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e5b732-2858-472a-a443-dacd2b386fa8_727x730.png 848w, https://substackcdn.com/image/fetch/$s_!Ojed!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e5b732-2858-472a-a443-dacd2b386fa8_727x730.png 1272w, https://substackcdn.com/image/fetch/$s_!Ojed!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e5b732-2858-472a-a443-dacd2b386fa8_727x730.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ojed!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e5b732-2858-472a-a443-dacd2b386fa8_727x730.png" width="321" height="322.3246217331499" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b3e5b732-2858-472a-a443-dacd2b386fa8_727x730.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:730,&quot;width&quot;:727,&quot;resizeWidth&quot;:321,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ojed!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e5b732-2858-472a-a443-dacd2b386fa8_727x730.png 424w, https://substackcdn.com/image/fetch/$s_!Ojed!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e5b732-2858-472a-a443-dacd2b386fa8_727x730.png 848w, https://substackcdn.com/image/fetch/$s_!Ojed!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e5b732-2858-472a-a443-dacd2b386fa8_727x730.png 1272w, https://substackcdn.com/image/fetch/$s_!Ojed!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3e5b732-2858-472a-a443-dacd2b386fa8_727x730.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I am going to adopt his approach in creating my list of top metrics for measuring data governance.</p><h1>How To Measure Data Quality</h1><p>I noticed that data observability platforms tend to over-complicate the whole concept of data governance with <a href="https://www.montecarlodata.com/blog-the-right-way-to-measure-roi-on-data-quality/">confusing formulas</a>, or introducing weird <a href="https://www.montecarlodata.com/blog-data-roi-pyramid">data ROI Pyramids</a>, or developing <a href="https://www.montecarlodata.com/blog-data-quality-maturity-curve/">data Maturity Curves</a>.</p><p>As entertaining as it might be, I want to make the framework for measurement simple and clear, so I decided to condense multiple data quality pillars into only 3 sections with the following KPIs:&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K4aP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02eb30e6-6763-4a27-81d9-63df9b34d01f_1356x730.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K4aP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02eb30e6-6763-4a27-81d9-63df9b34d01f_1356x730.png 424w, https://substackcdn.com/image/fetch/$s_!K4aP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02eb30e6-6763-4a27-81d9-63df9b34d01f_1356x730.png 848w, https://substackcdn.com/image/fetch/$s_!K4aP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02eb30e6-6763-4a27-81d9-63df9b34d01f_1356x730.png 1272w, https://substackcdn.com/image/fetch/$s_!K4aP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02eb30e6-6763-4a27-81d9-63df9b34d01f_1356x730.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K4aP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02eb30e6-6763-4a27-81d9-63df9b34d01f_1356x730.png" width="1356" height="730" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/02eb30e6-6763-4a27-81d9-63df9b34d01f_1356x730.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:730,&quot;width&quot;:1356,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!K4aP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02eb30e6-6763-4a27-81d9-63df9b34d01f_1356x730.png 424w, https://substackcdn.com/image/fetch/$s_!K4aP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02eb30e6-6763-4a27-81d9-63df9b34d01f_1356x730.png 848w, https://substackcdn.com/image/fetch/$s_!K4aP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02eb30e6-6763-4a27-81d9-63df9b34d01f_1356x730.png 1272w, https://substackcdn.com/image/fetch/$s_!K4aP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02eb30e6-6763-4a27-81d9-63df9b34d01f_1356x730.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Data Accuracy and Integrity</h2><p>Accuracy ensures that your data is error-free, precise, and has the right data structure, form, size, or range. It involves checking for errors, duplications, and anomalies.</p><h3>Metrics can be used to measure accuracy and integrity:</h3><ol><li><p>Number of identified data discrepancies and inconsistencies</p></li><li><p>% of records containing anomalies.</p></li><li><p>% events with missing attributes or properties.</p></li><li><p>Number of missing rows or columns in a dataset.</p></li><li><p>% duplicated records.</p></li><li><p>Daily AVG of missing values.</p></li><li><p>Volume data/events ready for consumption.</p></li><li><p>Volume data/events used by consumer teams.</p></li><li><p>% incorrectly inserted data.</p></li><li><p>% data delayed or failed to load.</p></li></ol><h2>Data Consistency and Completeness</h2><p>You measure completeness by comparing data against trusted sources. It represents the uniformity in data formats, labels, and definitions across many data sources. Your DAU in Mixpanel should have the same value as your DAU in Postgres, or Signups in GA should match with Signups in Snowflake.</p><h3>Metrics can be used to measure:</h3><ol><li><p>% cross-platform unique user variance</p></li><li><p>Total number of transactions (or events) loaded.</p></li><li><p>Total number of transactions (or events) processed.</p></li><li><p>Number of tests implemented / Airflow tasks / dbt tests.</p></li><li><p>Number of audits completed or alerts created</p></li><li><p>Number of data checks monitored.</p></li><li><p>Number of successfully executed jobs in the pipeline</p></li><li><p>Number of business KPIs affected by data anomalies</p></li><li><p>Number of dashboards affected by data issues.</p></li><li><p>% of user IDs obfuscated.</p></li></ol><h1>Data Timeliness and Freshness</h1><p>Timeliness is about data freshness, meaning having data when you need it. Ensure that your data is not only accurate but also up-to-date for relevant and timely insights.</p><h3>Metrics can be used to measure timeliness:</h3><ol><li><p>AVG time (min/hours) for data collection.</p></li><li><p>AVG time for data processing.</p></li><li><p>Frequency of data refreshes.</p></li><li><p>AVG time for error detection</p></li><li><p>AVG time for error resolution</p></li><li><p>AVG query execution run time</p></li></ol><div><hr></div><p>I am not a data engineer (thank god), so it&#8217;s likely to be an incomplete list with missing important measurements. But at least it&#8217;s something I can now pass to the Head of Data to measure and scale the teamwork.</p><p>I am also frustrated that data experts who shape and influence modern data stacks today publish books, start companies, grow communities, and have enormous reach, and they don&#8217;t have these basics and essentials ready and available to borrow. (Most likely, they do, and it&#8217;s an issue of discoverability. That is partially a reason why we have to repeat each other over and over again.)&nbsp;</p><p>Thanks for reading, everyone. Until next Wednesday!</p>]]></content:encoded></item><item><title><![CDATA[When To Use Mean Or Median - Issue 171]]></title><description><![CDATA[Working with descriptive statistics: how to know when to use Mean or Median in your reporting]]></description><link>https://dataanalysis.substack.com/p/when-to-use-mean-or-median-issue</link><guid isPermaLink="false">https://dataanalysis.substack.com/p/when-to-use-mean-or-median-issue</guid><dc:creator><![CDATA[Olga Berezovsky]]></dc:creator><pubDate>Wed, 08 Nov 2023 13:00:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffa339e7-5da3-4406-bab9-43fc08423337_986x902.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Today&#8217;s publication is a continuation of my Statistics series:&nbsp;</p><ol><li><p><a href="https://dataanalysis.substack.com/p/what-statistics-are-used-in-data">What Statistics Are Used In Data Analysis?</a> - An introduction to the top four statistical concepts we apply and rely on in analytics.</p></li><li><p><a href="https://dataanalysis.substack.com/p/when-simple-becomes-tricky-passing">Statistics 101: When Simple Becomes Tricky</a> - Descriptive statistics basics (and a reminder of how easily simple mistakes can slip in.)</p></li><li><p><a href="https://dataanalysis.substack.com/p/applying-statistics-in-product-analytics">Applying Statistics In Product Analytics</a> - A deep dive into distributions, their types, and use cases.</p></li></ol><p>Now that you know that most of the analytical world is built on distributions, I will walk you through the cases and examples when it&#8217;s acceptable to use Mean vs Median and when it&#8217;s okay to use both.&nbsp;&nbsp;&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LoOM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60f904c3-5324-4140-9cb8-00c35f2026ea_200x200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LoOM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60f904c3-5324-4140-9cb8-00c35f2026ea_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!LoOM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60f904c3-5324-4140-9cb8-00c35f2026ea_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!LoOM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60f904c3-5324-4140-9cb8-00c35f2026ea_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!LoOM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60f904c3-5324-4140-9cb8-00c35f2026ea_200x200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LoOM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60f904c3-5324-4140-9cb8-00c35f2026ea_200x200.png" width="200" height="200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/60f904c3-5324-4140-9cb8-00c35f2026ea_200x200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:200,&quot;width&quot;:200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LoOM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60f904c3-5324-4140-9cb8-00c35f2026ea_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!LoOM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60f904c3-5324-4140-9cb8-00c35f2026ea_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!LoOM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60f904c3-5324-4140-9cb8-00c35f2026ea_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!LoOM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60f904c3-5324-4140-9cb8-00c35f2026ea_200x200.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><h2><strong>Central Tendency Theory recap&nbsp;</strong></h2><p><em>If you skipped statistics back in school</em></p><p>There is a lot written out there on the Central Tendency. Here is what you need to know when working with descriptive statistics:</p><ul><li><p><a href="https://www.statisticshowto.com/central-tendency/">Central Tendency</a> is a theory about <strong>the center of the data set</strong>. Why is it important? Because once you locate the center of the dataset, then you can describe it, graph it, and model it. If your center is off or incorrectly defined (e.g., using the flawed Mean), your whole distribution will be off, leading to the wrong outputs affecting your analysis, model, A/B test, etc. In statistics, the truth (and confidence) starts with the center.&nbsp;&nbsp;</p></li><li><p><strong>Mean, Mode, and Median</strong> are the top 3 measures of Central Tendency.&nbsp;</p></li><li><p>There are more Central Tendency measures (e.g., weighted average, geometric average, midrange, trimean, root mean square, simplicial depth, etc.). For analytics, most of your work will be around the top 3: Mean, Mode, and Median.</p></li><li><p>You would use the top 3 to (a) describe the dataset (hence why, it&#8217;s called &#8220;descriptive&#8221; statistics) and (b) locate the center of the dataset.</p></li><li><p>Once you know the center in the dataset, you can locate the dataset ranges, variance, and spread.&nbsp;&nbsp;&nbsp;</p></li></ul><h2><strong>Median vs Mean</strong></h2><p>Choosing Mean vs. Median mostly depends on two factors:&nbsp;</p><ol><li><p>The data type you work with and&nbsp;</p></li><li><p>The data distribution.</p></li></ol><h4><strong>When you can use both Mean and Median:</strong></h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e55L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b60c8a-867c-4d64-8e4b-352d5e345a32_948x710.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e55L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b60c8a-867c-4d64-8e4b-352d5e345a32_948x710.png 424w, https://substackcdn.com/image/fetch/$s_!e55L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b60c8a-867c-4d64-8e4b-352d5e345a32_948x710.png 848w, https://substackcdn.com/image/fetch/$s_!e55L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b60c8a-867c-4d64-8e4b-352d5e345a32_948x710.png 1272w, https://substackcdn.com/image/fetch/$s_!e55L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b60c8a-867c-4d64-8e4b-352d5e345a32_948x710.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e55L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b60c8a-867c-4d64-8e4b-352d5e345a32_948x710.png" width="422" height="316.0548523206751" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56b60c8a-867c-4d64-8e4b-352d5e345a32_948x710.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:710,&quot;width&quot;:948,&quot;resizeWidth&quot;:422,&quot;bytes&quot;:110480,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!e55L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b60c8a-867c-4d64-8e4b-352d5e345a32_948x710.png 424w, https://substackcdn.com/image/fetch/$s_!e55L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b60c8a-867c-4d64-8e4b-352d5e345a32_948x710.png 848w, https://substackcdn.com/image/fetch/$s_!e55L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b60c8a-867c-4d64-8e4b-352d5e345a32_948x710.png 1272w, https://substackcdn.com/image/fetch/$s_!e55L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b60c8a-867c-4d64-8e4b-352d5e345a32_948x710.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://www.scribbr.com/statistics/central-tendency/">Sribbr: Central Tendency | Understanding the Mean, Median &amp; Mode</a></figcaption></figure></div><ul><li><p>When data is normally distributed (or close to normal distribution), the values are evenly distributed around a central value.</p></li><li><p>When the data is symmetrically distributed.</p></li></ul><ul><li><p>When the data is continuous or discrete:</p><ul><li><p>Discrete: (1, 2, 3, 4&#8230;)</p></li><li><p>Continuous: (1.1, 2.45, 3.543&#8230;), 26.5%, 30C, 65F, 12.3 miles, 2.5 hours, 10/3/2023&#8230; &nbsp;</p></li></ul></li></ul><p>As you already know, the Mean is the most commonly used summary statistic. It is essentially a mini model of your data set.</p><h4><strong>When not to use Mean:</strong></h4>
      <p>
          <a href="https://dataanalysis.substack.com/p/when-to-use-mean-or-median-issue">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Applying Statistics In Product Analytics - Issue 165]]></title><description><![CDATA[Distributions 101: Introduction to statistical distributions, their types, use cases, and examples]]></description><link>https://dataanalysis.substack.com/p/applying-statistics-in-product-analytics</link><guid isPermaLink="false">https://dataanalysis.substack.com/p/applying-statistics-in-product-analytics</guid><dc:creator><![CDATA[Olga Berezovsky]]></dc:creator><pubDate>Wed, 04 Oct 2023 12:00:14 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6737e7-6384-4e15-b782-4dd1d875d905_1000x661.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome to the <a href="https://dataanalysis.substack.com/">Data Analysis Journal</a>, a weekly data science and analytics newsletter.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataanalysis.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dataanalysis.substack.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><p>Earlier this year, I published <a href="https://dataanalysis.substack.com/p/applying-ml-in-product-analytics">Applying ML in Product Analytics</a>, where I covered the types of ML models and their use cases we apply to predict user behavior.&nbsp;</p><p>Today, I want to zoom out from ML and talk about statistical methods we use to <em>describe </em>user behavior<em>.</em> Obviously, first, you have to <em>describe</em> and <em>learn</em> events before you can <em>infer</em> and <em>predict</em> them.&nbsp;</p><p>Applying core statistics in product analytics mostly comes down to understanding:&nbsp;</p><ol><li><p>Causal inference&nbsp;</p></li><li><p>Distributions&nbsp;</p></li><li><p>Significance and Confidence</p></li><li><p>Correlations and Regressions</p></li><li><p>Law of Large Numbers and Central Limit Theorem&nbsp;</p></li></ol><blockquote></blockquote><p>This week&#8217;s publication is a continuation of my earlier <a href="https://dataanalysis.substack.com/p/what-statistics-are-used-in-data">What Statistics Are Used In Data Analysis?</a> and is also a deep dive into <strong>distributions</strong> aimed at analysts, data scientists, and product owners working with data and ML products.</p><p>Below, I&#8217;ll share a guide to help you decide which distribution to apply for your use case, analysis, or forecast.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EUzJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa401e8ed-aac6-4b44-992a-3a50aff17c69_200x200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EUzJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa401e8ed-aac6-4b44-992a-3a50aff17c69_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!EUzJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa401e8ed-aac6-4b44-992a-3a50aff17c69_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!EUzJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa401e8ed-aac6-4b44-992a-3a50aff17c69_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!EUzJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa401e8ed-aac6-4b44-992a-3a50aff17c69_200x200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EUzJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa401e8ed-aac6-4b44-992a-3a50aff17c69_200x200.png" width="156" height="156" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a401e8ed-aac6-4b44-992a-3a50aff17c69_200x200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:200,&quot;width&quot;:200,&quot;resizeWidth&quot;:156,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EUzJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa401e8ed-aac6-4b44-992a-3a50aff17c69_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!EUzJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa401e8ed-aac6-4b44-992a-3a50aff17c69_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!EUzJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa401e8ed-aac6-4b44-992a-3a50aff17c69_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!EUzJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa401e8ed-aac6-4b44-992a-3a50aff17c69_200x200.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>I have a secret fascination with distributions. They circulate all around us in decision-making, analysis, coincidences, probabilities, chances, estimations, forecasts, and more. I believe that reading distributions is a way to understand such phenomena as luck, success, failure, or karma.&nbsp;</p><p>Distributions are so underrated in analytics, and it feels like many schools are on a mission to present distributions as tedious, complex, and unnecessary. Most students don&#8217;t quite understand why they need to learn distributions or what to do with them. While researching this publication, I went through dozens of publications, videos, and tutorials on distributions to find any guidance on the &#8220;applied&#8221; use cases and examples. There is a lot of information about what distribution is and how to understand it, but not much on <em>when/why</em> you would apply which distribution type.</p><p>Below is my simplified and practical guide, condensed to only essential must-know concepts for you to understand types of distributions and when to apply them for which analysis or inference.&nbsp;</p><h2>When to use distributions</h2>
      <p>
          <a href="https://dataanalysis.substack.com/p/applying-statistics-in-product-analytics">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Statistics 101: When Simple Becomes Tricky - Issue 157]]></title><description><![CDATA[How not to get trapped in descriptive statistics during data science interviews]]></description><link>https://dataanalysis.substack.com/p/when-simple-becomes-tricky-passing</link><guid isPermaLink="false">https://dataanalysis.substack.com/p/when-simple-becomes-tricky-passing</guid><dc:creator><![CDATA[Olga Berezovsky]]></dc:creator><pubDate>Wed, 16 Aug 2023 12:00:35 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!LXaY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaccc4e1-483d-4d24-b473-8771cbbe5c1c_1112x738.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Did you know that to get the median value of an array, you must first sort the array?&nbsp;</p><p>Now you do!&nbsp;</p><p>You can&#8217;t calculate the median of an unsorted dataset (but you <em>can </em>get the mean and the mode). Welcome to another &#8220;let me remind you of some basics&#8221; data science and analysis newsletter.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m9-g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92ef2d6f-2017-4089-b302-4e87a4ff75be_200x200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m9-g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92ef2d6f-2017-4089-b302-4e87a4ff75be_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!m9-g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92ef2d6f-2017-4089-b302-4e87a4ff75be_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!m9-g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92ef2d6f-2017-4089-b302-4e87a4ff75be_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!m9-g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92ef2d6f-2017-4089-b302-4e87a4ff75be_200x200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m9-g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92ef2d6f-2017-4089-b302-4e87a4ff75be_200x200.png" width="170" height="170" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92ef2d6f-2017-4089-b302-4e87a4ff75be_200x200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:200,&quot;width&quot;:200,&quot;resizeWidth&quot;:170,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!m9-g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92ef2d6f-2017-4089-b302-4e87a4ff75be_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!m9-g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92ef2d6f-2017-4089-b302-4e87a4ff75be_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!m9-g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92ef2d6f-2017-4089-b302-4e87a4ff75be_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!m9-g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92ef2d6f-2017-4089-b302-4e87a4ff75be_200x200.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>We are so accustomed to built-in functions in SQL and Python. I don&#8217;t even remember what variance means anymore. I just type VAR() and don&#8217;t overthink it. A few weeks ago, I was asked to help replicate STDDEV() step by step and let me tell you, I had to take a good minute to collect my thoughts.&nbsp;</p><p>Very often, you can be asked to break down common built-in functions during technical interviews, e.g. <em>imagine there is no mode() function - write code to get the mode </em>(rumors say the exercises below were asked at Reddit, GitHub, Amazon, and YouTube data science interviews). They are not meant to be difficult but instead aim to test your understanding of statistics, and if you&#8217;re able to recognize when and which function to apply. </p><p>There is no excuse for failing to replicate the mean or median.&nbsp;&nbsp;</p><p>I assume everyone is familiar with MEAN(), MIN(), and MAX(). I&#8217;ll focus below on the basic must-know functions, which you definitely have to use for A/B test analysis, time series analysis, probability estimations, and predictions, such as variance, standard deviation, mode, median, and distribution.</p>
      <p>
          <a href="https://dataanalysis.substack.com/p/when-simple-becomes-tricky-passing">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[An Introduction To Feature Engineering - Issue 152]]></title><description><![CDATA[Elevate your analysis and improve modeling with feature engineering.]]></description><link>https://dataanalysis.substack.com/p/an-introduction-to-feature-engineering</link><guid isPermaLink="false">https://dataanalysis.substack.com/p/an-introduction-to-feature-engineering</guid><dc:creator><![CDATA[Olga Berezovsky]]></dc:creator><pubDate>Wed, 12 Jul 2023 12:01:02 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!iRVR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61af1562-5e1b-4e56-b879-897aa7fefd3c_954x460.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome to the <a href="https://dataanalysis.substack.com/">Data Analysis Journal</a>, a weekly newsletter about data science and analytics.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataanalysis.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dataanalysis.substack.com/subscribe?"><span>Subscribe now</span></a></p>
      <p>
          <a href="https://dataanalysis.substack.com/p/an-introduction-to-feature-engineering">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How To Prove Causation - Issue 148]]></title><description><![CDATA[Using causal inference and regression analysis for informed decision-making]]></description><link>https://dataanalysis.substack.com/p/how-to-prove-causation-issue-148</link><guid isPermaLink="false">https://dataanalysis.substack.com/p/how-to-prove-causation-issue-148</guid><dc:creator><![CDATA[Olga Berezovsky]]></dc:creator><pubDate>Wed, 14 Jun 2023 12:01:44 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6plT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6575b5e3-7feb-4364-a542-7c59e100ef57_1216x668.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><p>Today&#8217;s issue is the final publication in my regression series and probably one of the most important topics in analytics: proving causation.&nbsp;&nbsp;</p><p>Over the last two weeks, I shared an introduction to correlation and regressions, where I covered:&nbsp;</p><ul><li><p><a href="https://www.lennysnewsletter.com/p/linear-regression-and-correlation-analysis">How to do linear regression and correlation analysis</a> - what a regression analysis is, when and how to run it, how linear regression is different from correlation analysis, and use cases for each.</p></li><li><p><a href="https://dataanalysis.substack.com/p/decoding-regression-scores-issue">Decoding Regression Scores</a> - what the different types of regression are, when to run which regression method, and how to read regression scores and an equation.&nbsp;</p></li></ul><p>You already know that correlation does not prove causation. Regression analysis exists in a gray area between cause and effect. </p><p>I was taught back at school, that <a href="https://en.wikipedia.org/wiki/Linear_regression">regression may be used to attempt to estimate causal relationships from observational data.</a> And you will find quite a few <a href="https://towardsdatascience.com/causal-inference-with-linear-regression-endogeneity-9d9492663bac#.com/causal-inference-with-linear-regression-endogeneity-9d9492663bac#:~:text=A%20linear%20regression%20model%20is,X)%2C%20shown%20as%20follows.">examples</a> shared by statisticians and researchers where linear regression is used for causal inference.  </p><p>And yet, if you take an economic theory or MBA, you <a href="https://hbr.org/2015/11/a-refresher-on-regression-analysis">will learn</a> that <a href="https://en.wikibooks.org/wiki/Econometric_Theory/Regression_versus_Causation_and_Correlation#">regression doesn&#8217;t prove cause-effect</a>, and the relationship you see on the linear regression graph <a href="https://hbr.org/2015/11/a-refresher-on-regression-analysis">does not imply causation</a>.</p><h4>&#129300; So wait, does regression prove or not prove causation?</h4><p>I like <a href="https://statisticalhorizons.com/our-instructors/paul-allison/">Paul Allison's</a>, Ph.D., professor of statistical methods, <a href="https://statisticalhorizons.com/prediction-vs-causation-in-regression-analysis/">take on this</a> that &#8220;regression can be used for both causal inference and prediction&#8220; but it all comes down to &#8220;how the methodology is used&#8221;, or if it should be used at all for a particular problem or question. </p><p>Bringing this into applied data science and analytics, any type of regression (like any type of ML) on its own doesn&#8217;t prove causation. However, it can be used to get clues and higher confidence that there&#8217;s a strong connection between variables, and how this connection will change if we increase or decrease the input values. That being said, if used wrong, and error scores are ignored, it can improperly guide you further away from the truth by misrepresenting data and failing to return the true relationship pattern.&nbsp;</p><p>In this publication, I&#8217;ll share some examples of causation analysis and show you ways how not to get tricked by flawed regression output, and how to recognize when the regression pattern you see is correct, trusted, and causal.  </p>
      <p>
          <a href="https://dataanalysis.substack.com/p/how-to-prove-causation-issue-148">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Decoding Regression Scores - Issue 147]]></title><description><![CDATA[How to read linear regression plots and understand the equation that drives analysis and predictions]]></description><link>https://dataanalysis.substack.com/p/decoding-regression-scores-issue</link><guid isPermaLink="false">https://dataanalysis.substack.com/p/decoding-regression-scores-issue</guid><dc:creator><![CDATA[Olga Berezovsky]]></dc:creator><pubDate>Wed, 07 Jun 2023 12:00:53 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86e51aa9-fb2f-4281-aaf0-6613dd9c44e1_1294x930.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome to the Data Analysis Journal, a weekly newsletter about data science and analytics.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataanalysis.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dataanalysis.substack.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><p>Last week I shared my recent publication in <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Lenny Rachitsky&quot;,&quot;id&quot;:1849774,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/afba5161-65bb-4d99-8d6b-cce660917fa1_1540x1540.png&quot;,&quot;uuid&quot;:&quot;d660701c-7b09-4a73-a82b-990f9be9d36c&quot;}" data-component-name="MentionToDOM"></span> newsletter about correlation analysis and linear regression. It was more of an introductory piece to the regression, and there is still so much more to cover on it.</p><p>So today, it is <strong>Linear Regression Part 2,</strong> dedicated to my great analysts who are expected to work with ML and effectively understand the concept of regressions.&nbsp;</p><p>Linear regression is the most commonly used analysis to locate the relationship between values and run predictions. It has been widely adopted across product, growth, business, finance, and marketing by all levels of professionals.&nbsp;</p><p>What many people might not realize is that linear regression is<strong> a machine-learning model</strong>, so you have to treat it accordingly (considering data cleaning, feature engineering, addressing <a href="https://towardsdatascience.com/overfitting-and-underfitting-principles-ea8964d9c45c">underfitting and overfitting</a>, all of that). Reading its scatterplots is not conclusive. Many analysts apply regression without understanding what the scores and equation actually mean or how to read them. Let&#8217;s fix that today.</p><p>In this publication, I focus on:</p><ul><li><p>Breaking down use cases across different regression types.</p></li><li><p>Examples of product and marketing questions and hypotheses that you can solve with regression analysis.</p></li><li><p>Explain the linear regression equation, what it means, and how to interpret the regression scores.</p></li><li><p>How to break down scatterplot slope using the regression equation.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gFMx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03cf1556-1365-48f2-af52-2b0995c3cffb_200x200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gFMx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03cf1556-1365-48f2-af52-2b0995c3cffb_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!gFMx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03cf1556-1365-48f2-af52-2b0995c3cffb_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!gFMx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03cf1556-1365-48f2-af52-2b0995c3cffb_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!gFMx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03cf1556-1365-48f2-af52-2b0995c3cffb_200x200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gFMx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03cf1556-1365-48f2-af52-2b0995c3cffb_200x200.png" width="200" height="200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/03cf1556-1365-48f2-af52-2b0995c3cffb_200x200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:200,&quot;width&quot;:200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gFMx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03cf1556-1365-48f2-af52-2b0995c3cffb_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!gFMx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03cf1556-1365-48f2-af52-2b0995c3cffb_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!gFMx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03cf1556-1365-48f2-af52-2b0995c3cffb_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!gFMx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03cf1556-1365-48f2-af52-2b0995c3cffb_200x200.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h2>Introduction to regressions and their use cases</h2><p>Once again, regression shows how much one variable affects another and whether you can use the pattern of one variable to predict and estimate the behavior of another. Any type of <strong>regression doesn&#8217;t prove causation</strong> but rather gives us more clues and higher confidence that there&#8217;s a strong connection between variables, and how this connection will change if we increase or decrease variables.&nbsp;</p><h3><strong>Possible use cases for regression:</strong></h3><ul><li><p>How many page views do we need to improve signups by at least twice?</p></li><li><p>Is showing more recommendations positively affecting trials?</p></li><li><p>How many more games/trials/clicks do users have to make to convert to Paid?</p></li><li><p>If we send 3x more notifications, how much will this increase DAU?</p></li></ul>
      <p>
          <a href="https://dataanalysis.substack.com/p/decoding-regression-scores-issue">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Applying ML in Product Analytics - Issue 131]]></title><description><![CDATA[Using machine learning to solve common challenges in product and data analytics. How to figure out which model to use for which business case.]]></description><link>https://dataanalysis.substack.com/p/applying-ml-in-product-analytics</link><guid isPermaLink="false">https://dataanalysis.substack.com/p/applying-ml-in-product-analytics</guid><dc:creator><![CDATA[Olga Berezovsky]]></dc:creator><pubDate>Wed, 08 Feb 2023 13:00:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!C9_Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a13818c-8bf5-41c7-aeb3-23b0a9009c93_1234x656.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="pullquote"><p>&#8220;You don&#8217;t have to know the code, but you do need to know what the code can do.&#8221; </p><p>from <a href="https://dschool.stanford.edu/resources/i-love-algorithms">Stanford DS School&nbsp;</a></p></div><p>There are over 50 types of machine learning algorithms with different levels of complexity, similarity, and application.&nbsp;</p><p>How many of these 50 models will you realistically be developing for your projects?&nbsp;</p><p>In fact, mostly two: Linear Regression and Random Forest. Joking! &#8230;<em>Or am I?</em>&nbsp;&nbsp;&nbsp;&nbsp;</p><p>With so many algorithms out there, it can be challenging to figure out which model to choose for your analysis (or if you even need to use one at all). This week&#8217;s publication is a continuation of <a href="https://dataanalysis.substack.com/p/what-statistics-are-used-in-data">What Statistics Are Used In Data Analysis?</a>, and is also an introduction of sorts to Machine Learning 101 (but aimed at analysts).</p><p>In today&#8217;s publication, I&#8217;ll share a guide to help you decide which ML model to pick to solve a problem, and how deep into the woods you&#8217;ll need to go with statistics to create a customer churn prediction model or forecast subscription revenue.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OoNQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4f72674-b311-4aab-8190-9d11aa1fa66e_200x200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OoNQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4f72674-b311-4aab-8190-9d11aa1fa66e_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!OoNQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4f72674-b311-4aab-8190-9d11aa1fa66e_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!OoNQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4f72674-b311-4aab-8190-9d11aa1fa66e_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!OoNQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4f72674-b311-4aab-8190-9d11aa1fa66e_200x200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OoNQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4f72674-b311-4aab-8190-9d11aa1fa66e_200x200.png" width="186" height="186" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f4f72674-b311-4aab-8190-9d11aa1fa66e_200x200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:200,&quot;width&quot;:200,&quot;resizeWidth&quot;:186,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OoNQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4f72674-b311-4aab-8190-9d11aa1fa66e_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!OoNQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4f72674-b311-4aab-8190-9d11aa1fa66e_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!OoNQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4f72674-b311-4aab-8190-9d11aa1fa66e_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!OoNQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4f72674-b311-4aab-8190-9d11aa1fa66e_200x200.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>To start with, here&#8217;s a quick recap of the main types of ML algorithms:&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!C9_Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a13818c-8bf5-41c7-aeb3-23b0a9009c93_1234x656.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!C9_Q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a13818c-8bf5-41c7-aeb3-23b0a9009c93_1234x656.png 424w, https://substackcdn.com/image/fetch/$s_!C9_Q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a13818c-8bf5-41c7-aeb3-23b0a9009c93_1234x656.png 848w, https://substackcdn.com/image/fetch/$s_!C9_Q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a13818c-8bf5-41c7-aeb3-23b0a9009c93_1234x656.png 1272w, https://substackcdn.com/image/fetch/$s_!C9_Q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a13818c-8bf5-41c7-aeb3-23b0a9009c93_1234x656.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!C9_Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a13818c-8bf5-41c7-aeb3-23b0a9009c93_1234x656.png" width="1234" height="656" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3a13818c-8bf5-41c7-aeb3-23b0a9009c93_1234x656.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:656,&quot;width&quot;:1234,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!C9_Q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a13818c-8bf5-41c7-aeb3-23b0a9009c93_1234x656.png 424w, https://substackcdn.com/image/fetch/$s_!C9_Q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a13818c-8bf5-41c7-aeb3-23b0a9009c93_1234x656.png 848w, https://substackcdn.com/image/fetch/$s_!C9_Q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a13818c-8bf5-41c7-aeb3-23b0a9009c93_1234x656.png 1272w, https://substackcdn.com/image/fetch/$s_!C9_Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a13818c-8bf5-41c7-aeb3-23b0a9009c93_1234x656.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>From <a href="https://www.mathworks.com/discovery/reinforcement-learning.html">MathWorks</a></em></figcaption></figure></div><p><strong>Supervised learning</strong> - You know how to classify the input data and the type of behavior you want to predict, but you need the algorithm to calculate it for you on new data. The data you work with is labeled, tagged, and has all the inputs and outputs. If you are an analyst, 98% of your work will stay here.</p><p><strong>Unsupervised learning</strong> - You do not know how to classify the data, and you want the algorithm to find patterns and classify the data for you. A lot of things are happening here. Algorithms process a large volume of data and organize it into clusters (mostly) or some type of structure. The more data you pipe in, the more precise and refined the model will be. If you are a lucky analyst, 2% of your work will be in unsupervised learning, mostly around clusters and recommender systems.</p><p><strong>Semi-supervised learning -</strong> The line between unsupervised and supervised learning is blurry. Often you have to work with a mix of labeled and unlabeled data. The algorithms use available tags to learn and then mark unlabeled data. This is for cases where you have to use all available data, not only labeled. This also includes even partially labeled data, and you have to make it work for the whole volume.</p><p><strong>Reinforcement learning</strong> - You don&#8217;t have a lot of training data; you cannot clearly define the ideal end state, or the only way to learn about the environment is to interact with it. These algorithms work as an ecosystem that is built on a &#8220;trial and error&#8221; approach. They learn from errors in already-processed data, adapt, and look for the best result over and over again.&nbsp;</p><h2><strong>Using Supervised ML for data analysis</strong></h2><p>Because it&#8217;s all about analytics here, I&#8217;ll try to list the most common models used in data analytics and layout use cases for each.&nbsp;</p><p>There are 4 main outputs and the type of analysis we are expected to work with:</p><ol><li><p>Regression&nbsp;</p></li><li><p>Classification&nbsp;</p></li><li><p>Clustering&nbsp;(unsupervised)</p></li></ol>
      <p>
          <a href="https://dataanalysis.substack.com/p/applying-ml-in-product-analytics">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[What Statistics Are Used In Data Analysis? - Issue 126]]></title><description><![CDATA[Introduction and must-know statistical concepts for data analysis and data science]]></description><link>https://dataanalysis.substack.com/p/what-statistics-are-used-in-data</link><guid isPermaLink="false">https://dataanalysis.substack.com/p/what-statistics-are-used-in-data</guid><dc:creator><![CDATA[Olga Berezovsky]]></dc:creator><pubDate>Wed, 04 Jan 2023 13:01:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!b0do!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F39398dc7-4b69-4ed7-a05f-0bb5dd5ab311_443x421.gif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I&#8217;d like to start a new year by reminding you about the foundational concepts of any analytical role that requires data interpretation. Wearing the hat of an analyst takes more than knowing how to wrangle data. It takes the essential knowledge of statistics, business, and requires <a href="https://dataanalysis.substack.com/p/how-to-develop-critical-thinking">critical thinking</a>. In other words - <em>many </em>hats.</p><p>Analysts don&#8217;t work with advanced statistics (unlike data scientists) and are unlikely to deal with complex distributions like exponential, Weibull, or Beta:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b0do!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F39398dc7-4b69-4ed7-a05f-0bb5dd5ab311_443x421.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b0do!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F39398dc7-4b69-4ed7-a05f-0bb5dd5ab311_443x421.gif 424w, https://substackcdn.com/image/fetch/$s_!b0do!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F39398dc7-4b69-4ed7-a05f-0bb5dd5ab311_443x421.gif 848w, https://substackcdn.com/image/fetch/$s_!b0do!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F39398dc7-4b69-4ed7-a05f-0bb5dd5ab311_443x421.gif 1272w, https://substackcdn.com/image/fetch/$s_!b0do!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F39398dc7-4b69-4ed7-a05f-0bb5dd5ab311_443x421.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b0do!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F39398dc7-4b69-4ed7-a05f-0bb5dd5ab311_443x421.gif" width="443" height="421" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/39398dc7-4b69-4ed7-a05f-0bb5dd5ab311_443x421.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:421,&quot;width&quot;:443,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!b0do!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F39398dc7-4b69-4ed7-a05f-0bb5dd5ab311_443x421.gif 424w, https://substackcdn.com/image/fetch/$s_!b0do!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F39398dc7-4b69-4ed7-a05f-0bb5dd5ab311_443x421.gif 848w, https://substackcdn.com/image/fetch/$s_!b0do!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F39398dc7-4b69-4ed7-a05f-0bb5dd5ab311_443x421.gif 1272w, https://substackcdn.com/image/fetch/$s_!b0do!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F39398dc7-4b69-4ed7-a05f-0bb5dd5ab311_443x421.gif 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>From <a href="https://accendoreliability.com/weibull/">Adam Bahret: What is Weibull?</a></em></figcaption></figure></div><p>And that&#8217;s the biggest downside of most statistical classes and tutorials out there. Without the context and foundation, your search for the right solution will take ages and will pull you far off of your original path.</p><p>Applying core statistics to real-world problems mostly comes down to understanding casual inference, distributions, significance, correlations, and experimentations. In this issue, I will summarize and cover must-know statistical concepts and their use cases that data analysis couldn&#8217;t happen without.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lVbO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fed6f8b4d-482d-41c0-a048-8413d71f8e7d_200x200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lVbO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fed6f8b4d-482d-41c0-a048-8413d71f8e7d_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!lVbO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fed6f8b4d-482d-41c0-a048-8413d71f8e7d_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!lVbO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fed6f8b4d-482d-41c0-a048-8413d71f8e7d_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!lVbO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fed6f8b4d-482d-41c0-a048-8413d71f8e7d_200x200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lVbO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fed6f8b4d-482d-41c0-a048-8413d71f8e7d_200x200.png" width="200" height="200" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/ed6f8b4d-482d-41c0-a048-8413d71f8e7d_200x200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:200,&quot;width&quot;:200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lVbO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fed6f8b4d-482d-41c0-a048-8413d71f8e7d_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!lVbO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fed6f8b4d-482d-41c0-a048-8413d71f8e7d_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!lVbO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fed6f8b4d-482d-41c0-a048-8413d71f8e7d_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!lVbO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fed6f8b4d-482d-41c0-a048-8413d71f8e7d_200x200.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><h1>Statistics 101&nbsp;</h1><p>The whole purpose of statistics (statistical theory, methods, and analysis) is to provide<a href="https://www.statisticshowto.com/uncertainty-in-statistics/"> certainty out of uncertainty</a>. In other words, when you don&#8217;t have a high degree of trust (in either quantity or quality of data), how can you make sure you can make the right decision?&nbsp;</p><p>There are <a href="https://www.statisticshowto.com/probability-and-statistics/descriptive-statistics/">descriptive statistics</a> and <a href="https://www.statisticshowto.com/inferential-statistics/">inferential statistics</a>. Both are very different approaches to solving and describing the same problems.&nbsp;</p><p>We apply <strong>descriptive statistics</strong> for big data analytics working with sums, medians, skewed data, variance, and deviations. It&#8217;s called &#8220;descriptive&#8221; because it&#8217;s meant to <em>describe</em> the data. It&#8217;s used to understand data trends. Every time you have to plot a distribution, you are working with descriptive stats.</p><p><strong>Inferential statistics</strong> are used to make predictions or <em>infer</em> data trends and patterns. In a nutshell, it allows you to take a sample of data and apply it to a larger population or run hypothesis testing. Every time you work with A/B tests, modeling, or any type of hypothesis or forecast, you are dealing with inferential stats.</p><h2>Must-know statistical concepts for data analysis</h2><p>Here are statistical concepts from both descriptive and inferential that you need to know and be able to use comfortably to be proficient at analytics:&nbsp;</p><ol><li><p>Estimations and significance</p></li></ol>
      <p>
          <a href="https://dataanalysis.substack.com/p/what-statistics-are-used-in-data">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Behold! How You Can Handle Missing Data - Issue 108]]></title><description><![CDATA[A recap on the proper ways to handle the missing data in your modeling or analysis]]></description><link>https://dataanalysis.substack.com/p/behold-how-you-can-handle-missing</link><guid isPermaLink="false">https://dataanalysis.substack.com/p/behold-how-you-can-handle-missing</guid><dc:creator><![CDATA[Olga Berezovsky]]></dc:creator><pubDate>Wed, 24 Aug 2022 16:30:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/h_600,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a9aa80-343a-4df4-af9e-4d53ff280b59_1360x1082.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Welcome to my <a href="https://dataanalysis.substack.com/">Data Analytics Journal</a> newsletter, where I write about data analysis, data science, and business intelligence.</p><p>If you&#8217;re not a paid subscriber, here&#8217;s what you missed this month:</p><ul><li><p><a href="https://dataanalysis.substack.com/p/how-we-optimized-the-onboarding-funnel">How we optimized the onboarding funnel by 220%</a> - a recap and deep dive into a complete redesign of a user onboarding flow with learnings, data benchmarks, and recommendations.&nbsp;</p></li><li><p><a href="https://dataanalysis.substack.com/p/how-to-get-paid-subscriptions-in">How To Get Paid Subscriptions In SQL</a> - advanced SQL for extracting and computing missing data to report on premium and growth KPIs. If you work in SaaS or B2B and deal with incomplete transactions/subscription data, this approach will save you many hours.&nbsp;&nbsp;</p></li><li><p><a href="https://dataanalysis.substack.com/p/engagement-and-retention-part-4-how">Engagement and Retention, Part 4: How To Visualize and Read Cohorted Retention</a> - a continuation of the User Engagement and Retention series with a focus on cohorted retention reporting. This time I offer multiple SQL solutions for calculating retention for different business types, visualizing it, and reading different retention charts.&nbsp;</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-cZD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F5afeaf2c-73fa-45a6-bed4-ee2fea27cffb_200x200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-cZD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F5afeaf2c-73fa-45a6-bed4-ee2fea27cffb_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!-cZD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F5afeaf2c-73fa-45a6-bed4-ee2fea27cffb_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!-cZD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F5afeaf2c-73fa-45a6-bed4-ee2fea27cffb_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!-cZD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F5afeaf2c-73fa-45a6-bed4-ee2fea27cffb_200x200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-cZD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F5afeaf2c-73fa-45a6-bed4-ee2fea27cffb_200x200.png" width="200" height="200" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5afeaf2c-73fa-45a6-bed4-ee2fea27cffb_200x200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:200,&quot;width&quot;:200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-cZD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F5afeaf2c-73fa-45a6-bed4-ee2fea27cffb_200x200.png 424w, https://substackcdn.com/image/fetch/$s_!-cZD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F5afeaf2c-73fa-45a6-bed4-ee2fea27cffb_200x200.png 848w, https://substackcdn.com/image/fetch/$s_!-cZD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F5afeaf2c-73fa-45a6-bed4-ee2fea27cffb_200x200.png 1272w, https://substackcdn.com/image/fetch/$s_!-cZD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F5afeaf2c-73fa-45a6-bed4-ee2fea27cffb_200x200.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataanalysis.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dataanalysis.substack.com/subscribe?"><span>Subscribe now</span></a></p><p>Today I want to do a recap on the most common and, I believe, underrated issue of <strong>handling missing values</strong> in data analytics and data science.&nbsp;</p><p>If your dataset is sparse or incomplete, you already know about 2 main ways to deal with missing values: you either ignore them or input averages instead of N/As and proceed like nothing bad had ever happened. You may also use more powerful methods, e.g forecasting the missing values or applying ML algorithms. In this issue, I want to offer more of a structured approach that will help you recognize situations where it&#8217;s okay to ignore N/As for your analysis or when you should do more thorough data cleaning.</p><p><em>*note: this is all assuming that you can&#8217;t obtain or acquire the missing data from other sources. If you can, you should do that first and skip this article.&nbsp;</em></p><p>Your approach to handling missing values depends on the following questions:</p><ol><li><p>What are you working on? (e.g. multivariate analysis, regression, ML)&nbsp;</p></li><li><p>How much data is missing? (e.g. only a few values, critical values, most values)</p></li><li><p>What is the pattern of missing data? (random, partially random, in pairs, etc)</p></li></ol><h3><strong>Ignore or drop missing values:</strong></h3><ul><li><p>If your data is missing at random, you can remove NULLs.&nbsp;</p></li><li><p>If your analysis is multivariate (data containing more than two variables), and if there is a larger number of missing values, then it might be better to drop those rather than do imputation. If the variance is not a factor, you can do it either way.&nbsp;</p></li></ul><p>Deletion can be done listwise (rows containing missing variables are deleted): </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SbNY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F820891a5-1442-4371-aa9f-57a632bf45c5_1740x512.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SbNY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F820891a5-1442-4371-aa9f-57a632bf45c5_1740x512.png 424w, https://substackcdn.com/image/fetch/$s_!SbNY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F820891a5-1442-4371-aa9f-57a632bf45c5_1740x512.png 848w, https://substackcdn.com/image/fetch/$s_!SbNY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F820891a5-1442-4371-aa9f-57a632bf45c5_1740x512.png 1272w, https://substackcdn.com/image/fetch/$s_!SbNY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F820891a5-1442-4371-aa9f-57a632bf45c5_1740x512.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SbNY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F820891a5-1442-4371-aa9f-57a632bf45c5_1740x512.png" width="1456" height="428" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/820891a5-1442-4371-aa9f-57a632bf45c5_1740x512.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:428,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:156527,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SbNY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F820891a5-1442-4371-aa9f-57a632bf45c5_1740x512.png 424w, https://substackcdn.com/image/fetch/$s_!SbNY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F820891a5-1442-4371-aa9f-57a632bf45c5_1740x512.png 848w, https://substackcdn.com/image/fetch/$s_!SbNY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F820891a5-1442-4371-aa9f-57a632bf45c5_1740x512.png 1272w, https://substackcdn.com/image/fetch/$s_!SbNY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F820891a5-1442-4371-aa9f-57a632bf45c5_1740x512.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Or pairwise (only the missing observations are deleted):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hP2L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab2dccb6-7be2-41e9-a5e9-7297605193cc_1744x504.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hP2L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab2dccb6-7be2-41e9-a5e9-7297605193cc_1744x504.png 424w, https://substackcdn.com/image/fetch/$s_!hP2L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab2dccb6-7be2-41e9-a5e9-7297605193cc_1744x504.png 848w, https://substackcdn.com/image/fetch/$s_!hP2L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab2dccb6-7be2-41e9-a5e9-7297605193cc_1744x504.png 1272w, https://substackcdn.com/image/fetch/$s_!hP2L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab2dccb6-7be2-41e9-a5e9-7297605193cc_1744x504.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hP2L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab2dccb6-7be2-41e9-a5e9-7297605193cc_1744x504.png" width="1456" height="421" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/ab2dccb6-7be2-41e9-a5e9-7297605193cc_1744x504.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:421,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:154857,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hP2L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab2dccb6-7be2-41e9-a5e9-7297605193cc_1744x504.png 424w, https://substackcdn.com/image/fetch/$s_!hP2L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab2dccb6-7be2-41e9-a5e9-7297605193cc_1744x504.png 848w, https://substackcdn.com/image/fetch/$s_!hP2L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab2dccb6-7be2-41e9-a5e9-7297605193cc_1744x504.png 1272w, https://substackcdn.com/image/fetch/$s_!hP2L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab2dccb6-7be2-41e9-a5e9-7297605193cc_1744x504.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>From <a href="https://clevertap.com/blog/how-to-treat-missing-values-in-your-data-part-i/">How to treat missing values in your data</a></em></p><p><em>As a rule of thumb, if missing values are random, you can do a T-test for two data partitions: one with missing values and another one without missing values, and check the difference between the 2 means. If there&#8217;s no difference, then ignore the NULLs.</em></p><p>If missing values are not random, you are dealing with a more complicated case. Your data might be still randomly distributed within one or more sub-samples. You can also perform regression or nearest neighbor imputation on the column to predict the missing values (I&#8217;ll cover more on this in one of my next issues).</p><h3><strong>Can&#8217;t drop missing values, have to impute:&nbsp;</strong></h3><ul><li><p>If your analysis is not multivariate, and the values are missing at random, imputation is a good choice, as it will decrease the amount of bias in the data.&nbsp;</p></li><li><p>If the values are not missing at random, you need to do data imputation.&nbsp;</p></li></ul><p>Check this guide to set up the approach for handling missing values in your analysis:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!D6Tg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a9aa80-343a-4df4-af9e-4d53ff280b59_1360x1082.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!D6Tg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a9aa80-343a-4df4-af9e-4d53ff280b59_1360x1082.png 424w, https://substackcdn.com/image/fetch/$s_!D6Tg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a9aa80-343a-4df4-af9e-4d53ff280b59_1360x1082.png 848w, https://substackcdn.com/image/fetch/$s_!D6Tg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a9aa80-343a-4df4-af9e-4d53ff280b59_1360x1082.png 1272w, https://substackcdn.com/image/fetch/$s_!D6Tg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a9aa80-343a-4df4-af9e-4d53ff280b59_1360x1082.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!D6Tg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a9aa80-343a-4df4-af9e-4d53ff280b59_1360x1082.png" width="1360" height="1082" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/f8a9aa80-343a-4df4-af9e-4d53ff280b59_1360x1082.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1082,&quot;width&quot;:1360,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!D6Tg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a9aa80-343a-4df4-af9e-4d53ff280b59_1360x1082.png 424w, https://substackcdn.com/image/fetch/$s_!D6Tg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a9aa80-343a-4df4-af9e-4d53ff280b59_1360x1082.png 848w, https://substackcdn.com/image/fetch/$s_!D6Tg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a9aa80-343a-4df4-af9e-4d53ff280b59_1360x1082.png 1272w, https://substackcdn.com/image/fetch/$s_!D6Tg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a9aa80-343a-4df4-af9e-4d53ff280b59_1360x1082.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Handling missing values in Python&nbsp;</strong></h3><p><em>(pandas, dataframe as df)</em></p><p><code>df.isnull().sum().sort_values(ascending = False)</code></p><p>If you have to fill in missing values:&nbsp;</p><p><code>df['column']=df['column'].fillna('1') </code>where 1 is the value you want to input</p><h3><strong>Handling missing values in SQL</strong></h3><p>In SQL you can leverage DateTime and window functions, and GROUP BY to locate missing values or other data abnormalities. If you have to re-write or input missing values in SQL, you can do it in 2 ways:&nbsp;</p><ol><li><p>During the schema design step, you can set NULL values to 1, for example, or drop the column if needed, rename, or set its parameters.</p></li><li><p>If you create VIEW, a temporary table with merged datasets, or CTE, you can use a <a href="https://dataanalysis.substack.com/p/sql-case-conditional-expression-guide?utm_source=url">CASE expression</a> to rename or re-define some values.</p></li><li><p>In your SELECT statement, you can use a CASE expression for data input the same way.&nbsp;</p></li><li><p>You can use ALTER to run transformations for the already created tables. This probably won&#8217;t work if you are dealing with a huge volume of data. Keep in mind that ALTER can be costly.&nbsp;</p></li></ol><p>You can check more Python methods on how to change a data format, input values, and rename fields in my <a href="https://www.kaggle.com/olgaberezovsky/predicting-titanic-survival-using-most-common-ml">analysis</a>, or in this guide - <a href="https://www.timescale.com/blog/postgresql-vs-python-for-data-cleaning-a-guide/">PostgreSQL vs Python for data cleaning: A guide</a>.&nbsp;</p><p>Thanks for reading, everyone!</p>]]></content:encoded></item></channel></rss>