

Discover more from Data Analysis Journal
Analytical Claustrophobia (Or Customer Data Platform) - Issue 91
Or how to make a sandwich in 587 steps.
Hello analysts, hope you are well. I’m back as usual with prepared insights for you in this week’s Data Analysis Journal, an advice column about data and product analytics.
If you’re not a paid subscriber, here’s what you missed this month:
This Is Why You Over-Count Your Daily Active Users - an explanation of why you are most likely over-reporting your DAU and MAU metrics, especially if you leverage Amplitude for analytics. A walkthrough of why activity reporting is tricky, what the common pitfalls are, and proposed recommendations on how to make your DAU cleaner and more accurate.
5 Must-Have Board Slides For Sales and Revenue Leaders - read if you are lucky enough to put together decks reporting growth and revenue for an audience of executives. I review examples of effective slides, and the deck structure that works, and share expert recommendations.
User Research - Get To Know Your Users and Their Motives - an introduction to qualitative analysis, the intersection of data analytics, marketing, and design. Its methods, practice, and how the data team can support user research effectively. The difference between qualitative and quantitative analysis, and why you have to leverage both for analytics.
All revenue from subscriptions is donated to UCARE - Ukrainian Children's Aid and Relief Effort charity.
Today I will thrill you with a popular and mildly controversial topic - Customer Data Platform. It is one of the most well-selling data products on the market, and I am about to tell you why it’s not always effective and might actually be damaging to your data-driven organization.
What CDP is and what problems it solves
Customer Data Platform is an all-in-one marketing and data infrastructure. In a nutshell, it’s a database for all your user information with a connected activation layer to help you leverage the data for marketing.
What the CDP does: data is pulled from multiple sources, cleaned, and combined to create a single customer profile. They can range from $100K up to $300K annually. For marketers, that for sure sounds intriguing. Data engineers and analysts, however, are likely to raise an eyebrow.
Every CDP has a similar structure:
Data ingestion - most commonly, data into CDPs is loaded from all usable sources via APIs - websites, mobile apps, third-party vendors, and whatever tool you want to leverage for analytics.
User or identity mapping - the core of any CDP is a user graph, where a user profile is generated based on all IDs loaded from all usable sources. A unique user ID is created based on any available user identifiers loaded via APIs - cookie, IDFA, device ID, etc.
User segmentation - an interface for marketers to create user segments (without SQL) and connect them to various marketing platforms to run targeted campaigns.
The age of Customer Data Platforms arrived a few years ago, and the demand for them keeps growing. The CDP market size is projected to grow from USD 2.4 billion in 2020 to USD 10.3 billion by 2025! Every data vendor is trying to sell you on buying their SaaS to build a “single view of the customer”. Optimizing marketing campaigns, saving costs on user acquisition, and creating a complete funnel of the user lifecycle is the main objective for many companies now. And CDP claims to offer exactly that while selling you the tools to get there.
In reality, does it really work?
Why CDP might not solve the problem you have
In the marketing community, an all-in-one platform to solve our countless data problems sounds like the holy grail. In the data community, on the other hand, we’ve been trying to do this all along using the data warehouse. Yet, most people don’t make the connection between the two.
- Tejas Manohar, a former Segment developer, Founder of HightouchData in his piece - Why Your CDP Should Be the Data Warehouse.
In my opinion, that quote nails it. He is right to the point. Your data team struggles to complete a customer funnel because of a lack of data and attribution on the source layer. Not because of the low storage, resource, or data dissemination problems.
If you don’t have the events you need coming in from Google Analytics, Branch, Social Media or so, it doesn’t matter what fancy infrastructure you buy to support it downstream. It doesn’t matter if you load it into a data warehouse, data mart, data lake, or CDP. You will pay more, and still end up with a high % missing users and missing attribution.
If you have the data coming from the source, it’s fairly easy to map it with your internal user ID. You can recreate any CDP in your own data warehouse with ETL or simple SQL. It will work just as effectively, and it will be cheaper to develop and maintain. Plus, you have full ownership of the events.
How CDP hurts your analytics
From CDPs are dead
Customer Data Platforms are not the single source of truth
Your user data is already in your data warehouse. With CDP, you simply add more layers downstream, locking your data into a specific structure and format. However, you can’t end up with more users in CDP than you already have in the database. In a way, CDP is like a database inside of a database. It complicates your architecture, adding unnecessary steps for data processing.
In CDP, all user data is locked into a “single view of the customer” or whatever the selling pitch may have been at the time of persuasion. You can’t simply break it down if needed or extract some attributes. Analysts lose flexibility and ownership of the data ecosystem.
Mapping users between all the sources is a tricky part
Some apps don’t have device IDs, some don’t send cookies or browser data. How do you map them together? CDPs don’t do the mapping any better than your analyst via SQL. If it can’t be done in SQL, then it’s unlikely to be done in CDP as well.
This is the reason why in Google Analytics you might see 50K views, but in your database, you end up with only 10K for the same page and the same time range. Your own system can recognize (thus, confirm) only 10K actual users out of a blob of 50K visitors that might contain duplicates, bots, tests, blackjack, and confetti. This is not something CDP can solve for you. Good luck explaining this to your marketing folks.
Having different versions of the same metric
My biggest pet peeve with CDPs is multiple definitions for the same KPIs across the data sources.
You will end up with, for example, 10M DAU in Amplitude, 8M DAU in your database, and 6M DAU in CDP. Which one to trust? The reason is that CDP will create its “own” definition of Daily Active Users or Churn which you might not be able to replicate in Amplitude, Heap, Mixpanel, Mode, Looker, or whatever analytics you use.
The trickiest part is that the same event name (like app open or session start) which you use for defining a metric will have a completely different meaning in CDPs - and your analysts might not be aware of it. I can’t underline enough how damaging this can be for your reporting. This completely kills the trust in your data team.
How to replicate a “single view of a customer” without developing a CDP?
Easy! Assuming you have the data coming in from the source, 2 analysts and 1 data engineer will make it work in a few weeks. If you don’t have the data, no CDP in the world will solve it.
Step 1. Data engineering - load data from sources into separate schemas: marketing, registrations, users, payments, activity, and more. Set up primary and foreign keys, design tables structure and connectors.
Step 2. Data analytics - create metrics definitions and calculations in SQL based on events and available tables. Using created keys, analysts can connect registrations, purchases, and activity data mapping on a user_id.
Step 3. Data engineering - create a materialized view using provided calculations, logic, and SQL from analysts, work on potentially optimizing the cost and run time.
Step 4. Data analytics - QA data to confirm the match between the sources and expected data volume. Use this view in your BI tools for reporting and data dissemination.
And voilà! Your 360-view of a customer is ready! My team did it at least 3 times at different companies - Change.org, First Republic Bank, and VidIQ, and every time we made it work within a 2-3 week timeline.
Even more, I am opposed to creating data marts or lakes. I find them unnecessary. With data marts, every simple data request becomes a multiple-week project requiring 5 more JIRA tickets and ETL changes. And again, analysts lose flexibility and ownership of the data ecosystem. It makes sense for enterprises or big companies with over 100 analysts-consumers. I don’t see its benefits for a small company with a team of 5-10 analysts.
The modern data warehouse offers tools and instruments to allow flexible reporting with optimal cost and scale (data mesh, for example). It requires, of course, higher analytics expertise to recognize the needed logic and structure for the reporting, and the ability to replicate it in SQL. I have seen many times how analysts nailed reporting based on the raw data in the data warehouse without data marts usage. Unfortunately, I have also seen analytics leaders requesting the development of a CDP or a data lake within a data warehouse, spending many months and resources on development, only to eventually just allow some better-optimized querying or flexible formatting.
That’s why I advocate for investing in strong data analytics leadership with a coherent strategy for organizing, governing, and analyzing your data over spending your budget for another SaaS data service.
Thanks for reading, everyone. Until next Wednesday!
Analytical Claustrophobia (Or Customer Data Platform) - Issue 91
Great post! Building customer data platform on the warehouse also facilitates richer customer data from billing, surveys, support and offline channels.
Hey Olga, I've been meaning to respond to this issue of yours. While a CDP has limitations, it is certainly not going away anytime soon and in fact, has the potential to co-exist with Reverse ETL. I wrote about this last year and recently updated the post so sharing it here: https://arpitc.substack.com/data-activation