How To Make A Sandwich in 587 Steps - Issue 208

Or why Customer Data Platform may not be the solution you need.

Jun 19, 2024

Welcome to my Data Analytics Journal, where I write about data science and analytics.

“A few years ago, this company decided that it wanted to create an analytics platform, following the decision to become more "data driven". They hired some incredibly talented people to make this happen, and then like five times as many idiots… they hired a bunch of Big Firm Consultants. You can see where this is going already…

We were told that we just had to wait for the Advanced Analytics Platform (AAP) to be deployed. You see, it's December, and it's launching in January.

Then in January I was told to be patient, it was coming in March.

In June, I was told it had been put on hold due to Covid… We skip ahead three years. The AAP is finally ready to launch…

It's an insane dumpster fire spiderweb of technical debt and it's only like one week old.”

From Ludicity.

Sounds familiar?

This is a controversial POV, but when I hear requests for "Analytics Platform", "Business Data Platform", or "Customer Data Platform", I suspect I am dealing with analytics leadership of low quality who lack experience with analytics at scale, have never stitched users together across multiple sources, and haven’t worked with mapping third-party and first-party data together. As a result, I believe nothing good will come from it because the world is flooded with stories of leadership failures like these, and I can personally add a few more.

As you might have guessed, today I will discuss a popular yet painful topic: the Customer Data Platform.

It is one of the most well-selling data products on the market, and I am about to explain why it is not always effective and might actually be damaging to your data-driven organization.

What CDP is and what problems it solves

Customer Data Platform is an all-in-one marketing and data infrastructure. In a nutshell, it’s a database for all your user information with a connected activation layer to help you leverage the data for marketing.

What the CDP ~~does~~ intends to do: data is pulled from multiple sources, cleaned, and combined to create a single customer profile.

Every CDP has a similar structure:

Data ingestion - most commonly, data into CDPs is loaded from all usable sources via APIs - websites, mobile apps, third-party vendors, and whatever tool you want to leverage for analytics.
User or identity mapping - the core of any CDP is a user graph, where a user profile is generated based on all IDs loaded from all usable sources. A unique user ID is created based on any available user identifiers loaded via APIs - cookie, IDFA, device ID, etc.
User segmentation - an interface for marketers to create user segments (without SQL) and connect them to various marketing platforms to run targeted campaigns.

The age of Customer Data Platforms arrived a few years ago, and the demand for them keeps growing. The CDP market size is projected to grow from USD 2.4 billion in 2020 to USD 10.3 billion by 2025! Every data vendor is trying to sell you on buying their SaaS to build a “single view of the customer”. Optimizing marketing campaigns, saving costs on user acquisition, and creating a complete funnel of the user lifecycle are the main objectives for many companies. And CDP claims to offer exactly that while selling you the tools to get there.

In reality, does it really work?

CDPs are a little different but all the same

There are many types of CDPs—traditional standalone CDP, “headless” CDP, or composable CDP, hybrid CDPs, marketing clouds, etc. Now, companies are switching to an alternative solution that costs far less and delivers the same - leveraging their data warehouse. So now, there are CDPs sitting on top of the data warehouse, and there are methods for bunded or unbunded CDPs.

For example, Segment is stronger in tag management, and Amperity is more of a vertical solution that supports retail analytics. Reverse ETL tools (e.g., Hightouch and Census) overlap with CDPs, collecting data into the warehouse, and then building audiences and exporting to marketing platforms from there.

There is a close overlap between CRM tools and CDPs. CRM tools may store some customer data but remain disconnected from a data warehouse.

Similarly, there is a close overlap between Amplitude and CDPs. Many CDPs are based on event structure, which limits aggregate analytics and makes it difficult to change historical data (e.g. if a user changes their subscription status). Doesn’t mean there are no ways to solve it.

Why CDP might not solve the problem you have

While legacy CDPs claim to provide a single source of truth for data, the reality is they only consolidate a fragmented copy of customer data.

- Tejas Manohar, a former Segment developer and founder of Hightouch in - Friends Don’t Let Friends Buy a CDP.

Your data team struggles to complete a customer lifecycle view not because of low storage, lack of talent, or data distribution problems but due to a lack of data and attribution at the source layer.

To support marketing effectively, you need a full picture of your customers - not just their recent transactions but their entire history of interactions with the product or app bundled with their demographic, social, and household data.

If you don’t have the necessary events coming in and properly mapped between Google Analytics, Braze, and product analytics tools, it doesn’t matter what advanced infrastructure you invest in downstream. It doesn’t matter if you load it into a data warehouse, data mart, data lake, or CDP. You will end up paying more and still face a high % of missing users and missing attribution.

How CDP hurts your analytics

1. CDPs are not the single source of truth.

Your user data is already in your data warehouse. With CDP, you simply add more layers, locking your data into a specific structure and format. However, you can’t end up with more users in CDP than you already have in the database. In a way, CDP is like a database inside of a database. It complicates your architecture, adding unnecessary steps for data processing.

In CDP, all user data is locked into a “single view of the customer” or whatever the selling pitch might have been. You can’t easily break it down if needed or extract specific attributes. Analysts lose flexibility and ownership of the data ecosystem.

On top of it, you add a workload to ensure:

The data in the CDP is accurate and up-to-date.
The frequency of data sync into your CDP is maintained.
Accuracy and consistency between different systems are maintained.

By adding yet another data consumer layer to your overall data infrastructure, you make it more complicated. Eventually, you will either compromise the quality of your data (resulting in poor performance of your marketing campaigns) or triple the total cost of ownership.

2. Mapping users across all sources is a tricky part

Identity resolution and data unification are challenging.

When data lacks a common key or identifier and arrives from various sources that weren't designed to interact or integrate, stitching it together correctly becomes incredibly complex. No CDP offers a magic solution to resolve customer identities and create a unified 360-degree customer view unless you already have it.

How do you map users together? Some apps lack device IDs, while others do not send cookies or browser data, some use IP. Your ability to make data actionable depends on joining together the touchpoints in the user’s journey. CDPs do not perform this mapping any better than your analyst or data engineer. If it can’t be done in SQL, then it’s unlikely to be accomplished in a CDP as well.

3. CDPs introduce different versions of the same entities and metrics.

Having different versions of the same entities and metrics is my biggest pet peeve with CDPs.

For instance, you likely end up with 10M DAU in Amplitude, 8M DAU in your database, and 5M DAU in the CDP. Which one should you trust? The reason is that CDP will create its own definition of DAU or Churn, which you might not be able to replicate in other applications, such as Heap, Mixpanel, Mode, Looker, or whatever analytics you use.

The trickiest part is that the same event name (such as "app open" or "session start"), used to define activity metric, will have a completely different meaning in CDPs, depending if it’s based on the account, user, or entity, and your analysts might not be aware of this. I can’t stress enough how frustrating this can be.

4. CDPs are expensive.

Apart from (a) the cost of the CDP itself, (b) dedicating engineering and analytical resources to its setup and maintenance, and (C) the cost of ETLs to ingest the data into the CDP, you will also have to double the storage cost because data has to be replicated to the CDP.

Think about it: why do you need to keep multiple copies of the same data?

For most CDPs, figuring out the total pricing is not easy, as its cost depends on:

Unique tracked users (which may or may not be similar to active users).
The total number of user records or events.
API calls and the volume of incoming data.
The components and features used.

5. CDPs require resources to maintain it.

CDPs promised fast ROI and self-service, but the reality is quite different.

Beyond being very expensive, they require a significant initial investment of data and platform engineering time and resources for setup. Your team has to evaluate every usage metric upfront and then develop an integration strategy for each source you have.

Analysts will need to replicate KPIs and metrics in the CDP, QA the data, and map it to the values in a data warehouse, which is not a quick task. The match is never close, leading to many hours spent investigating why the CDP reports 35% fewer users than the dim_users table in Postgres.

Many companies also underestimate the ongoing resources needed to maintain a CDP. Every time you launch a new feature, a new screen, or set a new campaign, you will need to re-calibrate definitions.

I don’t recall who said it, but the reason why standalone CDPs failed is that they were marketed to marketing teams as a way to get independence from data engineering and analytics teams. However, maintaining a data layer that can ingest data at scale from any source and ensure it is clean and accurate requires everything from data engineering and analytics - APIs, SDKs, pipelines, monitoring, governance, and Q&A.

That’s why I advocate for investing in strong data analytics leadership with a coherent long-term strategy for organizing, governing, distributing, and analyzing your data.

To wrap up, my message is this: Don’t fall for the CDP hype or trap:

The setup is hard, and maintenance will require more support from every engineering team.
Your user identity stitching should be built and stored in your data warehouse, not with a vendor. It allows you to control costs, run ML, ensure security, and more.
If your director of analytics requests a BDP or CDP to unlock reporting, you need a new leader who understands the dependencies of analytics output.
Analysts should not lose flexibility and ownership of the data ecosystem, regardless of which CDP you adopt.

Thanks for reading, everyone. Until next Wednesday!

Data Analysis Journal

Discussion about this post

Ready for more?