How To Move Beyond A Monolithic Data Lake To A Distributed Data Mesh
How To Move Beyond A Monolithic Data Lake To A Distributed Data Mesh
How To Move Beyond A Monolithic Data Lake To A Distributed Data Mesh
20 May 2019
Zhamak Dehghani
ENTERPRISE ARCHITECTURE
DATA ANALYTICS
CONTENTS
Almost every client I work with is either planning or building their 3rd generation
data and intelligence platform, while admitting the failures of the past
generations:
The third and current generation data platforms are more or less similar to the
previous generation, with a modern twist towards (a) streaming for real-time data
availability with architectures such as Kappa, (b) unifying the batch and stream
processing for data transformation with frameworks such as Apache Beam, as
well as (c) fully embracing cloud based managed services for storage, data
pipeline execution engines and machine learning platforms. It is evident that the
third generation data platform is addressing some of the gaps of the previous
generations such as real-time data analytics, as well as reducing the cost of
managing big data infrastructure. However it suffers from many of the underlying
characteristics that led to the failures of the previous generations.
To unpack the underlying limitations that all generations of data platforms carry,
let's look at their architecture and their characteristics. In this writeup I use the
domain of internet media streaming business such as Spotify, SoundCloud, Apple
iTunes, etc. as the example to clarify some of the concepts.
At 30,000 feet the data platform architecture looks like Figure 1 below; a
centralized piece of architecture whose goal is to:
Ingest data from all corners of the enterprise, ranging from operational and
transactional systems and domains that run the business, or external data
providers that augment the knowledge of the enterprise. For example in a
media streaming business, data platform is responsible for ingesting large
variety of data: the 'media players performance', how their 'users interact with
the players', 'songs they play', 'artists they follow', as well as 'labels and artists'
that the business has onboarded, the ' nancial transactions' with the artists,
and external market research data such as 'customer demographic'
information.
Cleanse, enrich, and transform the source data into trustworthy data that can
address the needs of a diverse set of consumers. In our example, one of the
transformations turns the click streams of user interaction to meaningful
sessions enriched with details of the user. This attempts to reconstruct the
journey and behavior of the user into aggregate views.
Serve the datasets to a variety of consumers with a diverse set of needs. This
ranges from analytical consumption to exploring the data looking for insights,
machine learning based decision making, to business intelligence reports that
summarize the performance of the business. In our media streaming example,
the platform can serve near real-time error and quality information about the
media players around the globe through distributed log interfaces such as
Kafka or serve the static aggregate views of a particular artist's records being
played to drive nancial payments calculation to the artists and labels.
While over the last decade we have successfully applied domain driven design
and bounded context to our operational systems, we have largely disregarded the
domain concepts in a data platform. We have moved away from domain oriented
data ownership to a centralized domain agnostic data ownership. We pride
ourselves on creating the biggest monolith of them all, the big data platform.
Figure 2 Centralized data platform with no clear data domain boundaries and ownership of domain
oriented data
While this centralized model can work for organizations that have a simpler
domain with smaller number of diverse comsumption cases, it fails for
enterprises with rich domains, a large number of sources and a diverse set of
consumers.
There are two pressure points on the architecture and the organizational
structure of a centralized data platform that often lead to its failure:
While I don't want to give my solution away just yet, I need to clarify that I'm not
advocating for a fragmented, siloed domain-oriented data often hidden in the
bowels of operational systems; siloed domain data that is hard to discover, make
sense of and consume. I am not advocating for multiple fragmented data
warehouses that are the results of years of accumulated tech debt. This is a
concern that leaders in the industry have voiced. But I argue that the response to
these accidental silos of unreachable data is not creating a centralized data
platform, with a centralized team who owns and curates the data from all
domains. It does not organizationally scale as we have learned and demonstrated
above.
Though this model provides some level of scale, by assigning teams to different
stages of the pipeline, it has an inherent limitation that slows the delivery of
features. It has high coupling between the stages of the pipeline to deliver an
independent feature or value. It's decomposed orthogonally to the axis of change.
Let's look at our media streaming example. Internet media streaming platforms
have a strong domain construct around the type of media that they offer. They
often start their services with 'songs' and 'albums', and then extend to 'music
events', 'podcasts', 'radio shows', 'movies', etc. Enabling a single new feature, such
as visibility to the 'podcasts play rate', requires a change in all components of the
pipeline. Teams must introduce new ingestion services, new cleansing and
preparation as well as aggregates for viewing podcast play rates. This requires
synchronization across implementation of different components and release
management across teams. Many data platforms provide generic and
con guration-based ingestion services that can cope with extensions such as
adding new sources easily or modifying the existing sources to minimize the
overhead of introducing new sources. However this does not remove an end to
end dependency management of introducing new datasets from the consumer
point of view. Though on paper, the pipeline architecture might appear as if we
have achieved an architectural quantum of a pipeline stage, in practice the whole
pipeline i.e. the monolithic platform, is the smallest unit that must change to
cater for a new functionality: unlocking a new dataset and making it available for
new or existing consumption. This limits our ability to achieve higher velocity
and scale in response to new consumers or sources of the data.
Figure 4 Architecture decomposition is orthogonal to the axis of change when introducing or enhancing
features, leading to coupling and slower delivery
The third failure mode of today's data platforms is related to how we structure
the teams who build and own the platform. When we zoom close enough to
observe the life of the people who build and operate a data platform, what we nd
is a group of hyper-specialized data engineers siloed from the operational units
of the organization; where the data originates or where it is used and put into
actions and decision making. The data platform engineers are not only siloed
organizationally but also separated and grouped into a team based on their
technical expertise of big data tooling, often absent of business and domain
knowledge.
Figure 5 Siloed hyper-specialized data platform team
I personally don't envy the life of a data platform engineer. They need to consume
data from teams who have no incentive in providing meaningful, truthful and
correct data. They have very little understanding of the source domains that
generate the data and lack the domain expertise in their teams. They need to
provide data for a diverse set of needs, operational or analytical, without a clear
understanding of the application of the data and access to the consuming
domain's experts.
In the media streaming domain, for example, on the source end we have cross-
functional 'media player' teams that provide signals around how users interact
with a particular feature they provide e.g. 'play song events', 'purchase events',
'play audio quality', etc.; and on the other end sit the consumer cross-functional
teams such as 'song recommendation' team, 'sales team' reporting sales KPIs,
'artists payment team' who calculate and pay artists based on play events, and so
on. Sadly, in the middle sits the data platform team that through sheer effort
provides suitable data for all sources and consumptions.
Figure 6 Convergence: the paradigm shift for building the next data platforms
Though this might sound like a lot of buzzwords in one sentence, each of these
techniques have had a speci c and incredibly positive impact in modernizing the
technical foundations of operational systems. Lets deep dive into how we can
apply each of these disciplines to the world of Data to escape the current
paradigm, carried over from years of legacy data warehousing architecture.
This requires shifting our thinking from a push and ingest, traditionally through
ETLs and more recently through event streams, to serving and pull model across
all domains.
Some domains naturally align with the source, where the data originates. The
source domain datasets represent the facts and reality of the business. The source
domain datasets capture the data that is mapped very closely to what the
operational systems of their origin, systems of reality, generate. In our example
facts of the business such as 'how the users are interacting with the services', or
'the process of onboarding labels' lead to creation of domain datasets such as
'user click streams', 'audio play quality stream' and 'onboarded labels'. These facts
are best known and generated by the operational systems that sit at the point of
origin. For example the media player system knows best about the 'user click
streams'.
The business facts are best presented as business Domain Events, can be stored
and served as distributed logs of time-stamped events for any authorized
consumer to access.
In addition to timed events, source data domains should also provide easily
consumable historical snapshots of the source domain datasets, aggregated over
a time interval that closely re ects the interval of change for their domain. For
example in an 'onboarded labels' source domain, which shows the labels of the
artists that provide music to the streaming business, aggregating the onboarded
labels on a monthly basis is a reasonable view to provide in addition to the events
generated through the process of onboarding labels.
Note that the source aligned domain datasets must be separated from the
internal source systems' datasets. The nature of the domain datasets is very
different from the internal data that the operational systems use to do their job.
They have a much larger volume, represent immutable timed facts, and change
less frequently than their systems. For this reason the actual underlying storage
must be suitable for big data, and separate from the existing operational
databases. Section Data and self-serve platform design convergence describes
how to create big data storage and serving infrastructure.
Source domain datasets are the most foundational datasets and change less
often, as the facts of business don't change that frequently. These domain
datasets are expected to be permanently captured and made available, so that as
the organization evolves its data-driven and intelligence services they can always
go back to the business facts, and create new aggregations or projections.
Note that source domain datasets represent closely the raw data at the point of
creation, and are not tted or modeled for a particular consumer.
Consumer oriented and shared domain data
Some domains align closely with the consumption. The consumer domain
datasets and the teams who own them, aim to satisfy a closely related group of
use cases. For example the 'social recommendation domain' that focuses on
providing recommendations based on users social connections to each other,
create domain datasets that t this speci c need; perhaps through a 'graph
representation of social network of users'. While this graph dataset is useful for
recommendation use case, it might be also useful for a 'listeners noti cations'
domain, which provides data regarding different types of noti cations that are
sent to the listener, including what people in their social network are listening to.
So it is possible that 'user social network' can become a shared and newly rei ed
domain dataset for multiple consumers to use. The 'user social network' domain
team focuses on providing an always curated and uptodate view of the 'user
social network'.
While the datasets ownership is delegated from the central platform to the
domains, the need for cleansing, preparing, aggregating and serving data remains,
so does the usage of data pipeline. In this architecture, a data pipeline is simply
an internal complexity and implementation of the data domain and is handled
internally within the domain. As a result we will be seeing a distribution of the
data pipelines stages into each domain.
For example the source domains need to include the cleansing, deduplicating,
enriching of their domain events so that they can be consumed by other domains,
without replication of cleansing. Each domain dataset must establish a Service
Level Objectives for the quality of the data it provides: timeliness, error rates, etc.
For example our media player domain providing audio 'play clickstream' can
include cleansing and standardizing data pipeline in their domain that provides a
stream of de-duped near real-time 'play audio click events' that conform to the
organization's standards of encoding events.
Equally, we will see that aggregation stages of a centralized pipeline move into
implementation details of consuming domains.
Figure 8 Distribute the pipelines into the domains as a second class concern and the domain's internal
implementation detail
One might argue that this model might lead to duplicated effort in each domain
to create their own data processing pipeline implementation, technology stack
and tooling. I will address this concern shortly as we talk about the Convergence
of Data and Platform Thinking with Self-serve shared Data Infrastructure as a
Platform.
Over the last decade operational domains have built product thinking into the
capabilities they provide to the rest of the organization. Domain teams provide
these capabilities as APIs to the rest of the developers in the organization, as
building blocks of creating higher order value and functionality. The teams strive
for creating the best developer experience for their domain APIs; including
discoverable and understandable API documentation, API test sandboxes, and
closely tracked quality and adoption KPIs.
For a distributed data platform to be successful, domain data teams must apply
product thinking with similar rigor to the datasets that they provide; considering
their data assets as their products and the rest of the organization's data
scientists, ML and data engineers as their customers.
In this case our 'played songs' domain provides two different datasets as its
products to the rest of the organization; real-time play events exposed on event
streams, and aggregated play events exposed as serialized les on an object store.
An important quality of any technical product, in this case domain data products,
is to delight their consumers; in this case data engineers, ml engineers or data
scientists. To provide the best user experience for consumers, the domain data
products need to have the following basic qualities:
Discoverable
Note the perspective shift here is from a single platform extracting and owning
the data for its use, to each domain providing its data as a product in a
discoverable fashion.
Addressable
A data product, once discovered, should have a unique address following a global
convention that helps its users to programmatically access it. The organizations
may adopt different naming conventions for their data, depending on the
underlying storage and format of the data. Considering the ease of use as an
objective, in a decentralized architecture, it is necessary for common conventions
to be developed. Different domains might store and serve their datasets in
different formats, events might be stored and accessed through streams such as
Kafka topics, columnar datasets might use CSV les, or AWS S3 buckets of
serialized Parquet les. A standard for addressability of datasets in a polyglot
environment removes friction when nding and accessing information.
No one will use a product that they can't trust. In the traditional data platforms
it's acceptable to extract and onboard data that has errors, does not re ect the
truth of the business and simply can't be trusted. This is where the majority of
the efforts of centralized data pipelines are concentrated, cleansing data after
ingestion.
The target value or range of a data integrity (quality) indicator vary between
domain data products. For example, 'play event' domain may provide two
different data products, one near-real-time with lower level of accuracy,
including missing or duplicate events, and one with longer delay and higher level
of events accuracy. Each data product de nes and assure the target level of its
integrity and truthfulness as a set of SLOs.
One of the main concerns in a distributed domain data architecture, is the ability
to correlate data across domains and stitch them together in wonderful,
insightful ways; join, lter, aggregate, etc. The key for an effective correlation of
data across domains is following certain standards and harmonization rules. Such
standardizations should belong to a global governance, to enable interoperability
between polyglot domain datasets. Common concerns of such standardization
efforts are eld type formatting, identifying polysemes across different domains,
datasets address conventions, common metadata elds, event formats such as
CloudEvents, etc.
For example in the media streaming business, an 'artist' might appear in different
domains and have different attributes and identi ers in each domain. The 'play
eventstream' domain may recognize the artist differently to 'artists payment'
domain that takes care of invoices and payments. However to be able to correlate
the data about an artist across different domain data products we need to agree
on how we identify an artist as a polyseme. One approach is to consider 'artist'
with a federated entity and a unique global federated entity identi er for the
'artist', similarly to how federated identities are managed.
Interoperability and standardization of communications, governed globally, is one
of the foundational pillars for building distributed systems.
Section Data and self-service platform design convergence describes the shared
infrastructure that enables the above capabilities for each data product easily and
automatically.
Domains that provide data as products; need to be augmented with new skill sets:
(a) the data product owner and (b) data engineers.
A data product owner makes decisions around the vision and the roadmap for the
data products, concerns herself with satisfaction of her consumers and
continuously measures and improves the quality and richness of the data her
domain owns and produces. She is responsible for the lifecycle of the domain
datasets, when to change, revise and retire data and schemas. She strikes a
balance between the competing needs of the domain data consumers.
Figure 10 Cross functional domain data teams with explicit data product ownership
Data and self-serve platform design convergence
One of the main concerns of distributing the ownership of data to the domains is
the duplicated effort and skills required to operate the data pipelines technology
stack and infrastructure in each domain. Luckily, building common infrastructure
as a platform is a well understood and solved problem; though admittedly the
tooling and techniques are not as mature in the data ecosystem.
Figure 11 Extracting and harvesting domain agnostic data pipeline infrastructure and tooling into a
separate data infrastructure as a platform
The key to building the data infrastructure as a platform is (a) to not include any
domain speci c concepts or business logic, keeping it domain agnostic, and (b)
make sure the platform hides all the underlying complexity and provides the data
infrastructure components in a self-service manner. There is a long list of
capabilities that a self-serve data infrastructure as a platform provides to its
users, a domain's data engineers. Here are a few of them:
A success criteria for self-serve data infrastructure is lowering the 'lead time to
create a new data product' on the infrastructure. This leads to automation,
required for implementing the capabilities of a 'data product' as covered in
section Domain data as a product. For example, automating data ingestion
through con gurations and scripts, data product creation scripts to put
scaffolding in place, auto-registering a data product with the catalog, etc.
Using cloud infrastructure as a substrate reduces the operational costs and effort
required to provide on-demand access to the data infrastructure, however it
doesn't completely remove the higher abstractions that need to be put in place in
the context of the business. Regardless of the cloud provider there is a rich and
ever growing set of data infrastructure services that are available to the data infra
team.
The paradigm shift towards a data mesh
It's been a long read. Let's bring it all together. We looked at some of the
underlying characteristics of the current data platforms: centralized, monolithic,
with highly coupled pipeline architecture, operated by silos of hyper-specialized
data engineers. We introduced the building blocks of a ubiquitous data mesh as a
platform; distributed data products oriented around domains and owned by
independent cross-functional teams who have embedded data engineers and data
product owners, using common data infrastructure as a platform to host, prep
and serve their data assets.
Accordingly, the data lake is no longer the centerpiece of the overall architecture.
We will continue to apply some of the principles of data lake, such as making
immutable data available for explorations and analytical usage, to the source
oriented domain data products. We will continue to use the data lake tooling,
however either for internal implementation of data products or as part of the
shared data infrastructure.
This, in fact, takes us back to where it all began: James Dixon in 2010 intended a
data lake to be used for a single domain, multiple data domains would instead
form a 'water garden'.
The main shift is to treat domain data product as a rst class concern, and data
lake tooling and pipeline as a second class concern - an implementation detail.
This inverts the current mental model from a centralized data lake to an
ecosystem of data products that play nicely together, a data mesh.
The same principle applies to the data warehouse for business reporting and
visualization. It's simply a node on the mesh, and possibly on the consumer
oriented edge of the mesh.
I admit that though I see the data mesh practices being applied in pockets at my
clients, enterprise scale adoption still has a long way to go. I don't believe
technology is the limitation here, all the tooling that we use today can
accommodate distribution and ownership by multiple teams. Particularly the
shift towards uni cation of batch and streaming and tools such as Apache Beam
or Google Cloud Data ow, easily allow processing addressable polyglot datasets.
Data catalog platforms such as Google Cloud Data Catalog provide central
discoverability, access control and governance of distributed domain datasets. A
wide variety of cloud data storage options enables domain data products to
choose t for purpose polyglot storage.
The needs are real and tools are ready. It is up to the engineers and leaders in
organizations to realize that the existing paradigm of big data and one true big
data platform or data lake, is only going to repeat the failures of the past, just
using new cloud based tools.
This paradigm shift requires a new set of governing principles accompanied with
a new language:
Let's breakdown the big data monolith into a harmonized, collaborative and
distributed ecosystem of data mesh.
© Martin Fowler | Privacy Policy | Disclosures