The “Extend” Cloud Data Pattern
As part of my re:Invent 2024 Innovation Talk, I shared three data patterns that many of our largest AWS customers have adopted. This article focuses on “Extend” which is an emerging data pattern. You can also watch this four-minute video clip on the Extend data pattern if interested.
Many companies find great success with the Aggregate or Curate data patterns. Some companies are now moving from Curate, which is presenting direct access to curated data via a table, simple API, or data marketplace for Amazon S3 objects, to Extend, which means presenting that data as a first-class service over an API. Rather than directly accessing the curated data sets, a data API acts as an intermediary, helping to control and govern the usage of the underlying data, and extends the capabilities of the curated data set. It can be used to standardize on formats of data ingestion and retrieval, which removes the need for application teams to build and own their own pipelines. Extend takes the centralized model of Curate one step further. In Curate, you have a centralized team that owns data products. In Extend, the centralized team owns the data as a product but also the experience of interacting with the data through an API.
Making this transition to a data API is often a substantial engineering effort, but it can provide immense value back to the business. With a rich set of capabilities in a data API, you have centralized control over data governance, security, access policies, usage pattern monitoring, rate limiting, and many other data operations. Companies like Salesforce are transforming their product experience with a data API. Raveendrnathan Loganathan, EVP of the Salesforce Data Cloud, and team are expert practitioners of building a Data API at scale and share how it has transformed their customer experience in this nine-minute talk.
Using Extend as a data pattern takes commitment. Centralizing an organization of any size on a data API means that you are taking ownership of the quality, availability, and interaction model of your business’ data and you have to make sure you have the talent and resources in your workforce to do so. If you are using the Aggregate data pattern today, try moving to Curate before you make the investment into Extend. When you adopt Curate, you gain expertise in building a high-quality data product. You will need to have that expertise regardless if you stick with the Curate data pattern or try Extend. With Extend, you are building an extensive data API on top of your data products — which means you are controlling how the data is processed and used, as well as the data set quality itself.
One of the benefits of Extend is engineering leverage. You can adopt the latest technology innovations to build your data API, and those innovations benefit every user of the API. For example, your engineers can build an API that automatically enriches metadata using a LLM. Users of the data enrichment API don’t have to know how to work with LLMs themselves because that understanding is built into the API. Many types of data APIs can take advantage of foundation models, such as finding personally identifiable information in a data set or using agentic workflows to standardize data into a common schema. Many AWS customers build these types of data engineering capabilities today using Amazon Bedrock, which offers support for data automation for processing multimodal content, along with Bedrock agents, which support multiple agents collaborating on a common goal. When you leverage these technologies as part of your data API, you are bringing the benefits to your API users without requiring that the API users understand how the underlying AI works.
In Curate, the data product ownership is centralized but the data interaction model is decentralized. In Extend, both the data product and its interaction model is centralized in the data API. So how do you choose between Curate and Extend data patterns? Often, the choice of which data pattern you use depends on the culture of your organization. In Curate, different organizations might prefer to build expertise and their own interaction models when using the data. Or in other cases, an organization might prefer the Extend data pattern because it means that teams that use the data API don't have to deeply understand the underlying data infrastructure.
Whatever your organizational preference, factor in how data processing as a field is evolving. Data engineering organizations are already successfully taking advantage of LLMs that help with specific tasks, like generating SQL queries given a schema. In 2023, Pinterest shared that they saw a 40% productivity gain for team members working with data when they used AI to generate SQL queries to their Iceberg tables with Amazon S3.
In 2024, we have seen customers use AI agents to improve data quality and develop agentic workflows that consume data APIs. For example, Moody's, a leading provider of financial analysis tools, offers specialized data APIs like Orbis that give detailed information on over 550 million companies and entities, including company financials, corporate structures, risk scores, patents, and more. Under the hood of their data API, Moody's is using Amazon Bedrock's multi-agent collaboration (similar multi-agent approaches here) to build a workflow that involves a cohort of specialized agents collaborating to synthesize insights from multiple data sources, including Moody's data product APIs and the customers' own data sources. These collaborative AI agents generate comprehensive risk reports and recommendations by analyzing volumes of disparate data, surfacing the most relevant and contextual findings from the full data landscape.
In 2025, we will see AI agents increasingly power the workflows in a data API or even use the data API itself to accomplish a goal. And in less time than you think, you can imagine giving a goal like “create a data API that does...” to an agentic workflow that then builds the data API for you, self-correcting along the way as needed.
The three data patterns of Aggregate, Curate, and Extend “stack”— which means they layer on one another. They are essentially loosely-coupled systems in the distributed system sense. Data patterns on AWS are a two-way door. With AWS data patterns, you can adopt one data pattern and change it easily to another. AWS customers almost always start with Aggregate, and many customers successfully scale with this pattern. Other AWS customers move to Curate if their business wants to centralize on a subset of data products. Some customers adopt Extend if it is important to standardize both usage and data interaction through a single data API.
Because the AWS data patterns “stack,” you don’t have to rearchitect your data model or applications to change a pattern. You can move from Curate to Extend easily, or from Extend to Aggregate simply by redirecting your data consumers to different parts of your data stack. And, as I mentioned when I introduced the three data patterns, we have lots of customers who apply different data patterns to different parts of their business in a "mix-and-match" approach. With AWS, you always have choice and flexibility in your data strategy, and you know we are always innovating across AWS on new capabilities to make using data to drive your business easier.
Love this Mai-Lan Tomsen Bukovec key reading over the holidays and thank you for enabling so many of our EMEA customers in 2024. All the best for the New Year