The Analytics Stack Guidebook
The Analytics Stack Guidebook
The Analytics Stack Guidebook
We are Holistics. We've been making data analytics tools for over four
years, and helped more than a hundred companies build their business
intelligence capabilities, sometimes from scratch.
www.holistics.io
www.holistics.io/blog
We appreciate that if you find the book helpful and spread the word about it. But please don't
share the book or use our content anywhere else without giving us appropriate
credit or at least a link to the book!
https://www.holistics.io/books/setup-analytics/
I love that you went into more detail in the later chapters around modeling,
transformations, and providing real-world case studies like the Snowplow
case which I'd love to hear more about!
I thought ELT was just another cool kids’ jargon [...] Chapter 2 slapped me
hard in the face telling me that I was concluding too soon and I know nothing
about it.
I love the bits of personality woven in [...] It's definitely getting me excited for
my analytics degree and really looking to be versed on the technicalities.
We know how that feels, because we've been there. The truth is that
much knowledge of modern data analytics is locked up in the heads of
busy practitioners. Very little of it is laid out in a self-contained
format.
This book is suitable for technical team members who are looking
into setting up an analytics stack for their company for the very first
time.
Book Content
What this book is about, who is it for and who is it not for.
If you're just starting out and don't need all the bells and whistles, you
might be able to get going with this very simple setup.
Next, what does a full analytics stack look like? We give you a high
level overview of a modern analytics system, and lay out the structure
of the rest of the book.
Everyone has their own biases of how a good analytics setup looks like.
Here are ours.
5
2.2 Understanding The Data Warehouse
Learn about the data warehouse, the central place to store and process
all your data. Understand why the modern data warehouse is at the
core of contemporary data analytics — why it's so important, and how
to pick one.
Learn more about ETL, and its more modern cousin, ELT. Learn why
we advocate for ELT over ETL in our book.
Learn how to turn raw data into clean, reliable and reusable data
components, but within the ELT paradigm.
6
3.3 Modeling Example: A Real-world Use Case
Chapter 5: Conclusion
The End
7
Chapter 1:
High-level Overview of
an Analytics Setup
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
The questions:
How can I start small but still follow best practices that help me
scale the system up easily later?
9
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
Our hope is that this book will help you answer the above questions.
This book is not about what metrics to track for your industry. It is
about how can you build an adequate system for your business to
produce those metrics in a timely manner.
10
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
The field of business intelligence has been around for about 60 years.
It is incredibly confusing. There are many vendors, fads, trends,
technologies and buzzwords in the market — and it's been this way for
most of those six decades. It is impossible to expect new data
professionals to be familiar with all that has come before, or to
identify new developments as trends that will repeat in the future.
This book will give you the bare minimum you need to orient
yourself in the contemporary BI environment. It assumes some
technical knowledge, but won't get mired in technical detail.
Our goal is to give you 'just enough so you no longer feel lost'.
11
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
This book is also not written for experienced data engineers who
manage large-scale analytics systems and want deeper knowledge
about one particular problem. If you are familiar with cloud-first
environments, you probably already know most, if not all of the
content that is covered in this book. That said, you might still find
some parts of the book useful as a refresher.
12
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
As much as we'd like to, there's an entire topic on the human and
organizational aspects of business intelligence that we won't cover in
this book, which include questions like:
Let's start
Are you ready to read the book? If so, let's begin.
13
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
If your data comes from only one source (which is most likely your
production database) then you can skip the data loading process.
14
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
In short, your initial analytics setup can be very simple: just hook
business intelligence tool up to the production database of your
application.
When you interact with dashboards in your BI tool, the data will be
queried live from your application database. As a result, you also
happen to be getting the data in real-time.
The highest risk you will face with the above setup is performance.
Besides your normal production workload, your database now will
15
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
Check with your dev team to see if they have a replica that you can
connect to. There's a good chance that your dev team has already set
something up.
Of course, you could do some really bizarre things like export a dump,
load that into a local database, and then query that — but most
companies we know outgrow such a workflow in a matter of weeks.
16
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
Since SQL has become the defacto standard for analytics, most BI
tools are designed to work with SQL, so that limits the choice of BI
tools you may pick.
You might even get a surprise visit from your dev team for hogging
up their production database.
17
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
Alright, them's the basics. Now let's talk about the vast majority of data
analytics stacks we see out there.
18
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
In this chapter, we will talk about the most common setup for an
analytics stack. Granted, you may see other data practitioners doing
certain parts of this setup differently, but if you take a step back and
squint, nearly all data analytics systems boil down to the same basic
approach.
2. You must process data: that is, transform, clean up, aggregate and
model the data that has been pushed to a central data warehouse.
19
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
In the past, this may have been a 'staging area' — that is, a random
server where everyone dumped their data. A couple of years ago,
someone had the bright idea of calling this disorganized staging area a
'data lake'. We believe that the idea is more important than the name
(and we also believe that a dump by any other name would smell just
20
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
Chapter 2 of the book will go into more detail about this step.
21
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
Since we're talking about a big picture view in this chapter, there are
only two key components you need to understand.
This is when your raw source data is loaded into a central database.
To discuss the nuances of our approach, we shall first talk about data
consolidation in general, before discussing the pros and cons between
22
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
ETL and ELT. Yes, you're probably thinking "Wow! This sounds like a
boring, inconsequential discussion!" — but we promise you that it isn't:
down one path lies butterflies and sunshine, and down the other is
pestilence and death.
This is the place where most of your analytics activities will take place.
In this book we'll talk about:
After going through the above two concepts, what you will get at the
end of this step is:
You will have a process in place that syncs all raw data from
multiple sources (CRM, app, marketing, etc) into the central data
warehouse.
23
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
Once you have these two pieces set up, the next step is to turn your
raw data into meaningful gold for analytics.
24
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
Modeling data: apply business logic and formulae onto the data
Chapter 3 goes into more detail about these two operations, and
compares a modern approach (which we prefer) to a more traditional
approach that was developed in the 90s. Beginner readers take note:
usually, this is where you'll find most of the fun — and complexity! —
of doing data analytics.
At the end of this step, you'll have a small mountain of clean data that's
ready for analysis and reporting to end users.
25
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
Most people think of this step as just being about dashboarding and
visualization, but it involves quite a bit more than that. In this book
we'll touch on a few applications of using data:
Data exploration: how letting end users freely explore your data
lightens the load on the data department.
26
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
Since this step involves the use of a BI/visualization tool, we will also
discuss:
Alright! You now have an overview of this entire book. Let's take a
brief moment to discuss our biases, and then let's dive into data
consolidation, in Chapter 2.
27
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
We think that SQL based analytics will win over non-SQL based
analytics.
Some of these terms might not be clear to you right now, but we will
clarify what each of these terms and statements mean as we go deeper
into the book.
28
The Analytics Setup Guidebook – High-level Overview of an Analytics Setup
Onward!
29
Chapter 2:
Centralizing Data
The Analytics Setup Guidebook – Centralizing Data
In this chapter, we will walk you through the basics of the data
consolidation process.
31
The Analytics Setup Guidebook – Centralizing Data
32
The Analytics Setup Guidebook – Centralizing Data
If you are a company that does most of its business through a website
or application (like Amazon, for instance), this is the data that exists
within your main app. Such application databases typically contain all
the transactional data that is critical to your business operations.
For example:
33
The Analytics Setup Guidebook – Centralizing Data
The data loading process is the work of extracting data from multiple
sources and loading them onto your data warehouse. This process is
34
The Analytics Setup Guidebook – Centralizing Data
source_db = connect_to_source_db();
dest_db = connect_to_dest_datawarehouse();
# Extract step
source_records = source_db.query(
"SELECT id, email, user_id, listing_id, created_at FROM bookings"
);
35
The Analytics Setup Guidebook – Centralizing Data
# Load step
Over time, the human cost to maintain your scripts will far out-weight
the actual value it brings you.
36
The Analytics Setup Guidebook – Centralizing Data
The good news is that there are a large number of free and paid data
loading tools in the market. These tools often behave in a plug-and-
play manner. They provide a user-friendly interface to connect to
your data sources and data storage, set loading intervals and load
modes, and they likely also deliver reports about all your data loading
jobs.
These data load tools are commonly known as ETL tools. This might
cause confusion to beginner readers though, since most of the modern
tools we look at don't do the Transform step, and ask you to use a
dedicated Transformation tool.
That isn't to say that you need such sophisticated tools all the time,
though. Basic data loading capabilities are also often bundled in data
37
The Analytics Setup Guidebook – Centralizing Data
38
The Analytics Setup Guidebook – Centralizing Data
Anyway, let's wrap up. These are some proprietary data loading tools
on the market for you to consider:
Alooma
HevoData
StitchData
Talend
Pentaho
And here are a couple of great open source options (though — as with
all things open source, caveat emptor):
Prefect
Airflow
39
The Analytics Setup Guidebook – Centralizing Data
Meltano
If you still need to be convinced that writing your own data load &
ETL scripts is a bad idea, check out this great article by Jeff
Magnusson, who wrote it in his capacity as the VP of Data Platform at
Stitch Fix.
Common Concepts
Let's revisit the earlier example of loading bookings data, but this
time, let's look at how to run this transformation incrementally.
40
The Analytics Setup Guidebook – Centralizing Data
We can see that when an incremental loading job runs, only data for
2020-01-04 will be queried and copied over to the new table.
source_db = connect_to_source_db();
dest_db = connect_to_dest_datawarehouse();
How much performance gain does incremental load get you? A lot.
Imagine that you have 100M booking records and that those records
are growing at a pace of 10,000 records a day:
41
The Analytics Setup Guidebook – Centralizing Data
It's also important to note that unless your use case absolutely requires
it, it is not very important to get real-time data in most business
analytics. To understand why, think about this: If you want to view
sales data over the last seven weeks, is it necessary for the data to
account up to the minute you're requesting it?
Most business use cases just need a daily refresh of analytics data.
Summary
There are three different types of source data systems: application
data, third-party data and manual data.
To store and process data for analytics, you need a thing called data
warehouse.
We went over how the Extract & Load process looks like in
practice, and recommend that you use off-the-shelf tools. We also
42
The Analytics Setup Guidebook – Centralizing Data
talk about how incremental load can help you increase the
performance of your EL process.
In the next section, we talk about the next logical piece of the puzzle:
the Understanding The Data Warehouse.
43
The Analytics Setup Guidebook – Centralizing Data
Why do you need one? You will need a data warehouse for two main
purposes:
44
The Analytics Setup Guidebook – Centralizing Data
45
The Analytics Setup Guidebook – Centralizing Data
However, if you are still not sure if a data warehouse is the right thing
for your company, consider the below pointers:
46
The Analytics Setup Guidebook – Centralizing Data
For example, the vast majority of BI tools do not work well with
NoSQL data stores like MongoDB. This means that applications
that use MongoDB on the backend need their analytical data to be
transferred to a data warehouse, in order for data analysts to work
effectively with it.
47
The Analytics Setup Guidebook – Centralizing Data
If you answered yes to any of the above questions, then chances are
good that you should just get a data warehouse.
That said, in our opinion, it's usually a good idea to just go get a data
warehouse, as data warehouses are not expensive in the cloud era.
Amazon Redshift
Google BigQuery
Snowflake
48
The Analytics Setup Guidebook – Centralizing Data
ClickHouse (self-hosted)
Presto (self-hosted)
BigQuery is free for the first 10GB storage and first 1TB of queries.
After that it's pay-per-usage.
(Note: we don't have any affiliation with Google, and we don't get paid
to promote BigQuery).
49
The Analytics Setup Guidebook – Centralizing Data
"Hey isn't a data warehouse just like a relational database that stores
data for analytics? Can't I just use something like MySQL, PostgreSQL,
MSSQL or Oracle as my data warehouse?"
50
The Analytics Setup Guidebook – Centralizing Data
51
The Analytics Setup Guidebook – Centralizing Data
SELECT
category_name,
count(*) as num_products
FROM products
GROUP BY 1
(The above query scans the entire products table to count how many
products are there in each category)
Each query is heavy and takes a long time (minutes, or even hours)
to finish
52
The Analytics Setup Guidebook – Centralizing Data
What are the differences in architecture you ask? This will take a
dedicated section to explain, but the gist of it is that analytical
databases use the following techniques to guarantee superior
performance:
53
The Analytics Setup Guidebook – Centralizing Data
If you're just starting out with small set of data and few analytical use
cases, it's perfectly fine to pick a normal SQL database as your data
warehouse (most popular ones are MySQL, PostgreSQL, MSSQL or
Oracle). If you're relatively big with lots of data, you still can, but it will
require proper tuning and configuring.
That said, with the advent of low-cost data warehouse like BigQuery,
Redshift above, we would recommend you go ahead with a data
warehouse.
54
The Analytics Setup Guidebook – Centralizing Data
Summary
In this section, we zoomed in into data warehouse and spoke about:
55
The Analytics Setup Guidebook – Centralizing Data
Do note that some of this will make more sense after you read
Transforming Data in the ELT paradigm (in chapter 3).
56
The Analytics Setup Guidebook – Centralizing Data
In this process, an ETL tool extracts the data from different data
source systems, transforms the data by applying calculations,
concatenations, and the like, and finally loads the data into the data
warehouse.
The key things to note here is that raw data is transformed outside of
the data warehouse, usually with the help of a dedicated "staging
server"; and that only transformed data is loaded into the warehouse.
57
The Analytics Setup Guidebook – Centralizing Data
58
The Analytics Setup Guidebook – Centralizing Data
The key things to note here are that raw data is transformed inside the
data warehouse without the need of a staging server; your data
warehouse now contains both raw data and transformed data.
In this context, the ETL model made perfect sense: raw data was
properly transformed in a staging server (or ETL pipelining tool)
before being loaded into your ridiculously expensive data warehouse.
The volume of data that was handled by such tools back then was
relatively small, and thus manageable for most staging servers to
handle.
59
The Analytics Setup Guidebook – Centralizing Data
As data sizes increased, the ETL approach became more and more
problematic. Specifically, the staging server — that is, the machine
that orchestrated all the loading and transforming of data — began
to be a bottleneck for the rest of the stack.
And finally, we saw the rise of lean and agile software development
practices. Such practices meant that people began to expect more
from their data departments, the same way that they were used to
quick execution speeds in their software development teams.
And so at some point, people began to realize: the cost of storing and
processing data had become so cheap, it was now a better idea to just
60
The Analytics Setup Guidebook – Centralizing Data
dump all your data into a central location, before applying any
transformations.
And thus lay the seed that grew into the ELT approach.
With proper transform and modeling tools, ELT did not require
data engineers to be on standby to support any transformation
request from the analytics team. This empowered data analysts,
and increased execution speed.
61
The Analytics Setup Guidebook – Centralizing Data
Data warehouse only contains All data is stored in the cloud data
cleaned, transformed data ⇒ warehouse ⇒ very easy to change up
maximize utilization of data new data warehouse. Doesn't need
warehouse. Doesn't work well when additional staging servers. Assuming
Pros/Cons
data volume increase ⇒ bottlenecks a modern data warehouse, works well
on the staging server. Usually take when data volume increases. Takes
weeks/months to change process only days to transform/introduce
due to waterfall approach. new data.
A data lake is a fancy term for a central staging area for raw data. The
idea is to have everything in your organization dumped into a central
lake, before loading it into your data warehouse. Unlike data
warehouses (which we have talked about extensively in our discussion
about ELT, above) lakes are often object buckets in which you may
62
The Analytics Setup Guidebook – Centralizing Data
Summary
To conclude, when you are picking analytics tools, ask yourself: does
this tool assume an ETL approach, or does it assume an ELT
approach? Anything that requires a data transformation step outside
the data warehouse should set off alarm bells in your head; it means
that it was built for the past, not the future.
Pick ELT. As we will soon see in Chapter 3, ELT unlocks a lot more
than just the operational advantages we've talked about above.
63
The Analytics Setup Guidebook – Centralizing Data
64
The Analytics Setup Guidebook – Centralizing Data
65
The Analytics Setup Guidebook – Centralizing Data
just one transform and is reused in multiple reports. This helps you
avoid the scenario where two different data reports produce two
different numbers for the same metric.
66
The Analytics Setup Guidebook – Centralizing Data
Our raw bookings data would look like this after being loaded into the
data warehouse:
Table bookings {
id integer [pk]
email varchar
country_code varchar
platform varchar
user_id integer
listing_id integer
created_at timestamp
}
Table countries_code {
country_code varchar [pk]
country_name varchar
}
67
The Analytics Setup Guidebook – Centralizing Data
Source
So to implement this, the code for the transform job will be:
68
The Analytics Setup Guidebook – Centralizing Data
$ psql transforms/bookings_daily.sql
In the above example, the main transform logic is only within the
SELECT statement at the end of the code block. The rest is considered
metadata and operational boilerplate.
Besides using an SQL table to store the transform results, we may also
opt to create a database view (which means we store the definition
only), or we can create a materialized view for it.
69
The Analytics Setup Guidebook – Centralizing Data
For example, the below screenshots show how this is done using
Holistics:
70
The Analytics Setup Guidebook – Centralizing Data
For example, let's say that you have two transform jobs: one to
calculate sales revenue, and the other to calculate sales commissions
based on revenue. You will want the commissions job to be run only
after the revenue calculation is done.
71
The Analytics Setup Guidebook – Centralizing Data
Source
Each node inside the data warehouse represents a table, with the
left column (A, B, C, D) being tables loaded from source systems
into the data warehouse.
You can clearly see that job E should run after job A, B, C have
finished, while job I should run only after both D and F finish. This
is the dependency property of a DAG workflow in action.
In practice, most data transformation tools will have support for DAG
workflows. This is especially true for classical ETL tools, in the older
72
The Analytics Setup Guidebook – Centralizing Data
73
The Analytics Setup Guidebook – Centralizing Data
The main drawback of this approach is that the majority of the load is
now on the single computer that runs the script (which has to process
millions of data records).
This worked well when data warehouses were slow and expensive. It
also worked well at a time when data volumes were comparatively low.
Given these restrictions, data professionals would look for ways to
offload all processing outside of the data warehouse, so that they may
only store cleaned, modeled data in the warehouse to cut down on
costs.
74
The Analytics Setup Guidebook – Centralizing Data
You may read more about this in the Consolidating data from source
systems section of our book.
Source
In the above diagram, you can see that when the transform job runs,
only data for 2020-01-04 will be processed, and only two new records
will be appended to the bookings_daily table.
75
The Analytics Setup Guidebook – Centralizing Data
How much cost does this save us? Quite a bit, as it turns out.
Imagine that your bookings have 100M records and are growing at a
pace of 10,000 records a day:
destination: bookings_daily
incremental:
enabled: true
column: date_d
---
SELECT
ts::date as date_d,
C.country_name,
platform,
count(*) as total
FROM bookings B
LEFT JOIN countries C ON B.country_code = C.country_code
WHERE [[ ts::date > {{max_value}} ]] --this is added to the code.
GROUP BY 1
The [[ ts::date > {{max_value}} ]] is added so that the tool will pull the
latest value of the incremental column from the destination table and
substitute it within the SQL query. With this, only newer data are
materialized into a destination table.
76
The Analytics Setup Guidebook – Centralizing Data
When your old transformed data keeps changing, and would need
to be reloaded
77
Chapter 3:
Data Modeling for Analytics
The Analytics Setup Guidebook – Data Modeling for Analytics
If you don't have any experience with data modeling, then buckle in.
We're going to take you on a ride. Let's get started.
When the CEO has a question to ask about data, she goes to the data
analyst and asks: "Hey, Daniel can you help me get the sales
commissions numbers for bookings in this region?"
Daniel listens to the CEO's request, goes to his computer, and comes
back with the data, sometimes in the form of a short written note,
79
The Analytics Setup Guidebook – Data Modeling for Analytics
"She doesn't know how to do it", you might say, or "This is serious
technical stuff". These are common responses that you might get
when you pose this question.
The CEO (or any other business user, for that matter) thinks in
business terms, using business logic. Your actual data, on the other
hand, is stored in a different format. It follows different rules — often
rules imposed by the implementation of the application.
The CEO can't translate her mental model of the business into code in
order to run the numbers. She doesn't know how the data is
organized. But Daniel does.
For example: when asking about sales commissions, the CEO will
think "sales commissions is 5% of closed deals". However, the data
analyst will think "closed deals are stored in table closed_deals , so I
need to take the amount column and multiply that with months to
figure out the final amount; oh, and I need to check the
payment_received column to make sure that only the received payment
is counted".
80
The Analytics Setup Guidebook – Data Modeling for Analytics
He runs the query, get the results to Excel, formats it, and then
sends it over to the CEO.
This process works fine for some companies. But it will not scale up
beyond a few people, and is an incredibly inefficient way to do things.
Why? Well:
81
The Analytics Setup Guidebook – Data Modeling for Analytics
Every time the CEO wants something, she needs to wait hours (or
even days) for Daniel to crunch the numbers and get back to her.
At one point, Daniel might be too busy crunching out numbers for
different business stakeholders, instead of focusing his time on
more valuable, long-term impact work.
82
The Analytics Setup Guidebook – Data Modeling for Analytics
These tools include such tools like Holistics, dbt, dataform and Looker.
These tools share a couple of similar characteristics:
83
The Analytics Setup Guidebook – Data Modeling for Analytics
There isn't a good name for such tools right now. For the sake of
convenience, we will call them 'data modeling layer' tools.
Conceptually, they present a 'data modeling layer' to the analytics
department.
The CEO can just log in to the BI application, ask questions and get
the right numbers that she needs, without waiting for the data
84
The Analytics Setup Guidebook – Data Modeling for Analytics
Now that we know what data modeling is at a high level and why it's
important, let's talk about specific concepts that exist in the data
modeling layer paradigm.
85
The Analytics Setup Guidebook – Data Modeling for Analytics
These are the basic database tables that you pulled into your data
warehouse.
Now, let's look at a few modeling operations that you can apply to the
above data.
86
The Analytics Setup Guidebook – Data Modeling for Analytics
We'll be using Holistics for the examples below. That said, these
concepts map pretty well across any data modeling layer-type tool in
the market. (We'll tell you if we're introducing an idea that is specific
to Holistics, with no clear external analogs).
87
The Analytics Setup Guidebook – Data Modeling for Analytics
The most common types of metadata at the level of a data model are
textual descriptions, calculated dimensions that capture some business
logic, and relationship mappings between a model and some other
model (or table). Data modeling layers often also include
housekeeping metadata alongside the data model, such as a full
history of user accounts who have modified the model, when that
model was first created and last modified, when the underlying data
was last refreshed, and so on.
While database tables hold data, data models often contain metadata to
provide extra context for that data.
88
The Analytics Setup Guidebook – Data Modeling for Analytics
89
The Analytics Setup Guidebook – Data Modeling for Analytics
This is useful because you may want to create models that derive from
other models, instead of deriving from just underlying database tables.
90
The Analytics Setup Guidebook – Data Modeling for Analytics
We have defined two custom fields here. The first is the sum of nights
stayed from a successful booking (that is, a booking that has been seen
to check out). If the booking has reached the 'checked out' state, then
the nights are counted. Otherwise, it returns 0.
The second is a calculated field that returns the total number of guests
per booking. This latter field is a simple sum, because the
homestay.bookings table stores the number of children, babies, and
91
The Analytics Setup Guidebook – Data Modeling for Analytics
92
The Analytics Setup Guidebook – Data Modeling for Analytics
We have then taken the two models, and created a new transformed
model on top of them.
93
The Analytics Setup Guidebook – Data Modeling for Analytics
This is powerful for two reasons. First, all of this has happened via
SQL. We do not need to wait for data engineering to set up a new
transformation pipeline; we can simply ask an analyst to model the
data within the Holistics data modeling layer, by writing the following
SQL:
94
The Analytics Setup Guidebook – Data Modeling for Analytics
Second, the fact that our model exists as a simple join means that our
bookings_revenue model will be updated whenever the two underlying
models (or their underlying tables) are updated!
95
The Analytics Setup Guidebook – Data Modeling for Analytics
96
The Analytics Setup Guidebook – Data Modeling for Analytics
97
The Analytics Setup Guidebook – Data Modeling for Analytics
98
The Analytics Setup Guidebook – Data Modeling for Analytics
Our analysts do this until they have a tapestry of data models that
represent everything our business people would ever want to know
about our company. Then, creating a report simply becomes an issue
of picking the right models to present to the business.
What happens behind the scenes is this: the CEO's selection will be
translated into a corresponding SQL query (thanks to the modeling
99
The Analytics Setup Guidebook – Data Modeling for Analytics
layer), and this query will be sent to the analytical database. The
retrieved results will then be displayed for the CEO to view.
Summary
So what have we covered? We've looked at data modeling in the ELT
paradigm. The modern approach to data modeling in this paradigm is
to use what we call a 'data modeling layer', though this is a name that
we've adopted out of convenience. Such tools include Dataform, dbt,
Looker, and Holistics itself.
We then discussed several ideas that exist within this approach to data
modeling:
100
The Analytics Setup Guidebook – Data Modeling for Analytics
In our next section, we shall talk about how this approach to data
modeling works when it is combined with the classical method of
dimensional data modeling.
101
The Analytics Setup Guidebook – Data Modeling for Analytics
Why Kimball?
There are many approaches to data modeling. We have chosen to
focus on Kimball's because we think his ideas are the most widespread,
and therefore the most resonant amongst data professionals. If you
hire a data analyst today, it is likely that they will be familiar with the
ideas of dimensional data modeling. So you will need to have a handle
on the approach to work effectively with them.
102
The Analytics Setup Guidebook – Data Modeling for Analytics
We think that many of these approaches are valuable, but that all of
them are in need of updates given the rapid progress in data
warehousing technology.
A fact table, which acts as the primary table for the schema. A fact
table contains the primary measurements, metrics, or 'facts' of a
business process.
These dimensional tables are said to 'surround' the fact table, which is
where the name 'star schema' comes from.
103
The Analytics Setup Guidebook – Data Modeling for Analytics
Let's say that you're running a store, and you want to model the data
from your Point of Sales system. A naive approach to this is to use
your order transaction data as your fact table. You then place several
dimension tables around your order table — most notably products
and promotions. These three tables are linked by foreign keys — that
is, each order may reference several products or promotions stored in
their respective tables.
This basic star schema would thus look something like this:
104
The Analytics Setup Guidebook – Data Modeling for Analytics
Notice how our fact table will grow very quickly over time, as we may
see hundreds of orders per day. By way of comparison, our products
table and promotions table would contain far fewer entries, and would
be updated at a frequency much lower than the fact table.
105
The Analytics Setup Guidebook – Data Modeling for Analytics
But the star schema is only useful if it is easily applicable within your
company. So how do you come up with a star schema for your
particular business?
2. Decide on the grain. The grain here means the level of data to store
as the primary fact table. It should be the most atomic level
possible — that is, a level of data that cannot be split further. For
instance, in our Point of Sales example earlier, the grain should
actually be the line items inside each order, instead of the order
itself. This is because in the future, business users may want to ask
questions like "what are the products that sold the best during the
day in our stores?" — and you would need to drop down to the
line-item level in order to query for that question effectively. In
Kimball's day, if you had modeled your data at the order level, such
a question would take a huge amount of work to get at the data,
because you would run the query on slow database systems. You
106
The Analytics Setup Guidebook – Data Modeling for Analytics
3. Chose the dimensions that apply to each fact table row. This is
usually quite easy to answer if you have 'picked the grain' properly.
Dimensions fall out of the question "how do business people
describe the data that results from the business process?" You will
decorate fact tables with a robust set of dimensions representing all
possible descriptions.
4. Identify the numeric facts that will populate each fact table row.
The numeric data in the fact table falls out of the question "what
are we answering?" Business people will ask certain obvious
business questions (e.g. what's the average profit margin per
product category?), and so you will need to decide on what are the
most important numeric measures to store at the fact table layer, in
order to be recombined later to answer their queries. Facts should
be true to the grain defined in step 2; if a fact belongs to a different
grain, it should live in a separate fact table.
107
The Analytics Setup Guidebook – Data Modeling for Analytics
Notice how the dimension tables are oriented out from around the
fact table. Note also how fact tables consist of foreign keys to the
dimensional tables, and also how 'numeric facts' — fields that can be
aggregated for business metric purposes — are carefully chosen at the
line item fact table.
This might be surprising to you. Why would you have something like
a date dimension, of all things? The answer is to make things easier to
query for the business user. Business users might like to query in
terms of fiscal year, special holidays, or selling seasons like
Thanksgiving and Christmas. Since these concepts aren't captured in
the date field of an RDBMS system, we need to model date as an
explicit dimension.
108
The Analytics Setup Guidebook – Data Modeling for Analytics
This short example gives you all the flavor of dimensional data
modeling. We can see that:
4. The star schema works well given the performance constraints that
Kimball worked with. Remember that memory was relatively
expensive during Kimball's time, and that analytical queries were
either run on top of RDBMSes, or exported into OLAP cubes. Both
approaches benefited from a well-structured dimensional data
model.
109
The Analytics Setup Guidebook – Data Modeling for Analytics
110
The Analytics Setup Guidebook – Data Modeling for Analytics
111
The Analytics Setup Guidebook – Data Modeling for Analytics
You will notice that this setup is vastly more complicated than our
approach. Why is this the case?
Again, the answer lies in the technology that was available at the time.
Databases were slow, computer storage was expensive, and BI tools
needed to run on top of OLAP cubes in order to be fast. This
demanded that the data warehouse project be composed of a number
of separate data processing steps.
112
The Analytics Setup Guidebook – Data Modeling for Analytics
Today, things are much better. Our approach assumes that you can do
away with many elements of Kimball's approach.
Kimball then demonstrates that data analysis can happen using the
aggregated snapshot tables, and only go down to the inventory fact
table for a minority of queries. This helps the business user because
running such queries on the full inventory table is often a
performance nightmare.
113
The Analytics Setup Guidebook – Data Modeling for Analytics
can throw out the entire chapter on snapshot techniques and still
get relatively good results.
(Yes, we can hear you saying "but snapshotting is still a best practice!"
— the point here is that it's now an optional one, not a hard must.)
114
The Analytics Setup Guidebook – Data Modeling for Analytics
The simplest strategy you may adopt is what Kimball calls a 'Type 1'
response: you update the dimension naively. This is what has
happened above. The good news is that this response is simple. The
bad news is that updating your dimension tables this way will mess up
your old reports.
The first, 'Type 1', is to update the dimension column naively. This
approach has problems, as we've just seen.
The second, 'Type 2', is to add a new row to your product table, with a
new product key. This looks as follows:
115
The Analytics Setup Guidebook – Data Modeling for Analytics
With this approach, all new orders in the fact table will refer to the
product key 25984, not 12345. This allows old reports to return the
same numbers.
The final approach, 'Type 3', is to add a new column to the dimension
table to capture the previous department. This setup supports the
ability to view an 'alternate reality' of the same data. The setup thus
looks like this:
116
The Analytics Setup Guidebook – Data Modeling for Analytics
117
The Analytics Setup Guidebook – Data Modeling for Analytics
The key insight here is that storage is really cheap today. When
storage is cheap, you can get away with 'silly' things like partitioning
every dimension table every day, in order to get a full history of slowly
changing dimensions.
Let's give credit where credit is due: Kimball's ideas around the star
schema, his approach of using denormalized data, and the notion of
dimension and fact tables are powerful, time-tested ways to model
data for analytical workloads. We use it internally at Holistics, and we
recommend you do the same.
We think that the question isn't: 'is Kimball relevant today?' It's clear to
us that the approach remains useful. The question we think is worth
118
The Analytics Setup Guidebook – Data Modeling for Analytics
By this we mean that you should model when you have to.
Start with generating reports from the raw data tables from your
source systems — especially if the reports aren't too difficult to create,
or the queries not too difficult to write. If they are, model your tables
to match the business metrics that are most important to your users —
without too much thought for future flexibility.
119
The Analytics Setup Guidebook – Data Modeling for Analytics
developed his ideas. The Data Warehouse Toolkit was written at a time
when one had to create new ETL pipelines in order to change the
shape of one's data models. This was expensive and time consuming.
This is not the case with our approach: because we recommend that
you centralize your raw data within a data warehouse first, you are
able to transform them into new tables within the same warehouse,
using the power of that warehouse.
This is even easier when coupled with tools that are designed for this
paradigm.
What are some of these tools? Well, we've introduced these tools in the
previous section of the book. We called these tools 'data modeling
layer tools', and they are things like Holistics, dbt, and Looker.
These tools then generate the SQL required to create new data models
and persist them into new tables within the same warehouse. Note
how there is no need to request data engineering to get involved to set
up (and maintain!) external transformation pipelines. Everything
happens in one tool, leveraging the power of the underlying data
warehouse.
120
The Analytics Setup Guidebook – Data Modeling for Analytics
project. With 'data modeling layer tools', you no longer need data
engineering to get involved — you may simply give the task of
modeling to anyone on your team with SQL experience. So: do it 'just-
in-time', when you are sure you're going to need it.
121
The Analytics Setup Guidebook – Data Modeling for Analytics
Data architects trained in the old paradigm are likely to balk at this
approach. They look at potential cloud DW costs, and gasp at the extra
thousands of dollars you might have to pay if you push the heavy-
lifting to the data warehouse. But remember this: it is usually far more
costly to hire an extra data engineer than it is to pay for the marginal
cost of DW functionality. Pushing BigQuery to aggregate terabytes of
data might cost you an extra 1000 dollars of query time a month. But
hiring an extra data engineer to set up and maintain a pipeline for you
is going to cost many times more than that, especially if you include
the full cost of employee benefits.
Conclusion
Think holistically about your data infrastructure. The best companies
we work with do more with fewer people. They use the power of their
data warehouses to increase the impact of the people they have, and
choose to hire data analysts (who create reusable models) over data
engineers (who create extra infra).
122
The Analytics Setup Guidebook – Data Modeling for Analytics
1. We want to give you a taste of what it's like to model data using a
data modeling layer tool. Naturally, we will be using Holistics, since
that is what we use internally to measure our business. But the
general approach we present here is what is important, as the ideas
we apply are similar regardless of whether you're using Holistics, or
some other data modeling layer tool like dbt or Looker.
By the end of this segment, we hope to convince you that using a data
modeling layer-type tool along with the ELT approach is the right way
to go.
The Problem
In the middle of 2019, we began to adopt Snowplow as an alternative
to Google Analytics for all our front-facing marketing sites. Snowplow
is an open-source data delivery platform. It allows us to define and
record events for any number of things on https://www.holistics.io/ —
123
The Analytics Setup Guidebook – Data Modeling for Analytics
Our Snowplow data pipeline captures and delivers such event data
to BigQuery. And our internal Holistics instance sits on top of this
BigQuery data warehouse.
Note that there are over 130 columns in the underlying table, and
about 221 fields in the data model. This is a large fact table by most
measures.
Our data team quickly realized two things: first, this data was going to
be referenced a lot by the marketing team, as they checked the
124
The Analytics Setup Guidebook – Data Modeling for Analytics
performance of our various blog posts and landing pages. Second, the
cost of processing gigabytes of raw event data was going to be
significant given that these reports would be assessed so regularly.
Notice a few things that went into this decision. In the previous section
on Kimball data modeling we argued that it wasn't strictly necessary to
write aggregation tables when working with large fact tables on
modern data warehouses. Our work with the Snowplow data
happened within BigQuery — an extremely powerful MPP data
warehouse — so it was actually pretty doable to just run aggregations
off the raw event data.
125
The Analytics Setup Guidebook – Data Modeling for Analytics
with
page_view_stats as (
select
{{#e.domain_userid}}
, {{#e.domain_sessionidx}}
, {{#e.session_id}} as session_id
, wp.id as page_view_id
, {{#e.app_id}}
, min({{#e.derived_tstamp}}) as pv_start_at
, max({{#e.derived_tstamp}}) as pv_stop_at
, timestamp_diff(max({{#e.derived_tstamp}}), min({{#e.derived_tstamp}}),
second) as time_spent_secs
, timestamp_diff(max({{#e.derived_tstamp}}), min({{#e.derived_tstamp}}),
second) / 60 as time_spent_mins
126
The Analytics Setup Guidebook – Data Modeling for Analytics
unnest({{#e.contexts_com_snowplowanalytics_snowplow_web_page_1_0_0}})
as wp
left join {{#internal_visitors iv}} on {{#e.domain_userid}} =
{{#iv.domain_userid}}
where {{#e.app_id}} in ('landing', 'docs', 'blog')
and cast(derived_tstamp as date) >= '2019-07-30'
and {{#iv.domain_userid}} is null
group by 1, 2, 3, 4, 5
)
, session_stats as (
select
p.domain_userid
, p.session_id as session_id
, min(p.pv_start_at) as session_started_at
, cast(min(p.pv_start_at) as date) as session_started_date
, sum(time_spent_mins) as session_time_mins
, round(sum(time_spent_mins) / 60) as session_time_hours
from page_view_stats p
group by 1, 2
)
, visitor_stats as (
select
p.domain_userid
, cast(min(case when app_id in ('landing', 'docs') then p.pv_start_at else null
end) as date) as first_visited_landing_date
, cast(min(case when app_id = 'blog' then p.pv_start_at else null end) as date)
as first_visited_blog_date
from page_view_stats p
group by 1
)
select
{{#e.app_id}}
, {{#e.domain_userid}}
, vs.first_visited_landing_date
, vs.first_visited_blog_date
, {{#e.domain_sessionidx}}
, {{#e.session_id}} as session_id
, ss.session_started_at
, ss.session_started_date
127
The Analytics Setup Guidebook – Data Modeling for Analytics
, {{#e.mkt_source_grouping}} as mkt_source_grouping
, {{#e.utm_source_grouping}} as utm_source_grouping
, {{#e.utm_referrer_grouping}} as utm_referrer_grouping
, {{#e.mkt_medium}}
, {{#e.mkt_campaign}}
, {{#e.os_timezone}}
, {{#e.geo_country}}
, {{#e.referrer_grouping}} as referrer_grouping
, {{#e.page_urlhost}}
, {{#e.page_urlpath}}
, concat({{#e.page_urlhost}}, coalesce({{#e.page_urlpath}}, '/')) as page
, {{#e.page_grouping}} as page_grouping
, wp.id as page_view_id
, pvs.pv_start_at
, pvs.pv_stop_at
, coalesce(pvs.max_scroll_depth_pct, 0) as max_scroll_depth_pct
, pvs.time_spent_secs as time_on_page_secs
, pvs.time_spent_mins as time_on_page_mins
-- Actions aggregation
, {{#e.count_click_vid_how_holistics_works}} as
count_click_video_how_holistics_works
, {{#e.count_submit_demo_email}} as count_submit_demo_email
, {{#e.count_book_demo}} as count_book_demo
, {{#e.count_submit_trial_email}} as count_submit_trial_email
, {{#e.count_request_trial}} as count_request_trial
from {{#snowplow_holistics e }}
, unnest(
{{#e.contexts_com_snowplowanalytics_snowplow_ua_parser_context_1_0_0}})
128
The Analytics Setup Guidebook – Data Modeling for Analytics
as ua
left join unnest(
{{#e.contexts_com_snowplowanalytics_snowplow_web_page_1_0_0}}) as wp
where
{{#e.app_id}} in ('landing', 'docs', 'blog')
and {{#e.event}} != 'page_ping'
and cast({{#e.derived_tstamp}} as date) >= '2019-07-30'
and {{#e.is_test}} = false
and {{#iv.domain_userid}} is null
-- Remove bots
and not regexp_contains(ua.useragent_family,'(?i)
(bot|crawl|slurp|spider|archiv|spinn|sniff|seo|audit|survey|pingdom|worm|capture|
(browser|screen)shots|analyz|index|thumb|check|YandexBot|Twitterbot|a_archiver|
facebookexternalhit|Bingbot|Googlebot|Baiduspider|360(Spider|User-agent))')
and coalesce(regexp_contains( {{#e.refr_urlhost}}, 'seo'), false ) is false
and {{#e.page_urlhost}} != 'gtm-msr.appspot.com'
and ({{#e.refr_urlhost}} != 'gtm-msr.appspot.com' or {{#e.refr_urlhost}} is null)
group by 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29
Within the Holistics user interface, the above query generated a model
that looked like this:
129
The Analytics Setup Guidebook – Data Modeling for Analytics
130
The Analytics Setup Guidebook – Data Modeling for Analytics
We could also sanity check the data lineage of our new model, by
peeking at the dependency graph generated by Holistics:
Our reports were then switched over to this data model, instead of the
raw event fact table that they used earlier. The total time taken for this
131
The Analytics Setup Guidebook – Data Modeling for Analytics
a session in which the visitor did not scroll down the page, or
a session in which the visitor scrolled down but spent less than 20
seconds on the page.
The solution our data team settled on was to create a new data model
— one that operated at a higher grain than the pageview model. We
wanted to capture 'sessions', and build reports on top of this session
data.
132
The Analytics Setup Guidebook – Data Modeling for Analytics
And the SQL used to generate this new data model from the pageview
model was as follows (again, skim it, but don't worry if you don't
understand):
133
The Analytics Setup Guidebook – Data Modeling for Analytics
#standardsql
with
session_ts as (
select
{{#s.domain_userid}}
, {{#s.domain_sessionidx}}
, {{#s.session_id}} as session_id
, min( {{#s.session_started_at}}) as session_started_at
, max( {{#s.pv_stop_at}}) as session_latest_ts
from {{#session_pages_aggr s}}
group by 1, 2, 3
)
, first_page as (
select * from (
select
{{#p1.domain_userid}}
, {{#p1.domain_sessionidx}}
, {{#p1.session_id}}
, {{#p1.mkt_source_grouping}} as mkt_source
, {{#p1.mkt_medium}}
, {{#p1.mkt_campaign}}
, {{#p1.page_urlhost}} as first_page_host
, {{#p1.page}} as first_page
, {{#p1.page_grouping}} as first_page_grouping
, {{#p1.refr_urlhost}} as first_referrer_host
, {{#p1.referrer}} as first_referrer
, {{#p1.referrer_grouping}} as first_referrer_grouping
select
{{#p.domain_userid}}
, {{#p.domain_sessionidx}}
, {{#p.session_id}}
, st.session_started_at
, st.session_latest_ts
, cast(st.session_started_at as date) as session_started_date
134
The Analytics Setup Guidebook – Data Modeling for Analytics
, fp.mkt_source
, fp.mkt_medium
, fp.mkt_campaign
, first_referrer_host
, first_referrer
, first_referrer_grouping
, first_page_host
, first_page
, first_page_grouping
, {{#p.count_pageviews}} as count_pageviews
, {{#p.count_unique_pages_viewed}} as count_unique_pages_viewed
, {{#p.count_pages_without_scroll}} as count_pages_without_scroll
, {{#p.sum_time_on_page_secs}} as total_session_time_secs
, {{#p.sum_time_on_page_mins}} as total_session_time_mins
-- demo
, sum({{#p.count_submit_demo_email}}) > 0 as submitted_demo_email
, sum( {{#p.count_book_demo}}) > 0 as booked_demo
-- trial
, sum({{#p.count_submit_trial_email}}) > 0 as submitted_trial_email
, sum({{#p.count_request_trial}}) > 0 as requested_trial
And in the Holistics user interface, this is what that query looked like
(note how certain fields were annotated by our data analysts; this
made it easier for marketing staff to navigate in our self-service UI
later):
135
The Analytics Setup Guidebook – Data Modeling for Analytics
With this new session data model, it became relatively easy for our
analysts to create new reports for the marketing team. Their queries
were now very simple SELECT statements from the session data
model, and contained no business logic. This made it a lot easier to
create and maintain new marketing dashboards, especially since all the
hard work had already been captured at the data modeling layer.
136
The Analytics Setup Guidebook – Data Modeling for Analytics
Data modeling serves a similar purpose for us. We don't think it's very
smart to have data analysts spend all their time writing new reports for
business users. It's better if their work could become reusable
components for business users to help themselves.
137
The Analytics Setup Guidebook – Data Modeling for Analytics
On the leftmost column are the fields of the models collected within
the data set. These fields are usually self-describing, though analysts
take care to add textual descriptions where the field names are
ambiguous.
138
The Analytics Setup Guidebook – Data Modeling for Analytics
Takeaways
What are some lessons we may take away from this case study? Here
are a few that we want to highlight.
Is this ideal? No. But is it enough for the reports that our marketing
team uses? Yes, it is.
The truth is that if our current data model poses problems for us
down the line, we can always spend a day or two splitting out the
dimensions into a bunch of new dimension tables according to
Kimball's methodology. Because all of our raw analytical data is
captured in the same data warehouse, we need not fear losing the data
required for future changes. We can simply redo our models within
Holistics's data modeling layer, set a persistence setting, and then let
the data warehouse do the heavy lifting for us.
Notice how we modeled pageviews first from our event data, and
sessions later, only when we were requested to do so by our marketing
colleagues. We could have speculatively modeled sessions early on in
our Snowplow adoption, but we didn't. We chose to guard our data
team's time judiciously.
139
The Analytics Setup Guidebook – Data Modeling for Analytics
Use such speed to your advantage. Model only what you must.
Most of the data modeling layer tools out there encourage you to pre-
calculate business metrics within your data model. This allows you to
keep your queries simple. It also prevents human errors from
occurring.
140
The Analytics Setup Guidebook – Data Modeling for Analytics
Going Forward
We hope this case study has given you a taste of data modeling in this
new paradigm.
Use a data modeling layer tool. Use ELT. And what you'll get from
adopting the two is a flexible, easy approach to data modeling. We
think this is the future. We hope you'll agree.
141
Chapter 4:
Using Data
The Analytics Setup Guidebook – Using Data
It will explain to you certain shifts that have happened in the past
three decades, so that you will have a basic historical understanding
of the tools you will encounter throughout your career. This will
help orient you, and prevent you from feeling lost.
143
The Analytics Setup Guidebook – Using Data
144
The Analytics Setup Guidebook – Using Data
Much of DPB's data was stored in a central data warehouse. This was a
set of massive RDBMS databases that had been bought in a wave of
digitization that DPB underwent in the late 80s. Daniel didn't know
what these databases were — in fact, he never interacted directly with
them. Instead, an 'IT services' team was assigned to him, and he
interacted primarily with Cognos — at the time, one of the most
dominant business intelligence tools on the market.
A typical day would look like this: Daniel's boss would ask for a set of
numbers, and Daniel would go to his Cognos system to check the
PowerPlay cubes that were available to him. Most of the time, the data
would be in one or more cubes that had already been built by the 'IT
Services' team (the cube would be built with a subset of the main data
in the data warehouse). Daniel would point his PowerPlay client to the
cubes he wanted on the bank's internal Cognos server, and then sliced
and diced the data within the cube to extract the desired numbers for
his boss. Most of the time, this went out in the form of an Excel
spreadsheet — because Daniel's boss would want to do some
additional analysis of his own.
145
The Analytics Setup Guidebook – Using Data
The problems with Daniel's job emerged whenever Daniel's boss asked
for numbers that he didn't have access to. Whenever that happened,
Daniel would have to start a process which he quickly learned to hate.
The process went something like this:
4. At the end of three weeks, Daniel would be notified that the work
order had been processed, and that his new data was waiting for
him within the Cognos system. He might have to wait a few hours
for the data to be refreshed, because Cognos Transformer servers
took four hours on average to build a new PowerPlay cube.
146
The Analytics Setup Guidebook – Using Data
Naturally, Daniel had to obsess over his work orders. The cost of delay
with one bad request would be incredibly bad, because his boss would
be expecting numbers by the end of the reporting period. Daniel lived
under constant fear that the IT Services department would assign him
a dim-witted data engineer; he also felt helpless that he had to rely on
someone else to give him the resources he needed to do his job well.
What made things worse was when Daniel's boss's boss (yes, he of the
fearsome data-driven reputation) dominated the requests of the other
data analysts in Daniel's department. During such events, both the data
analysts and the IT Services department would prioritize the big boss's
request, leaving Daniel to fight over leftover resources at the services
scrap table. It was during times like these that he was most likely to be
assigned a dim-witted data engineer; over the course of a few years,
Daniel learned to be SUPER careful with his work order requests
whenever the big boss went on one of his data-requesting sprees.
147
The Analytics Setup Guidebook – Using Data
Daniel thought back to all the hell he went through in his previous job,
and agreed that this sounded like a good idea.
148
The Analytics Setup Guidebook – Using Data
itself.
From his time at YouDoo, Daniel learned that Tableau was essentially
the best tool in a new breed of BI tools, all of which represented a new
approach to analytics. This new approach assumed that the data team's
job was to prepare data and make them available to business users.
Then, capable business users could learn and use intuitive tools like
Tableau to generate all the reports they needed.
It was six months into Daniel's tenure at YouDoo that he was first
dragged into a metric knife fight. Apparently, marketing and sales
were at loggerheads over something numbers-related for a couple of
weeks now. Daniel and Joe were booked for a meeting with the
respective heads of sales and marketing. They learned quickly that
marketing's numbers (presented in a beautiful Tableau visualization,
natch) didn't match sales's. Sales had exported their prospects from
the same data sources as marketing's — what was going on?
149
The Analytics Setup Guidebook – Using Data
Daniel dug into the data that week. Over the course of a few hours, he
realized that marketing was using a subtly different formula to
calculate their numbers. Sales was using the right definitions for this
particular dispute — but they, too, had made subtle errors in a few
other metrics. To his dawning horror, Daniel realized that multiple
business departments had defined the same metrics in slightly
different ways ... and that there was no company-wide standardization
for measures across the company.
Daniel alerted Joe. Joe alerted the CEO. And the CEO called them into
his office and exploded at both of them, because he had just presented
the wrong numbers to the board in a quarterly meeting that had
concluded the previous week. Daniel and Joe were forced to work
overtime that day to get the numbers right. The CEO had to update
the board members with a follow-up email, and then he issued a
directive that all metric definitions were to be stored in a central
location, to be maintained by the business intelligence team.
150
The Analytics Setup Guidebook – Using Data
Daniel realized that this new Tableau workflow may have solved some
problems ... but it led to others as well.
Looker was amazing. Unlike the Cognos workflow Daniel started in, or
the Tableau workflow he grappled with at his previous company,
Looker assumed a completely different approach to data analytics. At
PropertyHubz, Looker was hooked up to RedShift, and ran SQL
queries directly against the database. Daniel's team of data analysts
spent most of their time creating data models in LookML, the
proprietary data modeling language in Looker. They then handed
those models off to less-technical members of the organization to turn
into dashboards, reports, and self-service interfaces.
151
The Analytics Setup Guidebook – Using Data
152
The Analytics Setup Guidebook – Using Data
paradigm would bring — and in turn, what the next paradigm would
do to solve them.
Let's recap Bien's argument, before picking apart the parts that don't
fully match with reality. Those parts will have real implications on
how you view the business intelligence tool landscape today.
153
The Analytics Setup Guidebook – Using Data
subsets of data and build them out into cubes: data warehouses were
simply too expensive and too slow to be used for day-to-day analysis.
Of course, as we've seen from Daniel's story, these tools came with
their own set of problems.
It's important to note that even as these 'second-wave' tools came into
prominence, the Cognos-type first-wave environments continued to
gain ground within large corporations. It wasn't as if people adopted
Tableau, and then the monolithic workflows went away. In fact,
Cognos still exists today, albeit under IBM's umbrella of business
intelligence tools. (Even the PowerPlay cubes that Daniel used at the
beginning of his career are still part of the product!)
154
The Analytics Setup Guidebook – Using Data
155
The Analytics Setup Guidebook – Using Data
But for new companies — and large tech companies like Google, Uber,
Facebook and Amazon — business intelligence that is implemented
today is often built entirely within the third wave. This is the
viewpoint that this book has attempted to present.
156
The Analytics Setup Guidebook – Using Data
Daniel is, of course, not real. But his story is a pastiche of real events
and real problems that were taken from analysts we know. The pains
that Daniel felt were real pains experienced by thousands of analysts
in the previous decade.
157
The Analytics Setup Guidebook – Using Data
158
The Analytics Setup Guidebook – Using Data
SQL vs Non-SQL
Some business intelligence tools demand knowledge of SQL. Others
do not.
159
The Analytics Setup Guidebook – Using Data
the data into a form that is suitable for analysis before handing it to a
business user. The Tableau user then loads such data into Tableau, in
order to generate visualizations and reports. Tableau assumes no
knowledge of SQL in order to produce such visualizations.
Similarly, BI tools that operate on top of OLAP cubes tend to not use
SQL. For example, Microsoft's MDX language was developed
specifically for operations on a data cube, and came to prominence
with their SSAS suite. The language was then adopted by many other
OLAP cube vendors in the market, and is today considered a solid
alternative to SQL.
160
The Analytics Setup Guidebook – Using Data
This was not always obvious. Starting around 2010, there was a huge
amount of hype around NoSQL datastores like MongoDB, Cassandra,
and CouchDB. These datastores promised superior performance, but
did not use SQL. There was also an earlier wave of excitement over big
data technologies like Hadoop and Spark, the majority of which
eschewed SQL for proprietary APIs.
Both trends have died in the years since. The vast majority of
analytical datastores today have standardized around SQL. Even non-
SQL datastores like the proprietary Spanner database from Google —
and even Hadoop today! — have adopted SQL as a query language.
In an essay titled The Rise of SQL-Based Data Modeling and Data Ops,
Holistics co-founder Thanh argued that the standardization around
SQL happened for a few reasons:
Proficient SQL users were able to utilize the power of any SQL
cloud-based data warehouse to produce beautiful charts and
dashboards ... without the need to learn a new proprietary language
or tool. This meant transferrable skills. It also meant that it was
easier for companies to hire and train new analysts.
161
The Analytics Setup Guidebook – Using Data
We do not think this trend will reverse anytime soon, for the simple
reason that standards only become more entrenched over time. The
upshot here is that if you work in data, you should assume that SQL is
the lingua franca of data analytics in the future. Pick tools accordingly.
162
The Analytics Setup Guidebook – Using Data
(This was also made easier because modern data warehouses have all
standardized on SQL. See how awesome standardization is?)
163
The Analytics Setup Guidebook – Using Data
The short answer is that no: they're not. We've covered this in a
previous section, in Understanding The Data Warehouse, but let's go
through this quickly, again. When we say 'modern data warehouse', we
really mean data warehouses that have two properties:
This is not to say that you can't use Holistics and other similar tools
with RDBMSes like MySQL and Postgresql. In fact, we've seen many
startups start out with a Postgres replica as a primary analytical
database.
164
The Analytics Setup Guidebook – Using Data
In-Memory vs In-Database
Another variant of this 'embedded datastore' vs 'external datastore'
spectrum is the idea of 'in memory' vs 'in database' tools.
Tools like Holistics, Redash, Chartio, Metabase and Looker run SQL
queries on top of a powerful database. The heavy lifting is done by the
database itself; the connected BI tool merely grabs the results of
generated queries and displays it to the user.
The Cluvio, Redash and Mode Analytics tools we've mentioned don't
provide any modeling capabilities whatsoever. In practice, many
contemporary data teams that we know either implement ELT
techniques using data modeling layers like dbt or Dataform, or use a
more traditional ETL approach using tools like Pentaho or Talend.
165
The Analytics Setup Guidebook – Using Data
Holistics and Looker are somewhat unique in this sense, in that they
include a modeling layer alongside BI functionality. Because of that,
your entire logic is centralized in the data modeling layer, thus greatly
increase metric consistency and logic reusability across the
organization.
166
The Analytics Setup Guidebook – Using Data
We think that SQL based analytics will win over non-SQL based
analytics.
167
The Analytics Setup Guidebook – Using Data
Given what we've covered over the past couple of chapters, and given
what we've shown you about the landscape of BI tools in this section,
you should now be able to place these biases against a larger context.
We prefer ELT over ETL because we think that ELT gives you the
power of more flexible data modeling practices. You may choose to
skip the up-front modeling costs of the original Kimball paradigm,
and only chose to do so when your reporting requirements call for
it.
We think that SQL based analytics will win over non-SQL based
analytics, because the entire industry has standardized on SQL in
the last five years.
168
The Analytics Setup Guidebook – Using Data
not bottleneck on its data team in order to get the insights they
need.
Wrapping Up
So, let's recap. When you're in the market for a business intelligence
tool, you may categorize the tools you see in the following ways:
1. SQL vs Non-SQL: Does the tool assume SQL as its primary query
interface? Or does it export data out to a non-SQL data engine?
Does it use cubes? The answer to this will tell you a lot about the
paradigm the tool is from.
169
The Analytics Setup Guidebook – Using Data
We think that the list we've laid out above is the shortest possible
taxonomy that gives you the most interesting information about the
tools you are evaluating. We hope they give you a way to orient
yourself, the next time you look out to the landscape of business
intelligence.
170
The Analytics Setup Guidebook – Using Data
How you serve these queries depends heavily on the tools you have
available to you. If you have access to a centralized data warehouse, it
is likely that you would write some ad-hoc SQL query to generate the
171
The Analytics Setup Guidebook – Using Data
This, too, is inevitable. Over time, data people will learn that there is a
steady, predictable cadence to some of the requests they receive. For
instance, at an early-stage company that we worked with, the head of
172
The Analytics Setup Guidebook – Using Data
data quickly realized that the product team had a certain set of metrics
they wanted to look at on a weekly basis, while the marketing team
had a natural tempo of requests once every three weeks.
This head of data began to look for a BI tool to create dashboards for
those predictable metrics, in order to free up his team for the more
ad-hoc requests that they received from other parts of the company.
Once he had created those reports, his data team immediately began
to feel less overwhelmed.
"We're very happy," he told us, "The product team and the marketing
team each got their own dashboard, and once we set everything up,
the number of requests we got from those two teams went down. We
now try and give them a new report every time they ask for
something, instead of running ad-hoc queries all the time for them."
173
The Analytics Setup Guidebook – Using Data
Three: Self-Service
It is perhaps ironic that more dashboard usage leads to more data-
driven thinking ... which in turn leads to more ad-hoc requests! As
time passes, business operators who lean on their dashboards begin to
adopt more sophisticated forms of thinking. They learn to rely less on
their gut to make calls like "let's target Japanese businessmen golfers in
Ho Chi Minh City!", or "let's invest in fish instead of dogs!" This leads
to an increase in ad-hoc, exploratory data requests.
The data team finds itself overwhelmed yet again. The data lead
begins to think: "if only there was some way to let our business people
explore metrics on their own!"
174
The Analytics Setup Guidebook – Using Data
"And they will be able to get more sophisticated data from the SQL-
oriented tools we have (like Redash, Mode or Metabase)."
Both approaches have problems, but the biggest problem is that they
often lead to the metrics knife fight we've talked about at the
beginning of this chapter. Different business users may accidentally
introduce subtly different metric definitions to their analyses. These
inconsistencies often lead to miscommunication, or — worse — errors
of judgment at the executive level.
175
The Analytics Setup Guidebook – Using Data
1. Reporting — This is the lowest level. As Schario puts it: when you
have no answers, you never get beyond looking for facts. Example
questions at this level are things like: ‘how many new users visited
our website last week?’ and ‘how many leads did we capture this
month?’ Some companies do not get to this level, because they lack
176
The Analytics Setup Guidebook – Using Data
177
The Analytics Setup Guidebook – Using Data
Such models can be useful because they show business leaders what is
possible within their organizations. The fact that so many data
maturity models exist tells us that there is some base truth here. It is
interesting to ask why these models exist.
Most people are not data-driven by nature. They have to learn it, like
they learned reading or writing. In a sufficiently large company,
however, you will find certain people who are naturally data-driven in
their thinking; others that seem data-driven from the beginning may
have come from more data-mature organizations and therefore seek
to continue the practices that they were used to. Depending on the
culture of your company, these attitudes will spread in your
organization (or not!)
When viewed through this lens, the data capabilities that you build out
in your team will have an impact on the spread of data-driven
thinking in your organization. The more data capabilities you have,
the more people will be exposed to the potential of using data to
advance their arguments. The more data capabilities you build up, the
more leverage you give to data-driven people in your company's
culture to spread their way of thinking.
As a result, the amount of work that your data team has to do increases
linearly with the spread of data-driven thinking in your company!
178
The Analytics Setup Guidebook – Using Data
The upshot is that if all goes well, your data team will find itself buried
under a wave of ad-hoc requests. You will seek solutions to this
problem. You will discover that dashboards and automated reports
will buy you some time. But eventually, as your organization moves
from reporting to insights to predictions, you would have to tackle this
problem head-on.
This arc shouldn't surprise us. Spend even a small amount of time
looking at industry conferences, or data thought leadership articles, or
179
The Analytics Setup Guidebook – Using Data
marketing materials from vendors, and you'll find that many of these
professionals are obsessed over self-service as an ultimate goal. "Listen
to us!" the thought leaders cry, "We will show you a way out of this
mess!" To be clear, this is an understandable desire, because data-
driven decision-making so often bottlenecks on the data team. Also to
be clear: a majority of companies do not succeed in this effort. True
self-service is a difficult challenge.
The most important takeaway from this section is that the arc is real.
We've talked about the challenges of scaling data infrastructure in the
past, but the flip side of that discussion is the idea that you must scale
your BI tools to match the data consumption patterns in your
company. Keep the arc of adoption in mind; nobody really escapes
from it.
If you've read this far in the book, you can probably guess at what we
think about this: unlike 'second wave' business intelligence tools, we
think that data modeling at a central data warehouse is a solution to
this problem. Define your business definitions once, in a modeling
layer, and then parcel these models out for self-service. In this way,
you get all the benefits of self-service without the problems of ill-
defined, inconsistent metric definitions.
As far as we can tell, the only business intelligence tools to adopt this
approach is Looker and Holistics. We expect more tools to adapt
180
The Analytics Setup Guidebook – Using Data
Will this approach win out in the end? We'd like to think so. We think
that there are many advantages to this approach. However, as our
intrepid data analyst Daniel has shown us, we cannot know what
problems will fall out of this new paradigm. We will have to wait a
couple of years to see.
181
Chapter 5:
Conclusion
The Analytics Setup Guidebook - Conclusion
The End
So what have we shown you?
We've shown you that all analytical systems must do three basic things.
Those three things give us a useful framework for talking about and
thinking about building data infrastructure. The three things that you
must do are, again:
2. You must process data: that is, transform, clean up, aggregate and
model the data that has been pushed to a central data warehouse.
Within these three steps, we've taken a look at each with a fair amount
of nuance:
We've shown you how ELT provides powerful benefits that stretch
beyond minor operational improvements.
183
The Analytics Setup Guidebook - Conclusion
Finally, we've sketched for you the shape of the entire BI tool
landscape.
The Future
What do we think the future holds for data analytics?
In the short term, we think that many of the trends we've pointed out
throughout this book will continue to proliferate. The industry's
current shift to standardize around SQL will only strengthen; new BI
tools will continue to adapt to the power of the modern data
warehouse. We (rather biasedly, it must be said) believe that the
approach we have described in this book is The Way Of The Future,
and that it would eventually seep into large enterprises — that is,
184
The Analytics Setup Guidebook - Conclusion
If you are a data analyst, one implication is that you should have
passing familiarity with all the approaches from all three paradigms in
the business intelligence world. This doesn't mean that you must
master them — but it does mean that you should be aware of the
alternative approaches that exist. This is because — as we've pointed
out — old approaches in business intelligence stick around for a long
time.
185
The Analytics Setup Guidebook - Conclusion
We hope you enjoyed this book, and that you've learned a great deal
from it. If you thought this was useful, we would be very grateful if
you shared this book with the people who need it — new data
professionals, startup founders looking to set up a data analytics
capability for the first time, product people who just want the bare
minimum to hit the ground running.
We'd also love to hear from you if you have feedback or thoughts on
the book, you can:
186
Try Holistics for your company today!
Holistics helps you set up, run and maintain your analytics stack without
data engineering help.
www.holistics.io →