Oracle Histograms and Why

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 8

Oracle Histograms and Why???

In this short series on histograms we will be looking at the reasons why we need histograms and the
methods Oracle has to create them. We will examine the costs of creating them, and the potential
they introduce for undesirable overheads, and then review their potential for giving you stability
problems in your execution plans. This overview will be limited to versions of Oracle prior to 12c,
where new types of histograms and collection methods aimed at reducing the overheads and
improving stability have appeared

A simple example
Some time ago a client of mine saw some strange performance problems in what seemed to be a
very simple part of their application. They were dealing with an online sales system, and during the
course of the day they needed to run off regular reports of the form: show me order that have
recently been placed but have not yet been dispatched. This requirement turned into a very simple
piece of SQL:

select {list of columns}


from orders
where
status = R
order by
order_id
;

At any one instant during the day there would be only a small number of orders matching this
predicate one or two hundred out of a table holding millions of rows. There was an index on the
status column to allow for an efficient access path and since new orders would, for simple mechanical
reasons, be in a portion of the table that was the most recent group of blocks added to the table and
those blocks would be cached in memory, it was assumed that the query would inevitably execute
very quickly. The problem was that some days the report would take tens of seconds to run rather
than being virtually instantaneous.
The first diagnostic, of course, was to check that the execution plan was the one expected was the
query trying to do something efficient. There was no need for a second diagnostic when the query
ran quickly Oracle was using the expected index, when it ran slowly Oracle was doing a tablescan. So
the question changed from why is the query slow to why does the optimizer think, some of the time,
that a tablescan is a good idea.
Reading the description of the business activity, and given the title of the article, youve probably
already got a very good idea. This data set is very skewed, and when the optimizer sees the skew
we get the right plan and when it doesnt see the skew we get the wrong plan.
Heres a query (against a model of the data) highlighting the type of problem:

select status, count(*)


from orders
group by status
order by
status
;

S
C
P
R
S
X

COUNT(*)
529,100
300
300
300
500,000

As you can see, most of the data has ended up in one of two states (depending on how it finally got to
the customer) and a tiny volume of data is scattered across a handful of other values. When you see
data like this and know that you have a requirement to access the rare or non-popular values (the
latter is Oracles preferred terminology) your thoughts are likely to turn in one of two directions
virtual columns (which might mean a function-based index, or the 11g implementation of virtual
columns, or even the 11g extended stats), or histograms.

Virtual Columns etc.


To my mind the nicest solution comes from virtual columns (or function-based indexes pre-11g)
because this allows us to maintain a very small index despite the size of the data set. So we might
create something like the following:

create index ord_new on orders(


case status when 'R' then 'R' else null end
);
begin
dbms_stats.gather_table_stats(
user,
'orders',
method_opt => 'for all hidden columns size 1'
);
end;
/

Although I will need to collect statistics on the hidden column holding the index definition, collecting
stats on all hidden columns after creating the index could be expensive way of doing it I could check

the view user_tab_colsfor the most recent column name, which will be something
like sys_nc00037$, and collect stats on just that specific column. (Note: technically the else null is
redundant but I prefer to include the final option explicitly.)
Of course, I might want to run similar queries for the other non-popular values, so I could create two
more indexes like the above, or I could create an index that covers the three values heres an
example using the 11g virtual column approach:

alter table orders


add (
interesting_status
generated always as (
case status
when 'C' then null
when 'X' then null
else status
end
) virtual
)
/
begin
dbms_stats.gather_table_stats(
user,
'orders',
method_opt => 'for columns interesting_state size 1'
);
end;
/
create index ord_is on orders(interesting_status);

There is a limitation to the virtual column / function-based index approach (whichever method you
use) you have to change the application code to take advantage of it the proper virtual columns
in 11g make the code look neater than the FBI code, but there still has to be a change, e.g (for the FBI
approach):

select {list of columns}


from orders
where
case status when 'R' then 'R' else null end = R
order by

order_id
;

Histograms
So what do we do if we If we cant change the code? We have to make sure that the optimizer is
aware of the problem, because if we dont then the basic optimizer model will produce a bad estimate
for the cardinality (row count) and choose a bad execution path. At the simplest level the statistics we
collect for the optimizer will say: there are 1,030,000 rows in the table, this column has 5 distinct
values and no nulls and the values are distributed evenly from C to X. With this information the
optimizers cardinality estimate for the predicate status = C will be derived as: total rows / number
of distinct values = 206,000. This assumes, of course, that you used a 100% sample
(estimate_percent => 100) to gather stats, the results could be a little less predictable if youre using
a version older than 11g, or havent yet converted to the approximate NDV mechanism in 11g.
This is where histograms come into play they allow us to give the optimizer more detailed
information about the distribution of values in a column. Until 12c they come in two flavours: frequency
histograms and height-balanced histograms and in our example we need a frequency histogram.
(Note: 12c has two new types of histogram: Top-N and hybrid).
In principle a frequency histogram is an exact picture of the data at a moment in time (and that last
phrase is very important) whereas a height-balanced histogram is an approximate image of the data
distribution that tries to capture details of popular values and the uneven spread of the rest. A
frequency histogram can be created when a column holds no more that 254 distinct values (2,000 in
12c), whereas a height balanced histogram is much less precise and cant really capture information
about more than 127 popular values. For the rest of this article Ill restrict myself to frequency
histograms.

Frequency Histogram
In our example we have only five distinct values, and my model dataset holds just over 1M rows. I can
ask Oracle to collect a histogram on the column by gathering table stats with the following setting for
the method_opt parameter:for columns status size 254. (Note, although I know there are only 5
values I might as well ask for the maximum, Oracle will discover that 5 is sufficient). If I also set
the estimate_percent to 100 I will end up with the following in the user_tab_histograms view for
this column:

select
ep_let
status,
endpoint_number - ep_lag
ep_val
from
(
select

ct,

to_char(endpoint_value,'FMxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx')
chr(
to_number(

ep_val,

substr(to_char(endpoint_value,'FMxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'),1,2),
'XX'
)
)
ep_let,
endpoint_number,
lag(endpoint_number,1,0) over(order by endpoint_number) ep_lag
from
user_tab_histograms
where
table_name = 'ORDERS'
and
column_name = 'STATUS'
)
order by
ep_let
/
S
CT EP_VAL
- ---------- ------------------------------C
529100 43202020202009e7ff86dab2800000
P
300 50202020202016f62c60904c200000
R
300 52202020202029a671551080e00000
S
300 53202020202032fe93cf509b400000
X
500000 5820202020202b817684cb40800000

Ive shown a method of converting the endpoint_value as its stored for character columns from its
numeric form to its character equivalent, if you have ASCII codes at your finger tips youll spot lots of
spaces (020) appended to the hex representation of the endpoint value for details of how the
transformation is actually performed see:http://jonathanlewis.wordpress.com/2010/10/05/frequencyhistogram-4/
Note that the histogram is effectively stored as a cumulative frequency, which Ive unwound by using
the lag()analytic function thus allowing you to see that Oracle holds exact counts for each of the
distinct values in the data.
With this information in place and assuming that my SQL really does use literal values when the
optimizer calculates the cardinality from the predicate status = R it can check that this is a value in
the histogram and report the count it finds recorded there. The frequency histogram is an enormous
aid in this particular case so, you may ask, why dont we simply create histograms for every single
column in the application (or possibly just the columns that appear in where clauses)?

Threats with Frequency Histograms


There are four main threats with histograms which I can label with the following bullet points, which Ill
then examine in order:
o
o
o
o

They dont mix nicely with bind variables


Theyre expensive to compute
They can be very unstable when sampled
You have to collect them at the right moment

I made the comment about the optimizer being able to pick the correct entry from the histogram if your
query uses the literal value. If youve used a bind variable for this query then the optimizer will use
bind-peeking on the first parse call and still produce the correct cardinality (and execution plan); but
until the advent of adaptive cursor sharing in 11g, and adaptive execution plans in 12c, that one
plan was (essentially) the plan you kept for all subsequent executions of the query no matter how
the value of the bind variable changes. In my example, matching the clients system, a bind variable
would have been okay because the ONLY queries against this table were very simple ones for status
= {rare value}, and the plan for status R, would have been fine for P and S but generally you
wont be that lucky. If youve created a histogram on a column then you should expect to do
something in the application that lets the optimizer handle the histogram well and that may mean
using literals in the where clause (which can have its own problems), it may mean doing something
more subtle such as writing application code to check the users requests and pick the most
appropriate SQL from a short list to run in each case.
Assuming youve worked out how to make best use of a frequency histogram in your code, you still
have the problem of making sure that the histogram is accurate when the optimizer wants to read it.
Heres the SQL Oracle ran when I told it to collect the frequency histogram for the above data with a
100% sample size:

select
substrb(dump(val,16,0,32),1,120) ep,
cnt
from (
select
/*+
no_expand_table(t) index_rs(t)
no_parallel(t)
no_parallel_index(t) dbms_stats cursor_sharing_exact
use_weak_name_resl dynamic_sampling(0) no_monitoring
no_substrb_pad
*/
max("STATUS") val,
count(*) cnt
from
"TEST_USER"."ORDERS" t
where
"STATUS" is not null
group by

nlssort("STATUS", 'NLS_SORT = binary')


)
order by
nlssort(val,'NLS_SORT = binary')
;

The exact query will depend on the version of Oracle and whether Oracle thinks the column needs a
frequency histogram or a height-balanced histogram, but the general principle is that youre going to
see an aggregate query thats going to crunch a large volume of data and a variant of the query will
appear for each column that you identify as the target for a histogram. Gathering histograms
(accurately, at least) is an expensive operation.
You could reduce the cost of gathering the histogram by sampling, rather than computing. You will see
similar SQL appearing when you do this, though there are variations in particular Oracle will often
copy a sample of the original rows into a global temporary table that it creates for the purpose and
then run the queries against the global termporary table. This can result in a lot less work being done
to build the histogram but it introduces a different threat. Heres the content of the histogram on my
original data when I gave Oracle the auto_sample_size option to collect a histogram:

S
CT EP_VAL
- ---------- ------------------------------C
2868 43202020202009e7ff86dab2800000
P
2 50202020202016f62c60904c200000
S
1 53202020202032fe93cf509b400000
X
2627 5820202020202b817684cb40800000

If you add up the figures youll see that Oracle has sample 5,498 rows from the table so when it
estimates the number of rows for any given value is till check the histogram and multiple by 1,030,000
/ 5498 (the numerator is the number of rows according
to user_tables.num_rows minus user_tab_cols.num_nulls), so the estimate for status = S will be
187, and for P it will be 375 both of which are fairly reasonable (especially when compared to the
1.030.000/5 that we would see in the absence of a histogram).
But what do we do about status R it hasnt appeared in the sample so it hasnt appeared in the
histogram? In this case the optimizer will simply halve the cardinality of the least popular value that it
sees in the histogram so the cardinality would be calculated as 94. Again, in this case, thats not too
bad and it wont change the critical execution plan, but you can appreciate that if you get a little
unlucky in the rows that Oracle samples from day to day your execution plans could change fairly
randomly. Can you spot the major threat in this particular set of data ?
What happens if Oracle doesnt spot ANY of the rare values when sampling the data and ends up with
a histogram that says the data is split roughly 50/50 between C and X at about 500,000 rows each? A
query for status = R will use half the least popular value giving an estimate of about 250,000; and
thats exactly what happened to my client. From time to time the stats gathering routine (which was

doing the default 10g overnight collection of stale statistics) would gather stats on this table and miss
all the rare values, and for the next 24 hours (or until the next stats collection) the optimizer would
decide to use a tablescan on a very large table rather than using a highly appropriate index.
The idea of failing to capture critical information in the histogram leads us to the last critical problem of
histograms what if the critical information is never there when you collect the stats. Imagine that the
rare values in my order processing system appear only between 6:00 a.m. and 6:00 p.m. and by
10:00 p.m. they have all been processed out of the system. When the default stats collection runs
some time late at night the only values in the table are C and X, but when the queries run in the
daytime the only values were interested are exactly the ones that werent there when the stats were
collected. Even with a 100% sample there may be parts of your system that get misleading stats if you
gather them at the wrong time. You need to know your system well enough to know when the
application code itself should be responsible for the quality of the statistics. This may mean that you
write code to gather statistics at a specific time of day; it may mean that you write code that
manipulates the stored statistics directly (and we will be looking at that strategy when we look at
height-balanced histograms).

Conclusion
When the distribution of data values is highly skewed you need to do something about it if you want to
ensure that the optimizer doesnt produce very bad execution plans as a consequence. If you have
control of the application code features like virtual columns or function-based indexes may help you
deal with special values; but if you cant change the code you may need to depend on histograms.
Even with histograms in place, though, bind variables can easily cause problems even with the
newer features in 11g and 12c for adaptive cursors and adaptive execution plans.
A brief examination of frequency histograms (the simpler type) showed us how useful they can be for
a column with a small number of distinct values especially if your SQL uses literals. Histograms,
even the simple frequency histograms, can be expensive for Oracle to create unless it samples a
small fraction of the data, and then its possible that the histogram will introduce instability if the really
interesting data is transient and a very small fraction of the total. In fact, even if you use 100% sample
at the wrong time the resulting histogram can still cause problems thanks to the optimizers treatment
of missing values.

You might also like