Be cautious when the optimizer estimates a cardinality of 1 – Part1

A time ago I came across a problem which inspired me to write this article.
It also made me thinking how important it is to understand the client’s business model.
Not that you need to get in depth of it – most of the cases we remain within the specifics of SQL.
The optimistic side of it, with a bit of patience there are good chances you can figure out enough business details to be then applied in SQL tuning.

I’ll be more specific very soon.
Still, before jumping to the real matter, I feel obliged to carry on with some semi-theoretical introduction.

Let’s start from the problem-query.
This is the query (the names of the tables changed) which gave us a headache :

SELECT
(some columns from the tables below)
FROM
    VERYSECOWNR.pr_secrtb_secrtpos_info  inf,
    VERYSECOWNR.pr_secrtb_secrtpos_value val,
    VERYSECOWNR.pr_secrtb_my_versec      sta,
    VERYSECOWNR.pr_secrtb_det1           uni,
    VERYSECOWNR.pr_rverysecr             res,
    VERYSECOWNR.pr_realsec               dev,
    VERYSECOWNR.pr_reallysec             loc
WHERE
        inf.hist_id = val.hist_secret_id (+)
    AND inf.myversecret_column = sta.secrid (+)
    AND inf.sensor_unit_autoid = uni.secrid
    AND inf.secretlo = loc.secrid (+)
    AND uni.secdev = dev.secrid
    AND dev.mysecret_ref = res.secrid (+)
    AND inf.hist_id >= :1
    AND val.hist_id (+) >= :2
order by inf.hist_id , inf.verysecret_timstmp
FETCH NEXT 50000 ROWS WITH TIES

The predicates which are of special importance :

AND inf.hist_id >= :1
AND val.hist_id (+) >= :2

and also (will be examined later):

inf.hist_id = val.hist_secret_id (+)

Then, it is crucial to be aware how the business data processing determined the (mis)behavior of this critical SQL.
For the starter, let’s take the hist_id from the table PR_SECRTB_SECRTPOS_INFO and see how it “travelled in time” leaving behind the values gathered during (very early: at 22:02) stage of the stats-gathering window.
Both columns hist_id are of type NUMBER and serve as Primary Keys in both tables.

Gathering of statistics (22:02): 
Min ------------------------- Max

Reality at T0 (say, 22:10): 
    Min (+) --------------------------- Max (+)


Reality at T1 (say, 22:30): 
        Min (+) --------------------------- Max (+)


Reality at T2 (say, 23:30): 
                    Min (+) -------------------------------------------------- Max (+)


Reality at T3 (say, 00:30, next day): 
                                    Min (+) -------------------------------------------------- Max (+)

First (this is paramount), the value of HIST_ID in the predicate
inf.hist_id >= :1
lies outside of the “official” (so, derived from the statistics) min-max boundary. We will shortly come to that and will show why this is crucial.

Secondly, simple monitoring of how the HIST_ID “moves” over time (where “time” is to be understood as “several hours during a business day”) revealed the following :

Min (+) --------------------------- Max (+)
                                  Min (+) ------------------------------------------------------------ Max (+)
                                          Min (+) ----------------------------------------------------------------------------- Max (+)

I have traced the “movement” of HIST_ID with the trivial SQL against the business table which I executed every ~10 minutes, or so :
SQL> select min(HIST_ID) min_hist_id_info, max(HIST_ID) max_hist_id_info, max(HIST_ID) - min(HIST_ID) min_max_range from VERYSECOWNR.PR_SECRTB_SECRTPOS_INFO ;

Indeed, I could observe that both, min and max of HIST_ID have been steadily “moving to the right”, obviously leaving the CBO more and more helpless over time, as no stats have been refreshed to reflect these changes.
Such movement was obviously caused by very frequent DELETEs followed by (equally frequent) accompanying INSERTs.

As a result, what had worked fine close to the point the statistics had been gathered, has been gradually getting worse along with the constant increase of the HIST_IDs.
Why?
Let’s build a theoretical example.

If you consider a predicate like “WHERE column_num > :some_numerical_value”, your query gets certain range of values being stretched from the :some_numerical_value till the max(column_num).
Probably a better (and arguably more popular than mine) example is the predicate: “WHERE date_col >= sysdate - :nr_days”.
It’s returning the number of rows which can be roughly described as “give me all rows which reach out :nr_days in the past”.
This “number of rows” returned from such predicate carries great importance for the optimizer as it essentially determines two important things :
The method of accessing (access path) of the underlying table (basically: index range scan or full table scan) and the join strategy (essentially: nested loop or hash join).

Now, why did I mention this :
“As a result, what had worked fine close to the point the statistics had been gathered, has been gradually getting worse along with the constant increase of the HIST_IDs”

Look at the line-like diagram, representing this constant movement of the IDs used in the predicates.
Let us assume that the min and max value of the queried predicate according to the optimizer’s knowledge is 1 <-> 10’000.

Statistics :
Min (1) ------------------------- Max (10’000)

Now, if your query applies a predicate: col >= 9990, it returns 10 rows from the range of 1 <-> 10’000, so fairly small part.
Provided there is an index on column col, it’s very likely that INDEX RANGE SCAN will be chosen with the calculated cardinality: 10 (of 10’000), so under normal circumstances CBO would definitely not go FULL TABLE SCAN.
There is yet another crucial outcome: provided a table behind this predicate is joined to another one, it’s very probable that NESTED LOOP will get calculated (with this table as a driving table), in place of much more (appearing as such) expensive HASH JOIN.

Pictorially :

Statistics :
Min (1) ------------------------- Max (10’000)

Predicate (x) :
Min (1) -------------- x(9’990)-- Max (10’000)

This is completely fine, since with these statistics adhering to the reality, both: INDEX RANGE SCAN and NESTED LOOP are legitimate and most importantly: the query runs fast.
Fast.. but gradually slower until the point it completely stopped behaving.
The reason for it’s been running slower over time is, the min and max moved to “the right” with the value of the predicate being (for some longer time) the same, so being sort of anchored with the actual range of rows to be visited which was steadily growing.

To illustrate how this evolved, let’s consider the following scenario.

Statistics :
Min (1) ------------------------- Max(10’000)

Let’s assume predicate (col > x) :
Min (1) ------------------------- Max – x(10’100)

So, the value of x lies outside of the boundary (is higher than max).
Thus, the expected cardinality of the rowset based on such predicate is: 0.
Despite this (in theory) true assumption, the optimizer calculates the cardinality as: 1, most probably to avoid 0 being used as divisor.
This fact is clearly communicated in many places of the 10053 trace:

Using prorated density: <density> of col #<number> as selectivity of out-of-range/non-existent value pred

Continuing: all is fine as far as the state of optimizer’s knowledge (so, statistics) reflects the reality.
It’s not good if it does not and the real disaster comes if it does not by far.
Let’s add the “factor of reality” to the picture.
So, let’s assume there was a bunch of DELETEs and INSERTS which completely changed the min and max, so it flipped the state of knowledge of the CBO.

Statistics :
Min (1) ------------------------- Max (10’000)

Predicate (x) "where col > :x"                    
Min (1) ------------------------- Max – x(10’100)

Reality:                    
                                 Min (9’000) ------------------------- Max (20’000)

According to the CBO’s knowledge, the predicate 10’000 from
WHERE table.col > 10’100
lies outside of the boundary, so the cardinality: 1 will be calculated.
If there is an index on col, it will surely be used, since the INDEX RANGE SCAN will appear very cheap.
Here comes the first part of the problem: in reality, this range scan will turn out extremally expensive, traversing almost entire table (whole range: 20’000 – 9’000 = 11’000).
If you add possible row chaining, high clustering factor and the obvious: the physical reads (lookups) supporting such index range scans are accomplished with single block reads, the things may look even worse.
Needless to say, full table scan would in vast majority of cases be much more efficient.
There is yet another matter lurking from the bushes which is too rarely taken into account (and I was hit by it).
Namely, the longer such query is running, the higher chances are, it’ll need to rebuild before images of the blocks which might have been changed by in the meantime active transactions.
That (in extreme cases: vastly) increases the CPU utilization, wastes buffer cache memory and increases the risk of snapshot too old (ORA-01555).
The “extreme cases” can be defined by the situations in which such rebuilding of the before image (rewinding the currently read buffer back to the required SCN) involves cascaded rollback.
There is also the second part of the – I dare to state – disaster.
If the table table is involved in the join (which was also my case), great chances are, it’ll be nominated to be the driving table in the NESTED LOOP (I will come to that).

Continue to part 2

Leave a Reply

Your email address will not be published. Required fields are marked *