Clustering Factor – theory

Some time ago I wrote a post diving into, what I like to call the “younger siblings” of Clustering Factor (aka CLUF – this is how it’s called in the 10053 trace):
AVG_DATA_BLOCKS_PER_KEY and AVG_LEAF_BLOCKS_PER_KEY—
While they represent close relationship to Clustering Factor, the CLUF tends to be more recognized in the tuning world and has its part in the costing formula when calculating the cost of a SQL query (see: Calculations of I/O cost in 10053 trace vs. SQL Plan for more on this).
This series of 4 posts is an attempt to demystify Clustering Factor itself.

I’ll thoroughly explain its internals, aiming to make the tests comprehensive with focus on clearing up any possible confusion about its meaning and its impact on SQL performance within the three areas: elapsed time, memory and CPU.
This exploration dives “block-dump” deep, using index, index tree and table block dumps, DBMS_ROWID package outputs and analysis of 10046 trace files.
Toward the end, I’ll showcase its potential impact through four simple test scenarios.
Somewhere in the middle I will present some short analysis considering an oddity (well, I thought of it as an oddity. In reality, I just didn’t make enough effort to engage little perception..).

I will be making references to the post about AVG_DATA_BLOCKS_PER_KEY and AVG_LEAF_BLOCKS_PER_KEY– as the CLUF and these two parameters, under certain circumstances show close similarities.

Introduction to Clustering Factor – some theory

This is where the story begins.
For starter, let’s reach out to the Oracle 19c documentation:

CLUSTERING_FACTOR:

Indicates the amount of order of the rows in the table based on the values of the index.

If the value is near the number of blocks, then the table is very well ordered. In this case, the index entries in a single leaf block tend to point to rows in the same data blocks.

If the value is near the number of rows, then the table is very randomly ordered. In this case, it is unlikely that index entries in the same leaf block point to rows in the same data blocks.

For bitmap indexes, this column is not applicable.

Noteworthy, the above excerpt, although clearly emphasizing the “rows in the table“, it can be found in the doc. of DBA_INDEXES (or DBA_IND_PARTITIONS, etc).
Reason being, this statistic cannot be and should not be treated in separation from the underlying table.

Let’s make up an example.
Your SQL plan picks an index block range scan and needs to read, say, 4 index blocks for a value of MYSTRING = ‘ABC’.
Then it calculates the rowids and jumps to the table’s blocks.
Now, ideally (almost impossible) if it required just 4 table blocks in order to satisfy that range scan to retrieve the selected values (not being present in that index) from that table, e.g. some COLOR column.
So, again keeping it simple – ideally, if the entire SQL would need just 9 physical reads (I added the root index block: 1 + 4 + 4).

But it also may need many more physical reads from the corresponding table, say, 16.
That means, in such case the values (present in the index) must be scattered around in the table’s blocks, so they are not so densely packed (sorted along the index).
With such assumption (16 blocks of a table) that case the query would need 21 physical reads (1 + 4 + 16).

To be more pictorial, consider a query :
select COLOR from Table where MYSTRING = ‘ABC’ ;
Looking at my masterpiece below and assuming index range scan, this query will need to visit just one table block in order to retrieve the values of COLOR based on the predicate MYSTRING = ‘ABC’ since all required data is located in just one table block :

Data in the index leaf block (obviously sorted):
________________
|              |
|rowidx ABC    |
|rowidy ABC    |
|rowidz ABC    |
|ABD           |
|ABE           |
|ABE           |
|ABF           |
|______________|
 
Corresponding table block (block #n)
______________________
|                    |
|rowidx ABC 1 Blue   |
|rowidz ABC 2 Black  |
|rowidy ABC 2 Grey   |
|(....) ABD 1 White  |
|…                   |
|…___________________|

Now, sticking to this scenario and assuming scattered data in the table below, the same query:
select COLOR from Table where MYSTRING = ‘ABC‘ ;
will now need three physical reads from the table “Table” in order to fetch the values of column COLOR based on the same predicate ‘ABC’ :

Data in the index leaf block (index is sorted – nothing changes here):
 ______________
|              |
|rowidx ABC    |
|rowidy ABC    |
|rowidz ABC    |
|ABD           |
|ABE           |
|ABE           |
|ABF           |
|______________|
 
 
But the corresponding data in the table now spans 3 blocks :
 
table block #5
 ____________________
|                    |
|XYZ 1 Green         |
|rowidx ABC 2 Black  |
|AVF 5 Brown         |
|ABD 1 White         |
|____________________|
 
table block #804
 ____________________
|                    |
|rowidz ABC 1 Blue   |
|HGZ 0 Purple        |
|ZUF 9 Brown         |
|ABD 1 White         |
|____________________|
 
table block #14264
 ____________________
|                    |
|XRT 4 Pink          |
|ABR 3 Black         |
|AVU 8 Brown         |
|rowidy ABC 2 Grey   |
|____________________|

The rows in the table containing ‘ABC’ are no longer so nicely accommodated – they are scattered around these three blocks in the table.
Note, this circumstance may lead to even less optimistic scenario: these blocks themselves don’t need to be adjacent on disk, they may be “away” from one another. This will surely worsen the situation, since – without getting into details of e.g. possible caching on storage level, etc. – while being read directly from HDD, such reads may involve the prolonged HDD head move and seek time.

I should not have mentioned that ..
Continuing introductory word: the tests carried out in the subsequent parts are essentially very detailed mirror (on the block dump level) of the above “masterpiece”.
Meaning, you can basically skip the rest of this series of articles if the above explanation managed to clarify the possible doubts of “what Clustering Factor is” and you simply were looking for the explanation without the proof and the estimation of its impact on various resource areas.

One last word and I will cease the theoretical part.
Taking opportunity, let’s clear out another fairly common misunderstanding that rebuilding of the index will help to lower the CLUF.
Unfortunately, it won’t.
All in all what the rebuilding of the index might bring? The problem at the first place lies in the table, not the index – it’ll be clear while looking at various dumps, SQL Plans and traces in the next parts.
Anticipating, only the change (improving the ordering) carried out on the level of the underlying table can truly lower the CLUF so hopefully improve the performance.
There is still a trap in that statement above.
Think e.g. of recreating of an affected table and most of all: affected column with such CTAS:
CREATE TABLE_NOW_LOWCLUF as SELECT * from TABLE_HIGHCLUF order by HIGH_CLUF_COL.
No doubt, the HIGH_CLUF_COL will turn into low (actually: perfect) CLUF, but what about the other (indexed) columns whose CLUF was fairly low up to now?
They may end up pretty messy.
All that above will hopefully be obvious soon enough beyond any doubt.

Clustering factor – part2

Piotr Sajda Oracle Tuning Blog

Clustering Factor (CLUF) – the anatomy and its impact on performance – part1

Clustering Factor – theory

Leave a Reply Cancel reply

All tags

Latest posts