Lack of offload with large SGA (part 1)

This post contains 3 parts.

Recently I witnessed an interesting case which added a headache to the application owners and the business.
Namely, one of the important reports which used to run smoothly all of sudden stopped behaving.
Talking numbers, something what people used to get within some seconds (up to 8) unexpectedly took close to 2 hours.
To add a bit of flavor, all took place on a fat Exadata platform with 4 nodes dedicated exclusively to this database, each instance assigned a large portion of SGA (close to 0.5TB), decent PGA – you’d say, “what possibly could go wrong?!”
If I tell you that the large SGA indirectly caused that problem, you’d hardly believe, wouldn’t you?
While I can’t share too many details, I will simulate this scenario in a test environment (also on Exadata). Although my SGA won’t come close to the one in question, the core nature of the issue remains the same.

INTRO:
In short, a query which used to run smoothly was handled with very efficient offload, so there must have been full table scan (accompanied with direct path read) involved.
The predicate which made that offload so efficient was of DATE type, so you may assume the query might look like this:

SELECT *
FROM TABLE T
WHERE T.coldate BETWEEN :date1 AND :date2

While the actual SQL was more complex, the essence of it could be simplified to the example above.
The time range between date1 and date2 represented just a small fraction (not even 1%) of the total data volume.
You’d probably immediately spot the opportunity, “so create an index on coldate”.
Well, yes/no.

Why “no”?
Imagine concurrent insert like this executed by 20 sessions at the same time:

INSERT INTO TABLE (coldate) VALUES (SYSDATE)

How would that index be populated? Nearly each such INSERT would generate similar SYSDATE values (down to 1-second fraction), so nearly all of these 20 sessions would be competing for the same right-most index leaf block, each trying to acquire an exclusive lock on it.
One second later there would be the same problem, as SYSDATE grows monotonically (with 1-sec. tick).
This issue is often referred to as the “right-most leaf growing index” problem caused by monotonically increasing values being inserted concurrently.
I am sure that you’ll be able to google it quickly if you use any reasonable combination of these keywords.

The situation becomes significantly worse when these inserts are executed across RAC nodes, as now such block may need (and it surely will) to be transferred over interconnect down to the requesting RAC node.
Note, this behavior or lack thereof is one of the key factors that distinguishes RAC-aware application from the one which is not RAC-aware.
Then well – let’s create hash-partitioned index on coldate.
No, it won’t work either as such 1-sec fraction time will still be hashed to the same index leaf block, thus contention remains.
Reversed index? The same. Skipping the disadvantage of applying range index scan on such indexes.

What could help, though not completely, is redesigning the application to use TIMESTAMP instead of DATE- with all the drawbacks of such change.
That being for example bigger index and table, to name just two.
I can also imagine the business related hassle behind such change.
Anyway, such redesign would enable generating more distinct (but still granular) TIMESTAMP values in order to reduce contention to some extent.

Another viable approach would be to direct all INSERTs to a single RAC node. While the contention issue would still exist, this would at least eliminate the need for inter-node block transfers, reducing gc-traffic.

Since this isn’t the main focus of this post, I will skip further digressions.
As you can see, this “SYSDATE caveat” seems relatively tough to handle. If anyone happens to have any good suggestions, feel free to share.
What’s important to highlight, and this should be paramount to that introduction, Exadata handles that problem pretty well as it needs no indexes to tackle that issue.
I will also skip extremely useful Exadata feature which is storage index. While still being called “index”, it won’t produce any hassle with contention though.
Storage Index is a broader topic, need to skip it here. For simplicity, you may consider the mechanism of utilization of the storage index somewhat similar to partition pruning.

Now, let’s come back to the main topic and let us start from the basics.

The offload happens under the following circumstances:
there must be full table(index) scan *and* the underlying segment needs to be accessed by the means of “direct path read”, so bypassing the buffer cache.

What happens then when there is enough blocks cached in the SGA?

The engine (during the runtime!, CBO would have already done its job) decides to revert to block shipping (SGA access).
In such case there will be no direct path read, thus no chance for offload to kick in.
Intuitively, the bigger the SGA is, the higher chances are to cache large segments in larger portions – what’s worse – across nodes, if this is RAC.
That would likely not happen quickly. It will rather take days, weeks or even perhaps months, but the breaking point (i.e. enough cached blocks) may ultimately be achieved.
Once such “breaking point” is achieved, the affected query will start crawling without a warning.
Why “without a warning”? I need to emphasize it again – such query, even if already parsed and the parsed version of the cursor will be present in the shared pool, will still be evaluated every time while being executed.
Should the state of caching of the segment of interest change (enough), you may be caught by the vastly prolonged execution times.

Let’s go through the entire test scenario and the content of this three-part post:

I will create a test table TESTOF of size ~13GB.
I will run a query against it retrieving just a fraction of it (efficient filtering with FTS serviced by direct path read invoking the offload). As a result, the query will be back in ~1 sec. During that run I have enabled tracing of the nsmtio sub-component of rdbms and will show the decision paths and options which are considered before accessing an object with FTS on Exadata. I will support the analysis by – as it’s called by its creator (Tanel Poder): the Most Advanced Oracle Troubleshooting Script on the Planet (and can’t really argue with that).
Then I will force caching of TESTOF table while reading the table with full table scan. As a result ~4.6GB will get cached. Caching is good, isn’t it…?
Next, I will run the same query and will state that.. it took 27 sec.
Subsequently, I will again enable tracing of the nsmtio and will show that too much caching may be (and it was in my case) an eval.
Then we will take a look at the output of the 10046 trace and will see why it took so long.
Next, I will force caching in the buffer cache roughly the same volume of the TESTOF, this time though by the means of single block reads and will run the same test query. Still poor performance will be observed, there will be an interesting difference when comparing to (4).
Lastly (this is to be found in part3 of this post) I will suggest the method to set up a trap to intercept the sql_id of interest and establish the reason of “why the offload did/did not take place”.

I have (CTAS) populated the test table so that I have achieved the data distribution similar to :

OWNER                     TABLE_NAME                     COL_ID COLUMN_NAME                    DATA_TYPE             NDV     DENSITY  NUM_NULLS AVG_COL_LEN LOW_VAL                             HI_VAL                              HISTOGRAM       NUM_BUCKETS LAST_ANALYZED         SAMPLE_SIZE
------------------------- ------------------------------ ------ ------------------------------ ------------ ------------ ----------- ---------- ----------- ----------------------------------- ----------------------------------- --------------- ----------- -------------------- ------------
TESTUS                     TESTOF                              1 ID                             NUMBER          5,000,000   .00000020          0           6 1                                   5000000                             NONE                      1 21-05-2023 21:00:17     5,000,000
TESTUS                                                         2 MYDATE                         DATE                6,751   .00010200          0           8 2022-12-14/15:45:21                 2023-01-31/13:31:36                 HYBRID                  254 21-05-2023 21:00:17         5,413
TESTUS                                                         3 PADDING                        VARCHAR2                1  1.00000000          0        2049 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX    NONE                      1 21-05-2023 21:00:17     5,000,000

CTAS (not exactly the command I have used, that one is good enough to achieve the result though) :

Create Table TESTOF 
AS 
select rownum as ID, sysdate-mod(rownum,365) as MYDATE, rpad(‘X’,2048,’X’) as PADDING 
from dual 
connect by level <= 5e6 ;

Important is to stick to the similar data distribution of the ID as we’ll be filtering 10,000 rows out of 5 mln.
If you are about to repeat that test, just pay attention to stick to the below proportions.

My SGA size: sga_target = 20G.
My table size: ~13GB

DDL: 
 
  CREATE TABLE "TESTUS"."TESTOF"
   (    "ID" NUMBER,
        "MYDATE" DATE,
        "PADDING" VARCHAR2(2048)
   ) SEGMENT CREATION IMMEDIATE
  PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255
 NOCOMPRESS LOGGING
  STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1
  BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
  TABLESPACE "TS01"
 
Size of the table TESTOF: 
 
SEGMENT_NAME                                                 SEGMENT_TYPE       TABLESPACE_NAME                      SEG_MB   LOBSIZE_MB     TOTAL_MB
------------------------------------------------------------ ------------------ ------------------------------ ------------ ------------ ------------
TESTUS.TESTOF                                                 TABLE              TS01                                 13,047                    13,047

With emptied cache (see the @fb script), I am running my test SQL :

select /*+ full (T) */ * from TESTUS.TESTOF T where ID <= 10e3 ;

20:03:17 SYS@MYTEST1->PDBXX> @off03
20:03:19 SYS@MYTEST1->PDBXX>
20:03:19 SYS@MYTEST1->PDBXX> @fb
20:03:19 SYS@MYTEST1->PDBXX> set echo on
20:03:19 SYS@MYTEST1->PDBXX> alter system flush buffer_cache ;
 
System altered.
 
Elapsed: 00:00:00.28
20:03:20 SYS@MYTEST1->PDBXX>
20:03:20 SYS@MYTEST1->PDBXX> set echo off
 
INST_ID         SID SERIAL# OSPID             PID TRACEFILE
------- ----------- ------- ---------- ---------- -------------------------------------------------------------------------
      1        2428   45574 239589            302 /u01/app/oracle/diag/rdbms/MYTEST/MYTEST1/trace/MYTEST1_ora_239589.trc
 
Elapsed: 00:00:00.07
20:03:20 SYS@MYTEST1->PDBXX>
20:03:20 SYS@MYTEST1->PDBXX> ALTER SESSION SET TRACEFILE_IDENTIFIER='&amp;trfile_suffix';
Enter value for trfile_suffix: ZEROCACHE
old   1: ALTER SESSION SET TRACEFILE_IDENTIFIER='&amp;trfile_suffix'
new   1: ALTER SESSION SET TRACEFILE_IDENTIFIER='ZEROCACHE'
 
Session altered.
 
Elapsed: 00:00:00.00
20:03:35 SYS@MYTEST1->PDBXX>
20:03:35 SYS@MYTEST1->PDBXX>
20:03:35 SYS@MYTEST1->PDBXX> alter session set events 'trace[nsmtio]';
 
Session altered.
 
Elapsed: 00:00:00.00
20:03:35 SYS@MYTEST1->PDBXX>
20:03:35 SYS@MYTEST1->PDBXX> @ssqlid
20:03:35 SYS@MYTEST1->PDBXX> set  echo on
20:03:35 SYS@MYTEST1->PDBXX>  set feedback on sql_id
20:03:35 SYS@MYTEST1->PDBXX> set echo off
20:03:35 SYS@MYTEST1->PDBXX> SET AUTOTRACE TRACE EXPLAIN STATISTICS
20:03:35 SYS@MYTEST1->PDBXX>
20:03:35 SYS@MYTEST1->PDBXX>
20:03:35 SYS@MYTEST1->PDBXX> select /*+ full (T) */ * from TESTUS.TESTOF T where ID &lt;= 10e3 ;
 
10000 rows selected.
 
SQL_ID: 8s6pdjnxuzsx3
Elapsed: 00:00:00.21
SQL_ID: g72kdvcacxvtf
 
Execution Plan
----------------------------------------------------------
SQL_ID: 7xtbscac8c0ta
Plan hash value: 3294238222
 
------------------------------------------------------------------------------------
| Id  | Operation                 | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |        | 10000 |    19M|   289K  (1)| 00:00:12 |
|*  1 |  TABLE ACCESS STORAGE FULL| TESTOF | 10000 |    19M|   289K  (1)| 00:00:12 |
------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - storage("ID"&lt;=10e3)
       filter("ID"&lt;=10e3)
SQL_ID: 3s1hh8cvfan6w
 
SQL_ID: g72kdvcacxvtf
 
Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
    1666677  consistent gets
    1666669  physical reads
          0  redo size
     318414  bytes sent via SQL*Net to client
       7750  bytes received via SQL*Net from client
        668  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
      10000  rows processed

MYTEST1_ora_239589_ZEROCACHE.trc (relevant sections of the output of trace[nsmtio]) :

Partial short call stack signature: 0xd57df041b0226129
NSMTIO: kcbism: islarge 1 next 0 nblks 1668786 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 4368 keep_nb 0 nbuf 945194 kcbnwp 8 pstt 0 Exadata 1 bsz 8192 objd 75105
NSMTIO: kcbimd: nblks 1668786 kcbstt 4368 kcbnbh 94519 kcbisdbfc 3 is_medium 0 objd 0
NSMTIO: kcbivlo: nblks 1668786 vlot 500 pnb 945194 kcbisdbfc 0 is_large 0
NSMTIO: qertbFetch:[MTT &lt; OBJECT_SIZE &lt; VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 8, objd: 75105, objn: 75105
ckpt: 1, nblks: 1668786, ntcache: 0, ntdist:0, Exadata:1, bsz:8192, rule 0, threshold 104857600, rule_enabled 0\nDirect Path for pdb 3 tsn 8  objd 75105 objn 75105
Direct Path 1 ckpt 1, nblks 1668786 ntcache 0 ntdist 0
Direct Path mndb 0 tdiob 6 txiob 0 tciob 123
Direct path diomrc 128 dios 2 kcbisdbfc 0
kcbdpc: kx 1.500000 kc 2.000000 lhs 3337572.000000 rhs_ckpt 1668786.000000 rhs_dp 1668786.000000 rio 1.000000 cached 0 Exadata 1 bsz 8192 offload_rule 0 threshold 104857600 rule_enabled 0
NSMTIO: Additional Info: VLOT=4725970
Object# = 75105, Object_Size = 1668786 blocks
SqlId = 8s6pdjnxuzsx3, plan_hash_value = 3294238222, Partition# = 0

There are several interesting points in the trace file being the output of tracing: alter session set events ‘trace[nsmtio]’;

I will skip many of them but will leave two which are important for us, namely :

ntcache: number of blocks cached in the buffer cache (0)
NSMTIO: kcbdpc:DirectRead: tsn: 8, objd: 75105, objn: 75105
clearly, direct path read was chosen over the block access. Consequently: offload took place.

Now, why was is so fast ?

All in all that table is as big as 13GB and the query took way below 1 sec.
I will use invaluable snapper by Tanel Poder and will pick up the interesting excerpts.

Taking a glance at the STAT(istics) portion:
Firstly, apparently all the physical reads were satisfied from flash cache.
Secondly, apparently the storage index has been built before which may often be quite large contributor to time reduction.

Were we lucky? Not really.

You can be sure that over time both, your (entire) segment will be put in flash cache and will stay there virtually for good as well as the storage index will be created.
As for the latter, I need to admit it’s a bit of luck. But even without it the elapsed time is still extremely short (still around 1sec).
How to check it? Try to experiment with disabling the usage of storage index and run your SQL:

SQL> alter session set “_kcfis_storageidx_disabled” = true ;

Most of all, the offload efficiency following the smart scan is brilliant: close to 99%, calculated as:
[ 100 – “cell physical IO interconnect bytes returned by smart scan” / “cell physical IO bytes eligible for predicate offload” * 100 ].

If taking a look at the WAIT the actual work can be accounted to basically three wait events with its largest contributor: WAIT, SQL*Net message from client wich can’t really be considered a WAIT as it covers the FETCH phase (so, returning the 10000 rows back to me).
I have shortened it considerably with setting “SET AUTOTRACE TRACE EXPLAIN STATISTICS” and the query still obviously got executed, but I didn’t have to wait for all the rows rolling over my screen.

Output of Snapper (the output tailored to our needs):

SYS@MYTEST1->PDBXX> @sn4 all  10 1 2428
Sampling SID 2428 with interval 10 seconds, taking 1 snapshots...
 
-- Session Snapper v4.34 - by Tanel Poder ( https://tanelpoder.com/snapper ) - Enjoy the Most Advanced Oracle Troubleshooting Script on the Planet! 🙂
 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SID @INST, USERNAME  , TYPE, STATISTIC                                                 ,         DELTA, HDELTA/SEC,    %TIME, GRAPH       , NUM_WAITS,  WAITS/SEC,   AVERAGES
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  2428  @1, SYS       , STAT, physical read total IO requests                           ,         13047,       1.2k,         ,             ,          ,           ,      1.05M bytes per request
  2428  @1, SYS       , STAT, physical read total multi block requests                  ,         13039,       1.2k,         ,             ,          ,           ,          8 total single block read requests
  2428  @1, SYS       , STAT, physical read requests optimized                          ,         13047,       1.2k,         ,             ,          ,           ,     652.35 per execution
  2428  @1, SYS       , STAT, physical read total bytes optimized                       ,   13653336064,      1.25G,         ,             ,          ,           ,    682.67M per execution
  2428  @1, SYS       , STAT, physical read total bytes                                 ,   13653336064,      1.25G,         ,             ,          ,           ,    682.67M per execution
…
  2428  @1, SYS       , STAT, cell physical IO bytes saved by storage index             ,   13625532416,      1.25G,         ,             ,          ,           ,    681.28M per execution
  2428  @1, SYS       , STAT, cell physical IO bytes eligible for predicate offload     ,   13653336064,      1.25G,         ,             ,          ,           ,    682.67M per execution
  2428  @1, SYS       , STAT, cell physical IO bytes eligible for smart IOs             ,   13653336064,      1.25G,         ,             ,          ,           ,    682.67M per execution
  2428  @1, SYS       , STAT, cell physical IO bytes saved by storage index             ,   13625532416,      1.25G,         ,             ,          ,           ,    681.28M per execution
  2428  @1, SYS       , STAT, cell physical IO interconnect bytes returned by smart scan,      21425328,      1.97M,         ,             ,          ,           ,      1.07M per execution
..
  2428  @1, SYS       , STAT, cell flash cache read hits                                ,            36,       3.31,         ,             ,          ,           ,        1.8 per execution
  2428  @1, SYS       , STAT, cell flash cache read hits for smart IO                   ,            36,       3.31,         ,             ,          ,           ,        1.8 per execution
..
  2428  @1, SYS       , TIME, hard parse elapsed time                                   ,          1705,   156.68us,      .0%, [          ],          ,           ,
  2428  @1, SYS       , TIME, repeated bind elapsed time                                ,            22,     2.02us,      .0%, [          ],          ,           ,
  2428  @1, SYS       , TIME, parse time elapsed                                        ,          2047,   188.11us,      .0%, [          ],          ,           ,
  2428  @1, SYS       , TIME, PL/SQL execution elapsed time                             ,          2893,   265.85us,      .0%, [          ],          ,           ,
  2428  @1, SYS       , TIME, DB CPU                                                    ,        145496,    13.37ms,     1.3%, [@         ],          ,           ,
  2428  @1, SYS       , TIME, sql execute elapsed time                                  ,        136088,    12.51ms,     1.3%, [#         ],          ,           ,
  2428  @1, SYS       , TIME, hard parse (sharing criteria) elapsed time                ,          1712,   157.32us,      .0%, [          ],          ,           ,
  2428  @1, SYS       , TIME, DB time                                                   ,        145473,    13.37ms,     1.3%, [#         ],          ,           ,       4.71 % unaccounted-for time*
  2428  @1, SYS       , WAIT, enq: KO - fast object checkpoint                          ,          1234,    113.4us,      .0%, [          ],         3,        .28,   411.33us average wait
  2428  @1, SYS       , WAIT, cell smart table scan                                     ,          9988,   917.83us,      .1%, [          ],        93,       8.55,    107.4us average wait
  2428  @1, SYS       , WAIT, SQL*Net message to client                                 ,          1019,    93.64us,      .0%, [          ],       675,      62.03,     1.51us average wait
  2428  @1, SYS       , WAIT, SQL*Net message from client                               ,      10224654,   939.58ms,    94.0%, [WWWWWWWWWW],       675,      62.03,    15.15ms average wait
  2428  @1, SYS       , WAIT, events in waitclass Other                                 ,          4058,    372.9us,      .0%, [          ],        86,        7.9,    47.19us average wait

Keep reading here

Piotr Sajda Oracle Tuning Blog

Leave a Reply Cancel reply

All tags

Latest posts