Lack of offload with large SGA (part 3)

Note: this is the continuation of the previous post.

The natural question arises: how can I set a trap for any specific SQL_ID which causes issues.

Update: near the end of this part3 I have also included some immediate solutions to mitigate the problem supported by two other posts.

The goal is to intercept an execution of a sql_id, produce the trace and examine the content of NSMTIO trace file and specifically navigate to the excerpt informing which decision (lack of “direct path read”?) has been taken at the runtime. The last but not least: which reason was behind it.
In addition, since such trace file can be pretty large, we need to limit its output (size) and stop processing once we would be able to establish what we’re looking for.
I am referring to the case when the decision was made to revert to block shipping (so without applying “direct path read”). In such case the trace file reports populating and/or reading of/from the buffer cache. Such file seems to contain each such call which may produce enormous size, e.g.

*** 2024-09-10T21:45:50.219573+01:00 (PDBXXX(3))

NSMTIO: kcbzib: Cache Scan triggered for tsn = 8, rdba=0x04892c49, fn = 18, kobjd = 75105, block cnt = 951, noncontig = FALSE pre-warm = FALSE, prefetch = FALSE
NSMTIO: kcbzib: Cache Scan triggered for ts n = 8, rdba=0x04892c91, fn = 18, kobjd = 75105, block cnt = 879, noncontig = FALSE pre-warm = FALSE, prefetch = FALSE
NSMTIO: kcbzib: Cache Scan triggered for tsn = 8, rdba=0x04892ca4, fn = 18, kobjd = 75105, block cnt = 860, noncontig = FALSE pre-warm = FALSE, prefetch = FALSE

(many more lines…)

NSMTIO: kcbzib: Cache Scan triggered for tsn = 8, rdba=0x05075c88, fn = 20, kobjd = 75105, block cnt = 1400, noncontig = FALSE pre-warm = FALSE, prefetch = FALSE
NSMTIO: kcbzib: Cache Scan triggered for tsn = 8, rdba=0x05075d08, fn = 20, kobjd = 75105, block cnt = 1272, noncontig = FALSE pre-warm = FALSE, prefetch = FALSE

There may obviously be many thousands of such entries as above, yet they obviously do not bring anything valuable to our analysis.


Anyhow, here is the example how to accomplish the above specified goal.

Let’s assume, you have an SQL_ID which causes problems (9ur4ytc0n99wr) :

21:43:40 SYS@MYTEST1->PDBXXX> @setts
21:43:41 SYS@MYTEST1->PDBXXX> SET AUTOTRACE TRACE EXPLAIN STATISTICS
21:43:42 SYS@MYTEST1->PDBXXX>
21:43:42 SYS@MYTEST1->PDBXXX> @ssqlid
21:43:44 SYS@MYTEST1->PDBXXX> set  echo on
21:43:44 SYS@MYTEST1->PDBXXX>  set feedback on sql_id
21:43:44 SYS@MYTEST1->PDBXXX> set echo off
21:43:44 SYS@MYTEST1->PDBXXX>
21:43:44 SYS@MYTEST1->PDBXXX> Select /*+ full (T) */ * from delme.test02 T where ID <= 10e3 ;
 
10000 rows selected.
 
SQL_ID: 9ur4ytc0n99wr --> this is “our problematic SQL”
Elapsed: 00:00:25.72
SQL_ID: g72kdvcacxvtf
 
Execution Plan
----------------------------------------------------------
SQL_ID: 5g1dwm359vh2k
Plan hash value: 3294238222
 
------------------------------------------------------------------------------------
| Id  | Operation                 | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |        | 10000 |    19M|   289K  (1)| 00:00:12 |
|*  1 |  TABLE ACCESS STORAGE FULL| TEST02 | 10000 |    19M|   289K  (1)| 00:00:12 |
------------------------------------------------------------------------------------

Set up a trap to trace the NSMTIO subcomponent which will intercept the problematic SQL_ID once it has been executed. Limit the output :

SQL> alter system set events 'trace[rdbms.nsmtio][sql:9ur4ytc0n99wr] {occurence:start_after 0,  end_after 1000}' ;
 
System altered.

You may want to change to “Diag Trace” directory and probe for the new trace file (the trace file will contain our sql_id : 9ur4ytc0n99wr)

[oracle @ MYTEST1 @ testhost ]$ # while true; do find . -name "*.trc" -mmin -2 -exec grep -n 9ur4ytc0n99wr {} \; -ls ; echo "-----------------------"; sleep 5; done
110:SqlId = 9ur4ytc0n99wr, plan_hash_value = 3294238222, Partition# = 0
 12182   12 -rw-r-----   1 oracle   asmadmin    10574 Sep 10 21:38 ./MYTEST1_ora_106790.trc
44:SqlId = 9ur4ytc0n99wr, plan_hash_value = 3294238222, Partition# = 0
 12334  464 -rw-r-----   1 oracle   asmadmin   471996 Sep 10 21:45 ./MYTEST1_ora_151789.trc
37:SqlId = 9ur4ytc0n99wr, plan_hash_value = 3294238222, Partition# = 0
 12248    4 -rw-r-----   1 oracle   asmadmin     2142 Sep 10 21:42 ./MYTEST1_ora_136475.trc
-----------------------
110:SqlId = 9ur4ytc0n99wr, plan_hash_value = 3294238222, Partition# = 0
 12182   12 -rw-r-----   1 oracle   asmadmin    10574 Sep 10 21:38 ./MYTEST1_ora_106790.trc
44:SqlId = 9ur4ytc0n99wr, plan_hash_value = 3294238222, Partition# = 0
 12334  464 -rw-r-----   1 oracle   asmadmin   471996 Sep 10 21:45 ./MYTEST1_ora_151789.trc
37:SqlId = 9ur4ytc0n99wr, plan_hash_value = 3294238222, Partition# = 0
 12248    4 -rw-r-----   1 oracle   asmadmin     2142 Sep 10 21:42 ./MYTEST1_ora_136475.trc
-----------------------

As you can see I have executed that SQL three times.
Let’s check two of them.

MYTEST1_ora_106790.trc :

NSMTIO: kcbism: islarge 1 next 0 nblks 1668786 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 4368 keep_nb 0 nbuf 945194 kcbnwp 8 pstt 0 Exadata 1 bsz 8192 objd 75105
NSMTIO: kcbimd: nblks 1668786 kcbstt 4368 kcbnbh 94519 kcbisdbfc 3 is_medium 0 objd 0
NSMTIO: kcbivlo: nblks 1668786 vlot 500 pnb 945194 kcbisdbfc 0 is_large 0
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 8, objd: 75105, objn: 75105
ckpt: 1, nblks: 1668786, ntcache: 350933, ntdist:1052799, Exadata:1, bsz:8192, rule 0, threshold 104857600, rule_enabled 0\nDirect Path for pdb 3 tsn 8  objd 75105 objn 75105
Direct Path 1 ckpt 1, nblks 1668786 ntcache 350933 ntdist 1052799
Direct Path mndb 0 tdiob 6 txiob 0 tciob 134
Direct path diomrc 128 dios 2 kcbisdbfc 0
kcbdpc: kx 1.500000 kc 2.000000 lhs 2109306.500000 rhs_ckpt 1668786.000000 rhs_dp 1668786.000000 rio 1.000000 cached 1403732 Exadata 1 bsz 8192 offload_rule 0 threshold 104857600 rule_enabled 0
NSMTIO: Additional Info: VLOT=4725970
Object# = 75105, Object_Size = 1668786 blocks
SqlId = 9ur4ytc0n99wr, plan_hash_value = 3294238222, Partition# = 0

Direct Path Read was in place (NSMTIO: kcbdpc:DirectRead) and indeed, the time elapsed for that execution was below 1sec. (with very efficient offload).

Then I have populated the buffer cache as I did before and executed the SQL again.
The trace file has been produced: MYTEST1_ora_151789.trc.
After a short glance it’s clear there was enough buffers in the buffer cache, so the direct path read was not engaged (NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ]), therefore no offload took place (with the elapsed time close to 25 sec).
The relevant section of the trace file is below :

MYTEST1_ora_151789.trc :

NSMTIO: kcbism: islarge 1 next 0 nblks 1668786 type 2, bpid 3, kcbisdbfc 0 kcbnhl 16384 kcbstt 4368 keep_nb 0 nbuf 945194 kcbnwp 8 pstt 0 Exadata 1 bsz 8192 objd 75105
NSMTIO: kcbimd: nblks 1668786 kcbstt 4368 kcbnbh 94519 kcbisdbfc 3 is_medium 0 objd 0
NSMTIO: kcbivlo: nblks 1668786 vlot 500 pnb 945194 kcbisdbfc 0 is_large 0
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ]: tsn: 8, objd: 75105, objn: 75105
ckpt: 0, nblks: 1668786, ntcache: 757528, ntdist:911258, Exadata:1, bsz:8192, rule 0, threshold 104857600, rule_enabled 0\nDirect Path for pdb 3 tsn 8  objd 75105 objn 75105
Direct Path 0 ckpt 0, nblks 1668786 ntcache 757528 ntdist 911258
Direct Path mndb 0 tdiob 6 txiob 0 tciob 134
Direct path diomrc 128 dios 2 kcbisdbfc 0
kcbdpc: kx 1.500000 kc 2.000000 lhs 1366887.000000 rhs_ckpt 1668786.000000 rhs_dp 1668786.000000 rio 1.000000 cached 1668786 Exadata 1 bsz 8192 offload_rule 0 threshold 104857600 rule_enabled 0
NSMTIO: Additional Info: VLOT=4725970
Object# = 75105, Object_Size = 1668786 blocks
SqlId = 9ur4ytc0n99wr, plan_hash_value = 3294238222, Partition# = 0


Once done, disable the tracing :

SQL> alter system set events 'trace[rdbms.nsmtio][sql:9ur4ytc0n99wr] off’ ;


Worth mentioning, the problem described in these posts does not seem to be that infrequent.
Just some days back I got a message from Osman DİNÇ reporting the same issue occurring under slightly different circumstances, yet the final result was the same – the offload was expected and proved to be more efficient, but it failed.
Osman has described the circumstances he experienced in his two articles also offering immediate solutions.
Here they are :
Direct Path Read Decision in Oracle: Beyond the Optimizer
Optimizing Direct Path Read Decision with Table_Stats Hint


Solutions? – Briefly.

In the nutshell, every way which forces a segment to be accessed with direct path read may be viable approach (note, “direct path read” is still not a guarantee for the offload to kick in – see below).
You just need to adjust the method to your needs and/or possible constraints.
Be it running the query in parallel (parallel of 2 for our purpose will be enough. The SQL Patch will do it), forcing the serial direct path read by setting the _serial_direct_read to ALWAYS with e.g. a trigger or on the session level.
Even fooling the optimizer with setting the number of blocks of a segment of interest to some high value for a given SQL_ID with the SQL Patch will do the thing.

The approach of using a SQL Patch to inject hints like PARALLEL(2) or modify CBO statistics with table_stats(<table_name> set blocks=N) is thoroughly detailed in Osman’s posts. The impact of applying a SQL Patch is limited to a specific SQL_ID, making it relatively safe.

Regarding PARALLEL(2): As you might expect, the primary goal isn’t necessarily to execute the query in parallel, but rather to enforce direct path reads (since parallel query slaves operate with direct path reads). That’s why the lowest possible level of parallelism, 2, is used.

Last but not least.
If your SGA is in fact that huge, it may be worth reconsidering its size. Reducing it could help minimize the risk of gradually accumulating cached buffers until a threshold turning the direct path read to CACHE_READ is crossed (as we’ve seen, especially the buffers from index range scans tend to persist for quite some time).
After all, I/O operations on Exadata—particularily from Flash Cache and PMEM-based I/O—are highly efficient, and reallocating some of that memory to the PGA might be a better strategy.

Speaking of witch (where the “witch” is the PGA and Flash Cache in this context) – if you’re dealing with queries that require a significant amount of PGA for operations like sorting, aggregations, hash joins, unions, or distinct operations—and you’ve exhausted your tuning options or aren’t allowed to adjust the queries—take a look at my other article on creatively using Flash Cache as TEMP storage. It includes the analysis of the potential gains from this strategy.
In overall: it outlines an approach for identifying the most resource-intensive areas of your query (ASH and the waits events are utilized) and helps assess the potential improvements from tuning. Hopefully, it’ll spark some new ideas.


Some theory for the end..

This should have been brought up at the beginning, but I decided not to beat around the bush too much and got to the point instead.
Anyway, the point of our interest throughout the whole analysis was the KXD component :

SQL> oradebug doc component kxd 
 
  KXD                          Exadata specific Kernel modules (kxd)
    KXDAM                      Exadata Disk Auto Manage (kxdam)
    KCFIS                      Exadata Predicate Push (kcfis)
    NSMTIO                     Trace Non Smart I/O (nsmtio)
    KXDBIO                     Exadata Block level Intelligent Operations (kxdbio)
    KXDRS                      Exadata Resilvering Layer (kxdrs)
    KXDOFL                     Exadata Offload (kxdofl)
    KXDMISC                    Exadata Misc (kxdmisc)
    KXDCM                      Exadata Metrics Fixed Table Callbacks (kxdcm)
    KXDBC                      Exadata Backup Compression for Backup Appliance (kxdbc)

This might seem a bit misleading at first glance. On one hand, we need to determine whether a Smart Scan occurred or not, and why, while also tracing “Trace Non Smart I/O”? The reasoning behind this is that tracing the nsmtio subcomponent of the KXD component (Exadata-specific kernel modules) helps reveal the decisions made about how a serial full table scan will be executed.

It’s important to note that if, for example, an index range scan had been chosen during the initial CBO costing phase, nsmtio wouldn’t be invoked at all, and thus no trace file would be generated in the diagnostic directory. So, the Full Table Scan (FTS) is a prerequisite for Smart Scan—but that’s not the whole story.

For Smart Scan to occur, the FTS must be performed using direct path read (bypassing the buffer cache). Why is this necessary? Technically, it’s because the kcfis C function (kernel cache file intelligent storage – in short, “smart scan”) is called by kcbldrget (kernel cache block direct read get), which represents the direct path read access method.

It’s worth noting that while direct path read is a prerequisite for Smart Scan, it’s not a guarantee. The subcomponent NSMTIO of KXD (Exadata-specific kernel modules) is designed to handle “non-smart I/O,” which still involves direct path reads. This is why the term “NSMTIO: Trace Non Smart I/O (nsmtio)” can be misleading, as we’ve observed SQL queries undergoing Smart Scan even when traced under NSMTIO.

One final detail: If you’re curious, you can explore this further by navigating to $ORACLE_HOME/bin and running the strings command on the Oracle binary (maybe not on your PROD database.. : ).

[oracle @ testhost ]$ # strings oracle | grep NSMTIO
[NSMTIO]:qkaBloomPredsSplit: #predicates:%d scan:%x push:%s sizechk:%s dop:%d SmartIO: %s, quota(slave):%dKB, quota(total):%dKB, SqlID: %s
(...)
[NSMTIO]: kcfis_updstats_at_dealloc: Smart I/O was not used for Appstate: %p, phv(rwsid): %d as rdbms_blockio was enabled
[NSMTIO]: kcfis_updstats_at_dealloc: Smart I/O was not used for Appstate: %p, Permflags: 0x%x phv(rwsid): %d due to big payload
[NSMTIO]: kcfis_updstats_at_dealloc: Smart I/O was not used for Appstate: %p, Permflags: 0x%x, phv(rwsid): %d because a cell(s) decided to use cell passthru mode
[NSMTIO]: kcfis_updstats_at_dealloc: Smart I/O was not used for Appstate: %p, permflags: 0x%x, phv(rwsid): %d as timezone was not available
[NSMTIO]:kcfis_updstats_at_dealloc: Smart I/O not used for Appstate: %p, phv(rwsid): %d as cell_passthru was enabled
[NSMTIO]:SQL for this (non)Smart I/O session is:

This is more speculative, but it seems as though Oracle developers were essentially saying: “[NSMTIO]: SQL for this (non)Smart I/O session is:”, meaning though: “The SQL in this session may or may not have used Smart Scan, but either way, it followed the direct path read access path (or at least was considered for Direct Path Read).

Leave a Reply

Your email address will not be published. Required fields are marked *