Performance Engineering

Friday, August 2, 2013

Script to identify performance bottlenecks in a query using SQL (and estimate potential tuning benefits)

The script needs query rowsource statistics to produce output, so make sure the query is run with STATISTICS_LEVEL = ALL or gather_plan_statistics hint before analyzing it with the script.

Script output:

id, operation — id and description of the plan operation
potential_savings — how much performance benefits can be expected from tuning this operation (e.g. by creating or enabling a better access path)
origin_of_inefficiency — an educated guess as to the origin of inefficiency based on how many times the operation was executed (STARTS), whether or not there are filter predicates for this operation, and some additional information about parent operations
starts — “starts” column from V$SQL_PLAN_STATISTICS

rem  (c) Nikolay Savvinov, 2013
rem  a script to identify performance bottleneck
rem  and estimate potential benefits from removing them
rem  using rowsource statistics (will only return meaningful results
rem  if the plan is still in the cache and if rowsource statistics were populated.
rem  takes sql_id and child_number as parameters

set verify off
set linesize 400
set pagesize 9999

column id format 999
column operation format A40
column potential_savings format A30
column origin_of_inefficiency format A30
column comments format A30
column starts format 999999
column filter_predicates format A30
column repetition_ratio format 999999


spool report.txt

    with myplan as                                                                                                                                                                                      
    (                                                                                                                                                                                                   
        select p.sql_id, id, parent_id, operation, options, object_owner, object_name, object_type, cardinality, depth,                                                                                        
                S.LAST_CR_BUFFER_GETS gets, S.last_starts STARTS, s.last_output_rows outrows, p.filter_predicates, p.access_predicates                                                                                                       
        from v$sql_plan p,                                                                                                                                                                              
                v$sql_plan_statistics s                                                                                                                                                                 
        where p.sql_id = s.sql_id                                                                                                                                                                       
        and P.CHILD_NUMBER = S.CHILD_NUMBER                                                                                                                                                             
        and p.id = s.operation_id                                                                                                                                                                       
        and (p.sql_id, p.child_number) = (SELECT * FROM (SELECT SQL_ID, CHILD_NUMBER FROM V$SQL WHERE SQL_TEXT like '&&1' ORDER BY LAST_ACTIVE_TIME DESC) WHERE ROWNUM=1)
--        and p.sql_id = '&1'
--        and p.child_number = &2                                                                                                                                                                                                                                                                                       
     ),                                                                                                                                                                                                 
    extended_plan as                                                                                                                                                                                    
    (                                                                                                                                                                                                   
        select mp.*,   
        (                                                                                                                                                                                               
            select max(id)                                                                                                                                                                              
            from myplan                                                                                                                                                                                 
            connect by prior id = parent_id                                                                                                                                                             
            start with id = mp.id
            and 'TABLE ACCESS' = mp.operation
            and mp.options like 'BY%ROWID'                                                                                                                                                                       
        ) rowid_provider_id,                                                                                                                                                                              
        (                                                                                                                                                                                               
            select min(id)
            from myplan
            where parent_id = mp.id            
        ) older_child_id,                                                                                                                                                                              
        (select min(id) from myplan where parent_id = mp.id) older_child,
        (select count(*) from myplan where parent_id = mp.id) num_children
        from myplan mp                                                                                                                                                                                  
    ),
    opstats as
    (
      select owner, table_name, num_rows, blocks, t.stale_stats, t.user_stats
      from dba_tab_statistics t
      where table_name in (select object_name from myplan)
      and partition_name is null
      union
      select owner, index_name, num_rows, leaf_blocks blocks, I.STALE_STATS, I.USER_STATS
      from dba_ind_statistics i
      where index_name in (select object_name from myplan)
      and partition_name is null
    ),
    segstats as
    (
        select owner, segment_name, sum(blocks) actual_blocks
        from dba_segments s
        where segment_name in (select object_name from myplan)
        and rownum>0
        group by owner, segment_name
    ),    
    inrows as
(    
    select lpad(' ', ep.depth, ' ') || operation || ' ' || options || ' ' plan_output,
            (select min(id) from myplan where operation like 'PARTITION%' and id != ep.id connect by prior parent_id = id and num_children<=1 start with id = ep.id) partition_iterator_id,                                                                                                                                                                                                                                                                                                                          
    case when operation in ('FILTER', 'INDEX', 'VIEW', 'TABLE ACCESS', 'COUNT') then gets        
            when operation = 'HASH JOIN' then gets - (select sum(gets) from extended_plan ep2 where parent_id = ep.id)    
    end own_gets, 
            case when operation = 'TABLE ACCESS' and options like 'BY%ROWID' then (select outrows from myplan where id = rowid_provider_id)  
            --        when operation  = 'TABLE ACCESS' and options = 'FULL' then ss.num_rows*ep.starts
                    when operation = 'TABLE ACCESS' and options = 'FULL' then ceil(num_rows*gets/os.blocks)
                    when (operation = 'INDEX' and options like '%SCAN' and options != 'FULL SCAN') then ceil(num_rows*gets/os.blocks)
                    else (select outrows from myplan where id = older_child_id)  
            end inrows, 
            ep.*,
            ceil(gets/os.blocks) rpt_ratio,
            os.num_rows, 
            os.blocks,
            ss.actual_blocks,
            os.user_stats,
            os.stale_stats
    from extended_plan ep,
            opstats os,
            segstats ss
    where ep.object_owner = os.owner (+)
    and ep.object_name = os.table_name (+)
    and ep.object_owner = ss.owner (+)
    and ep.object_name = ss.segment_name (+)
),
leakage as
(
    select inrows.*, 
             case when nvl(inrows,0) <= 0 then null
                     when inrows1 then 'inefficient filtering operation (probably a subquery could not be unnested)'
              when operation = 'FILTER' and filter_predicates is not null and num_children=1 then 'restrictive single-source filtering operation'
              when operation = 'INDEX' and starts = 1 and filter_predicates is not null then 'index filter predicates' 
              when operation = 'INDEX' and starts > 1 and partition_iterator_id is null and filter_predicates is null then 'inefficient driving operation (join, subquery etc.) and/or B-tree navigation overhead'  
              when operation = 'INDEX' and starts > 1 and partition_iterator_id is not null  and filter_predicates is null and (select operation || ' ' || options from leakage where id = l.partition_iterator_id) not like '%SINGLE%' then 'navigating multiple partitions of a local index (' || (select operation || ' ' || options from leakage where id = l.partition_iterator_id) || ')'
              when operation = 'INDEX' and starts > 1 and partition_iterator_id is not null  and filter_predicates is null and (select operation || ' ' || options from leakage where id = l.partition_iterator_id)  like '%SINGLE%' then 'inefficient driving operation'
              when operation = 'INDEX' and starts > 1 and partition_iterator_id is not null  and filter_predicates is null then 'navigating multiple partitions of a local index (' || (select operation || ' ' || options from leakage where id = l.partition_iterator_id) || ')'
              when operation = 'INDEX' and starts > 1 and partition_iterator_id is not null  and filter_predicates is not null then 'inefficient index access predicates, navigating multiple partitions of a local index (' || (select operation || ' ' || options from leakage where id = l.partition_iterator_id) || ')'
              when operation = 'INDEX' and starts > 1 and filter_predicates is not null then 'inefficient join order and/or access predicates' 
              when operation = 'TABLE ACCESS' and options = 'FULL' and starts = 1 and filter_predicates is not null then 'rows rejected by filter predicate after a full table scan' 
              when operation = 'TABLE ACCESS' and options = 'FULL' and starts > 1 then 'multiple executions of a full table scan'
              when operation = 'TABLE ACCESS' and options like 'BY%ROWID' and starts > 1 and filter_predicates is null then 'inefficient driving operation (join, subquery etc.)'  
              when operation = 'TABLE ACCESS' and options like 'BY%ROWID' and starts>1 and filter_predicates is not null then 'inefficient driving operation (join, subquery etc.) and/or access predicates'  
              when operation = 'TABLE ACCESS' and options like 'BY%ROWID' and starts=1 and filter_predicates is not null then 'inefficient access predicates'
              WHEN COST_LEAKAGE>0 THEN 'unknown'
              end leakage_reason,             
              round(abs(100*(actual_blocks/blocks - 1))) stats_error,
             nullif(rpt_ratio, 1) repetition_ratio,
             max(gets) over () total_gets 
    from leakage l
)
select id, 
         operation || ' ' || options || ' ' || object_owner ||  decode(object_name, null, null, '.') || object_name operation,  --num_rows, gets, inrows, outrows,
        nvl2(cost_leakage,  cost_leakage || ' (' || round(100*cost_leakage/gets) || '% operation, ' || round(100*cost_leakage/total_gets) || '% query)', 'UNKNOWN') potential_savings, 
    --        round(abs(100*(actual_blocks/blocks - 1)))  || case when user_stats = 'YES' then ' caution: user-defined stats!' when stale_stats = 'YES' then ' caution: stale stats!' end "stats",
--        stale_stats,
--        user_stats,         
        case when cost_leakage is null then notes else leakage_reason end origin_of_inefficiency,
        case when stats_error>10 then 'caution: actual blocks differ from estimated blocks by ' || stats_error || ' %'  end || case when user_stats = 'YES' then '(user-defined stats!)' when stale_stats = 'YES' then '(stale stats!)' end comments,
        starts/*,
        repetition_ratio,
        filter_predicates*/ 
from summary s
where gets>0
--where cost_leakage > 0
order by cost_leakage desc nulls first;        
    
spool off

Reading AWR reports

Since this is a very popular subject on the OTN forum, I decided to put together a few points about analyzing AWR reports.

1. Choosing time period for the AWR report

When troubleshooting a specific problem, one should try and chose the period as close to the duration of the incident as possible. Including snapshots beyond that period would dilute the symptoms of the problem. For example, if the incident occured between 5:49 pm and 7:06 pm, then it’s reasonable to pick 7 pm as the start snapshot and 8 pm as the end snapshot. Choosing 5 pm and 8 pm will result in the AWR report being diluted by 1 hour and 55 minutes of normal running.

If the AWR report is generated to get a general feel of the database profile, then it’s preferable to chose the period of a peak load, since potential performance bottlenecks are more likely to manifest themselves at such times. On the other hand one should avoid any untypical activity(e.g. huge reports that are only run once a year) or any maintenance (e.g. an rman backup).

Of course, the AWR report cannot include an instance restart.

2. Choosing a baseline report

When using AWR report to troubleshoot a specific issue, it is a good idea to generate a second report to as a point of reference. When choosing start and end snapshots for such report, one should take into account application workload periodicity. E.g. if Mondays are busier than other days of week, then an incident that occured on a Monday between 2 and 3 am should be compared to a similar period for another Monday, etc.

3. Most informative sections of the report

I find the following sections most useful:
summary
top 5 timed events
top SQL (by elapsed time, by gets, sometimes by reads)

4. Things to look for
general workload profile (redo per sec, transactions per sec)
abnormal waits (first of all, concurrency and commit)
clear leaders in the top SQL (suggestive of plan-flip kind of a performance issue)

5. Things to keep in mind when interpreting the report

It is important not to get obsessed by the ratios in the report, especially ones that you don’t fully understand. Normally AWR doesn’t contain enough evidence to do the full analysis of a performance problem, it’s just a departing point. The next logical step is to use high-resolution tools to pinpoint the root cause of the problem, such as:

1) query AWR views(DBA_HIST%) directly

2) query ASH views (V$ACTIVE_SESSION_HISTORY, DBA_HIST_ACTIVE_SESS_HISTORY) to link suspicious waits to specific sessions

3) take a closer look at top SQL, using rowsource statistics and cardinality feedback analysis; if necessary, use SQL extended trace

It is a bad idea to use AWR reports when the scope of a performance problem is limited and known (and yet some people do that). E.g. if users complain about procedure DOSOMETHING being slow, it’s fine to generate an AWR report to see if the database is experiencing extra workload, or query AWR views to see if there are changes in the way users call the procedure, but other than that one needs to use more specific things: DBMS_PROFILER, rowsource stats, SQL trace etc.

Another bad idea is to get obsessed by some obscure ratio not being perfect in the AWR report, especially when users are generally happy with the performance. It is quite common that people run an AWR report just in case, find something that supposedly shouldn’t be there and then start to plan a potentially expensive and risky fix for a problem that may not even exist.

For example, when people see log file related waits, they tend to jump to conclusion that something needs to be immediately done to the redo buffer (of course, making it bigger is the 1st thing that comes to mind). Before doing anything, one should answer following questions:

1. What is the size of the problem, indicated by the suspicious wait event (‘wrong’ ratio, etc.)? Is it big enough to mean a problem? If already experiencing a problem — is the effect commensurate with its size? E.g. if anything in the database runs 5 times slower than normal and you see ‘buffer busy waits’ with 3% in the top-5 wait list, then clearly buffer busy waits are irrelevant (even though everyone knows they’re bad and shouldn’t be there… in a perfect world).

2. What is it linked to? Could it be a one-time thing? E.g. someone running a huge report that only runs once a quarter or uploading huge amount of data that will only happen once?