Friday, August 2, 2013

AWR Reporting troubleshooting CPU Usage

Introduction

Let’s start with some basic concepts. AWR reports deal with several kinds of time. The simplest kind is the elapsed time , it’s just the interval of time between the start and end snapshots. Another important quantity is DB time, which is defined as time in user calls during that period. It can be (and for a busy system typically is) greater than the elapsed time. However, the reason for that is not the number of CPUs as some experts incorrectly state (apparently, they confuse it with CPU time that we’ll discuss below), it’s that this time is a sum over all active user processes which are using CPU or waiting for something. Note that it only counts time spent in user calls, i.e. background processes are not included in that.

Another important quantity is database CPU time. It can also exceed the elapsed time, because the database can use more than one CPU. Unfortunately, AWR reports use up to 3 different names for it: CPU time, DB CPU, and CPU used by this session. Normally, they should have close values, and differences can probably be attributed to connection management (e.g. establishing or tearing down a session). And of course “CPU used by this session” is an odd name for an instance-level metric, but that’s understandable: it’s just a sum of a session-level metric over all sessions.

CPU time represents time spent on CPU and does not include time waiting for CPU. Unfortunately, the latter quantity is not accessible via AWR (but there are indirect ways of extracting in via ASH).

Finally, CPU consumption in the host operating system can also be important for trobleshooting high CPU usage. AWR provides these numbers in the “Operating Sysem Statistics” section (as “BUSY” and “IDLE”, the units are centiseconds).

DB time and DB CPU define two important timescales: wait times should be measured against the former, while CPU consumption during certain activity (e.g. CPU time parsing) should be measurd against the latter.

High CPU time


CPU usage is described by “CPU time” (or “DB CPU”) statistics. Somewhat counterintuitively, AWR report showing CPU time close to 100% in the top timed events section does not necessarily indicate a problem. It simply means that database is busy using CPU to do work for its users. However, if CPU time (expressed in CPU seconds) becomes commensurate to the total CPU power available on the host (or shows consistent growth patterns), then it becomes a problem, and a serious one: this means that at best, Oracle processes will wait lots of time to get on CPU runqueue. In the worst case scenario, the host OS won’t have adequate resources to run and may eventually hang.

Unfortunately, AWR reports only provide CPU time estimates either in absolute units or as a percentage of DB time, but not in terms of the overall capacity. It’s not wrong: you need to know what percentage of user calls falls on CPU time to see whether or not it’s contributing appreciably to response times. But it’s not complete, because when talking about resource usage you need to know what % of total resource available is being used. Fortunately, it’s quite simple to calculate that:DB CPU usage (% of CPU power available) = CPU time / NUM_CPUS / elapsed time

where NUM_CPUS is found in the Operating System statistics section. Of course, if there are other major CPU users in the system, the formula must be adjusted accordingly. To check that, look at OS CPU usage statistics either directly in the OS (using sar or other utility available on the host OS) or by looking at IDLE/(IDLE+BUSY) from the Operating System statistics section and comparing it to the number above.

If DB CPU usage is at 80-90% of the capacity (or 70-80% and growing) then you try to reduce CPU usage or if not possible, buy more CPU power before the system freezes.

To reduce high CPU usage one needs to find its source within the database. The first thing to check is parsing, not only because this is a CPU-intensive activity, but also because high parsing means lack of cursor sharing, which makes diagnostics very difficult: each statement is parsed to its own sql_id, spreading database workload over thousands of statements which only differ by parameter values. Of course, this makes all “SQL ordered by” lists in the AWR report useless.

If parsing is reasonable, then one needs to look at SQL statements consuming most CPU (“SQL ordered by CPU time” in the CPU section of the report) to see if there is excessive logical I/O that could be reduced by tuning, or some expensive sorts that could be avoided, etc. It could also be useful to check “segments by logical reads” to see if partitioning or a different indexing strategy would help.

Unaccounted CPU time

Occasionally, CPU time may underestimate the actual CPU usage because of errors and holes in database and OS kernel code instrumentation — then one needs to rely on OS statistics to figure out how much of the OS CPU capacity the database is using.

In this case, when looking for the source of high CPU usage within the database, in addition to OS tools (top, sar, vmstat etc.) one can use indirect indications of high CPU consumption, such as:

- missing time in the “timed events” section (sum of percentages in top-5 significantly below 100%)
- high parsing (ideally CPU usage during parsing should be accounted for in “CPU time”, but that’s not always the case)
- mutex-related wais, such as “cursor: pin S wait on X” etc. (either because of high parsing, or bugs, or both)
- logon storms (high number of logons in short time)
- resource manager events (“resmgr: cpu quantum”),

or look in ASH for sessions with the “ON CPU” state and see what they are doing.

Examples


Let’s consider a few examples.
Example 1WORKLOAD REPOSITORY report for

DB Name         DB Id    Instance     Inst Num Release     RAC Host
------------ ----------- ------------ -------- ----------- --- ------------
xxxx         xxxxxxxxx   xxxx               1 10.2.0.4.0  NO  xxxxxxxxx

              Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:     66607 02-Mar-12 12:00:52       648      19.6
  End Snap:     66608 02-Mar-12 12:30:54       639      21.4
   Elapsed:               30.04 (mins)
   DB Time:            3,436.49 (mins)

...
Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
resmgr: cpu quantum                 475,956     152,959    321   74.2  Scheduler
CPU time                                         47,879          23.2
db file sequential read           3,174,880      15,866      5    7.6 User I/O
db file scattered read              196,255       4,078     21    2.0 User I/O
log file sync                       157,730       4,579     29    4.4 Commit
...
-> Total time in database user-calls (DB Time): 104720.3s
...
Operating System Statistics           DB/Inst: ****/****  Snaps: 66607/66608

Statistic                                       Total
-------------------------------- --------------------
...
BUSY_TIME                                   5,707,941
IDLE_TIME                                           1
...
NUM_CPUS                                           32
-------------------------------------------------------------

This is a simple case: the report has “CPU starvation” written all over it. CPU time (47,879s) — even though not the largest wait event in the database — is close to the maximum capacity (32 x 30 min x 60 sec/min = 57,600). The top wait event (resmgr: cpu quantum) indicates that the database user calls are spending most of their time waiting for the Resource Manager to allocate CPU resource to them — that’s another symptom of extreme CPU starvation. And finally, OS stats are confirming that CPU is completely maxed out: 1 centisecond of idle time versus 5,707,941 busy!

Fortunately, SQL ordered by CPU time is just as unambiguous: it showed one SQL statement responsible for 60.99% of DB time, and fixing it (it was a bad plan with poor table ordering and millions of context switching because of a PL/SQL function calls) fixed the entire database.

Now let’s consider something less trivial.


Example 2
  WORKLOAD REPOSITORY report for

DB Name         DB Id    Instance     Inst Num Release     RAC Host
------------ ----------- ------------ -------- ----------- --- ------------
xxxx         xxxxxxxxx   xxxx               1 10.2.0.5.0  NO  xxxxxxxxx
Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:     38338 08-Mar-12 02:00:40       673       6.7
  End Snap:     38339 08-Mar-12 04:29:22       760       5.6
   Elapsed:              148.70 (mins)
   DB Time:           77,585.95 (mins)
...
Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
cursor: pin S                  ############   2,072,643      2   44.5      Other
cursor: pin S wait on X          76,424,627     929,303     12   20.0 Concurrenc
latch free                            1,958     246,702 ######    5.3      Other
CPU time                                         58,596           1.3
log file sync                       746,839      44,076     59    0.9     Commit
          -------------------------------------------------------------
...
-> Total time in database user-calls (DB Time): 4655157.1s
...
          -------------------------------------------------------------
Operating System Statistics
Statistic                                       Total
-------------------------------- --------------------
...
BUSY_TIME                                   6,327,511
IDLE_TIME                                      24,053
...
NUM_CPUS                                            7
          -------------------------------------------------------------

There are quite a few remarkable things in this report. And there is a good story to it, too, but I’m hoping to make a separate post about it, so let’s focus on CPU stuff here.

The time period of the report spans 148 min, but DB time is 77,586 min, which means that there were ~524 active sessions on the average. If we compare that to the number of sessions (673/760 beginning/end), we can see that even the database was terribly busy, or, yet more likely, most of the sessions were waiting on something. The list of timed event confirms this: it shows massive mutex contention in the library cache.
Now let’s look at the CPU time here. It’s 58,596 s, or just 1.3% of DB time… negligible! Or is it?… Let’s compare it to the total CPU time available: 148 minutes times 7 CPUs times 60 seconds per minute equals 62,454 s — i.e. the database alone was responsible for 93.7% of the CPU time during a 2.5 hour interval! More likely, it started off at a moderate level, and then for a good portion of the interval stayed close at 100%, which averaged to 93.7%.

If we look again at the wait events, we don’t find any mention of CPU time at all! However, if we do the math, we can find an indirect indication: 44.5+20+5.3+1.3+0.9=72, so where did the remaining 28% go?… Also, cursor: pin S wait on X, cursor: pin S are both mutext waits, which can burn CPU at a very high rate  This gives us a good idea of how the CPU is wasted (and if one looks in ASH, one can find where exactly it happens, but that’s beyond the scope of this post).

In this case, “SQL ordered by CPU time” was useless for finding the source of high CPU usage, because many SQL statements were not using binds. The culprit was found by looking in the ASH (actually, that requires a bit of work, too, but I’m hoping to make a separate post about it), and fixing it fixed the problem.

Let’s consider another case.

Example 3
WORKLOAD REPOSITORY report for

DB Name         DB Id    Instance     Inst Num Release     RAC Host
------------ ----------- ------------ -------- ----------- --- ------------
xxxx         xxxxxxxxx   xxxx               1 10.2.0.4.0  NO  xxxxxxxxx
              Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:     33013 02-Apr-12 10:00:00       439      27.1
  End Snap:     33014 02-Apr-12 11:00:12       472      24.4
   Elapsed:               60.20 (mins)
   DB Time:              520.72 (mins)
...
Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
CPU time                                         15,087          48.3
db file sequential read          28,442,386       8,758      0   28.0   User I/O
enq: TX - row lock contention         1,459       3,633   2490   11.6 Applicatio
log file sync                        89,026       2,922     33    9.4     Commit
db file parallel write              169,289       2,783     16    8.9 System I/O
...
Operating System Statistics
Statistic                                       Total
-------------------------------- --------------------
...
BUSY_TIME                                   5,707,941
IDLE_TIME
...
NUM_CPUS                                           64

Here, CPU time is responsible for almost half of the DB time. This looks big. Does this mean we should rush to buy more (or faster) CPUs? Probably not, since the CPU time (15,087) is only a small fraction of available CPU resource (64 CPUs x 60 mins x 60 s = 230,000s). OS stats also show that CPU is not a scarce resource on the system (211,335s idle vs 19,831s busy).

Of course, this doesn’t mean that tuning SQL to reduce CPU consumption won’t help here — it will, it just won’t be a global effect. Therefore, it would make sense to tune based on business priority, not on the amount of CPU usage.

Conclusion


Troubleshooting high CPU usage with AWR reports can be tricky and may require other tools (like ASH). While most waits are compared to DB time, CPU time should also be compared to the total CPU capacity on the host.

No comments: