Friday, August 2, 2013

Reading AWR reports

Since this is a very popular subject on the OTN forum, I decided to put together a few points about analyzing AWR reports.

1. Choosing time period for the AWR report

When troubleshooting a specific problem, one should try and chose the period as close to the duration of the incident as possible. Including snapshots beyond that period would dilute the symptoms of the problem. For example, if the incident occured between 5:49 pm and 7:06 pm, then it’s reasonable to pick 7 pm as the start snapshot and 8 pm as the end snapshot. Choosing 5 pm and 8 pm will result in the AWR report being diluted by 1 hour and 55 minutes of normal running.

If the AWR report is generated to get a general feel of the database profile, then it’s preferable to chose the period of a peak load, since potential performance bottlenecks are more likely to manifest themselves at such times. On the other hand one should avoid any untypical activity(e.g. huge reports that are only run once a year) or any maintenance (e.g. an rman backup).

Of course, the AWR report cannot include an instance restart.

2. Choosing a baseline report

When using AWR report to troubleshoot a specific issue, it is a good idea to generate a second report to as a point of reference. When choosing start and end snapshots for such report, one should take into account application workload periodicity. E.g. if Mondays are busier than other days of week, then an incident that occured on a Monday between 2 and 3 am should be compared to a similar period for another Monday, etc.

3. Most informative sections of the report


I find the following sections most useful:
summary
top 5 timed events
top SQL (by elapsed time, by gets, sometimes by reads)

4. Things to look for

general workload profile (redo per sec, transactions per sec)
abnormal waits (first of all, concurrency and commit)
clear leaders in the top SQL (suggestive of plan-flip kind of a performance issue)

5. Things to keep in mind when interpreting the report


It is important not to get obsessed by the ratios in the report, especially ones that you don’t fully understand. Normally AWR doesn’t contain enough evidence to do the full analysis of a performance problem, it’s just a departing point. The next logical step is to use high-resolution tools to pinpoint the root cause of the problem, such as:

1) query AWR views(DBA_HIST%) directly

2) query ASH views (V$ACTIVE_SESSION_HISTORY, DBA_HIST_ACTIVE_SESS_HISTORY) to link suspicious waits to specific sessions

3) take a closer look at top SQL, using rowsource statistics and cardinality feedback analysis; if necessary, use SQL extended trace

It is a bad idea to use AWR reports when the scope of a performance problem is limited and known (and yet some people do that). E.g. if users complain about procedure DOSOMETHING being slow, it’s fine to generate an AWR report to see if the database is experiencing extra workload, or query AWR views to see if there are changes in the way users call the procedure, but other than that one needs to use more specific things: DBMS_PROFILER, rowsource stats, SQL trace etc.

Another bad idea is to get obsessed by some obscure ratio not being perfect in the AWR report, especially when users are generally happy with the performance. It is quite common that people run an AWR report just in case, find something that supposedly shouldn’t be there and then start to plan a potentially expensive and risky fix for a problem that may not even exist.

For example, when people see log file related waits, they tend to jump to conclusion that something needs to be immediately done to the redo buffer (of course, making it bigger is the 1st thing that comes to mind). Before doing anything, one should answer following questions:

1.
What is the size of the problem, indicated by the suspicious wait event (‘wrong’ ratio, etc.)? Is it big enough to mean a problem? If already experiencing a problem — is the effect commensurate with its size? E.g. if anything in the database runs 5 times slower than normal and you see ‘buffer busy waits’ with 3% in the top-5 wait list, then clearly buffer busy waits are irrelevant (even though everyone knows they’re bad and shouldn’t be there… in a perfect world).

2. What is it linked to? Could it be a one-time thing? E.g. someone running a huge report that only runs once a quarter or uploading huge amount of data that will only happen once?

No comments: