Thursday 1 August 2013

Monitors: What metrics/counters to monitor for Windows System Resource?

Monitoring on Windows System Resource.
  • Processor(_Total)\% Processor Time
  • Processor(_Total)\% Privileged Time
  • Processor(_Total)\% User Time
  • Processor(_Total)\Interrupts/sec
  • Process(instance)\% Processor Time
  • Process(instance)\Working Set
  • Physical Disk (instance)\Disk Transfers/sec
  • System\Processor Queue Length
  • System\Context Switches/sec
  • Memory\Pages/sec
  • Memory\Available Bytes
  • Memory\Cache Bytes
  • Memory\Transition Faults/sec

As you've noticed, some of them are already being selected by LoadRunner. When designing the scenario for the load test, set the following counters for Windows System Resource. This is useful in the following areas: Server/Service Availability, Processor, Hardware Functionality, RAM and Disks.
  • Server/Service Availablity
  • System\System Up Time

This counter describes the time that the server was last rebooted in seconds.

Process(instance)\Elapsed Time

This counter describes the time for a certain process/instance. You can use the Process(instance) \Elapsed Time counter to also monitor processes associated with specific applications and services to monitor the availability of these applications and services.

Processor :

Processor(_Total)\% Processor Time


The Processor(_Total)\% Processor Time is useful in measuring the total utilization of your processor by all running processes. Note that if you have a multiprocessor machine, Processor (_Total)\% Processor Time actually measures the average processor utilization of your machine (i.e. utilization averaged over all processors).

Process(instance)\% Processor Time

When you know specificied process/instance is used by the application, it would be recommended to drill down to the level of monitoring the process/instance. Using this, it can be also used to detect suspicious process/instance that is utlizing the processor.

Processor(_Total)\% Privileged Time and Processor(_Total)\% User Time

For the two counters, Processor(_Total)\% Privileged Time represents the time taken for kernel- related operations such as handling OS housekeeping and Processor(_Total)\% User Time represent the server running too many specific roles. For Privilged Time being too high, consider reducing the amount of OS housekeeping task and for User Time, consider increasing the hardware.

System\Processor Queue Length

If this counter is consistently higher than around 5 when processor utilization approaches 100%, then this is a good indication that there is more work (active threads) available (ready for execution) than the machine's processors are able to handle. However, take note that it is not the best indicator of Processor contention based on overloaded threads. Other counters such as ASP\Requests Queued or ASP.NET\Requests Queued can be introduced for such tasks.

Hardware Functionality

System\Context Switches/sec


This counter measures how frequently the processor has to switch from user- to kernel-mode to handle a request from a thread running in user mode. The heavier the workload running on your machine, the higher this counter will generally be, but over long term the value of this counter should remain fairly constant. If this counter suddenly starts increasing however, it may be an indicating of a malfunctioning device, especially if you are seeing a similar jump in the Processor(_Total)\Interrupts/sec counter on your machine.

You may also want to check Processor(_Total)\% Privileged Time Counter and see if this counter shows a similar unexplained increase, as this may indicate problems with a device driver that is causing an additional hit on kernel mode processor utilization.

Processor(_Total)\Interrupts/sec

This counter measures the number of interrupts the processor have to handle over time.
RAM

Memory\Pages/sec


This counter indicates the number of paging operations to disk, and it is a primary counter to indicate possible insufficient RAM to meet your server's needs. A recommended benchmark to watch is when the number of pages per second exceeds 50 per paging disk on your system.

Memory\Available Bytes

If this counter is greater than 10% of the actual RAM in your machine then you probably have more than enough RAM and don't need to worry. If it drops below 2% of the installed RAM, you might want to dwell a little deeper with the next counter.

Process(instance)\Working Set

This counter measures the size of the working set for each process, which indicates the number of allocated pages the process can address without generating a page fault. This will assist to determine which process is consuming the larger amounts of RAM if a downward trend is developed.

Memory\Cache Bytes

On the other hand, this counter measures the working set for the system i.e. the number of allocated pages kernel threads can address without generating a page fault.

Memory\Transition Faults/sec

This counter measures how often recently trimmed page on the standby list are re-referenced. If this counter slowly starts to rise over time then it could also indicating you're reaching a point where you no longer have enough RAM for your server to function well.


Disks
Physical Disk (instance)\Disk Transfers/sec

To monitor disk activity, we can use this counter. When the measurement goes above 25 disk I/Os per second then you've got poor response time for your disk (which may well translate to a potential bottleneck. To further uncover the root cause we use the next mentioned counter.

Physical Disk(instance)\% Idle Time


This counter measures the percent time that your hard disk is idle during the measurement interval, and if you see this counter fall below 20% then you've likely got read/write requests queuing up for your disk which is unable to service these requests in a timely fashion. In this case it's time to upgrade your hardware to use faster disks or scale out your application to better handle the load.

No comments: