Saturday 27 July 2013

What is benchmark testing?

Benchmark testing is the process of load testing a component or an entire end to end IT system to determine the performance characteristics of the application. The benchmark test is repeatable in that the performance measurements captured will vary only a few percent each time the test is run. This enables single changes to be made to the application or infrastructure in an attempt to determine if there is a performance improvement or degradation.

Benchmark testing can combine aspects of security testing.- An example in case is benchmark testing firewalls. This requires system and or user loads combined with security violations concurrently executed against the component to determine its performance benchmark.

The goals of benchmark testing typically fall into two categories;

To test the system to measure how a change affects its performance characteristics.

To test and tune the system to reach a performance requirement or service level agreement (SLA). In this case a series of benchmark tests are conducted in conjunction with iterative cycles of performance tuning.

What is Reliability Testing?

A system's reliability is a measure of stability and overall performance of a system collated during an extended period of time under various specific sets of test conditions. This type of testing incorporates the results from non-functional testing such as stress testing, security testing, network testing, along with functional testing. It is a combined metric to define a system's overall reliability. A measure of reliability should be defined by business requirements in the form of service levels. These requirements should then be used to measure test results and the overall reliability metric of a system under test.

Testing Firewalls

The firewall is your company's defence system, protecting vulnerable applications from outsiders. This defence system is clever, it allows in friends and keeps out enemies. Security testing of your firewalls is a vital aspect to your business security.

Performance is just one aspect of the quality assurance work that must take place on a new or upgraded firewall. It is important that the firewall performs under load and during sustained security attacks.
There are a number of considerations when planning a Performance Test and Security Test of your firewall.

Ideally the workload generated in a Performance Test would include a scenario whereby friendly and unfriendly requests are generated. The firewall may exert a lot more effort when it is fending off an attack. A number of scenarios should be undertaken. These include tests that would determine:

The maximum number of TCP connections created per second
The maximum number of concurrent users that could be connected
The maximum http (hits per second) rate that can be achieved
The maximum bandwidth utilisation that can be serviced

In order to run the above tests a target application will have to be used. The application used is not important, it will serve as a reflector receiving requests and responding to those. The security testing of firewalls undertaken needs to monitor the reflector application to ensure that it does not become a bottleneck (a so called artificial bottleneck) thus limiting the ability to properly find the limits of the firewall itself.

A tool will need to be chosen. While tools such as Loadrunner, Silk Performer and QAload are very good at generating large workloads and simulating hundreds or thousands of users, they tend to do this well for http type message protocols.

Spirent’s Avalanche performance test tool can generate HTTP type traffic as well as a wide range of other non HTTP protocols such as:

802.1Q and 802.1 Q-in-Q
FTP (Active/Passive)
SMTP
ICMP
CIFS
SIP over TCP
SIP over UDP
Unicast Streaming Quicktime RTSP/RTP
Unicast Streaming RealNetwork RTSP/RTP
Unicast Streaming Microsoft MMS
Multicast Streaming IGMPv2
IGMPv3
RTMP and MLDv2

Rather than record a single user session, Avalanche can be used to generate realistic network traffic consisting of multiple message protocols and types. Avalanche is not a tool that can be installed,t i comes on a pre-configured box that can be slotted straight into your data centre all ready to go. Avalanche can be complimented by Spirent’s Threatex product which can be used to generate actual attacks against your defences. As little as 5 days are required to execute a Security Test that combines a Performance Test against your firewall.

Spirent’s Avalanche can be used to test the performance of network components other than the firewall including your load balancers, routers and switches.
Your network implementation can be subjected to performance validation. Quality of Service (QoS) can be tested to ensure that your most important message types are not impacted by users browsing the internet. Fail-over testing and redundancy to ensure continuous service under high workloads can be validated.

Vugen Script Generated from a Trace File

Where the application front end cannot be recorded, Vugen can analyse server traffic files to create automated test scripts using the Analyze Traffic feature. This will create what is known as a Server Traffic Script.

The Java protocol is a case in point. Loadrunner Java scripts for load testing can only be played back. They cannot be recorded. The question is, will the analyse traffic feature help performance tester to develop a working script?

As usual, the Vugen user guide makes it sound easy. Step 1: generate a capture file. Is this as easy as it sounds, or is it a bit more complicated?

Apparently the capture file is a trace that contains a log of TCP network traffic. I hate the term ‘sniffer application’, it is meaningless. A dump of the network traffic can be obtained by a network tracing tool. My favourite tool is Ethereal. In mid 2006 after a long and bitter dispute involving the trademark, Ethereal was re-named Wireshark, which in my opinion is a much better name any way. Wireshark is a free tool and I know lots of network people who use this tool on a regular basis. Take some time to get to know Wireshark, with a bit a familiarisation you can figure out how to generate a network trace.

If you do not want to use Wireshark, then Vugen has a capture facility that can be used. To do this, a command is issued (from the command line). There are separate commands for different platforms, e.g. for windows, the command is lrtcpdump.exe whereas for HP’s version of Unix, the command is lrtcpdump.hp9.

Now that we have a capture file all we have to do is generate a Loadrunner script. It is at this point in the process that I had very little confidence that the capture file would be valid. Still, all you can do is give it a go. As I started up Vugen the instructions indicated that after processing the capture file I would have either a web services or web soap functions generated. So, after all that effort, I realise that this is just another feature aimed at web applications. Sorry Java or some other bespoke application, you will have to wait another day.

Still, I can press on and evaluate how it works for an un-recordable web based application.

So, quickly finding a web application (www.afcwimbledon.co.uk), I captured a trace file and saved it away somewhere safe.

Starting Vugen,

I created a new script selecting single protocol for webservices. The analyse traffic button was easy to spot on the main menu bar.

Selecting analyse traffic, I was presented with a new window that requested WSDL’s.

I did not have any of those, so I just selected next and was presented with a screen that indicated ‘finish’ rather than ‘next’ which was a good sign.

The first problem. Loadrunner expected my capture file to have an extension of cap e.g. the file neame would look like capturefile.cap. Wireshark produced a capture file with a pcap file extension. With a sense of foreboding, I selected the Wireshark trace anyway.

The second problem. Did I have incoming traffic or outgoing traffic? Well, I figured that if I was sending a request and receiving a response I probably had both. Unfortunately, the incoming and out going traffic options are mutually exclusive, you have to choose one or other. In the end I tried both, neither produced satisfactory results.

Vugen accepted the capture file, populated a recording log and a generation log, but failed to produce a single statement of code.

My instincts were correct. Despite capturing what looked like a valid capture file, against a website, a Vugen script was not generated

Performance Testing Active Directory

Active Directory is Microsoft’s Authentication software. Siteminder by Computer Associates is another similar product to Active Directory. Within ZOS, ACF2 & RACF are major authentication products.

Interestingly ACF2 and RACF have completely different philosophies for achieving the same aim, i.e. ensuring access to only those users who should have access. ACF2 requires the administrator to explicitly grant access to a resource. RACF requires the administrator to explicitly block access to a resource. Which approach is better? Well, I’m a performance tester, thankfully I can dodge that one.

Requirements to performance test Authentication Software are rare. In most cases the performance tester is looking at a new or enhanced application, an upgrade to system software or hardware or some other such similar requirement. This normally requires little change or little extra load to security policy.

There is normally just one implementation of Authentication software that covers all development and test environments. The performance tester normally excludes authentication software from scope as there are many users across the development infrastructure which would make it difficult to correlate workload with resource utilisation.

There are some cases though where it can be very useful to test Authentication software. For instance:

1. Implementation of single sign-on where authentication of users is handled by a centrally located authentication software.

2. Addition to an authentication application of large numbers of security objects such as users or devices. This can add to the complexity

of structure of security objects which when combined with increased demand can cause greatly extended response times.

3. Any major reorganisation of authentication such as merging two or more domains into one.

Loadrunner can be used to test authentication software. Simply choose an application of protocol type that you are licensed for and create a logon script. Obtain sufficient user IDs and passwords and away you go.

Life isn’t always that easy. Sometimes (especially with single sign on), logon takes place at boot-up, or when a machine is connected to the network. Once the user is logged in and another application is started, for instance Exchange, additional user credentials are not required to be entered by the user, single sign on takes care of it.

Loadrunner can be used to simulate authentication against Radius. This is a protocol that allows centralised authentication for machines to connect and use a networking service. Actual authentication would still be carried out by reference to Active Directory, Site Minder or some other similar product.

The most common usage of Radius is for ISP’s that provide wireless networks. While these wireless networks can be connected to and used, unless authentication takes place (which may involve the use of a credit card), access to any other network location other than the portal cannot take place.

Rasius also allows an organisation to maintain user profiles in a central location that all remote servers can share. This allows a company to set up a policy that can be applied at a single administered network point using Active Directory or similar.

The Radius protocol can only be played back, it cannot be recorded. There is limited support for this protocol, for instance, it does not support certificates. The following details are setup in the runtime settings:

Property

Network Type Accounting network type

GPRS (General Packet Radio Service) or CSD (Circuit-Switched Data)

IP Address

IP address of the Radius server

Authentication port number

Authentication port of the Radius server

Accounting port number

Accounting port of the Radius server

Secret Key

The secret key of the Radius server

Connection Timeout (sec). The time in seconds to wait for the Radius server to respond. The default is 120 seconds.

Retransmission retries. The number of times to retry after a failed transmission. The default is 0.

Store attributes returned by the server to parameters. Allow Vusers to save attributes returned by the server as parameters, which can be used at a later time. The default is False.

Radius client IP

Radius packets source IP, usually used to differentiate between packets transmitted on different NIC cards on a single Load Generator machine

There are really only two statements that can be used with the Radius protocol.

This is an example as shown in the Loadrunner documentation of the Radius_account statement.

radius_account("AccountName=account1",

"Action=Start",

"Username=joe123",

"CallingId=123456", // MSISDN

LAST);

The other statement is the Radius Authenticate statement.

radius_authenticate("Username=jim",

"Password=doe123",

"CallingId=999", // the MSISDN

LAST);

Active Directory can be performance tested directly with a number of tools. The tool that makes most sense is the free Microsoft tool called ADTest.exe.

This is a reasonably good product that get the job done. While it takes some time to start using this tool, it is flexible and gets the job done. While it is always better to use production data, if you don’t have any, then ADTest can be used to set it up for you. A large range of sample automated functions come pre set-up. This makes it a lot easier to start familarising yourself with the tool.

Once ADTest is installed, it is initiated by executing a command at the run prompt, for example:

adtest –run inter –f adtest.ats –loop 5 –t 20 –m –bt

The above command stresses Active Directory with the test variation inter which is found in the file adtest.ats. 5 loops of inter are executed with 20 threads.

adtest –run LogonUsers –f adtest.ats –loop 5 –t 20 –m –bt

The file adtest.ats contains the following statements.
LogonUsers

{

TEST [LOGON]

LOOP RAND

RANGE #(1-10)

DN P18121##

PWD PassWord

OP LOGON32_LOGON_INTERACTIVE

SCOPE LOGON32_PROVIDER_DEFAULT

}

Vugen Logging Options

Useful when using lots of script blocks/actions and you want to turn on full logging for specific actions or blocks. I could just have added the appropriate code in the script but instead I have 2 actions, one that turns the logging on and one off. I then add the action ahead of the action or block that I want to have full logging for and switch off after.

Here is the turn on action

fFullLogging()

{

lr_set_debug_message(LR_MSG_CLASS_EXTENDED_LOG | LR_MSG_CLASS_RESULT_DATA | LR_MSG_CLASS_FULL_TRACE | LR_MSG_CLASS_PARAMETERS , LR_SWITCH_ON);

return 0;

}

and off


fNormalLogging()

{

lr_set_debug_message(LR_MSG_CLASS_EXTENDED_LOG | LR_MSG_CLASS_RESULT_DATA | LR_MSG_CLASS_FULL_TRACE | LR_MSG_CLASS_PARAMETERS , LR_SWITCH_OFF);

return 0;

}
you can of course change the level of logging required.
Another trick you may need to employ with Loadrunner or indeed most of the load testing tools, is to set the timeout option.
For Loadrunner, this is in Tools > Options > Advanced settings. By default, the timeout setting is set to 2 minutes. When 2 minutes are reached, Loadrunner gives up, issues an error message and aborts execution of the automation. When logging vast amounts of data, the speed at which this data is returned to Loadrunner varies depending on network speed. Surely you say, this should not make the blindest bit of difference, the page size is the page size. Well, this is not the case. When receiving a page to a browser, compression is often used. This can shrink a 1.5MB web page to a few 100KB. With full loadrunner logging switched on, every byte is slowly and painfully returned to the automated test tool. By increasing the timeout to 4 or 5 minutes, it gives the application and the test tool time to recieve the message.

Once you have your full trace, you can sit back and enjoy a few minutes of trawling through the data as you attempt to correlate a 2 byte field.

When executing performance or load tests, test tool logging should be set very low, if not switched off. Logging does not slow users down, but it can cause injectors to use much more CPU, and of course, if the load generator is running slow, then this can cause the users to slow down. It can be an option to have just one user per automated function with logging switched on if necessary.

Substitution data and parameter files in automated test scripts for performance and stress testing

When generating an automated test script, it is designed in a very specific way. The design needs to incorporate logon, key steps for the business function and logoff.

Logon and logoff may occur just once. A good example is a helpdesk user who is receiving and logging calls throughout the day. They log in when they arrive at the office and use the application throughout the day.

Logon and logoff may need to be simulated for each business function executed. An expenses application is a good example. Once a month, a user connects, enters their expense claim then logs off.

Whether logon & logoff occur once per test execution or once per iteration, the automated test script usually needs to use different authentication details for each logon.

On occasions, there are no restrictions to the number of times a user can log in to an application. Take webmail for instance, usually it is possible to access your mail from more than one workstation at the same time. It is more common that a user can only have a single session with an application. When trying to log in to an application when already connected, the user may be returned a message ‘already logged in’, or may find that the 2nd login works, but the first session is terminated.

The performance tester needs to be able to substitute the recorded details in an automated test script with values from a large list.

User1/password1, User2/password2, User3/password3...

This is a basic requirement in performance test tools. TPNS (now called WSIM or Workload Simulator) was the original performance test tool. That tool used a concept called User tables. The market leading tool, HP’s Loadrunner does more or less exactly the same thing as TPNS calling it instead parameter files. As Loadrunner is the market leading tool, ‘Parameter’ is the accepted term meaning the substitution of generic data.

So how does parameterisation of common data work? The following steps may help:

Determine which data needs to be paramertised.
When entering a postcode or zipcode, does the application determine that the postcode/zipcode is valid? What is the impact to the application of entering the same value for every user for every iteration? This question needs to considered for every piece of entered data in the automated test script.
Be clever, a userID may need to be different each time it is being used, but the password for all users may be the same. Only the userID needs to be parameterised.
Where is the data going to come from?
Some data will be supplied by system administrators such as userids or standing data.
Some data can be found on the internet, for instance, postcode files of variable correctness can sometimes be located.
Other data needs to be created. Before an order can be amended, it must first be created. Your automated test tool can be used to generate this data. On occasions, automated test scripts are created just to build data and are not subsequently used in a load test.
Correlation can be used to reference data which has been returned to the users screen. A good example of this is a function whereby a manager approves an expense claim. The manager searches for any outstanding actions he is required to do. One or more approval requests may be returned. The automated test tool can be told to look for specific information returned to the screen. If an approval request is returned, then code in the automated test script can be built to trap this information to be used in a subsequent action (this is correlation). With the returned data correlated, then the approvals can be actioned.
Extract from the database. Using SQL queries, data can be extracted from the database to build into a parameter file. This is a good method as data on the database is generally accurate. An example here is amending an order. An automated test script has already built some orders, but you don’t know which orders were correctly built, and which ones failed part way through. An SQL query with a properly defined 'where' clause can extract a valid list of data that can be used with confidence.

Problems and Issues when creating Loadrunner scripts with the Java protocol

When writing a Loadrunner script that interacts with MQ, Websphere or Weblogic, as a scripter, you need to use the Java protocol.

This is a bit of a nuisance for those of us who have become familiar with C programming language and can use it with our eyes shut. The question is often asked, why do I need to use Java? I want to use C. Java can take a long time to use and logically seems to work quite differently to C.

Well, the answer is a little obvious (but only when you know the answer!) In order to develop Automation for MQ, Weblogic or Websphere, you often need to use the same libraries that the developers have used. The developers have often written the application in Java, and the supporting object libraries are the Java versions for the middleware products MQ, Weblogic and Websphere.

In many cases, the Loadrunner automation is simulating a device which runs the Java application. This could be a desktop, laptop or a handheld terminal (HHT). The device contains a compiled version of the code. This code is executed on one of three circumstances:

• The device is switched on and the operating system is configured to execute at start up

• An input is received from the device such as barcode being scanned

• A message is received from a middleware application such as MQ, Websphere or Weblogic

When executing an automated test script with Vugen, Loadrunner always compiles the script first. With a Java script, this will create a compiled version of the application similar to that which in the real world is located on the desktop, laptop or a handheld terminal. The main difference between the Loadrunner script and the real application is that the Loadrunner script would normally be written to process a distinct business function, i.e. it would contain only a subset of the functionality of the real application.

In order to compile the application, the Loadrunner scripter needs to have access to the same common java files as the developer, the so called JAR files. The JAR files need to be accessible to Loadrunner when it compiles. This is done by entering the information into the runtime settings. By specifying the location of the classpath files in the runtime settings, you are telling Loadrunner where to find the Classpath files so that compilation will work.

While this seems straightforward, the way Loadrunner works with the compiler means that it detects the names of the Classpath files, but does not necessarily determine where they are.

To get around this we can change the path statement on the environment variables for the machine running Vugen. This also does not always work. What should work is to physically place the JAR files in the Loadrunner directory of program files and set the Classpath statements accordingly.

Bottlenecks that you will not encounter in production

As discussed in the Load Test Environments article, performance and load test environments tend to be smaller versions of the production environment. These differences can include:

Less memory
Fewer and smaller physical CPU’s
Fewer and less efficient disk arrays


Single or fewer instances of servers, e.g. 2 database servers in performance test, but 3 database servers in production or even no clustering at all

There can be other differences as well. System software & hardware versions and specifications can vary. Data can be older and less comprehensive than in production. There also may be different authentication, firewall or load balancing configurations in place. Data volumes may be less, the O/S may be 32bit instead of 64bit or vice versa.

The performance test itself is full of compromise. For instance:

Normally only a subset of functions are automated. The path through the function may vary slightly, but generally the same steps are repeated each iteration

Data can be mass produced in a uniform manner, affecting the way data is stored and accessed on the database. Some database tables can contain too little data, others too much

The behaviour of users can be unrealistic, for instance, if a requisition is raised, it is not generally fulfilled in reality until some days down the track. In a performance test, it may be 5 minutes later.

Workload being processed varies in the normal course of events, during a performance test it can remain uniform

This can create many problems for the performance tester. How accurate is the performance test? When the application goes into production and falls over in a heap, will I look incompentent? Understanding the differences between any performance test environment and the production environment is essential. This can help detect and understand an artificial bottleneck. This is something quite different from a performance bottleneck.

An artificial bottleneck is essentially a performance problem that is a direct result of a difference between the production environment or workload and the performance test environment or workload. It is not a performance bottleneck. A performance bottleneck is something that could or is happening in production.

When a performance bottleneck is found, the performance tester must investigate the symptoms in an attempt to try and pinpoint the cause of the issue. Care must be taken to distinguish between a genuine performance bottleneck and an artificial bottleneck brought about purely because of differences in performance test compared to production.

Examples of artificial bottlenecks include:

Database locking - Performance testing results in a subset of functionality being automated. The entire workload is then spread across this subset of automation resulting in some functions been driven at a higher frequency during performance test than would be seen in production. This can result in database locking which would not occur in production. The solution? The problem function should be driven no higher than the maximum peak workload expected in production.

Poor response times and excessive database physical or logical I/O - If a large amount of data has been recently added to the database via a data creation exercise, database tables can become disorganised, resulting in inefficient use of indexes. Additionally, if performance testing started with the database tables nearly empty of data, the optimiser can incorrectly decide that index use is not required. The solution? Run a database reorganisation and refresh optimiser statistics at regular intervals if large amounts of data are being added to the database.

Poor response times and excessive database physical I/O - The assumption here is that the database bufferpool is smaller in performance test than it is in production due to a shortage of memory in the performance test environment. Unfortunately, this is a case of poor performance of some users impacting the performance of all users. The solution? Once the problem function (or functions) is identified, attempt to decrease the workload for those functions to minimise physical I/O. If this is not possible, it may be time for a memory upgrade of the server. Memory upgrades are usually relatively straightforward in terms cost and time.
Memory leak - This is a terrible term describing how memory is gradually consumed over time inferring that at some point, there will be no free physical memory available on the server. Performance test environments often differ from production environments in terms of house keeping. In production, the application could for example be refreshed each night, in performance test it may have been nine weeks since it was last refreshed. The amount of memory in production is often substantially more than in performance test. The solution? Base memory leak calculations on the production memory availability with reference to the production house keeping regime. Remember that once physical memory is fully consumed, virtual memory is still available so it’s not the end of the world as we know it.

Last but not least, when a genuine problem is found, the intial reaction from the developers, the DBA's, the architects, the project management, pretty much everyone really is that the problem observed is not a real problem, it is an artifact of the test tool. There seems to be a disconnect from reality, a lack of understanding that the behaviour of the application changes depending on the workload. Every performance tester out there should know what I am talking about. In fact, please quote this article when discussing these issues with the project. It is down to the performance tester to come up with some approach that will satisfy everyone concerned that the problem being obeserved is or is not due to the automated test tool, i.e. is not an artificial bottleneck. This is a scientific approach, develop a theory, then design a test that should prove or disprove that theory. It does not really matter if the theory is right or wrong, everytime you devise a new test and execute it, you learn something new about the behaviour of the application. The performance tester has to work hard to gain the trust of the project and through thorough performance analysis, demonstrate that a performance problem is just that, a performance bottleneck that will cause an impact in production.

An example of a test tool generated performance issue:


Ramping up the workload too quickly. If virtual users start too quickly, the application struggles to open connections and server requests. For most applications, this is abnormal behaviour that you would not normally observe in a production environment.

Configuring the workload in such a way that it is spread unevenly across the period of an hour, for instance users with just 2 iterations to execute waiting for excessive amount of time between iterations causing a quiet patch in the middle of a Load test.

Restarting the application just before performance testing starts. While this can help to maintain consistency of results, throwing a large number of users at an application where data cache is not populated and software and hardware connections are not established can cause an unrealistic spike in performance as well as much longer than expected response times. In reality, application starts normally happen around 05:00 in the morning when few users are around. The first few users of the day will experience longer responses but these will be limited. By the time a much larger workload is starting to ramp up (such as in a performance test), data cache, threads and connections are already open and available and will not cause users these long response times.

Putting a think time inside the start and end transaction markers denoting response time so that the recorded response time for a transaction is much longer than it actually

What are Service Level Agreements?

SLA’s or Service Level Agreements are often a sought after piece of information when gathering requirements before performance testing begins.

The premise is that the SLA constitutes a measurable requirement that can be tested and marked as 'pass' or 'fail' accordingly.

SLA’s take two main forms:
1. System Availability. A typical SLA here could be - 'the application is required to be available 23.5 hours per day every day except on Sundays.'
2. Response time. A typical SLA here could be - 'all user transactions must respond to the user within 2 seconds.'

Performance testers are normally looking towards the 2nd type of SLA, the response time, as the system availability SLA cannot easily be tested.

Often an SLA for response times can be found, usually as a reference in a design document. Caution must be exercised. When an application is designed, high level requirements are captured. A SLA at this stage is not normally a mandatory requirement, merely a guide line, a statement of understanding.

At Testing Performance, we do not treat SLA’s as a measurable requirement. Let's take a typical user journey - it involves:

1. Login

2. Menu selection

3. Completion of form ‘A’

4. Completion of form ‘B’

5. Completion of form ‘C’

6. Submission of all data to be updated onto the database.

While it maybe reasonable for steps 2 to 5 to take only a couple of seconds to respond, the Login and data submission steps will almost certainly take significantly longer than 2 seconds.

By the time the performance tester gets their hands on the application, it is almost certainly too late to take either the login or the data submission steps and rework them so they take less than 2 seconds. In fact designers given the task of reducing the response time would look at taking that single user action and separating it out to 2 or more user actions. This would of course be no quicker to the end user but would meet the response time SLA.

Performance testing to an SLA requirement is really a red-herring. The project is much better off looking at the efficiency of the code and the application, ensuring that the application responds as quickly as possible given the amount of work that each user action is required to do. This can take place in two ways:

1. The performance tester can analyse the response times of user actions at a low workload. Any user action where the response time seems to be higher than expected can be traced, monitored and checked to determine if their are any inefficiencies.

2. As the workload is increased, the performance tester can look to see how the response times of transactions deviate from the baseline as the workload increases

Load Test Environments

On occasions, the Load Tester has the luxury of Load Testing in the production environment. This is rare - normally this would happen before an application goes into production for the first time. In other circumstances, the system administrators can take down the production instance of an application, and start in its place an environment that can be used for Load Tests. Again, this is rare as it normally requires Load Testing to take place during unsociable hours and it causes an outage to users or customers. There is also a risk that on bringing the production instance back up, problems will occur.

The best a Load Tester should realistically hope for is a production like disaster recovery environment that can be used for Load Testing. While there will always be differences with production, the fact that it could one day be asked to support production gives the Load Tester a high degree of confidence.

The norm for a Load Test environment is a halfway house between a small sized functional test environment and the full sized production environment. Often the Load Test environment is on a reasonably sized hardware platform which is shared with a number of other test environments. In many cases one too many test environments is squeezed onto a configuration than should really be the case. A limitation of memory means that paging can and will occur. This can cause response times to increase hugely. It can also cause CPU utilisation to increase if it is not already at 100%.

TP recommend that in this circumstance, the Load Test environment is looked at very carefully. While Load Test preparations can take place in an environment like this, execution of a Load Test would be fraught with problems. There are a number of solutions that the Load Tester can recommend to the project, including:

1. Purchase hardware and relocate the Load Test environment. This option is not realistic. The cost of an environment would likely be so high that any self respecting infrastructure manager would rule it out without a second thought. Lead times for purchase and installation of hardware is normally weeks, if not months. A project manager is more likely to cancel Load Testing than go for this option.

2. Wait till some of the other testing phases are complete, shut those test environments down thus freeing up system resource. This is a more sensible option, however, project timescales and implementation dates are often set well in advance. Delay can cause many problems increasing the cost of implementation. Again, this is not something a project manager would be likely to accept.

3. Plan for Load Test execution to take place out of hours. Functional testing normally takes place during normal working hours, i.e. from 08:00 till 18:00 though this is not always the case. Testers could be located in other time zones, extending the period which functional testing takes place. Testing often runs late, overtime is used to decrease the elapsed time for testing a code drop, again extending the test window. Even considering these exceptions, the Load Tester should always be able to find a window of opportunity when there is no functional testing taking place. This may be at the weekend or possibly weekdays between 00:00 & 04:00. These unsociable hours are not much fun, but they can provide a cost effective opportunity to Load Test on halfway decent kit without the interference of other users. Request that the system administrators maintain two configurations of the Load Test environment:

a. The first configuration would be defined to use a minimum of resources that could co-exist with other test environments during the day. This would enable the Load Tester to develop scripts, create and extract data, set up monitoring and carry out all those other preparations necessary for building a Load Test pack.

b. The 2nd configuration would be large sized and used for Load Test execution. Before Load Test execution begins, the system administrators would take down all other test environments that share hardware with the Load Test and bring up the large sized Load Test environment. This should enable a much better Load Test to be run, with a greatly increased chance of meeting Load Testing objectives and understanding the performance profile of the application

Database Tuning

Performance testing is an ideal way to generate a workload against a database. Great, now what? How about looking at the performance of the database? Many performance testers treat the database as a black box. Leave it to the experts they say. Unfortunately, if a performance problem is to occur anywhere in an n-tier application, it is most likely to occur in the database. Evaluation of the database is not really that difficult and comes down to understanding just a couple of key concepts.

Physical I/O

This must be minimised as much as possible. Physical I/O to disk is not only slow, it also consumes CPU. There are many ways of reducing Physical I/O, traditionally this would involve:

* Ensuring that the database has sufficient memory available for it to be able to store commonly accessed data in a data cache.
* Ensuring that data retrieving queries efficiently find target data without having to trawl through large tables.

CPU Utilisation

While a low CPU utilisation for a database server may appear to be good, it could by symptomatic of a poorly configured database. If the database is I/O bound, or lacks a sufficiently tuned configuration, it is possible that the database is not unable to service more requests for data but is not receiving those requests very quickly, i.e., those requests for data are queuing.

In a well configured database server, busy periods will see CPU utilisation around 80% with occasional peaks in workload taking the CPU utilisation closer to 100%

CPU utilisation can be reduced by investigating the following areas;

As a request for data arrives at the database, the SQL query must be parsed. This basically means that Oracle investigates the statement and decides how it will access the database tables so as to satisfy the query request. If the query has already been executed, it may still be in the area of the cache where queries are stored. Reusing a query which has already been parsed and is in the cache uses much less CPU than parsing a query for the first time. To ensure this reduction in CPU utilisation, the performance tester should ensure two things:

The cache is large enough to store the required number of queries.
Bind variables should be used so that the where clause of the query is not affected by different data variables; e.g. where name =b1: rather than where name = smith.

Once the query has been parsed, execution of the query will take place. It is essential that the database is able to find the requested data simply and easily. The access method that will be used to locate and retrieve the data is determined when the query is parsed. By using explain plan, the performance tester can determine what the access path is. An efficient query will require only a small number of rows to be read. Hopefully, these rows will be located in the databases buffer cache rather than on disk. An inefficient query will require a large number of rows to be accessed. Even if the much larger number of rows reside in the buffer cache, it still requires a certain amount of CPU to locate, retrieve and possibly sort the data.

What is Disaster Recovery Testing?

This is a process of verifying the success of the restoration procedures that are executed after a critical IT failure or disruption occurs. This could include the following testing;
What happens to the workload if some of the infrastructure for whatever reason becomes unavailable?
How long does it take to recover data if a corruption or data loss occurs?
How does the application cope if part of the network goes down?

This type of testing is sometimes carried out in conjunction with Load Testing. Disaster Recovery Testing is much more realistic if the tests are carried out while the application is busy servicing a user workload.

What is Spike Testing?

Spike testing is a type of load test. The object of this type of performance test is to verify a system's stability during bursts of concurrent user and or system activity to varying degrees of load over varying time periods.

Examples of business situations that this type of test looks to verify a system against:
A fire alarm goes off in a major business centre - all employees evacuate. The fire alarm drill completes and all employees return to work and log into an IT system within a 20 minute period
A new system is released into production and multiple users access the system within a very small time period
A system or service outage causes all users to lose access to a system. After the outage has been rectified all users then log back onto the system at the same time

Spike testing should also verify that an application recovers between periods of spike activity.

Memory Leaks in Load runner

Prevention Techniques:

Adequate Developer Staffing in Test Phase- Most developers would rather develop new code than fix defects, therefore it all too often happens that there are not enough developers to fix bugs as fast as the test group can find them. This leads to the software code being delivered with defects not yet fixed. Solution is to plan for bugs to be found and adequate resources retained to fix at least the high priority ones prior to release.

Adequate Ratio of Testers to Developers- Adequate developer to tester ratios can vary with industry and project phase, but a general rule of thumb is to have no less than 1 tester for every 5 developers. A higher ratio of developers to testers may cause inadequate testing to be accomplished, causing the product to be released with a high percentage of latent defects.

Adequate Staffing in Development Phase- Lack of sufficient staff may preclude assuring this attribute is not in the software. Using an estimation tool based on past performance may help allocate sufficient staff at appropriate times to preclude this attribute. However simply adding staff, especially late in the project may not help. (Brook's Law)

Coding Standards- Coding standard could limit the techniques used to allocate and de-allocate memory to those that are less error prone. Choice of the language used can also help lessen the probability of memory leaks, example C++ versus Java (garbage collection to free unused memory).

Estimation Tool or Technique- Lack of sufficient time may preclude assuring this attribute is not in the software. Using an estimation tool based on past performance may help allocate sufficient time to preclude this attribute.



Detection Techniques:

Automated Nightly Build and Test- Nightly functional testing may detect memory leaks, especially if some of the tests are long running or repetitive.

Code and Test Peer Inspections- Peer reviews may catch exits where allocated memory has not been released.

Continuous Integration- Continuous integration testing may detect memory leaks, especially if some of the tests are long running or repetitive.

Dynamic Analysis- Dynamic analyzers included leak detectors which can detect memory not being returned to the heap.

Functional / Regression Testing- Functional testing may detect memory leaks, especially if some of the tests are long running or repetitive.

Fuzzing or Automated Random Testing- Random testing, if allowed to run for long periods of time may be a good approach to find memory leaks or resources not available or heavily loaded at certain times.

Graphics User Interface (GUI) Testing- GUI tests that run many hours automatically may be able to find memory leaks with show up as memory low messages or sytem crashes.

Load, Stress, Failover, Performance Test- LSP testing may show how the system handles situations which meet or exceed specifications (number of users, amount of data stored, transfer rates, etc.) to determine if the system meets specification and degrades gracefully beyond specified levels of performance. Unhandled exceptions may be exposed under these types of tests which lock up the system (for instance running out of memory due to a memory leak).

Negative Testing- Negative testing includes deliberately inputting incorrect information into the system under test which should produce meaningful error messages that correlate to the negative inputs. System lock ups or crashes due to memory leaks may be detected and corrected so that an error message will be issued and/or the system will degrade gracefully.

Release Testing (alpha, beta)- Release testing may detect memory leaks, especially if some of the tests are long running or repetitive.

Smoke (Check In) Testing- Smoke testing may detect memory leaks, especially if some of the tests are long running or repetitive.

Static Analysis- Static analysis can detect situations where memory is allocated and not deallocated.

Thin Client Application Tests

An internet browser that is used to run an application is said to be a thin client. But even thin clients can consume substantial amounts of CPU time on the computer that they are running on. This is particularly the case with complex web pages that utilize many recently introduced features to liven up a web page. Rendering a page after hitting a SUBMIT button may take several seconds even though the server may have responded to the request in less than one second. Testing tools such as WinRunner are able to be used to drive a Thin Client, so that response time can be measured from a users perspective, rather than from a protocol level.

Thick Client Application Tests

A Thick Client (also referred to as a fat client) is a purpose built piece of software that has been developed to work as a client with a server. It often has substantial business logic embedded within it, beyond the simple validation that is able to be achieved through a web browser. A thick client is often able to be very efficient with the amount of data that is transferred between it and its server, but is also often sensitive to any poor communications links. Testing tools such as WinRunner are able to be used to drive a Thick Client, so that response time can be measured under a variety of circumstances within a testing regime.

Developing a load test based on thick client activity usually requires significantly more effort for the coding stage of testing, as VUGen must be used to simulate the protocol between the client and the server. That protocol may be database connection based, COM/DCOM based, a proprietary communications protocol or even a combination of protocols.

Protocol Tests

Protocol tests involve the mechanisms used in an application, rather than the applications themselves. For example, a protocol test of a web server may will involve a number of HTTP interactions that would typically occur if a web browser were to interact with a web server - but the test would not be done using a web browser. LoadRunner is usually used to drive load into a system using VUGen at a protocol level, so that a small number of computers (Load Generators) can be used to simulate many thousands of users.

Tuning Cycle Tests


A series of test cycles can be executed with a primary purpose of identifying tuning opportunities. Tests can be refined and re-targeted 'on the fly' to allow technology support staff to make configuration changes so that the impact of those changes can be immediately measured.

Sociability (sensitivity) Tests

Sensitivity analysis testing can determine impact of activities in one system on another related system. Such testing involves a mathematical approach to determine the impact that one system will have on another system. For example, web enabling a customer 'order status' facility may impact on performance of telemarketing screens that interrogate the same tables in the same database. The issue of web enabling can be that it is more successful than anticipated and can result in many more enquiries than originally envisioned, which loads the IT systems with more work than had been planned.

Network Sensitivity Tests in load runner

Network sensitivity tests are variations on Load Tests and Performance Tests that focus on the Wide Area Network (WAN) limitations and network activity (eg. traffic, latency, error rates...). Network sensitivity tests can be used to predict the impact of a given WAN segment or traffic profile on various applications that are bandwidth dependant. Network issues often arise at low levels of concurrency over low bandwidth WAN segments. Very 'chatty' applications can appear to be more prone to response time degradation under certain conditions than other applications that actually use more bandwidth. For example, some applications may degrade to unacceptable levels of response time when a certain pattern of network traffic uses 50% of available bandwidth, while other applications are virtually un-changed in response time even with 85% of available bandwidth consumed elsewhere.

This is a particularly important test for deployment of a time critical application over a WAN.

Also, some front end systems such as web servers, need to work much harder with 'dirty' communications compared with the clean communications encountered on a high speed LAN in an isolated load and performance testing environment.
Why execute Network Sensitivity Tests

The three principle reasons for executing Network Sensitivity tests are as follows:
Determine the impact on response time of a WAN link. (Variation of a Performance Test)
Determine the capacity of a system based on a given WAN link. (Variation of a Load Test)
Determine the impact on the system under test that is under 'dirty' communications load. (Variation of a Load Test)

Execution of performance and load tests for analysis of network sensitivity require test system configuration to emulate a WAN. Once a WAN link has been configured, performance and load tests conducted will become Network Sensitivity Tests.

There are two ways of configuring such tests.

Use a simulated WAN and inject appropriate background traffic.
This can be achieved by putting back to back routers between a load generator and the system under test. The routers can be configured to allow the required level of bandwidth, and instead of connecting to a real WAN, they connect directly through to each other.



When back to back routers are configured to be part of a test, they will basically limit the bandwidth. If the test is to be more realistic, then additional traffic will need to be applied to the routers.

This can be achieved by a web server at one end of the link serving pages and another load generator generating requests. It is important that the mix of traffic is realistic. For example, a few continuous file transfers may impact response time in a different way to a large number of small transmissions.





By forcing extra more traffic over the simulated WAN link, the latency will increase and some packet loss may even occur. While this is much more realistic than testing over a high speed LAN, it does not take into account many features of a congested WAN such as out of sequence packets.

Use the WAN emulation facility within LoadRunner.
The WAN emulation facility within LoadRunner supports a variety of WAN scenarios. Each load generator can be assigned a number of WAN emulation parameters, such as error rates and latency. WAN parameters can be set individually, or WAN link types can be selected from a list of pre-set configurations. For detailed information on WAN emulation within LoadRunner follow this link -mercuryinteractive.com/products/LoadRunner/wan_emulation.html.



It is important to ensure that measured response times incorporate the impact of WAN effects both at an individual session, as part of a performance test, and under load as part of a load test, because a system under WAN affected load may work much harder than a system doing the same actions over a clean communications link.

Where is the WAN?

Another key consideration in network sensitivity tests is the logical location of a WAN segment. A WAN segment is often between a client application and it's server. Some application configurations may have a WAN segment to a remote service that is accessed by an application server. To execute a load test that determines the impact of such a WAN segment, or the point at which the WAN link saturates and becomes a bottleneck, one must test with a real WAN link, or a back to back router setup - as described above. As the link becomes saturated, response time for transactions that utilize the WAN link will degrade.

Response Time Calculation Example.

A simplified formula for predicting response time is as follows:

Response Time = Transmission Time + Delays + Client Processing Time + Server Processing Time.

Where:
Transmission Time = Data to be transferred divided by Bandwidth.

Delays = Number of Turns multiplied by 'Round Trip' response time.

Client Processing Time = Time taken on users software to fulfil request.

Server Processing Time = Time taken on server computer to fulfil request.

Volume Tests using Load Runner

Volume Tests are often most appropriate to Messaging, Batch and Conversion processing type situations. In a Volume Test, there is often no such measure as Response time. Instead, there is usually a concept of Throughput.

A key to effective volume testing is the identification of the relevant capacity drivers. A capacity driver is something that directly impacts on the total processing capacity. For a messaging system, a capacity driver may well be the size of messages being processed.
Volume Testing of Messaging Systems

Most messaging systems do not interrogate the body of the messages they are processing, so varying the content of the test messages may not impact the total message throughput capacity, but significantly changing the size of the messages may have a significant effect. However, the message header may include indicators that have a very significant impact on processing efficiency. For example, a flag saying that the message need not be delivered under certain circumstances is much easier to deal with than a message with a flag saying that it must be held for delivery for as long as necessary to deliver the message, and the message must not be lost. In the former example, the message may be held in memory, but in the later example, the message must be physically written to disk multiple times (normal disk write and another write to a journal mechanism of some sort plus possible mirroring writes and remote failover system writes!)

Before conducting a meaningful test on a messaging system, the following must be known:
The capacity drivers for the messages (as discussed above).
The peak rate of messages that need to be processed, grouped by capacity driver.
The duration of peak message activity that needs to be replicated.
The required message processing rates.

A test can then be designed to measure the throughput of a messaging system as well as the internal messaging system metrics while that throughput rate is being processed. Such measures would typically include CPU utilization and disk activity.

It is important that a test be run, at peak load, for a period of time equal to or greater than the expected production duration of peak load. To run the test for less time would be like trying to test a freeway system with peak hour vehicular traffic, but limiting the test to five minutes. The traffic would be absorbed into the system easily, and you would not be able to determine a realistic forecast of the peak hour capacity of the freeway. You would intuitively know that a reasonable test of a freeway system must include entire 'morning peak' and 'evening peak' of traffic profiles, as both peaks are very different. (Morning traffic generally converges on a city, whereas evening traffic is dispersed into the suburbs.)
Volume Testing of Batch Processing Systems

Capacity drivers in batch processing systems are also critical as certain record types may require significant CPU processing, while other record types may invoke substantial database and disk activity. Some batch processes also contain substantial aggregation processing, and the mix of transactions can significantly impact the processing requirements of the aggregation phase.

In addition to the contents of any batch file, the total amount of processing effort may also depend on the size and makeup of the database that the batch process interacts with. Also, some details in the database may be used to validate batch records, so the test database must 'match' test batch files.

Before conducting a meaningful test on a batch system, the following must be known:
The capacity drivers for the batch records (as discussed above).
The mix of batch records to be processed, grouped by capacity driver.
Peak expected batch sizes (check end of month, quarter & year batch sizes).
Similarity of production database and test database.
Performance Requirements (eg. records per second)

Batch runs can be analysed and the capacity drivers can be identified, so that large batches can be generated for validation of processing within batch windows. Volume tests are also executed to ensure that the anticipated numbers of transactions are able to be processed and that they satisfy the stated performance requirements.

Performance Tests actions using Load runner

Performance Tests are tests that determine end to end timing (benchmarking) of various time critical business processes and transactions, while the system is under low load, but with a production sized database. This sets ‘best possible’ performance expectation under a given configuration of infrastructure. It also highlights very early in the testing process if changes need to be made before load testing should be undertaken. For example, a customer search may take 15 seconds in a full sized database if indexes had not been applied correctly, or if an SQL 'hint' was incorporated in a statement that had been optimized with a much smaller database. Such performance testing would highlight such a slow customer search transaction, which could be remediated prior to a full end to end load test.

It is 'best practice' to develop performance tests with an automated tool, such as WinRunner, so that response times from a user perspective can be measured in a repeatable manner with a high degree of precision. The same test scripts can later be re-used in a load test and the results can be compared back to the original performance tests.
Repeatability

A key indicator of the quality of a performance test is repeatability. Re-executing a performance test multiple times should give the same set of results each time. If the results are not the same each time, then the differences in results from one run to the next can not be attributed to changes in the application, configuration or environment.
Performance Tests Precede Load Tests

The best time to execute performance tests is at the earliest opportunity after the content of a detailed load test plan have been determined. Developing performance test scripts at such an early stage provides opportunity to identify and remediate serious performance problems and expectations before load testing commences.

For example, management expectations of response time for a new web system that replaces a block mode terminal application are often articulated as 'sub second'. However, a web system, in a single screen, may perform the business logic of several legacy transactions and may take 2 seconds. Rather than waiting until the end of a load test cycle to inform the stakeholders that the test failed to meet their formally stated expectations, a little education up front may be in order. Performance tests provide a means for this education.

Another key benefit of performance testing early in the load testing process is the opportunity to fix serious performance problems before even commencing load testing.

A common example is one or more missing indexes. When performance testing of a "customer search" screen yields response times of more than ten seconds, there may well be a missing index, or poorly constructed SQL statement. By raising such issues prior to commencing formal load testing, developers and DBAs can check that indexes have been set up properly.

Performance problems that relate to size of data transmissions also surface in performance tests when low bandwidth connections are used. For example, some data, such as images and "terms and conditions" text are not optimized for transmission over slow links.
Pre-requisites for Performance Testing

A performance test is not valid until the data in the system under test is realistic and the software and configuration is production like. The following table list pre-requisites for valid performance testing, along with tests that can be conducted before the pre-requisites are satisfied:
Performance Test
Pre-Requisites
Comment
Caveats on testing where
pre-requisites are not satisfied.
Production Like Environment
Performance tests need to be executed on the same specification equipment as production if the results are to have integrity.
Lightweight transactions that do not require significant processing can be tested, but only substantial deviations from expected transaction response times should be reported.
Low bandwidth performance testing of high bandwidth transactions where communications processing contributes to most of the response time can be tested.
Production Like Configuration
Configuration of each component needs to be production like. 
For example: Database configuration and Operating System Configuration.
While system configuration will have less impact on performance testing than load testing, only substantial deviations from expected transaction response times should be reported.
Production Like Version
The version of software to be tested should closely resemble the version to be used in production.
Only major performance problems such as missing indexes and excessive communications should be reported with a version substantially different from the proposed production version.
Production Like Access
If clients will access the system over a WAN, dial-up modems, DSL, ISDN, etc. then testing should be conducted using each communication access method.
See Network Sensitivity Tests for more information on testing WAN access.
Only tests using production like access are valid.
Production Like Data
All relevant tables in the database need to be populated with a production like quantity with a realistic mix of data.e.g. Having one million customers, 999,997 of which have the name "John Smith" would produce some very unrealistic responses to customer search transactionsLow bandwidth performance testing of high bandwidth transactions where communications processing contributes to most of the response time can be tested.

Documenting Response Time Expectations.

Rather that simply stating that all transactions must be 'sub second', a more comprehensive specification for response time needs to be defined and agreed to be relevant stakeholders.

One suggestion is to state an Average and a 90th Percentile response time for each group of transactions that are time critical. In a set of 100 values that are sorted from best to worst, the 90th percentile simply means the 90th value in the list.

Click on this link for more information on response time definition.
Executing Performance Tests.

Performance testing involves executing the same test case multiple times with data variations for each execution, and then collating response times and computing response time statistics to compare against the formal expectations. Often, performance is different when the data used in the test case is different, as different numbers of rows are processed in the database, different processing and validation come into play, and so on.

By executing a test case many times with different data, a statistical measure of response time can be computed that can be directly compared against a formal stated expectation.

Targeted Infrastructure Tests in Load Runner

Targeted Infrastructure Tests are Isolated tests of each layer and or component in an end to end application configuration. It includes communications infrastructure, Load Balancers, Web Servers, Application Servers, Crypto cards, Citrix Servers, Database… allowing for identification of any performance issues that would fundamentally limit the overall ability of a system to deliver at a given performance level.

Each test can be quite simple, For example, a test ensuring that 500 concurrent (idle) sessions can be maintained by Web Servers and related equipment, should be executed prior to a full 500 user end to end performance test, as a configuration file somewhere in the system may limit the number of users to less than 500. It is much easier to identify such a configuration issue in a Targeted Infrastructure Test than in a full end to end test.

The following diagram shows a simple conceptual decomposition of load to four different components in a typical web system.



Targeted infrastructure testing separately generates load on each component, and measures the response of each component under load. The following diagram shows four different tests that could be conducted to simulate the load represented in the above diagram.


Different infrastructure tests require different protocols. For example, VUGen supports a number of database protocols, such as DB2 CLI, Informix, MS SQL Server, Oracle and Sybase.