Friday, July 26, 2013

Failover Test

Failover Tests verify of redundancy mechanisms while the system is under load. This is in contrast to Load Tests which are conducted under anticipated load with no component failure during the course of a test.

For example, in a web environment, failover testing determines what will happen if multiple web servers are being used under peak anticipated load, and one of them dies.

Does the load balancer react quickly enough?


Can the other web servers handle the sudden dumping of extra load?

Failover testing allows technicians to address problems in advance, in the comfort of a testing situation, rather than in the heat of a production outage. It also provides a baseline of failover capability so that a 'sick' server can be shutdown with confidence, in the knowledge that the remaining infrastructure will cope with the surge of failover load.
Explanatory Diagrams:

The following is a configuration where failover testing would be required.



This is just one of many failover configurations. Some failover configurations can be quite complex, especially when there are redundant sites as well as redundant equipment and communications lines.

In this type of configuration, when one of the application servers goes down, then the two web servers that were configured to communicate with the failed application server can not take load from the load balancer, and all of the load must be passed to the remaining two web servers. See diagram below:




When such a failover event occurs, the web servers are under substantial stress, as they need to quickly accommodate the failed over load, which probably will result in doubling the number of HTTP connections as well as application server connections in a very short amount of time. The remaining application server will also be subjected to severe increase in load and the overheads associated with catering for the increased load.

It is crucial to the design of any meaningful failover testing that the failover design is understood, so that the implications of a failover event, while under load can, be scrutinized.
Fail-back Testing:

After verifying that a system can sustain a component outage, it is also important to verify that when the component is back up, that it is available to take load again, and that it can sustain the influx of activity when it comes back online.


4 comments:

Akhilesh said...

Thank you. Nice article. Useful.

Renjith said...

The post was really informative. Thank you very much.

Anonymous said...

do you thing failover testing has similarities with chaos monkey that Netflix implements, if NO then the differences? TQ

Unknown said...

its informative. but does not answer can loadrunner handle and reach quick enough. Can you help the best way to design and maintain at loadrunner. Because we have a situation of making persistent connection from LR to server. as it was not maintaining persistent connection, i create a loop when connection is established. if there is a failover script will notice there is an error on connection and exit the current while and perform a fresh connection. but this is not happening when performing site failover. but within 1 site when we have multiple vm, it is good