A typical scenario is when a business launches a new website or hosts a large online, well publicised event, and soon after the event starts, the site starts to have serious performance issues, posts on social media start complaining about the service, and in the worst case these issues are aired on national television and in the general media. 

 

There have been numerous examples of this in the past few years, with Ticketmaster in November 2022 being a notable example of a lack of functional testing. In this case the volumes of traffic were up to 4 times higher than Ticketmaster had expected, causing major issues for customers trying to buy tickets for Taylor Swift. 

 

What can be done to reduce the risk of your website failing? And avoid the expensive and embarrassing consequences for your business that may result? 
 

Analysis 

An initial analysis is needed to fully understand the potential risks and issues around the planned launch/event. This should be approached in three different areas: 

 

  1. a) Business View:

What are the details of the business launch/event, timing? 

What are the business risks? 

When will the business event take place? 

What are the projected user/customer numbers, transactions etc.? 

Is this a local business event or global? 

Has the business launch/event been widely advertised? 

 

  1. b) Infrastructure View:

What is the capacity of the system? Does this match the projected business view? 

What are the known operational risks? 

Can the system be scaled up (and down) quickly if needed? 

What are the cost implications of auto-scaling? 

Is the system a single instance or distributed? 

Does the system have a Disaster Recovery capability? 

What are the known single points of failure? 

Are there any known limits of constraints in the infrastructure? or network? 

 

  1. c) System/Application View:

Are there any limits of constraints in the system? 

What are the User/customer number limits? 

Are there any connection limits? 

Can the system queue requests? 

What happens when the system is overloaded? 

  

Mitigation Approaches 

Based on the information gathered from the risks and issues analysis, there are a number of mitigation approaches that can be taken. 

 

  1. a) Design for Performance, Scalability and Resilience:

As part of the design, aspects of performance, scalability and resilience should be key considerations. 

Avoid any single points of failure if possible. 

Are there any bottlenecks in the service and data flow designs? 

Can the system scale horizontally or vertically if needed? 

Can key services be load balanced, or designed as high availability clusters? 

 

  1. b) Performance Testing, Peak, Load and Stress tests:

Based on the projected system usage and volumes, execute appropriate performance tests, including worst case scenarios with stress testing. 

 

  1. c) Operational Acceptance Testing, DR, Capacity & Resilience tests:

Ensure that all Disaster Recovery tests are completed successfully. Execute extended soak tests to ensure resilience in normal operation. Allow enough headroom capacity for projected system growth. Test out scaling up and down to ensure there are no bottlenecks. Ensure that monitoring and alerting is in place to allow for prompt corrective actions if there are any issues. 

 

  1. d) Implementation Options, Pilot, Phased, Big Bang

Decide how the system/event will be implemented. This will depend on the business requirements, phased or pilot should ideally be preferred, as they will allow any issues to be contained and corrected before rolling out to a wider audience. However for business events, “big bang” is usually the only option, in this case, prior risk analysis, and appropriate testing are very important and key to a successful event. In a global event, event launch timing can be used to reduce activity being concentrated in a specific time period. 

Even when all of these approaches are taken into consideration, there is still the chance of an unexpected issue affecting the event, potentially an attack on the software or infrastructure layers during the event, or the business projections are wildly underestimated. How the system handles an overload will be critical in this case, including any mitigating actions that can be taken to stop any bad actors continuing to affect the event. 

2i is a market leader in non-functional testing can help you with these activities through application of our services.

 

Reach out to us on LinkedIn or Twitter