Summary of ResDiary Service Disruption in the UK Region - 10th and 11th March 2018

We’d like to give you some additional information about the ResDiary service disruption that occurred on Mother’s Day weekend (10th and 11th March 2018), what we did to resolve the issue immediately and what we’re doing to mitigate the risk of disruption occurring in the future.

Incident Summary

On both days, large amounts of traffic caused by the Mother’s Day weekend meant that the UK web servers were unable to respond in a timely manner to requests. This resulted in UK based diaries running slowly and/or becoming unresponsive. An on-call emergency support engineer was alerted promptly on the 10th and attempted a low-risk procedure on the UK servers to alleviate some of the stress they were under. This procedure helped the situation but did not completely resolve the issue. At this time, the servers were becoming responsive again, so the decision was taken that it was not in the best interests of our customers to perform higher risk fixes in the middle of service on Mother’s day weekend.

On the 11th, a similar incident occurred at around lunchtime and our emergency support engineers, having anticipated the recurrence, performed a procedure that routed a portion of the traffic to other servers, thereby greatly reducing the stress on the original web servers. This resulted in an immediate drop in load on the servers and UK diaries became responsive again.

Mitigation measures

We are currently in the process of changing the system so that all UK traffic is shared across a greater number of servers. Given that routing only a portion of the traffic to the new servers helped the situation vastly, we anticipate that sharing all the traffic will offer a large improvement. To put the improvements into perspective, we went from operating at almost 100% capacity to operating at only 20% capacity on an extremely busy day, which means that it would take an 80% jump in traffic for us to experience the same server stress from now on. In addition, if traffic were to increase by an extreme amount, we now have the ability to quickly scale up the number of servers.

Detailed technical breakdown

At 17:51 on the 10th, our support department alerted an on-call engineer that the system was running slowly. The engineer checked our monitoring system and saw that the servers running the UK diaries had very high CPU utilisation. The engineer immediately recycled the IIS app pool in an attempt to clear the request queue. This helped for long enough that load began reducing and the servers became responsive again. At this point, the emergency support team began preparations for moving a portion of traffic to a load balancer which sits in front of the original UK web servers and multiple new virtual machines, rather than directly to the UK servers.

At 12:48 on the 11th, emergency support were alerted that UK diaries had become unresponsive and again immediately recycled the IIS app pool. This had no effect, so the engineers routed all consumer API traffic, along with any user who logged in after that point, to the load balancer. This resulted in an immediate drop in load on the server and diaries became responsive again. The CPU utilisation on all servers went from between 80-100% to 20-25% after routing traffic through the load balancer to the new servers.

Share on: