We’d like to give you some additional information about the ResDiary service disruption that occurred on 7th April 2018, what we did to resolve the issue immediately, and what we’re doing to mitigate the risk of similar disruption occurring in the future.
At 18:17 on 7th April 2018, ResDiary support were notified of intermittent errors when attempting to log in to diaries in the UK region. An on-call support engineer immediately checked internal logging to see if there were any alerts that would indicate the reason for the errors. The logs indicated an issue with the main diary application connecting to servers that we use for storing information related to logged-in users. The engineer restarted the affected servers and monitored the web traffic to ensure connectivity – at this point normal traffic started to flow and internal logging showed that the errors were no longer occurring.
The ResDiary infrastructure engineering team are working to move the user session servers, currently hosted on physical Rackspace machines, into cloud-based machines. This will result in more resilience and much quicker recovery in the event of any future failure.
In addition, a new version of the 3rd party software library that handles the management of logged in users will be introduced.
The team are currently investigating the cause of the initial communication failure between the servers. The results of this investigation will form a more detailed action plan for preventing this situation in the future.