1st August 2019 - Command Generation Lag and Downtime Post Mortem

Author:

Liam Wiltshire, CTO

Start Time:

Command Generation Lag: 1700 UTC
Downtime: 2100 UTC

Finish Time:

Downtime: 2120 UTC
Command Generation Lag: 2155 UTC

Reason For Downtime:

In the afternoon/evening of the 1st of August, a few webstores on the Tebex platform experienced higher than anticipated levels of traffic, due to specific events within those communities.

The elevated level of purchases resulted in our command generation daemon being unable to keep up. A race condition within the check to ensure that commands were not double-generated exacerbated the problem as it forced the daemons to hang for longer than expected. As a result of this, the lag in command generation started to increase, peaking at around 1,100 behind realtime shortly before 2100 UTC. We attempted to boot additional daemons, but due to the previously mentioned race-condition these had limited effect.

As a result of the higher than usual traffic and the queue for command generation, our databases were put under increased load. Our read replicas maxed out their CPU capacity from 2050 UTC, causing server timeouts and gateway errors caused by a lack of DB connections being available. This 100% event did trigger our DB scaling processes, however the process to scale up the DB is more involved than scaling up web nodes - a snapshot of the DB has to be taken, then once the new server is provisioned, this snapshot has to be loaded onto the new server, replica logs need to be replayed to bring it up to date from the time between the snapshot and the time the server becomes available, and then it needs to be added to the load balancer.

As this process took some time, while that scaling event was taking place there were a large number of errors being returned due to the lack of DB capacity. The additional DB replicas came online shortly before 2120 UTC, at which point the error levels reduced and webstore availability started settling back towards normal. Alongside this, we made some changes to our command generation daemon to resolve the race condition, meaning that it was able to more effectively work through the backlog of payments, resulting in the command generation lag reaching 0 (realtime) shortly after 2150 UTC.

Future Work:

Given the significant lag in the scaling event in our database layer, we will investigate ways to improve this. We will also perform additional testing on our command generation daemon to ensure the race condition cannot re-occur in any other situations.

Signed Off By:

Lee McNeil, CEO