As most of you know, we experienced an unacceptable level of downtime and instability over the Black Friday weekend and the week that followed. We published a blog post explaining some of the issues, and what we’d done to resolve them, assuring you that we had things in place to make sure the Christmas period would run smoothly. We believed this to be the case, however it’s become clear that we didn’t do enough.
Over the holidays we again experienced periods of slowness and system unavailability. While the fixes we put in place after Black Friday prevented the issues from getting even worse, we still dropped the ball - our platform didn’t perform up to the level that you, our customers, rightly expect and for this we are sorry.
What Happened This Time?
As we’ve said previously, we know that Q4 is a critical time for our customers - that a large proportion of your sales take place during this time, particularly between Black Friday and New Year and as we explained in our previous blog post we migrated all our platforms to a new service provider (Digital Ocean) to allow us to scale better for this critical period.
Scaling databases can be particularly hard. The options of scaling a database are limited (Vertical Scaling, Sharding or Clustering are about all you can do), and particularly when you are trying to fit this to an existing application it adds to the complexity. As such we initially decided to move to a clustered solution with multiple master to master replication nodes sitting behind a load balancer.
The availability of data in real-time is important – because of this, we implemented Galera for near-synchronous replication. However this caused problems in that the nodes were never fully ‘catching up’ which lead Galera to lock the databases in order to resync. This caused requests to hang, load to increase, and ultimately caused downtime to the platform.
We then decided to move to a vertically-scaled solution. Rather than lots of small nodes, we went with one large DB node (the biggest we could get), and an asynchronous read replica for redundancy. In theory this should have given us plenty of resources to handle everything that Christmas could throw at us.
Indeed, for what is normally the most critical asset (RAM) it worked – at no point did we go about 50% RAM usage. However we found that when traffic ramped up we were becoming CPU-bound. Despite having a node with theoretically 20 CPU cores, we discovered that our steal-time (the amount of time where the hypervisor makes the physical cores unavailable) was regularly topping 50%. This is a ‘feature’ of the way our supplier's infrastructure works but is obviously something that doesn’t work for us. Needless to say, had we been aware of this limitation before we moved, we would never have done so.
In order to stop this from bringing down the entire platform, we had to make some difficult and unpopular decisions. On a normal day, 30%-50% of our traffic comes from the plugins – when the DB was CPU-bound to it’s worst levels we had to disable the plugin and API endpoints to protect the availability of the control panel and your webstores.
Typically we were able to re-instate the endpoints fairly quickly (within 15 minutes or so), as soon as the load settled, but particularly on the 29th December every time we re-enabled the plugin and API endpoints it would generate a huge amount of load which resulted is in having to pull them offline again.
Since this point the load has stabilised and except a few instances of plugin or API being unavailable for c. 5 minutes, we’ve seen no further downtime.
What have we done to fix it?
It’s become clear that Digital Ocean isn’t suitable for our needs, so once again we are undertaking a project to migrate our platform. We will be working with Amazon AWS going forward - particularly because their Aurora platform is perfectly suited to our database load profile. Migrating to AWS represents a significant additional investment both in terms of the migration and the ongoing costs, however we think it’s worth doing to ensure that our platform runs at the high level both we and our customers expect, and to assure our customers that we are dedicated to ensuring the performance issues that have caused huge amounts of pain for so many of them do not repeat.
This migration will be taking place in the next week or so. We will do everything we can to ensure any downtime is kept to a minimum. Amazon provide an amazing database migration tool that should help keep our downtime as low as possible. We will make sure our status page is kept up-to-date with any required downtime.
The past month or so has been tough, not just for us but also for our customers. We’ve tried to ensure that our status update was constantly being updated, but equally we know that this doesn’t help in the middle of selling season! We are sorry that we’ve let you down, but we are confident our new infrastructure with AWS will this time ensure that these are just unpleasant memories.