Over the last month, the Buycraft platform has experienced several days of unreliability. Since we started Buycraft one of our main priorities has been the performance of the website. Unfortunately, the last few weeks have been unacceptable. We know this and we cannot apologise for this enough. We've been working around the clock to resolve the issues and now we are at a point where things are stable again.
What caused the issues?
We knew that Black Friday would be important to our customers, and as such, we wanted to ensure that our infrastructure was able to handle it. We decided to move to new hosting infrastructure the week before Thanksgiving, using servers on the East Coast of the US. We did this for three reasons: improved application performance; lower latency; and, the ability to scale faster in the event of high load.
Buycraft has grown organically over time, which means that some of the codebase is around 4-5 years old. The new infrastructure revealed some of this legacy code wasn't up to today's standards, which was putting a strain on our database servers.
Rectifying and updating this legacy code of course took time. Even with our developers working full time on it, this ran into the Black Friday weekend. Combining unstable code with high load (averaging 45,000 requests per minute vs. our usual 5,000) put us in a situation that we should never be in.
The next week involved doing our utmost to keep the website running while refactoring significant parts of our codebase. The result: a more stable application with more consistent response times. Something still didn't seem right - deep down we were still not happy, and we knew our customers felt the same.
On Sunday evening everything went down again. We have developers on call 24/7/365, so we were investigating within 5 minutes. We assumed this would be caused by the new infrastructure again, but then we noticed something else. Over the past few days, it became clear that we were being targeted with a large DDOS attack which had been ongoing for 10 days or more.
What have we done to fix it?
As well as refactoring our codebase where needed, we've invested in extra hardware and caching to handle any increases in load.
We've also worked with Cloudflare to mitigate these attacks. This has involved adding systems to detect and block high-volume attacks against our platform.
During the times of worst performance, it was necessary to reduce our plugins 'phone home' frequency to every 15 minutes. This has now been reset to its standard duration.
Is anybody there?
While we were doing everything we could to solve these issues, there was one thing we didn't do enough of - keeping you informed. We know you were being asked difficult questions by your players, and we were not giving you the answers. We need to be better at letting you know what's going on. We dropped the ball on this, and we're sorry.
To ensure that we communicate better, we have changed most of our internal policies:
- We've improved our status page to provide more detail and given all our developers access to provide updates on any ongoing issues. This is accessible at http://status.buycraft.net/
- We've improved our deployment process, reducing the amount of downtime required to deploy new code from 5+ minutes to 30 seconds
- We've made changes to procedures for our support team, so they know which developer to contact to get ongoing status updates which they can pass on to you
- We've committed to making better use of Twitter to provide real-time updates of any issues
The last couple of weeks have been tough. Not just for us, but for our customers who have had tough questions asked of them from their players and we are sorry to everyone who was affected. At the same time, we thank you for your patience and loyalty throughout this, and we are pleased we've ended up with a stronger, more stable platform for everyone.