Postmortem: Inability To Add To Cart

In the early hours (UTC) of this morning, Tebex suffered a technical difficulty that resulted in customers being unable to add products to their cart. Once we became aware of the issue we worked as quickly as possible to resolve the matter, however we are aware that it took longer for us to become aware of the problem then it should have done. At Tebex, we strive to build and maintain resilient, scalable systems, and indeed our uptime is usually very good. That said, on this occasion we've fallen significantly short of our usual high standards, and for that we apologise to everyone who uses Tebex - both Sellers and Customers.

In the sake of complete transparency, we are sharing our postmortem below. The actions that we are taking from this will be worked on as a high priority issue tomorrow, and in the meantime we have additional on-call staff to ensure there are no knock-on effects for the rest of today. Once again I personally wish to apologise to anyone affected by this, and if you have any further questions, please contact our support team who will be happy to help.

Date

October 10, 2021

Title

Inability to add products to cart from Tebex Webstores affecting all users

Summary

Due to a failure to handle a certain edge condition in our legacy code, internal calls that were required to add products to a cart from a webstore were unavailable. Due to a primary focus on infrastructure rather than code, our alerting system didn’t detect this failure, resulting in a longer period of downtime than should have been required.

Customer Impact

During the time of this incident, most customers were unable to add items to their cart. Between 00:00 UTC and 06:00 UTC on October 10th, we only saw c. 30% of the expected value of successful purchases.

Incident Response Analysis

In the vast majority of cases, any instance of downtime or unavailability is caused by infrastructure (servers, networking, database etc) issues, rather than logic or code-based issues. As such, most of our monitoring and alerting tools are focused on monitoring our overall infrastructure. As such, we didn’t receive an alert for this issue (instead relying on customer reports of the issue), coupled with the fact it happened at a time when our staff were out of office (c. 1am local time), meaning that it took far longer to identify the issue than should have been the case.

Once we were aware of the issue, we were able to identify and rectify the issue quickly (within 20 minutes).

Post-Incident Analysis

Once aware of the issue, we were able to quickly identify the issue through our existing logging and system knowledge. The overall time from being aware of the issue through resolution of the primary issue was 20 minutes. As such, there is little that could be done to improve the time to diagnosis.

There was no code or infrastructure change that caused this issue, rather it was as a result of an edge-case (a race condition) that wasn’t well handled in a certain area of our codebase. This was not an edge-case that had been considered, and as such wasn’t something that we had an awareness of, in terms of being able to prevent or reduce the impact. In particular, the system that triggered the initial issue has been in place for 20 months and during that time has operated without issue.

Timeline

00:00 UTC A race condition occurred in the generation of a daily dataset - this resulted in two copies of the same dataset being generated - some sections of our codebase received the ‘original’ dataset, and some received the ‘regenerated’ dataset.

00:10 UTC Due to the conflict between the two datasets, calls to add items to baskets from webstores began failing. Due to a lack of expectation that such a business-logic error could occur without a code-change being present first, no alerts were triggered by out monitoring system

06:30 UTC Management saw customer messages in relation to the issue - a quick check confirmed this was a widespread and ongoing issue, and identification started immediately

06:49 UTC A deploy went out to fix the immediate issue, and customers could start adding items to baskets again. Some related errors required some additional configuration changes that went out during the following 20 minutes

07:15 UTC All error levels had returned to normal levels, and all functionality was working as expected.

Deep Dive On Contributing Factors

The initial problem was caused by an unexpected (and thus unaccounted for) edge case in a daily system to generate platform-wide datasets. In many cases, the sort of conflict that occurred can be detected and dealt with by the platform itself, but in this specific case the existence of some legacy code meant that no handling for this type of problem existed.

Within programming, it is virtually impossible to predict every possible failure scenario, however a culture of defensive programming should mean that even unanticipated fail states can be handled and recovered from. This didn’t happen in this situation.

Once the failure happened, an assumption that most out-of-hours outages will be caused by infrastrastructure problems (given that no code-changes are taking place) meant that insufficient monitoring of code-based systems was taking place, resulting in no alert being raised for our internal teams.

Once the problem was identified, due to it being an unaccounted for edge case, there was no defined process or tool to recover the dataset into a valid state, as such a manual change had to be instigated to do so, which caused some of the knock-on impact over the 20 minutes following resolution of the main issue.

What we’ve learnt

While, over the past 12 months, we have expanded our monitoring to include aspects of business logic and overall code monitoring, a majority of our monitoring is still focused on infrastructure. We need to be aware that our monitoring should be more widespread and cover all potential points of failure, even if we believe such a failure to be unlikely.

We further need to address any situations where, particularly with older code, a culture of defensive programming wasn’t in place, ensuring that, even in the instance of an unpredictable fail state, the platform can identify, resolve and recover in an automated manner as often as possible.

How we’re going to improve

Implement new monitoring of ‘baseline’ data points to raise alerts for logic (and other non-infrastructure) anomalies.

Audit legacy code to ensure that defensive programming principles are being used to detect and recover from fail states.