Tech
Ibbad Hafeez
Mar 21, 2024
How our engineering teams prepare for scalability events at Wolt
At Wolt, seamless scalability of our services isn’t just a technical requirement but is vital for our business, too. As we now operate in 27 countries around the world, serving over 36 million customers, 230,000+ courier partners and 140,000+ merchants, we need to be ready for high volumes of orders on our platform, all the time. Given the immediate and direct impact of any issues on customer satisfaction, platform reliability throughout the year is a key priority for our engineering team.
Over the years, we’ve put in a lot of effort in engineering and operations to ensure that none of our customers experience issues when using our application, especially on those days when our customers are using it the most. Holidays and celebrations like Valentine’s Day and New Years Day mean that a lot of people use Wolt to order food and other goods. From a technical perspective, we call these days ‘scalability events’ as our systems are exposed to significantly higher loads. After an incident in 2021, when our customer application went down for several hours, we learned that we need to improve the way we prepare for such events. In this blog post, we’ll go through our learnings and processes that we’ve built to ensure our platform functions seamlessly even through the highest peaks.
Throwback to a not-so-happy Valentine’s Day
The Valentine's Day of 2021 was a pivotal moment in intensifying our focus on our platform's reliability during high-traffic events. On that day, a record-breaking surge in user activity led to an unforeseen database failure, triggering a cascading effect on several critical components in our system. The incident disrupted our delivery operations for about an hour and required almost four hours for a full recovery. Such incidents can significantly damage our reputation with customers, courier partners, and merchants over the long term. As we always strive to deliver an excellent experience for all our customers and partners, we knew we had to prepare better for scalability events in the future.
The Valentine’s Day incident refueled our engineering efforts to improve our platform’s stability so we can be prepared for high-demand days. This way we can protect our user experience for all Wolt’s customers, merchants, and courier partners. As with any incident, our engineering team did a thorough postmortem which covered all affected systems to identify the root cause. We focused on identifying the immediate actions to prevent the issue from happening again in the near future, as well as identifying the improvements needed to future-proof our existing systems against similar issues. In addition to the engineering action items, we identified the need for a more formal process to ensure engineering readiness for high-demand days, aiming to maximize platform-wide availability. Let’s explore what was born from that process.
Identifying the types of scalability events we have at Wolt
At Wolt, scalability events come in various shapes and sizes, and distinguishing between the nature of these events helps us focus on the necessary preparations. For the sake of common understanding, these events have been categorized as:
High Volume Events: These events apply to all markets. Examples include New Year’s Day, Mother’s Day and so forth, where there’s overall higher than average traffic on the platform.
Flash Crowd Events: These events are mostly localized to a single market or even a single city. During these events, we see a sudden spike in user traffic over a shorter window of time. Some examples of these are local cultural, sports and other events, such as football matches, concerts and so forth. These events have an interesting profile because they might appear like a Denial of Service (DoS) or an anomaly to most of our systems but are in fact legitimate users trying to use our platform.
This categorization helps us track important events and set up requirements for engineering readiness for them. While the goal remains the same - minimize any issues that can impact customers, merchants, or courier partners - the nature of actions may vary due to traffic patterns and the duration of any event.
Ensuring production readiness for all our services
To help our engineering teams prepare for scalability events, we’ve built a set of engineering guidelines for the production readiness of our services. Each service is reviewed in the context of these guidelines and the outcome is a list of actions that we should complete before the event to mark the given service ready.
1. We use a tiering model to assess the criticality of events 🔎
At Wolt, we use a tiering model to categorize services based on their criticality in supporting the critical flow, for example, order creation and delivery. The tier levels are 0-3 where the lower tier means higher criticality. This tier information is used to outline the reliability requirements for the service, for example, Mean time to X (MTTx) expectations, on-call coverage, observability and disaster recovery. With an ever-increasing number of services powering the Wolt platform, tier information also helps us prioritize the actions required to prepare for any given scalability event.
2. Comprehensive observability for the win 🏆
Comprehensive observability - including logging, monitoring, alerting, and escalations - is another key requirement for every service we use. To ensure that robust monitoring and alerting are in place, we conduct workshop(s) for any given service, where we review the existing observability in place and identify improvements needed depending on the business case. These reviews are performed for all services with the help of the owner (engineering) teams and subject matter experts (SMEs) to ensure the required observability and escalation path is in place.
Given our unique model of how we do on-call at Wolt, we rely heavily on documentation and tooling provided by engineering teams to resolve any incidents. These reviews also cover documentation and tooling to ensure that all information is up-to-date and correct.
3. Load testing helps us understand service behavior under stress 💪
Load testing is another major activity when preparing for scalability events. As described earlier, the traffic patterns may vary based on the type of event we’re dealing with. For high volume events, we have a good understanding of traffic distribution, courier partner and venue availability, and expected user traffic based on the data from past years. In case of flash crowd events, we collaborate with our relevant country teams to gain insights about the timing of events, expected user activity, merchant and courier availability, and so forth.
We use an in-house tool to load test our services in a production environment. During the load test, we not only track the golden signals for a given service, we also observe the impact on critical resources, dependencies, and how the scaling strategies perform. Each load test is accompanied by a report outlining the load patterns that have been tested, the observations, and the action items, if any.
Load testing has proven to be extremely valuable to understand service behavior under stress. It helps us validate our architectural choices and identify performance bottlenecks. Depending on the criticality of these bottlenecks and their impact on the critical flow, necessary improvements are made. Load testing also provides good insights for capacity planning on any service and helps us choose the optimal scaling strategy for our services to handle any surges in user traffic.
4. Atomic changes over code freezes
At Wolt, we don’t impose code freezes before scalability events. Instead our guidelines suggest releasing atomic changes with an ability to rollback any changes immediately if they’re not working as expected. We also heavily use experiments and feature flags to rollout new features gradually in production so that any change doesn't impact *all* customers at the same time (check out our Wolt Tech Talks Podcast on how we’ve developed our experimentation platform using FastAPI).
Building enough guardrails, gradual rollouts, having safe fallbacks, and effective rollback strategy enables us to continue releasing new changes even during peak traffic hours.
How we prepare for an upcoming scalability event
For some of the high volume days, such as New Year's Day, we run preparation activities as a project. The preparations start a few weeks before the actual date. During this time, we review the overall state of reliability over the past several months, and identify any action items that should be completed before the event day.
Before the event: Getting our ducks in a row 🦆
Preparation activities start by calculating an estimated increase in traffic from the different verticals: consumers, merchants, and courier partners. We work with a team of individuals from these different verticals in our engineering organization. Each member is responsible for collecting necessary information from platform and engineering teams, related to recent load testing, observability workshops, business related expectations, and metrics. We analyze this information to collect known risks and create action items before the day of the event. Team members also work together with relevant engineering teams to complete and document the action items. The process and information is available for everyone across the company to contribute and is also reviewed by engineering leadership.
During the event: Ready for the Zero Hour 📈
On the day of the event, we engage additional on-call support if needed, from the key functions like infrastructure. During the event, we continuously monitor our services and work closely with the operations team to identify early signs of any issues, and try to mitigate them as soon as possible to minimize customer impact.
After the event: Documentation is key 📝
As they often say, “if it wasn’t documented, it wasn’t done”. After a scalability event, we share a summary of business and engineering metrics with the engineering organization. We identify the things that went well and things that could be improved before the next big scalability event, and share post-mortems if any.
Since 2021 we’ve used this process for scalability events and it has opened up opportunities for innovation. For instance, we’ve improved our communications, minimized single points of failure, and integrated graceful degradation mechanisms into our systems. We originally implemented these improvements to benefit our incident management on a regular basis and improve our emergency readiness in general. Going forward, our goal is to make this a continuous exercise instead of a few times per year activity. The reason for this is the scale at which Wolt operates now — with over 36 million users placing orders in 27 countries, any day can be a scalability event now — so we aim to always be prepared for the next one.