become our partner

What we learned from production incidents

The talk was accepted to the conference program

Ruslan Shakhaev

Yandex Delivery


About 4 years ago, when we started developing Yandex Delivery, we used all the main patterns for building stable and reliable applications:

- canary release
- retries and timeouts
- rate limiting
- circuit breaker
- feature toggling

Even if one of our datacenters is unavailable, our users will not notice anything. We can enable/disable and configure our features in production in real time, and much more.

But all this was not enough to prevent the system from experiencing downtime sometimes

I'll tell you about the non-obvious problems we encountered and what lessons we learned from various production incidents

Main sections:
- architectural solutions that lead to problems (inter-service interaction, entity processing, etc.)
- problems when developing external API
- specifics of working with mobile clients
- problems with PostgreSQL and what we did wrong