become our partner

Night porter's notes on how to provide a non-cloud-native service as a service

The talk was accepted to the conference program

Photo
Dmitrii Nekrylov

Yandex 360

Abstracts

We will discuss concrete examples from Yandex 360

* In Yandex Telemost, when we do broadcasts, or manage incoming calls from meeting rooms, we need to allocate heavy VMs as resources. They require warmup, authorization and healthchecks, because at scale any one of these VMs can break or go rogue at any time. And only one of these instances can serve a stream at any given moment.

We need to maintain up to 99.99% availability of these services. We have specific rules how it is calculated, and we can formally minimize downtime from planned updates with the help of a wisely chosen strategy. We have historical data at our disposal to test theories. And we have been using it.

* Sometimes these services are so imbalanced in CPU/RAM ratio, that we need to host multiple user sessions within one container. Otherwise RAM consumption and overhead on PaaS would be enormous. In this case we need to orchestrate a 2-layered service with all of the requirements from above.

* Yandex Telemost is based on Jitsi. It holds multi-user sessions across several distinct components in memory. And registers in several discovery systems to organize the calls. Special care should be taken to prevent unintentional random split of conferences into several independent rooms. Or to prevent a rogue POD from intercepting one of the traffic channels thus making it impossible for users to join a particular conference at all

based on these examples, we are going to discuss
* problem of stateless single-pod services. And our approach to their managent and maintenance
* how we can calculate and minimize donwtime of these stateful in-memory components and enable more frequent releases
* why there should be only one service discovery
* how split brain-like situations emerge from scaling a component that provides multi-user sessions. When the component is not built for scaling world-wide. And we did to address the issue.