Vladislav Shpilevoy
VirtualMinds
Enhancing server performance typically entails achieving one or more of the following objectives: reduce latency, increase number of requests per second (RPS), and minimize CPU and memory usage. These goals can be pursued through architectural modifications, such as eliminating some network hops, distributing data across multiple servers, upgrading to more powerful hardware, and so forth. This talk is not about that.
I categorize the primary sources of code performance degradation into three groups:
[*] Thread contention. For instance, too hot mutexes, overly strict order in lock-free operations, and false sharing.
[*] Heap utilization. Loss is often caused by frequent allocation and deallocation of large objects, and by the absence of intrusive containers at hand.
[*] Network IO. Socket reads and writes are expensive due to being system calls. Also they can block the thread for a long time, resulting in hacks like adding tens or hundreds more threads. Such measures intensify contention, as well as CPU and memory usage, while neglecting the underlying issue.
I present a series of concise and straightforward low-level recipes on how to gain performance via code optimizations. While often requiring just a handful of changes, the proposals might amplify the performance N-fold.
The suggestions target the mentioned bottlenecks caused by certain typical mistakes. Proposed optimizations might render architectural changes not necessary, or even allow to simplify the setup if existing servers start coping with the load effortlessly. As a side effect, the changes can make the code cleaner and reveal more bottlenecks to investigate.
The talk was accepted to the conference program
Aleksey Uchakin
EdgeCenter
In general, CDN is very simple thing. You just need a bunch of servers, several fat links to the best ISP and the nginx. Is it enough?
And how to choose the best option for your project from several candidates?
Abstract
* what issues you can fix with CDN;
* questions you have to ask before onboarding;
* black magic of routing: what is the real nearest node;
* how to rule the world with BGP and DNS.
The talk was accepted to the conference program
Ivan Potapov
Qrator Labs, Product Manager
You need to scale the network to handle increasing traffic and ensure the required service quality.
There are two methods for balancing surging user traffic – GeoDNS and BGP Anycast (a brief description of these technologies).
We'll examine how large companies tackle this task.
Let's move on to how you can solve these problems yourself, using open utilities and route information data.
Following this, we'll discuss an approach for expanding the BGP Anycast network.
For a global network expansion, two key questions arise:
- Where (in which country) should a new node be installed?
- Which local provider should it connect to for optimal service quality?
To answer the first question, we'll utilize the RIPE Atlas public toolkit to create an RTT map, highlighting regions with maximum network delays. This reveals the weak points in our current network.
To answer the second question, we'll describe a method based on analysing route information, which can be used to select the best providers in the region.
Simplistically, the method can be described as follows:
1. We gather route information from some BGP collector (e.g., RIPE, Route Views, PCH)
2. Using this data, we identify major players in the market – regional leaders, employing various metrics (a brief overview of these metrics and the algorithms behind them is provided).
3. We create a rating of the most promising providers for connection and select candidates from this rating.
This systematic approach swiftly identifies optimal locations for new node installation, enabling effective network development.
The talk was accepted to the conference program
Alexander Gilevich
EPAM
Need to continuously ingest data from numerous disparate and non-overlapping data sources and then merge them together into one huge knowledge graph to deliver insights to your end users?
Pretty cool, huh? And what about multi-tenancy, mirroring access policies and data provenance? Perhaps, incremental loading of data? Or monitoring the current state of ingestion in a highly-decoupled distributed microservices-based environment?
In my talk I will tell you our story: all started with a simple idea of building connectors, we ended up building fully configurable and massively scalable data ingestion pipelines which deliver disparate data pieces to a single data lake for their later decomposition and digestion in a multi-tenant environment. All while allowing customers and business analysts to create and configure their own ingestion pipelines in a friendly way with a bespoke pipeline designer with each pipeline building block being a separate decoupled microservice (think Airflow, AWS Step Functions, Azure Data Factory and Azure Logic Apps). Furthermore, we'll touch such aspects as choreography vs orchestration, incremental loading strategies, ingestion of access control policies (ABAC, RBAC, ACLs), parallel data processing, how frameworks can help in the implementation of cross-cutting concerns, and even briefly talk about the benefits of knowledge graphs.
The talk was accepted to the conference program
Ruslan Shakhaev
Yandex Delivery
About 4 years ago, when we started developing Yandex Delivery, we used all the main patterns for building stable and reliable applications:
- canary release
- retries and timeouts
- rate limiting
- circuit breaker
- feature toggling
Even if one of our datacenters is unavailable, our users will not notice anything. We can enable/disable and configure our features in production in real time, and much more.
But all this was not enough to prevent the system from experiencing downtime sometimes
I'll tell you about the non-obvious problems we encountered and what lessons we learned from various production incidents
Main sections:
- architectural solutions that lead to problems (inter-service interaction, entity processing, etc.)
- problems when developing external API
- specifics of working with mobile clients
- problems with PostgreSQL and what we did wrong
The talk was accepted to the conference program
Alexander Zaitsev
Altinity
ClickHouse is an ultra-fast analytic database. Object Storage is cheap. Can they work together? Let's learn!
ClickHouse is an ultra-fast database originally designed for local storage. Since 2020 a lot of effort has been made in order to make it efficient with object storage, like S3, that is essential for big clusters operated in clouds. In this talk I will explain ClickHouse storage model, and how Object Storage support is implemented. Finally, we will see performance results and discuss further improvements.
The talk was accepted to the conference program
Andrew Aksyonoff
Avito && Sphinx
I will dissect multiple internal binary JSON representations in several DBs (Mongo, Postgres, YDB, my own Sphinx, maybe more), and rant how they are so not great for querying.
My rant will also include a partial benchmark (of course), and a limited way out for the databases: as in, a few techniques I have tried and will be implementing in Sphinx, so that our BSON sucks on par or less. Spoiler alert: BSONs suck and nothing works really well for them, everything you thought is a lie (including the cake), hash tables suck, binary searches suck (even the clever ones and of course the naive ones), AVX2 sucks, maybe AVX512 sucks too (maybe I'll have the time to try that). As for the database users? Weeell, at least you will know how much your specific database sucks, why so, and what can the competition offer.
The talk was accepted to the conference program
Dmitrii Khodakov
Avito
Context: setting the task - a feed of personal recommendations on the main page. How to launch recommendations in production when you have 150 million items and 100 million users? I will share my experience, tell you about the pitfalls
A quick overview of the arsenal of models: classic ML approach
A quick overview of metrics starts with product metrics.
The basis of everything: fast experiments and analytics on actual data
Where to start? Classical matrix factorization and its launch pattern.
What problems did you encounter at this stage
Little more advanced: switching real-time user features and history. An alternative approach with simpler models.
Advanced models: Let's add neural networks, the strength is in diversity.
Mixing models - great blender
How does it work in production? Replaced Go with Python, what happened to time to market?
And again, about the experiment cycle, I'll tell you about product metrics.
The talk was accepted to the conference program
Dmitry Tsepelev
UULA
Almost everyone has monitoring. In the ideal world it is a reliable tool that detects sympthoms earlier than they become serious problems. Often time APM on a free plan with out-of-the-box reports is used as a monitoring tool. As a result, something is measured, some alerts are sent into the chat, no one responds to them, and one day the major incident happens.
In the talk we will:
- define monitoring antipatterns;
- pick the most critical metrics and ways to see insights in charts;
- represent the system in the terminology of queue theory;
- figure out how to choose lower–level metrics and how to use them to find problems;
- discuss why alerts are helpful, and when they are not needed.
The talk was accepted to the conference program
Oleg Voznesensky
Gazprombank
The purpose of this talk is to help DevOps engineers to understand GitOps pattern and take decisions about using GitOps or not. Also, I will discuss the most frequent problems and ways to solve them.
The talk was accepted to the conference program
The largest professional conference for developers of high-load systems
Participation options
Offline
The price is soaring — the closer the conference is, the more it costs.
The current price of a ticket is — 280000 EUR
Changed your mind?
Tell us why.
Thank you for your reply!
Professional conference for developers of high-load systems