become our partner
  • Architectures and scalability (5)

    Photo

    Vladislav Shpilevoy

    VirtualMinds

    First Aid Kit for C/C++ server performance

    Enhancing server performance typically entails achieving one or more of the following objectives: reduce latency, increase number of requests per second (RPS), and minimize CPU and memory usage. These goals can be pursued through architectural modifications, such as eliminating some network hops, distributing data across multiple servers, upgrading to more powerful hardware, and so forth. This talk is not about that.

    I categorize the primary sources of code performance degradation into three groups:

    [*] Thread contention. For instance, too hot mutexes, overly strict order in lock-free operations, and false sharing.
    [*] Heap utilization. Loss is often caused by frequent allocation and deallocation of large objects, and by the absence of intrusive containers at hand.
    [*] Network IO. Socket reads and writes are expensive due to being system calls. Also they can block the thread for a long time, resulting in hacks like adding tens or hundreds more threads. Such measures intensify contention, as well as CPU and memory usage, while neglecting the underlying issue.

    I present a series of concise and straightforward low-level recipes on how to gain performance via code optimizations. While often requiring just a handful of changes, the proposals might amplify the performance N-fold.

    The suggestions target the mentioned bottlenecks caused by certain typical mistakes. Proposed optimizations might render architectural changes not necessary, or even allow to simplify the setup if existing servers start coping with the load effortlessly. As a side effect, the changes can make the code cleaner and reveal more bottlenecks to investigate.

    The talk was accepted to the conference program

    Photo

    Aleksey Uchakin

    EdgeCenter

    The CDN journey: There and Back Again

    In general, CDN is very simple thing. You just need a bunch of servers, several fat links to the best ISP and the nginx. Is it enough?

    And how to choose the best option for your project from several candidates?

    Abstract
    * what issues you can fix with CDN;
    * questions you have to ask before onboarding;
    * black magic of routing: what is the real nearest node;
    * how to rule the world with BGP and DNS.

    The talk was accepted to the conference program

    Photo

    Ivan Potapov

    Qrator Labs, Product Manager

    One of the ways to expand the network for heavy loads

    You need to scale the network to handle increasing traffic and ensure the required service quality.

    There are two methods for balancing surging user traffic – GeoDNS and BGP Anycast (a brief description of these technologies).
    We'll examine how large companies tackle this task.
    Let's move on to how you can solve these problems yourself, using open utilities and route information data.

    Following this, we'll discuss an approach for expanding the BGP Anycast network.

    For a global network expansion, two key questions arise:
    - Where (in which country) should a new node be installed?
    - Which local provider should it connect to for optimal service quality?

    To answer the first question, we'll utilize the RIPE Atlas public toolkit to create an RTT map, highlighting regions with maximum network delays. This reveals the weak points in our current network.

    To answer the second question, we'll describe a method based on analysing route information, which can be used to select the best providers in the region.

    Simplistically, the method can be described as follows:
    1. We gather route information from some BGP collector (e.g., RIPE, Route Views, PCH)
    2. Using this data, we identify major players in the market – regional leaders, employing various metrics (a brief overview of these metrics and the algorithms behind them is provided).
    3. We create a rating of the most promising providers for connection and select candidates from this rating.

    This systematic approach swiftly identifies optimal locations for new node installation, enabling effective network development.

    The talk was accepted to the conference program

    Photo

    Alexander Gilevich

    EPAM

    Let’s talk Architecture: Limits of Configuration-driven Ingestion Pipelines

    Need to continuously ingest data from numerous disparate and non-overlapping data sources and then merge them together into one huge knowledge graph to deliver insights to your end users?

    Pretty cool, huh? And what about multi-tenancy, mirroring access policies and data provenance? Perhaps, incremental loading of data? Or monitoring the current state of ingestion in a highly-decoupled distributed microservices-based environment?

    In my talk I will tell you our story: all started with a simple idea of building connectors, we ended up building fully configurable and massively scalable data ingestion pipelines which deliver disparate data pieces to a single data lake for their later decomposition and digestion in a multi-tenant environment. All while allowing customers and business analysts to create and configure their own ingestion pipelines in a friendly way with a bespoke pipeline designer with each pipeline building block being a separate decoupled microservice (think Airflow, AWS Step Functions, Azure Data Factory and Azure Logic Apps). Furthermore, we'll touch such aspects as choreography vs orchestration, incremental loading strategies, ingestion of access control policies (ABAC, RBAC, ACLs), parallel data processing, how frameworks can help in the implementation of cross-cutting concerns, and even briefly talk about the benefits of knowledge graphs.

    The talk was accepted to the conference program

    Photo

    Ruslan Shakhaev

    Yandex Delivery

    What we learned from production incidents

    About 4 years ago, when we started developing Yandex Delivery, we used all the main patterns for building stable and reliable applications:

    - canary release
    - retries and timeouts
    - rate limiting
    - circuit breaker
    - feature toggling

    Even if one of our datacenters is unavailable, our users will not notice anything. We can enable/disable and configure our features in production in real time, and much more.

    But all this was not enough to prevent the system from experiencing downtime sometimes

    I'll tell you about the non-obvious problems we encountered and what lessons we learned from various production incidents

    Main sections:
    - architectural solutions that lead to problems (inter-service interaction, entity processing, etc.)
    - problems when developing external API
    - specifics of working with mobile clients
    - problems with PostgreSQL and what we did wrong

    The talk was accepted to the conference program

  • Databases and storage systems (2)

    Photo

    Alexander Zaitsev

    Altinity

    Object Storage in ClickHouse

    ClickHouse is an ultra-fast analytic database. Object Storage is cheap. Can they work together? Let's learn!

    ClickHouse is an ultra-fast database originally designed for local storage. Since 2020 a lot of effort has been made in order to make it efficient with object storage, like S3, that is essential for big clusters operated in clouds. In this talk I will explain ClickHouse storage model, and how Object Storage support is implemented. Finally, we will see performance results and discuss further improvements.

    The talk was accepted to the conference program

    Photo

    Andrew Aksyonoff

    Avito && Sphinx

    All BSONs suck

    I will dissect multiple internal binary JSON representations in several DBs (Mongo, Postgres, YDB, my own Sphinx, maybe more), and rant how they are so not great for querying.

    My rant will also include a partial benchmark (of course), and a limited way out for the databases: as in, a few techniques I have tried and will be implementing in Sphinx, so that our BSON sucks on par or less. Spoiler alert: BSONs suck and nothing works really well for them, everything you thought is a lie (including the cake), hash tables suck, binary searches suck (even the clever ones and of course the naive ones), AVX2 sucks, maybe AVX512 sucks too (maybe I'll have the time to try that). As for the database users? Weeell, at least you will know how much your specific database sucks, why so, and what can the competition offer.

    The talk was accepted to the conference program

  • BigData and Machine Learning (1)

    Photo

    Dmitrii Khodakov

    Avito

    How we built personal recommendations in the world’s most significant classified

    Context: setting the task - a feed of personal recommendations on the main page. How to launch recommendations in production when you have 150 million items and 100 million users? I will share my experience, tell you about the pitfalls
    A quick overview of the arsenal of models: classic ML approach
    A quick overview of metrics starts with product metrics.
    The basis of everything: fast experiments and analytics on actual data
    Where to start? Classical matrix factorization and its launch pattern.
    What problems did you encounter at this stage
    Little more advanced: switching real-time user features and history. An alternative approach with simpler models.
    Advanced models: Let's add neural networks, the strength is in diversity.
    Mixing models - great blender
    How does it work in production? Replaced Go with Python, what happened to time to market?
    And again, about the experiment cycle, I'll tell you about product metrics.

    The talk was accepted to the conference program

  • DevOps and Maintenance (2)

    Photo

    Dmitry Tsepelev

    UULA

    Backend monitoring from scratch

    Almost everyone has monitoring. In the ideal world it is a reliable tool that detects sympthoms earlier than they become serious problems. Often time APM on a free plan with out-of-the-box reports is used as a monitoring tool. As a result, something is measured, some alerts are sent into the chat, no one responds to them, and one day the major incident happens.

    In the talk we will:

    - define monitoring antipatterns;

    - pick the most critical metrics and ways to see insights in charts;

    - represent the system in the terminology of queue theory;

    - figure out how to choose lower–level metrics and how to use them to find problems;

    - discuss why alerts are helpful, and when they are not needed.

    The talk was accepted to the conference program

    Photo

    Oleg Voznesensky

    Gazprombank

    Demystifying GitOps. How to upgrade your CIOps to GitOps in a minimalistic way

    The purpose of this talk is to help DevOps engineers to understand GitOps pattern and take decisions about using GitOps or not. Also, I will discuss the most frequent problems and ways to solve them.

    The talk was accepted to the conference program