become our partner

Stable and scalable Triton Inference Server in production

The Program Committee has not yet taken a decision on this talk

Abstracts

The task of content moderation requires a lot of resources and time using a manual approach. That is why we are implementing ML models to solve this problem.

However, under conditions of high loads and the need for maximum fault tolerance, it's needed to choose the right solution for integrating ML models. NVIDIA's Triton Inference Server turned out to be such a tool for us.

Triton Inference Server is a powerful software that supports inferencing of several models at once and can allocate and use computing resources efficiently. However, in situations where high fault tolerance and maximum automation are required, features of pure Triton are not enough.

To meet the requirements that arise when working with a ML models in production, a number of solutions have been developed to improve stability and fault tolerance.
Main topics to be covered:
* Ensuring scalability
* Additional condition monitoring tools
* Full control and automation of model updates
* Ability to create individual instances for different models for efficient resource utilizing and fault tolerance

Thus, an attempt was made to create a Triton as a Service to make the models integration be easy and improve the stability of the system as a whole.