Machine Learning in the audio domain: when the neural network is overkill or where are the limits of lightweight models

Roman Smirnov


15 December, 13:30, «03 Hall. Queen Erato»


Machine learning engineers and data scientists typically use neural networks when the task is about media-data: texts, images, sounds/voices. There are many great and pretrained architectures for voice processing, e.g. Wav2Vec2 or Whisper. However, such models are really huge and require expensive computational resources or take too long to process data. I am going to describe several audio processing tasks from classification and regression on audio sequence to diarization and speech recognition with focus on the first two mentioned tasks - experiments with poor and rich datasets to solve these tasks using lightweight gradient boosting on decision trees model and pretrained Wav2Vec2 neural network (that is current SotA in many voice processing tasks). My main goal is to discuss where the limits of gradient boosting algorithms are in the audio domain.

The talk was accepted to the conference program