Staged Training Strategy and Multi-Activation for Audio Tagging with Noisy and Sparse Multi-Label Data

Abstract

Audio tagging aims to predict whether certain acoustic events occur in the audio clips. Due to the difficulty and huge cost of obtaining manually labeled data with high confidence, researchers begin to focus on audio tagging using a small set of manually-labeled data, and a larger set of noisy-labeled data. In addition, audio tagging is a sparse multi-label classification task, where only a small number of acoustic events may occur in an audio clip. In this paper, we propose a staged training strategy to deal with the noisy label, and adopt a sigmoid-sparsemax multi-activation structure to deal with the sparse multi-label classification. This paper is an improvement and extension of our previous work for participation in Task 2 of Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 Challenge. We evaluate our methods on the identical task, and achieve state-of-the-art performance, with an lwlrap score of 0.7591 on official evaluation dataset.

Publication
The 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)