Distilling a speech and music encoder with task arithmetic
Journal
Interspeech 2025
Series/Report No.
Proceedings of the Annual Conference of the International Speech Communication Association Interspeech
Start Page
3858
End Page
3862
ISSN
2308457X
Date Issued
2025-08-17
Author(s)
Ritter-Gutierrez, Fabian
Lin, Yi-Cheng
Wei, Jui-Chiang
Wong, Jeremy H.M
Chng, Eng Siong
Chen, Nancy F.
Abstract
Despite the progress in self-supervised learning (SSL) for speech and music, existing models treat these domains separately, limiting their capacity for unified audio understanding. A unified model is desirable for applications that require general representations, e.g. audio large language models. Nonetheless, directly training a general model for speech and music is computationally expensive. Knowledge Distillation of teacher ensembles may be a natural solution, but we posit that decoupling the distillation of the speech and music SSL models allows for more flexibility. Thus, we propose to learn distilled task vectors and then linearly interpolate them to form a unified speech+music model. This strategy enables flexible domain emphasis through adjustable weights and is also simpler to train. Experiments on speech and music benchmarks demonstrate that our method yields superior overall performance compared to ensemble distillation.
Subjects
knowledge distillation
self-supervised models
Publisher
ISCA
Type
conference paper
