Su, LiLiSuYeh, Chin Chia MichaelChin Chia MichaelYehLiu, Jen YuJen YuLiuWang, Ju ChiangJu ChiangWangYI-HSUAN YANG2023-10-202023-10-202014-01-0115209210https://scholars.lib.ntu.edu.tw/handle/123456789/636376There has been an increasing attention on learning feature representations from the complex, high-dimensional audio data applied in various music information retrieval (MIR) problems. Unsupervised feature learning techniques, such as sparse coding and deep belief networks have been utilized to represent music information as a term-document structure comprising of elementary audio codewords. Despite the widespread use of such bag-of-frames (BoF) model, few attempts have been made to systematically compare different component settings. Moreover, whether techniques developed in the text retrieval community are applicable to audio codewords is poorly understood. To further our understanding of the BoF model, we present in this paper a comprehensive evaluation that compares a large number of BoF variants on three different MIR tasks, by considering different ways of low-level feature representation, codebook construction, codeword assignment, segment-level and song-level feature pooling, tf-idf term weighting, power normalization, and dimension reduction. Our evaluations lead to the following findings: 1) modeling music information by two levels of abstraction improves the result for difficult tasks such as predominant instrument recognition, 2) tf-idf weighting and power normalization improve system performance in general, 3) topic modeling methods such as latent Dirichlet allocation does not work for audio codewords. © 1999-2012 IEEE.Bag-of-frames model | music information retrieval | sparse coding | unsupervised feature learning[SDGs]SDG4A systematic evaluation of the bag-of-frames representation for music information retrievaljournal article10.1109/TMM.2014.23110162-s2.0-84904754003https://api.elsevier.com/content/abstract/scopus_id/84904754003