Dual Attention-Based Multi-Scale Feature Fusion Approach for Dynamic Music Emotion Recognition

Liyue Zhang ( Xi’an Jiaotong University)*; Xinyu Yang (Xi'an Jiaotong University); Yichi Zhang (Xi'an Jiaotong University); Jing Luo (Xi'an Jiaotong University)

P2-07: Dual Attention-Based Multi-Scale Feature Fusion Approach for Dynamic Music Emotion Recognition

Liyue Zhang ( Xi’an Jiaotong University)*, Xinyu Yang (Xi'an Jiaotong University), Yichi Zhang (Xi'an Jiaotong University), Jing Luo (Xi'an Jiaotong University)

Subjects (starting with primary): MIR tasks -> automatic classification ; Musical features and properties -> musical affect, emotion and mood ; MIR fundamentals and methodology -> music signal processing

Presented Virtually: 4-minute short-format presentation

Abstract:

Music Emotion Recognition (MER) refers to automatically extracting emotional information from music and predicting its perceived emotions, and it has social and psychological applications. This paper proposes a Dual Attention-based Multi-scale Feature Fusion (DAMFF) method and a newly developed dataset named MER1101 for Dynamic Music Emotion Recognition (DMER). Specifically, multi-scale features are first extracted from the log Mel-spectrogram by multiple parallel convolutional blocks. Then, a Dual Attention Feature Fusion (DAFF) module is utilized to achieve multi-scale context fusion and capture emotion-critical features in both spatial and channel dimensions. Finally, a BiLSTM-based sequence learning model is employed for dynamic music emotion prediction. To enrich existing music emotion datasets, we developed a high-quality dataset, MER1101, which has a balanced emotional distribution, covering over 10 genres, at least four languages, and more than a thousand song snippets. We demonstrate the effectiveness of our proposed DAMFF approach on both the developed MER1101 dataset, as well as on the established DEAM2015 dataset. Compared with other models, our model achieves a higher Consistency Correlation Coefficient (CCC), and has strong predictive power in arousal with comparable results in valence.

Poster session Zoom meeting

If the video does not load properly please use the direct link to video