P4-08: Transformer-Based Beat Tracking With Low-Resolution Encoder and High-Resolution Decoder

Tian Cheng (National Institute of Advanced Industrial Science and Technology (AIST))*, Masataka Goto (National Institute of Advanced Industrial Science and Technology (AIST))

Subjects (starting with primary): ; Musical features and properties -> rhythm, beat, tempo

Presented In Person: 4-minute short-format presentation

Abstract:

In this paper, we address the beat tracking task which is to predict beat times corresponding to the input audio. Due to the long sequential inputs, it is still challenging to model the global structure efficiently and to deal with the data imbalance between beats and no beats. In order to meet the above challenges, we propose a novel Transformer-based model consisting of a low-resolution encoder and a high-resolution decoder. The encoder with low temporal resolution is suited to capture global features with more balanced data. The decoder with high temporal resolution is designed to predict beat times at a desired resolution. In the decoder, the global structure is considered by the cross attention between the global features and high-dimensional features. There are two key modifications in the proposed model: (1) adding 1D convolutional layers in the encoder and (2) replacing positional embedding by the upsampled encoder features in the decoder. In the experiment, we achieved the state-of-the-art performance and showed that the decoder produced more precise and stable results.

If the video does not load properly please use the direct link to video