P3-10: Sounds Out of Pläce? Score Independent Detection of Conspicuous Mistakes in Piano Performances
Alia Morsi (Universitat Pompeu Fabra)*, Kana Tatsumi (Nagoya Institute of Technology), Akira Maezawa (Yamaha Corporation), Takuya Fujishima (Yamaha Corporation), Xavier Serra (Universitat Pompeu Fabra )
Subjects (starting with primary): Evaluation, datasets, and reproducibility -> novel datasets and use cases ; MIR tasks -> automatic classification ; Applications -> music training and education ; Musical features and properties -> expression and performative aspects of music ; Knowledge-driven approaches to MIR -> machine learning/artificial intelligence for music ; Evaluation, datasets, and reproducibility -> annotation protocols
Presented In Person: 4-minute short-format presentation
In piano performance, some mistakes stand out to listeners, whereas others may go unnoticed. Former research concluded that the salience of mistakes depended on factors including their contextual appropriateness and a listener’s degree of familiarity to what is being performed. A conspicuous error is considered to be an area where there is something obviously wrong with the performance, which a listener can detect regardless of their degree of knowledge of what is being performed. Analogously, this paper attempts to build a score-independent conspicuous error detector for standard piano repertoire of beginner to intermediate students. We gather three qualitatively different piano playing MIDI data: (1) 103 sight-reading sessions for beginning and intermediate adult pianists with formal music training, (2) 245 performances by presumably late-beginner to early-advanced pianists on a digital piano, and (3) 50 etude performances by an advanced pianist. The data was annotated at the regions considered to contain conspicuous mistakes. Then, we use a Temporal Convolutional Network to detect the sites of such mistakes from the piano roll. We investigate the use of two pre-training methods to overcome data scarcity: (1) synthetic data with procedurally-generated mistakes, and (2) training a part of the model as a piano roll auto-encoder. Experimental evaluation shows that the TCN performs at an F-measure of 0.78 without pretraining for sight-reading data, but the proposed pretraining steps improve the F-measure on performance and etude data, approaching the agreement between human raters on conspicuous error labels. Importantly, we report on the lessons learned from this pilot study, and what should be addressed to continue this research direction.
If the video does not load properly please use the direct link to video