Authors:
(1) Juan F. Montesinos, Department of Information and Communications Technologies Universitat Pompeu Fabra, Barcelona, Spain {[email protected]};
(2) Olga Slizovskaia, Department of Information and Communications Technologies Universitat Pompeu Fabra, Barcelona, Spain {[email protected]};
(3) Gloria Haro, Department of Information and Communications Technologies Universitat Pompeu Fabra, Barcelona, Spain {[email protected]}.
The University of Rochester Multi-Modal Music Performance Dataset (URMP) [1] is a dataset with 44 Multi-instrument video recordings of classical music pieces. Each instrument present in a piece was recorded separately, both with video and high-quality audio with a stand-alone microphone, in order to have ground-truth individual tracks. Although playing separately, the instruments were coordinated by using a conducting video with a pianist playing in order to set the common timing for the different players. After synchronization, the audio of the individual videos was replaced by the high-quality audio of the microphone and then different recordings were assembled to create the mixture: the individual high-quality audio recordings were added up to create the audio mixture and the visual content was composited in a single video with a common background where all players were arranged at the same level from left to right. For each piece, the dataset provides the musical score in MIDI format, the high-quality individual instrument audio recordings and the videos of the assembled pieces. The instruments present in the dataset, shown in Figure 1, are common instruments in chamber orchestras. In spite of all its good characteristics, it is a small dataset and thus not appropriate for training deep learning architectures.
Two other datasets of audio-visual recordings of musical instruments performances have been presented recently: Music [23] and MusicES [31]. Music consists of 536 recordings of solos and 149 videos of duets across 11 categories: accordion, acoustic guitar, cello, clarinet, erhu, flute, saxophone, trumpet, tuba, violin and xylophone. This dataset was gathered by querying YouTube. MusicES [31] is an extension of MUSIC to around the triple of its original size with approximately 1475 recordings but spread in 9 categories instead: accordion, guitar, cello, flute, saxophone, trumpet, tuba, violin and xylophone. There are 7 common categories in MUSIC and Solos: violin, cello, flute, clarinet, saxophone, trumpet and tuba. The common categories between MusicES and Solos are 6 (the former ones except clarinet). Solos and MusicES are complementary. There is only an small intersection of 5% between both, which means both datasets can be combined into a bigger one.
We can find in the literature several examples which show the utility of audio-visual datasets. The Sound of Pixels [23] performs audio source separation generating audio spectral components which are further smartly selected by using visual features coming from the video stream to obtain separated sources. This idea was further extended in [20] in order to separate the different sounds present in the mixture in a recursive way. At each stage, the system separates the most salient source from the ones remaining in the mixture. The Sound of Motions [19] uses dense trajectories obtained from optical flow to condition audio source separation, being able
even to separate same-instrument mixtures. Visual conditioning is also used in [18] to separate different instruments; during training, a classification loss is used on the separated sounds to enforce object consistency and a co-separation loss forces the estimated individual sounds to produce the original mixtures once reassembled. In [17], the authors developed an energy-based method which minimizes a Non-Negative Matrix Factorization term with an activation matrix which is forced to be aligned to a matrix containing per-source motion information. This motion matrix contains the average magnitude velocities of the clustered motion trajectories in each player bounding box.
Recent works show the rising use of skeletons in audiovisual tasks. In Audio to body dynamics [29] authors show it is possible to predict skeletons reproducing the movements of players playing instruments such as piano or violin. Skeletons have proven to be useful for establishing audio-visual correspondences, such as body or finger motion with note onsets or pitch fluctuations, in chamber music performances [21]. A recent work [32] tackles the source separation problem in a similar to Sound of Motions [19] but replacing the dense trajectories by skeleton information.