Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments

Ethan Manilow, Prem Seetharaman, and Bryan Pardo
Interactive Audio Lab
Northwestern University

Abstract

We present a single deep learning architecture that can both separate an audio recording of a musical mixture into constituent single-instrument recordings and transcribe these instruments into a human-readable format at the same time, learning a shared musical representation for both tasks. This novel architecture, which we call Cerberus, builds on the Chimera network for source separation by adding a third "head" for transcription. By training each head with different losses, we are able to jointly learn how to separate and transcribe up to 5 instruments in our experiments with a single network. We show that the two tasks are highly complementary with one another and when learned jointly, lead to Cerberus networks that are better at both separation and transcription and generalize better to unseen mixtures.

See Arxiv Submission

Submitted to ICASSP 2020

(This page best viewed on Firefox or Safari)

Architecture Overview

Cerberus Architecture

Cerberus Vs. Other Architectures

Setup

The Cerberus loss is given by:

$\mathcal{L}_{\text{Cerberus}} = \alpha \mathcal{L}_{\text{DC}} + \beta \mathcal{L}_{\text{MI}} + \gamma \mathcal{L}_{\text{TR}}$

Where $\mathcal{L}_{\text{DC}}$ is the Deep Clustering loss, $\mathcal{L}_{\text{MI}}$ is the Mask Inference loss, and $\mathcal{L}_{\text{TR}}$ is the Transcription loss. In this set of experiments, we examine different weights for $\alpha$, $\beta$, and $\gamma$ to make different networks. By setting $\alpha$, $\beta$, or $\gamma$ $:=$ $0$, we effectively change the architecture of the network. Full results are shown in the table below.

Training Set Slakh2100 train partition
Validation Set Slakh2100 validation partition
Test Set Slakh2100 test partition

Audio Examples

Cerberus Output

Mixture Separation Output Transcription Output
(re-synthesized)
Piano Guitar Piano Guitar Remixed

Evaluation Measures

Model Name Loss Weights SDR SIR SAR Frame Level Onsets Note on/off
DC MI TR Precision Recall F1 Precision Recall F1 Precision Recall F1
Deep Clustering 1.0 8.5 16.5 9.8
Mask Inference 1.0 10.0 16.9 11.9
Transcription Only 1.0 0.90 0.83 0.85 0.77 0.70 0.71 0.48 0.43 0.44
Chimera 0.5 0.5 9.8 16.5 11.6
DC + TR 0.2 0.8 9.3 18.1 10.4 0.91 0.81 0.84 0.79 0.68 0.71 0.48 0.41 0.43
MI + TR 0.2 0.8 9.8 16.7 11.6 0.91 0.84 0.86 0.81 0.71 0.74 0.51 0.46 0.47
Cerberus 0.1 0.1 0.8 10.0 16.9 11.8 0.91 0.83 0.85 0.81 0.71 0.73 0.51 0.45 0.47

Three, Four, and Five Instrument Mixtures

Setup

In this set of experiments, we train a Cerberus network to separate and transcribe mixtures with 3, 4, and 5 instrument classes. Each network has its own training, validation, and testing set that contain mixtures of only the noted instruments.

Training Set Slakh2100 train partition
Validation Set Slakh2100 validation partition
Test Set Slakh2100 test partition

Audio Examples

Piano, Guitar, Bass

Mixture Separation Output Transcription Output
(re-synthesized)
Piano Guitar Bass Piano Guitar Bass Remixed

Piano, Guitar, Bass, Drums

Mixture Separation Output Transcription Output (re-synthesized)
Piano Guitar Bass Drums Piano Guitar Bass Drums Remixed

Piano, Guitar, Bass, Drums, Strings

Mixture Separation Output Transcription Output (re-synthesized)
Piano Guitar Bass Drums Strings Piano Guitar Bass Drums Strings Remixed

Evaluation Measures

Piano, Guitar, Bass

Instrument SDR SIR SAR Frame Level
Onsets Note on/off
Precision Recall F1 Precision Recall F1 Precision Recall F1
Piano 7.6 13.5 9.7 0.91 0.83 0.85 0.77 0.73 0.73 0.44 0.42 0.42
Guitar 6.9 12.5 9.8 0.91 0.78 0.82 0.75 0.59 0.63 0.46 0.35 0.38
Bass 10.1 16.9 11.9 0.96 0.93 0.94 0.94 0.89 0.91 0.85 0.80 0.82

Piano, Guitar, Bass, Drums

Instrument SDR SIR SAR Frame Level
Onsets Note on/off
Precision Recall F1 Precision Recall F1 Precision Recall F1
Piano 6.1 11.5 8.3 0.89 0.81 0.83 0.73 0.70 0.68 0.38 0.36 0.36
Guitar 5.8 10.8 8.7 0.89 0.76 0.79 0.72 0.55 0.58 0.42 0.32 0.34
Bass 7.7 13.0 10.0 0.96 0.92 0.94 0.92 0.88 0.89 0.82 0.78 0.79
Drums 11.3 19.2 12.2 0.64 0.71 0.63 0.61 0.76 0.63

Piano, Guitar, Bass, Drums, Strings

Instrument SDR SIR SAR Frame Level
Onsets Note on/off
Precision Recall F1 Precision Recall F1 Precision Recall F1
Piano 3.4 6.7 7.3 0.86 0.75 0.78 0.69 0.62 0.61 0.31 0.28 0.28
Guitar 3.1 6.5 7.0 0.84 0.70 0.73 0.64 0.45 0.49 0.29 0.20 0.22
Bass 6.4 11.4 8.6 0.95 0.91 0.93 0.91 0.84 0.87 0.77 0.72 0.74
Drums 10.6 18.7 11.5 0.63 0.71 0.63 0.62 0.75 0.64
Strings 4.1 8.6 8.5 0.91 0.83 0.85 0.62 0.53 0.53 0.39 0.35 0.35

Cerberus on Real Recordings

In this section we test Cerberus models (trained on synthesized data) on real recordings found on YouTube.

In [12]:
from IPython.display import IFrame

IFrame(src="https://www.youtube.com/embed/87cnbLi0xBw?rel=0&showinfo=0;start=125&end=155", width="560", height="315")
Out[12]:

Cerberus Piano/Guitar Model tested on above video from 2:05-2:35. (Audio downsampled to 16k below)

Mixture Separation Output Transcription Output
(re-synthesized)
Piano Guitar Piano Guitar Remixed
In [11]:
IFrame(src="https://www.youtube.com/embed/RBx-Ue28KUE?rel=0&showinfo=0;start=45&end=60", width="560", height="315")
Out[11]:

Cerberus Piano/Guitar Model tested on above video from 0:45-1:00. (Audio downsampled to 16k below)

Mixture Separation Output Transcription Output
(re-synthesized)
Piano "Guitar" Piano "Guitar" Remixed
In [15]:
%%html
<style>
audio { width: 50px; }
</style>
In [ ]: