# Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments¶

Ethan Manilow, Prem Seetharaman, and Bryan Pardo
Interactive Audio Lab
Northwestern University

## Abstract¶

We present a single deep learning architecture that can both separate an audio recording of a musical mixture into constituent single-instrument recordings and transcribe these instruments into a human-readable format at the same time, learning a shared musical representation for both tasks. This novel architecture, which we call Cerberus, builds on the Chimera network for source separation by adding a third "head" for transcription. By training each head with different losses, we are able to jointly learn how to separate and transcribe up to 5 instruments in our experiments with a single network. We show that the two tasks are highly complementary with one another and when learned jointly, lead to Cerberus networks that are better at both separation and transcription and generalize better to unseen mixtures.

## Cerberus Vs. Other Architectures¶

### Setup¶

The Cerberus loss is given by:

$\mathcal{L}_{\text{Cerberus}} = \alpha \mathcal{L}_{\text{DC}} + \beta \mathcal{L}_{\text{MI}} + \gamma \mathcal{L}_{\text{TR}}$

Where $\mathcal{L}_{\text{DC}}$ is the Deep Clustering loss, $\mathcal{L}_{\text{MI}}$ is the Mask Inference loss, and $\mathcal{L}_{\text{TR}}$ is the Transcription loss. In this set of experiments, we examine different weights for $\alpha$, $\beta$, and $\gamma$ to make different networks. By setting $\alpha$, $\beta$, or $\gamma$ $:=$ $0$, we effectively change the architecture of the network. Full results are shown in the table below.

Training Set Slakh2100 train partition
Validation Set Slakh2100 validation partition
Test Set Slakh2100 test partition

### Audio Examples¶

#### Cerberus Output¶

Mixture Separation Output Transcription Output
(re-synthesized)
Piano Guitar Piano Guitar Remixed

#### Evaluation Measures¶

Model Name Loss Weights SDR SIR SAR Frame Level Onsets Note on/off
DC MI TR Precision Recall F1 Precision Recall F1 Precision Recall F1
Deep Clustering 1.0 8.5 16.5 9.8
Mask Inference 1.0 10.0 16.9 11.9
Transcription Only 1.0 0.90 0.83 0.85 0.77 0.70 0.71 0.48 0.43 0.44
Chimera 0.5 0.5 9.8 16.5 11.6
DC + TR 0.2 0.8 9.3 18.1 10.4 0.91 0.81 0.84 0.79 0.68 0.71 0.48 0.41 0.43
MI + TR 0.2 0.8 9.8 16.7 11.6 0.91 0.84 0.86 0.81 0.71 0.74 0.51 0.46 0.47
Cerberus 0.1 0.1 0.8 10.0 16.9 11.8 0.91 0.83 0.85 0.81 0.71 0.73 0.51 0.45 0.47

## Three, Four, and Five Instrument Mixtures¶

### Setup¶

In this set of experiments, we train a Cerberus network to separate and transcribe mixtures with 3, 4, and 5 instrument classes. Each network has its own training, validation, and testing set that contain mixtures of only the noted instruments.

Training Set Slakh2100 train partition
Validation Set Slakh2100 validation partition
Test Set Slakh2100 test partition

### Audio Examples¶

#### Piano, Guitar, Bass¶

Mixture Separation Output Transcription Output
(re-synthesized)
Piano Guitar Bass Piano Guitar Bass Remixed

#### Piano, Guitar, Bass, Drums¶

Mixture Separation Output Transcription Output (re-synthesized)
Piano Guitar Bass Drums Piano Guitar Bass Drums Remixed

#### Piano, Guitar, Bass, Drums, Strings¶

Mixture Separation Output Transcription Output (re-synthesized)
Piano Guitar Bass Drums Strings Piano Guitar Bass Drums Strings Remixed

### Evaluation Measures¶

#### Piano, Guitar, Bass¶

Instrument SDR SIR SAR Frame Level
Onsets Note on/off
Precision Recall F1 Precision Recall F1 Precision Recall F1
Piano 7.6 13.5 9.7 0.91 0.83 0.85 0.77 0.73 0.73 0.44 0.42 0.42
Guitar 6.9 12.5 9.8 0.91 0.78 0.82 0.75 0.59 0.63 0.46 0.35 0.38
Bass 10.1 16.9 11.9 0.96 0.93 0.94 0.94 0.89 0.91 0.85 0.80 0.82

#### Piano, Guitar, Bass, Drums¶

Instrument SDR SIR SAR Frame Level
Onsets Note on/off
Precision Recall F1 Precision Recall F1 Precision Recall F1
Piano 6.1 11.5 8.3 0.89 0.81 0.83 0.73 0.70 0.68 0.38 0.36 0.36
Guitar 5.8 10.8 8.7 0.89 0.76 0.79 0.72 0.55 0.58 0.42 0.32 0.34
Bass 7.7 13.0 10.0 0.96 0.92 0.94 0.92 0.88 0.89 0.82 0.78 0.79
Drums 11.3 19.2 12.2 0.64 0.71 0.63 0.61 0.76 0.63

#### Piano, Guitar, Bass, Drums, Strings¶

Instrument SDR SIR SAR Frame Level
Onsets Note on/off
Precision Recall F1 Precision Recall F1 Precision Recall F1
Piano 3.4 6.7 7.3 0.86 0.75 0.78 0.69 0.62 0.61 0.31 0.28 0.28
Guitar 3.1 6.5 7.0 0.84 0.70 0.73 0.64 0.45 0.49 0.29 0.20 0.22
Bass 6.4 11.4 8.6 0.95 0.91 0.93 0.91 0.84 0.87 0.77 0.72 0.74
Drums 10.6 18.7 11.5 0.63 0.71 0.63 0.62 0.75 0.64
Strings 4.1 8.6 8.5 0.91 0.83 0.85 0.62 0.53 0.53 0.39 0.35 0.35

## Cerberus on Real Recordings¶

In this section we test Cerberus models (trained on synthesized data) on real recordings found on YouTube.

In [12]:
from IPython.display import IFrame


Out[12]:

Cerberus Piano/Guitar Model tested on above video from 2:05-2:35. (Audio downsampled to 16k below)

Mixture Separation Output Transcription Output
(re-synthesized)
Piano Guitar Piano Guitar Remixed
In [11]:
IFrame(src="https://www.youtube.com/embed/RBx-Ue28KUE?rel=0&amp;showinfo=0;start=45&amp;end=60", width="560", height="315")

Out[11]:

Cerberus Piano/Guitar Model tested on above video from 0:45-1:00. (Audio downsampled to 16k below)

Mixture Separation Output Transcription Output
(re-synthesized)
Piano "Guitar" Piano "Guitar" Remixed
In [15]:
%%html
<style>
audio { width: 50px; }
</style>

In [ ]: