Multi-resolution Common Fate Transform

Fatemeh Pishdadian and Bryan Pardo

Overview

This website is a companion to the article "Multi-resolution Common Fate Transform", which introduces the Multi-resolution Common Fate Transform (MCFT), an audio representation useful for representing mixtures of multiple audio signals that overlap in both time and frequency. Here you can find additional experimental results, audio examples and source code. For details on experimental setups, data sets used, and definitions of formulae, please see the paper.

Examples: Audio and Spectrograms


In this section, we present examples of our experimental results. Each example includes audio for the mixture, original sources, and estimated sources. We recommend to listen to audio examples with a set of headphones.

Time-frequency plots for the original sources and estimated sources are also provided. In order to make visual comparison across representations easier, all results are displayed in the STFT domain. It should, however, be noted that the MCFT (and clearly the CQT) use the CQT domain as the time-frequency analysis stage. The frequency range in all time-frequency plots is limited to 0 - 4 kHz for visual purposes.

Two-source Mixtures

Mixture = C2-bassoon-vibrato + C2-bassoon-minor trill

Original Source 1: C2-bassoon-vibrato

Estimated Source 1:
Masking ThrSTFTCQTCFT-best-sepMCFT
0 dB



15 dB



30 dB



Original Source 1: C2-bassoon-vibrato

Estimated Source 1:
Masking Thr = 0 dB
Masking Thr = 15 dB
Masking Thr = 30 dB
Original Source 2: C2-bassoon-minor trill

Estimated Source 2:
Masking ThrSTFTCQTCFT-best-sepMCFT
0 dB



15 dB



30 dB



Original Source 2: C2-bassoon-minor-trill

Estimated Source 2:
Masking Thr = 0 dB
Masking Thr = 15 dB
Masking Thr = 30 dB
Mixture = C3-viola-minor trill + C3-viola-major trill

Original Source 1: C3-viola-minor trill

Estimated Source 1:
Masking ThrSTFTCQTCFT-best-sepMCFT
0 dB



15 dB



30 dB



Original Source 1: C3-viola-minor trill

Estimated Source 1:
Masking Thr = 0 dB
Masking Thr = 15 dB
Masking Thr = 30 dB
Original Source 2: C3-viola-major trill

Estimated Source 2:
Masking ThrSTFTCQTCFT-best-sepMCFT
0 dB



15 dB



30 dB



Original Source 2: C3-viola-major trill

Estimated Source 2:
Masking Thr = 0 dB
Masking Thr = 15 dB
Masking Thr = 30 dB
Mixture = C4-oboe-vibrato + C4-viola-minor trill

Original Source 1: C4-oboe-vibrato

Estimated Source 1:
Masking ThrSTFTCQTCFT-best-sepMCFT
0 dB



15 dB



30 dB



Original Source 1: C4-oboe-vibrato

Estimated Source 1:
Masking Thr = 0 dB
Masking Thr = 15 dB
Masking Thr = 30 dB
Original Source 2: C4-viola-minor trill

Estimated Source 2:
Masking ThrSTFTCQTCFT-best-sepMCFT
0 dB



15 dB



30 dB



Original Source 2: C4-viola-minor trill

Estimated Source 2:
Masking Thr = 0 dB
Masking Thr = 15 dB
Masking Thr = 30 dB

Three-source Mixture

Mixture = C6-cello-vibrato + C6-piccolo trumpet-minor trill + C6-piccolo trumpet-major trill

Original Source 1: C6-cello-vibrato

Estimated Source 1:
Masking ThrSTFTCQTCFT-best-sepMCFT
15 dB



Original Source 1: C6-cello-vibrato

Estimated Source 1:
Masking Thr = 15 dB
Original Source 2: C6-piccolo trumpet-minor trill

Estimated Source 2:
Masking ThrSTFTCQTCFT-best-sepMCFT
15 dB



Original Source 2: C6-piccolo trumpet-minor trill

Estimated Source 2:
Masking Thr = 15 dB
Original Source 3: C6-piccolo trumpet-major trill

Estimated Source 3:
Masking ThrSTFTCQTCFT-best-sepMCFT
15 dB



Original Source 3: C6-piccolo trumpet-major trill

Estimated Source 3:
Masking Thr = 15 dB

Four-source Mixture

Mixture = C7-piano + C7-violin-vibrato + C7-violin-minor trill + C7-violin-major trill

Original Source 1: C7-piano

Estimated Source 1:
Masking ThrSTFTCQTCFT-best-sepMCFT
10 dB



Original Source 1: C7-piano

Estimated Source 1:
Masking Thr = 10 dB
Original Source 2: C7-violin-vibrato

Estimated Source 2:
Masking ThrSTFTCQTCFT-best-sepMCFT
10 dB



Original Source 2: C7-violin-vibrato

Estimated Source 2:
Masking Thr = 10 dB
Original Source 3: C7-violin-minor trill

Estimated Source 3:
Masking ThrSTFTCQTCFT-best-sepMCFT
10 dB



Original Source 3: C7-violin-minor trill

Estimated Source 3:
Masking Thr = 10 dB
Original Source 4: C7-violin-major trill

Estimated Source 4:
Masking ThrSTFTCQTCFT-best-sepMCFT
10 dB



Original Source 4: C7-violin-major trill

Estimated Source 4:
Masking Thr = 10 dB

Experimental Results


In this section, we present detailed results of our separability and clusterability experiments. Box plots show the distribution of separability and clusterability values. The boxes cover the values in the range of the first and third quartiles, with the middle notch indicating the median.

Tables contain the results of statistical significance tests. The null hypothesis in all statistical tests is that the median of the results for the MCFT is less or equal the median of other representations, or equivalently, the MCFT does not provide any improvement. In separability statistical tests, n = number of mixtures * number of masking thresholds = 126 * 7 = 882, and in clusterability statistical tests, n = number of mixtures * number of masking thresholds * number of similarity kernel widths = 126 * 7 * 10 = 8820. In all tables, median diff = median(MCFT) − median(other representation)" (positive values indicate improved performance for the MCFT).

Separability Results

Wilcoxon rank sum test restuls (n = 882).
STFT CQT CFT
best-sep
CFT
best-clus
2src median diff (dB) + 2.74 + 2.31 + 1.57 + 1.84
p-value ≤ 0.0001 ≤ 0.0001 ≤ 0.0001 ≤ 0.0001
3src median diff (dB) + 3.42 + 3.05 + 2.09 + 2.47
p-value ≤ 0.0001 ≤ 0.0001 ≤ 0.0001 ≤ 0.0001
4src median diff (dB) + 4.22 + 3.64 + 2.89 + 3.14
p-value ≤ 0.0001 ≤ 0.0001 ≤ 0.0001 ≤ 0.0001
5src median diff (dB) + 5.10 + 3.64 + 3.55 + 3.94
p-value ≤ 0.0001 ≤ 0.0001 ≤ 0.0001 ≤ 0.0001

Wilcoxon rank sum test restuls (n = 882).
STFT CQT CFT
best-sep
CFT
best-clus
2src median diff (dB) + 0.66 + 0.62 − 0.09 − 0.13
p-value ≤ 0.01 > 0.05 > 0.05 > 0.05
3src median diff (dB) + 1.64 + 1.08 + 0.51 + 0.58
p-value ≤ 0.0001 ≤ 0.0001 ≤ 0.0001 ≤ 0.0001
4src median diff (dB) + 2.25 + 1.59 + 1.05 + 1.35
p-value ≤ 0.0001 ≤ 0.0001 ≤ 0.0001 ≤ 0.0001
5src median diff (dB) + 2.80 + 1.90 + 1.60 + 1.80
p-value ≤ 0.0001 ≤ 0.0001 ≤ 0.0001 ≤ 0.0001

Wilcoxon rank sum test restuls (n = 882).
STFT CQT CFT
best-sep
CFT
best-clus
2src median diff (dB) + 2.93 + 2.66 + 1.79 + 2.14
p-value ≤ 0.0001 ≤ 0.0001 ≤ 0.0001 ≤ 0.0001
3src median diff (dB) + 3.53 + 3.10 + 2.26 + 2.64
p-value ≤ 0.0001 ≤ 0.0001 ≤ 0.0001 ≤ 0.0001
4src median diff (dB) + 4.30 + 3.70 + 2.96 + 3.11
p-value ≤ 0.0001 ≤ 0.0001 ≤ 0.0001 ≤ 0.0001
5src median diff (dB) + 5.09 + 3.77 + 3.71 + 4.09
p-value ≤ 0.0001 ≤ 0.0001 ≤ 0.0001 ≤ 0.0001

Clusterability Results

Wilcoxon rank sum test restuls (n = 8820).
STFT CQT CFT
best-sep
CFT
best-clus
2src median diff + 0.088 + 0.100 + 0.130 + 0.003
p-value ≤ 0.0001 ≤ 0.0001 ≤ 0.0001 > 0.05
3src median diff + 0.041 + 0.040 + 0.13 − 0.031
p-value ≤ 0.0001 ≤ 0.0001 ≤ 0.0001 > 0.05
4src median diff + 0.012 + 0.008 + 0.098 − 0.057
p-value > 0.05 ≤ 0.05 ≤ 0.0001 > 0.05
5src median diff − 0.009 − 0.017 + 0.098 − 0.077
p-value > 0.05 > 0.05 ≤ 0.0001 > 0.05