Simultaneous Separation and Segmentation in Layered Music

Prem Seetharaman and Bryan Pardo, ISMIR 2016

This notebook presents the separation/segmentation algorithm presented in the following paper:

http://music.cs.northwestern.edu/publications/seetharaman_pardo_ismir16.pdf

Seetharaman, Prem, and Bryan Pardo. "Simultaneous separation and segmentation in layered music." Proc. of the 17th International Society for Music Information Retrieval Conference (ISMIR). New York City, NY, USA, 2016

Background

Source separation

In [30]:
import librosa
from functions import *
from separate import *
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
%reload_ext autoreload
%autoreload 2
audio_path = 'demo_songs/myth.mp3'
audio_signal, sr, audio_stft = load_file(audio_path)

display_spectrogram_and_audio(audio_stft, audio_signal, sr, 'Myth - Beach House')

The goal of source separation is to separate a mixture (such as the one above) into its constituent parts. What these parts are are defined by your perception of the mixture. Here's an example of source separation using one common technique - harmonic/percussive source separation [1].

In [17]:
audio_hpss = librosa.effects.hpss(audio_signal)
titles = ['Harmonic part of mixture', 'Percussive part of mixture']
for a,t in zip(audio_hpss, titles):
    a_stft = librosa.stft(a)
    display_spectrogram(a_stft, t, 'time', 'log')

disp_audio(audio_hpss, titles)

Source separation algorithms rely on different auditory cues. Harmonic/percussive source separation, as above, relies on the idea that percussive sources will be short in time but broad in frequency, and harmonic sources will be long in time and narrow in frequency. Separating vertical lines from horizontal lines in the spectrogram representation above leads to harmonic/percussive source separation. Other source separation approaches rely on other auditory cues.

Non-negative matrix factorization

Nonnegative Matrix Factorization (NMF) is a popular source separation method. It learns a dictionary of spectral templates from the audio. Source separation with NMF requires something external to the NMF algorithm to group spectral templates by source. The spectral templates can be used to separate the associated sources from the mixture.

Let's look at the part of the above mixture where the "shk" sound is isolated:

In [31]:
start = 0 #in seconds
stop = .5
boundaries = [int(x) for x in librosa.samples_to_frames([start*sr, stop*sr])]

display_spectrogram_and_audio(audio_stft[:, boundaries[0]:boundaries[1]], 
                              audio_signal[int(sr*start):int(sr*stop)], 
                              sr, 
                              'Mixture spectrogram')

The percussive noise burst can be seen in the spectrogram above. It's that wide-band energy between 232ms and 358ms. Let's build a source model of this percussive noise burst using NMF:

In [33]:
templates, activations = find_template(audio_stft, sr, 1, 2, boundaries[0], boundaries[1])
display_spectrogram(templates, 'Spectral templates', None, 'log', components = True)
display_spectrogram(activations, 'Activations', 'time', None, activations = True)
display_spectrogram(templates.dot(activations), 'Reconstruction', 'time', 'log')
template, residual = extract_template(templates, audio_stft[:, boundaries[0]:boundaries[1]])
display_spectrogram_and_audio(template, librosa.istft(template), sr, 'Reconstructed from mixture')

The templates learned make sense: broadband noise of varying density. The noise bursts activate at the same time as in the original spectrogram. Multiplying the templates by the activations reconstructs the original spectrogram. Using the reconstruction as a mask on the original spectrogram (so we can keep the phase) gives us something we can invert back to the time domain to listen to.

We see that running NMF on a part of the mixture lets you reconstruct that one part of that mixture easily. What happens if we try to reconstruct the entire mixture, using just the spectral templates learned from a single part?

In [34]:
template, residual = extract_template(templates, audio_stft)
display_spectrogram_and_audio(template, librosa.istft(template), sr, 'Separating the learned source from the mixture')

It separated out just the noise burst from the entire mixture! What's the rest of it sound like?

In [35]:
display_spectrogram_and_audio(residual, librosa.istft(residual), sr, 'Everything leftover (residual)')

NMF lets us use parts of the mixture to build an understanding of the entire mixture.

Key idea: often, the composer will signal to the listener how the mixture should be listened to by introducing compositional elements (sources or groups of sources) in a sequential manner.

Let's look at and listen to some examples.

In [37]:
load_and_display('demo_songs/cherry.wav', 'Cherry - Ratatat')
load_and_display('demo_songs/sanfrancisco.wav', 'San Francisco - Foxygen')
load_and_display('demo_songs/shosty8.wav', 'Symphony No. 8, Mvt II - Dmitri Shostakovich')
load_and_display('demo_songs/onemoretime.wav', 'One More Time - Daft Punk')