teaching

COMPUTER AUDITION (AKA Computational Auditory Scene Analyis), Northwestern University EECS 495-??? fall 2018

Loctation: Technological Institute L168

Day/Time: Mondays, 3:00pm - 5:50pm

Instructor: Bryan Pardo

Course Description

This course studies how a computational system can organize sound into perceptually meaningful elements. Problems in this field include source separation (splitting audio mixtures into individual sounds), source identification (labeling a source sound), and streaming (finding which sounds belong to a single explanation/event). This course is an advanced graduate course covering current research in the field.

Course Calendar

Week	Date	Topic
1	Oct 1	Basics of audition and digital signal processing
2	Oct 8	Basics of deep learning (convolutional nets, LSTMs)
3	Oct 15	Auditory representations (Patterson, Shamma, Stoter, Pishdadian)
4	Oct 22	Source Separation Algorithms
5	Oct 29	Source Separation Algorithms
6	Nov 5	Generating Audio
7	Nov 12	Speech Recognition
8	Nov 19	NO CLASS: PROF. PARDO IS AT DCASE WORKSHOP
9	Nov 26	Audio Scene Labeling
10	Dec 3	Audio Scene Labeling
11	Dec 10	Attention Models

Course assignments

Written Paper reviews: 20 points

Every 2 weeks, you will submit a set of 4 single-page overviews of papers/chapters you read. Each review will be worth 1 point.

Class Paper Presentations: 20 points

Twice during the course of the term, you will be the lead person discussing a paper in class. This will mean you haven’t just read the paper, but you’ve read related work, really understand it and can give a brief presentation of the paper (including slides) and then lead a discussion about it.

Class participation: 20 points

Each week (even weeks when you’re not presenting) you’re expected to show up, have read the papers and be able to discuss ideas. Every week you show up and substantially contribute to the discussion, you get 2 points. If you don’t show up or you don’t say anything that week, you don’t get the 2 points.

Written Literature review and intro: 40 points

The final assignment in the class is to have written a literature review and introduction to a research paper in an ACM format.

Course Reading

Week 1: Oct 1

Bregman’s Auditory Scene Analysis, Chapter 1

Brown and Wang’s Computational Auditory Scene Analysis, Chapter 1

Week 2: Oct 8

Chapter 4 from Machine Learning

Convolutional Networks for Images, Speech, and Time-Series

Understanding LSTMs

stretch goal: Chapter 9 of Deep Learning: Convolutional Networks

stretch goal: Long Term Short Term Memory

Week 3: Oct 15 Auditory Representations

Prem: Summary statistics in auditory perception

Fatemeh: Multiresolution spectrotemporal analysis of complex sounds

Fatemeh: Multi-resolution Common Fate Transform

Week 4: Oct 22 More Auditory Representations & Source Separation

Prof Pardo: Recovering sound sources from embedded repetition

Madhav: REPET

Prem: 2DFT

Week 5: Oct 29 Source Separation Algorithms

Non-negative matrix factorization

Conway: Algorithms for Non-negative Matrix Factorization

Willl (3 L’s is cool): Learning mid-level auditory codes from natural sound statistics

SIMULTANEOUS SEPARATION AND SEGMENTATION IN LAYERED MUSIC

Deep embedding space source separation

Nathan: Deep clustering: Discriminative embeddings for segmentation and separation

Brian: DEEP ATTRACTOR NETWORK FOR SINGLE-MICROPHONE SPEAKER SEPARATION

additional materials

Alternative Objective Functions for Deep Clustering

Week 6: Nov 5 Generating audio

Wavenet and descendents

Vyas: WaveNet: A Generative Model for Raw Audio

Mike: Deep Voice: Real-time Neural Text-to-Speech

Max: Parallel Wavenet

Also of interest

Brian: Deep Cross-Modal Audio-Visual Generation

Wavenet alternatives

SampleRNN: Bengio’s take on generating speech

Tacotron 2 is the 2018 speech generation system from Google

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

Additional materials

Lyrebird commercializes deep learning speech generation

The Deep Voice Talk at ICML

Week 7: Nov 12 Speech Recognition: traditional

Vyas: Highway Long Short-Term Memory RNNs for Distant Speech Recognition

El Pardo: Connectionist temporal classification (CTC): labelling unsegmented sequence data with recurrent neural networks

Mike: Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

Additional materials

The Rabiner HMM tutorial (useful for CTC understanding)

Intuitively understanding Connectionist Temporal Classification

Towards End-to-End Speech Recognition with Recurrent Neural Networks (the precursor to the recognition with convolutional neural nets paper)

Highway Networks (useful for understanding Highway LSTMs)

Week 8: Nov 19 NO CLASS BUT GO AHEAD AND READ MORE ON SPEECH, IF YOU’RE INTERESTED

Building End-to-End Dialogue Systems Using Generative Hierarchical Neural Network Models

Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition

Analysis of I-vector Length Normalization in Speaker Recognition Systems

Week 9: Nov 26 General Audio Scene Labeling

Sean: SoundNet: Learning Sound Representations from Unlabeled Video

???: Unsupervised Cross-Modal Deep-Model Adaptation for Audio-Visual Re-Identification With Wearable Cameras

BongJun: A Human-in-the-Loop System for Sound Event Detection and Annotation

Nathan: UNSUPERVISED LEARNING OF SEMANTIC AUDIO REPRESENTATIONS

Week 10: Dec 3 More Scene labeling + Interactive Sound Search

BongJun: Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

Comparing architectures for scene labeling

Will: CNN Architectures for Large-Scale Audio Classification

Additional materials

(VGG Net) Very Deep Convolutional Networks for Large-Scale Image Recognition

(Alex Net) ImageNet Classification with Deep Convolutional Neural Networks

(Inception)Rethinking the Inception Architecture for Computer Vision

Conway: (ResNet)Deep Residual Learning for Image Recognition

Query by example audio search systems

Madhav: Siamese Style Convolutional Neural Networks for Sound Search by Vocal Imitation

Week 11: DEC 10 Attention Models

Max: Listen, Attend and Spell

Sean: Describing Videos by Exploiting Temporal Structure

Fatemeh:Modeling attention-driven plasticity in auditory cortical receptive fields

Additional Materials

Modelling Auditory Attention

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention