COMPUTER AUDITION (AKA Computational Auditory Scene Analyis), Northwestern University EECS 495-??? fall 2018
Loctation: Technological Institute L168
Day/Time: Mondays, 3:00pm - 5:50pm
Instructor: Bryan Pardo
Course Description
This course studies how a computational system can organize sound into perceptually meaningful elements. Problems in this field include source separation (splitting audio mixtures into individual sounds), source identification (labeling a source sound), and streaming (finding which sounds belong to a single explanation/event). This course is an advanced graduate course covering current research in the field.
Course Calendar
Week | Date | Topic |
---|---|---|
1 | Oct 1 | Basics of audition and digital signal processing |
2 | Oct 8 | Basics of deep learning (convolutional nets, LSTMs) |
3 | Oct 15 | Auditory representations (Patterson, Shamma, Stoter, Pishdadian) |
4 | Oct 22 | Source Separation Algorithms |
5 | Oct 29 | Source Separation Algorithms |
6 | Nov 5 | Generating Audio |
7 | Nov 12 | Speech Recognition |
8 | Nov 19 | ** NO CLASS: PROF. PARDO IS AT DCASE WORKSHOP ** |
9 | Nov 26 | Audio Scene Labeling |
10 | Dec 3 | Audio Scene Labeling |
11 | Dec 10 | Attention Models |
Course assignments
Written Paper reviews: 20 points
Every 2 weeks, you will submit a set of 4 single-page overviews of papers/chapters you read. Each review will be worth 1 point.
Class Paper Presentations: 20 points
Twice during the course of the term, you will be the lead person discussing a paper in class. This will mean you haven’t just read the paper, but you’ve read related work, really understand it and can give a brief presentation of the paper (including slides) and then lead a discussion about it.
Class participation: 20 points
Each week (even weeks when you’re not presenting) you’re expected to show up, have read the papers and be able to discuss ideas. Every week you show up and substantially contribute to the discussion, you get 2 points. If you don’t show up or you don’t say anything that week, you don’t get the 2 points.
Written Literature review and intro: 40 points
The final assignment in the class is to have written a literature review and introduction to a research paper in an ACM format.
Course Reading
Week 1: Oct 1
Bregman’s Auditory Scene Analysis, Chapter 1
Brown and Wang’s Computational Auditory Scene Analysis, Chapter 1
Week 2: Oct 8
Chapter 4 from Machine Learning
Convolutional Networks for Images, Speech, and Time-Series
stretch goal: Chapter 9 of Deep Learning: Convolutional Networks
stretch goal: Long Term Short Term Memory
Week 3: Oct 15 Auditory Representations
Prem: Summary statistics in auditory perception
Fatemeh: Multiresolution spectrotemporal analysis of complex sounds
Fatemeh: Multi-resolution Common Fate Transform
Week 4: Oct 22 More Auditory Representations & Source Separation
Prof Pardo: Recovering sound sources from embedded repetition
Madhav: REPET
Prem: 2DFT
Week 5: Oct 29 Source Separation Algorithms
Non-negative matrix factorization
Conway: Algorithms for Non-negative Matrix Factorization
Willl (3 L’s is cool): Learning mid-level auditory codes from natural sound statistics
SIMULTANEOUS SEPARATION AND SEGMENTATION IN LAYERED MUSIC
Deep embedding space source separation
Nathan: Deep clustering: Discriminative embeddings for segmentation and separation
Brian: DEEP ATTRACTOR NETWORK FOR SINGLE-MICROPHONE SPEAKER SEPARATION
additional materials
Alternative Objective Functions for Deep Clustering
Week 6: Nov 5 Generating audio
Wavenet and descendents
Vyas: WaveNet: A Generative Model for Raw Audio
Mike: Deep Voice: Real-time Neural Text-to-Speech
Max: Parallel Wavenet
Also of interest
Brian: Deep Cross-Modal Audio-Visual Generation
Wavenet alternatives
SampleRNN: Bengio’s take on generating speech
Tacotron 2 is the 2018 speech generation system from Google
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
Additional materials
Lyrebird commercializes deep learning speech generation
Week 7: Nov 12 Speech Recognition: traditional
Vyas: Highway Long Short-Term Memory RNNs for Distant Speech Recognition
Mike: Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks
Additional materials
The Rabiner HMM tutorial (useful for CTC understanding)
Intuitively understanding Connectionist Temporal Classification
Towards End-to-End Speech Recognition with Recurrent Neural Networks (the precursor to the recognition with convolutional neural nets paper)
Highway Networks (useful for understanding Highway LSTMs)
Week 8: Nov 19 ** NO CLASS BUT GO AHEAD AND READ MORE ON SPEECH, IF YOU’RE INTERESTED **
Building End-to-End Dialogue Systems Using Generative Hierarchical Neural Network Models
Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition
Analysis of I-vector Length Normalization in Speaker Recognition Systems
Week 9: Nov 26 General Audio Scene Labeling
Sean: SoundNet: Learning Sound Representations from Unlabeled Video
BongJun: A Human-in-the-Loop System for Sound Event Detection and Annotation
Nathan: UNSUPERVISED LEARNING OF SEMANTIC AUDIO REPRESENTATIONS
Week 10: Dec 3 More Scene labeling + Interactive Sound Search
BongJun: Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection
Comparing architectures for scene labeling
Will: CNN Architectures for Large-Scale Audio Classification
Additional materials
(VGG Net) Very Deep Convolutional Networks for Large-Scale Image Recognition
(Alex Net) ImageNet Classification with Deep Convolutional Neural Networks
(Inception)Rethinking the Inception Architecture for Computer Vision
Conway: (ResNet)Deep Residual Learning for Image Recognition
Query by example audio search systems
Madhav: Siamese Style Convolutional Neural Networks for Sound Search by Vocal Imitation
Week 11: DEC 10 Attention Models
Sean: Describing Videos by Exploiting Temporal Structure
Fatemeh:Modeling attention-driven plasticity in auditory cortical receptive fields
Additional Materials
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention