This course studies how a computational system can organize sound into perceptually meaningful elements. Problems in this field include source separation (splitting audio mixtures into individual sounds), source identification (labeling a source sound), and streaming (finding which sounds belong to a single explanation/event). This course is an advanced graduate course covering current research in the field.
Week | Date | Topic |
---|---|---|
1 | Oct 1 | Basics of audition and digital signal processing |
2 | Oct 8 | Basics of deep learning (convolutional nets, LSTMs) |
3 | Oct 15 | Auditory representations (Patterson, Shamma, Stoter, Pishdadian) |
4 | Oct 22 | Source Separation Algorithms |
5 | Oct 29 | Source Separation Algorithms |
6 | Nov 5 | Generating Audio |
7 | Nov 12 | Speech Recognition |
8 | Nov 19 | ** NO CLASS: PROF. PARDO IS AT DCASE WORKSHOP ** |
9 | Nov 26 | Audio Scene Labeling |
10 | Dec 3 | Audio Scene Labeling |
11 | Dec 10 | Attention Models |
Every 2 weeks, you will submit a set of 4 single-page overviews of papers/chapters you read. Each review will be worth 1 point.
Twice during the course of the term, you will be the lead person discussing a paper in class. This will mean you haven’t just read the paper, but you’ve read related work, really understand it and can give a brief presentation of the paper (including slides) and then lead a discussion about it.
Each week (even weeks when you’re not presenting) you’re expected to show up, have read the papers and be able to discuss ideas. Every week you show up and substantially contribute to the discussion, you get 2 points. If you don’t show up or you don’t say anything that week, you don’t get the 2 points.
The final assignment in the class is to have written a literature review and introduction to a research paper in an ACM format.
Bregman’s Auditory Scene Analysis, Chapter 1
Brown and Wang’s Computational Auditory Scene Analysis, Chapter 1
Chapter 4 from Machine Learning
Convolutional Networks for Images, Speech, and Time-Series
stretch goal: Chapter 9 of Deep Learning: Convolutional Networks
stretch goal: Long Term Short Term Memory
Prem: Summary statistics in auditory perception
Fatemeh: Multiresolution spectrotemporal analysis of complex sounds
Fatemeh: Multi-resolution Common Fate Transform
Prof Pardo: Recovering sound sources from embedded repetition
Madhav: REPET
Prem: 2DFT
Conway: Algorithms for Non-negative Matrix Factorization
Willl (3 L’s is cool): Learning mid-level auditory codes from natural sound statistics
SIMULTANEOUS SEPARATION AND SEGMENTATION IN LAYERED MUSIC
Nathan: Deep clustering: Discriminative embeddings for segmentation and separation
Brian: DEEP ATTRACTOR NETWORK FOR SINGLE-MICROPHONE SPEAKER SEPARATION
Alternative Objective Functions for Deep Clustering
Vyas: WaveNet: A Generative Model for Raw Audio
Mike: Deep Voice: Real-time Neural Text-to-Speech
Max: Parallel Wavenet
Brian: Deep Cross-Modal Audio-Visual Generation
SampleRNN: Bengio’s take on generating speech
Tacotron 2 is the 2018 speech generation system from Google
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
Lyrebird commercializes deep learning speech generation
Vyas: Highway Long Short-Term Memory RNNs for Distant Speech Recognition
Mike: Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks
The Rabiner HMM tutorial (useful for CTC understanding)
Intuitively understanding Connectionist Temporal Classification
Towards End-to-End Speech Recognition with Recurrent Neural Networks (the precursor to the recognition with convolutional neural nets paper)
Highway Networks (useful for understanding Highway LSTMs)
Building End-to-End Dialogue Systems Using Generative Hierarchical Neural Network Models
Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition
Analysis of I-vector Length Normalization in Speaker Recognition Systems
Sean: SoundNet: Learning Sound Representations from Unlabeled Video
BongJun: A Human-in-the-Loop System for Sound Event Detection and Annotation
Nathan: UNSUPERVISED LEARNING OF SEMANTIC AUDIO REPRESENTATIONS
BongJun: Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection
Will: CNN Architectures for Large-Scale Audio Classification
(VGG Net) Very Deep Convolutional Networks for Large-Scale Image Recognition
(Alex Net) ImageNet Classification with Deep Convolutional Neural Networks
(Inception)Rethinking the Inception Architecture for Computer Vision
Conway: (ResNet)Deep Residual Learning for Image Recognition
Madhav: Siamese Style Convolutional Neural Networks for Sound Search by Vocal Imitation
Sean: Describing Videos by Exploiting Temporal Structure
Fatemeh:Modeling attention-driven plasticity in auditory cortical receptive fields
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention