DEEP LEARNING: Northwestern University CS 396/496 FALL 20
Top  Calendar  Links  Slides  Readings 
Loctation
ONLINE ON ZOOM (meeing info available on Canvas)
Class Day/Time
Tuesday/Thursday, 4:20pm  5:40pm Central Time
Office Hours
Mondays 4:00pm  6:00pm Central Time
Instructor
Course Description
We will study deep learning architectures: perceptrons, multi layer perceptrons, convolutional networks, recurrent neural networks (LSTMs, GRUs), Attention networks, Autoencoders, Variational Auto Encoders (VAEs). Students will read original research papers that describe the algorithms and how they have been applied to fields like computer vision, machine translation, automatic speech recognition, and audio event recognition. They will do a final project that shows their learning.
Course Policies
Questions outside of class
Please use CampusWire for classrelated questions.
Grading Policy
You will be graded on a 100 point scale (e.g. 93 to 100 = A, 9092 = A, 8789 = B+, 8386 = B, 8082 = B…and so on).
Homework and reading assignments are solo assignments and must be original work.
Final projects are group assignments and all members of a group will share a grade for all parts of the assignment.
Submitting assignments
Assignments must be submitted on the due date by the time specified on Canvas. If you are worried you can’t finish on time, upload a safety submission an hour early with what you have. I will grade the most recent item submitted before the deadline. Late submissions will not be graded.
Class participation for up to 10 points of extra credit.
Students can earn up to 10 points (A full letter grade) of extra credit with class participation.
Participation during lecture There are 20 lectures this term. You will be asked to select 2 lectures for which you will be oncall. In your oncall lectures, I will feel free to call on you and will expect that you’ve done the relevant reading prior to lecture and will be able to engage in meaningful interaction on the lecture topic. Each oncall day will be worth up to 3 points.
CampusWire reputation We will track student CampusWire reputation scores. Those in the top 25% earn 4 points, top 50% earn 3 points, top 75% earn 2 points, Bottom 25% earn 1 point.
No additional extra credit beyond the 10% for class participation will be provided. No requests for extraextra credit will be considered.
Course Calendar
Week  Day and Date  Topic (tentative)  Deliverable  Points 

1  Thu Sep 17  Course basics  
2  Tue Sep 22  The perceptron  Readings  9 
2  Thu Sep 24  Basic Pytorch, Combining perceptrons  
3  Tue Sep 29  Basics of optimization  Readings  9 
3  Thu Oct 1  Backpropagation of loss through networks  
4  Tue Oct 6  Using TensorBoard and Lightning  Homework 1  10 
4  Thu Oct 8  Convolutional Filters & Pooling  
5  Tue Oct 13  Convolutional Filters & Pooling  Readings  9 
5  Thu Oct 15  Regularization, more loss functions  
6  Tue Oct 20  Adversarial Attacks  Homework 2  10 
6  Thu Oct 22  The final project  
7  Tue Oct 27  Recurrent networks, LSTM & GRUs  
7  Thu Oct 29  Recurrent networks, LSTM & GRUs  Project proposal  5 
8  Tue Nov 3  Example recurrent net: source separation  
8  Thu Nov 5  Language models  Project Readings  9 
9  Tue Nov 10  Attention networks  
9  Thu Nov 12  Transformers  Detailed Project Plan  5 
10  Tue Nov 17  BERT and GPT  
10  Thu Nov 19  Deep reinforcement learning  Project Meeting  5 
11  Tue Nov 24  Deep reinforcement learning  
11  Thu Nov 26  THANKSKIVING  
12  Tue Dec 1  Individual project meetings  Readings  9 
12  Thu Dec 3  Individual project meetings  
13  Tue Dec 8  no class, finals week  
13  Thu Dec 10  FINAL PROJECTS DUE  Final website & paper  20 
Links
Helpful Programming Packages
Anaconda is the most popular python distro for machine learning.
Pytorch Facebook’s popular deep learning package. My lab uses this.
Lightning makes Pytorch easier to use.
Tensorboard is what my lab uses to visualize how experiments are going.
Tensorflow is Google’s most popular python DNN package
Keras A nice programming API that works with Tensorflow
JAX Is an alpha package from Gogle that allows differentiation of numpy and also an optimizing compiler for working on tensor processing units
Trax Is Google Brain’s DNN package. It focuses on transformers and is implemented on top of JAX
MXNET is Apache’s open source DL package.
Helpful Books on Deep Learning
Deep Learning is THE book on Deep Learning. One of the authors won the Turing prize due to his work on deep learning.
Dive Into Deep Learning provides example code and instruction for how to write DL models in Pytorch, Tensorflow and MXNet.
Computing Resources
Google’s Colab offers free GPU time and a nice environment for running Jupyter notebookstyle projects. For $10 per month, you also get priority access to GPUs and TPUs.
Amazon’s SageMaker offers hundres of free hours for newbies.
The CS Department Wilkinson Lab just got 22 new machines that each have a graphics card suitable for deep learning, and should be remoteaccessable and running Linux with all the python packages needed for deep learning.
Some Datasets
Kaggle Has many useful datasets.
Zenodo also has many useful datasets.
The Imagenet Data: An image database organized according to the WordNet hierarchy in which each node of the hierarchy is depicted by hundreds and thousands of images. Very widelyused.
The CIFAR Datasets: The CIFAR10 and CIFAR100 are labeled subsets of the 80 million tiny images dataset. Very widelyused.
The LibriSpeech Data Set: A corpus of approximately 1000 hours of 16kHz read English speech.
The WikiText Data Set: A collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.
Lecture Slides and Notebooks
Lectures
gradient descent and backpropagation of error
Jupyter Notebooks
Course Reading
The History

The Organization of Behavior: Hebb’s 1949 book that provides a general framework for relating behavior to synaptic organization through the dynamics of neural networks.

The Perceptron: This is the 1st neural networks paper, published in 1958. The algorithm won’t be obvious, but the thinking is interesting and the conclusions are worth reading.

The Perceptron: A perceiving and recognizing automoton: This one is an earlier paper by Rosenblatt that is, perhaps, even more historical than the 1958 paper and a bit easer for an engineer to follow, I think.
The basics (1st reading topic)

* Chapter 4 of Machine Learning : This is Tom Mitchell’s book. Historical overview + explanation of backprop of error. It’s a good starting point for actually understanding deep nets. START HERE. IT’S WORTH 2 READINGS

Chapter 6 of Deep Learning: Modern intro on deep nets. To me, this is harder to follow than Chapter 4 of Machine Learning, though. Certainly, it’s longer.
Optimization (2nd reading topic)

This reading is NOT worth points, but……if you don’t know what a gradient, Jacobian or Hessian is, you should read this before you read Chapter 4 of the Deep Learning book.

Chapter 4 of the Deep Learning Book: This covers basics of gradientbased optimization. Start here

Chapter 8 of the Deep Learning Book: This covers optimization. This should come 2nd

Why Momentum Really Works: Reading this will help you understand the popular ADAM optimizer better.

On the Difficulties of Training Recurrent Networks: A 2013 paper that explains vanishing and exploding gradients

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. This is the most common approaches to normalization.

AutoClip: Adaptive Gradient Clipping for Source Separation Networks is a recent paper out of Pardo’s lab that helps deal with unruly gradients. There’s also a video for this one.
Convolutional Networks (3rd reading topic)

Generalization and Network Design Strategies: The original 1989 paper where LeCun describes Convolutional networks. Start here.
Regularization and overfitting (4th reading topic)

Chapter 7 of the Deep Learning Book: Covers regularization.

Dropout: A Simple Way to Prevent Neural Networks from Overfitting: Explains a widelyused regularizer

Understanding deep learning requires rethinking generalization: Thinks about the question “why aren’t deep nets overfitting even more than they seem to be”?

The Implicit Bias of Gradient Descent on Separable Data : A study of bias that is actually based on the algorithm, rather than the dataset.
Experimental Design
Visualizing and understanding network representations

Visualizing and Understanding Convolutional Networks: How do you see what the net is thinking? Here’s one way.

Local Interpretable ModelAgnostic Explanations (LIME): An Introduction A technique to explain the predictions of any machine learning classifier.
Popular Architectures for Convolutional Networks
If you already understand what convolutional networks are, then here are some populare architectures you can find out about.

Deep Residual Learning for Image Recognition: The 2016 paper that introduces the popular ResNet architecture that can get 100 layers deep

Very Deep Convolutional Networks for LargeScale Image Recognition: The 2015 paper introducing the popular VGG architecture

Going Deeper with Convolutions:The 2015 paper describing the Inception network architecture.
Recurrent Networks

Chapter 10 of Deep Learning: A decent starting point

The Recurrent Neural Networks Tutorial: Another good starting point

* Extensions of recurrent neural network language model: This covers the RNN language model discussed in class.

Long Term Short Term Memory: The original 1997 paper introducing the LSTM

Understanding LSTMs: A simple (maybe too simple?) walkthrough of LSTMs

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling: Compares a simplified LSTM (the GRU) to the original LSTM and also simple RNN units.
Attention networks (read these before looking at Transformers)

Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) ** This is a good starting point on attention models. **

Sequence to Sequence Learning with Neural Networks: This is the paper that the link above was trying to explain.

* Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation: This paper introduces encoderdecoder networks for translation. Attention models were first built on this framework. Covered in class.

* Neural Machine Translation by Jointly Learning to Align and Translate: This paper introduces additive attention to an encoderdecoder. Covered in class.

* Effective Approaches to Attentionbased Neural Machine Translation: Introduced multiplicative attention. Covered in class.

Massive Exploration of Neural Machine Translation Architectures: A 2017 paper that settles the questions about which architecture is best for doing translation….except that the Transformer model came out that same year and upended everything. Still, a good overview of the pretransformer stateoftheart.

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention: Attention started with text, but is now applied to images. Here’s an example.

Listen, Attend and Spell: Attention is also applied to speech, as per this example.

A Tutorial in TensorFlow: Ths walks you how to use Tensorflow 1.X to build a neural machine translation network with attention.
Transformer networks (Don’t read until you understand attention models)

The Illustrated Transformer: A good walkthrough that helps a lot with understanding transformers ** I’d start with this one to learn about transformers.**

The Annotated Transformer: An annotated walkthrough of the “Attention is All You Need” paper, complete with detailed python implementation of a transformer.

Attention is All You Need: The paper that introduced transformers, which are a popular and more complicated kind of attention network.

BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding: A widelyused language model based on Transformer encoder blocks.

The Illustrated GPT2: A good overview of GPT2 and its relation to Transformer decoder blocks.
Adversarial examples and Generative Adversarial Networks (GANS)

Explaining and Harnessing Adversarial Examples : This paper got the ball rolling by pointing out how to make images that look good but are consistently misclassified by trained deepnets.

2016 Tutorial on Generative Adversarial Networks by one of the creators of the GAN.

Generative Adversarial Nets: The paper that introduced GANs

Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images: This paper shows just how screwy you can make an image and still have it misclsasified by a “well trained, highly accurate” image recognition deep net.
Reinforcement Learning

Reinforcement Learning: An Introduction, Chapters 3 and 6: This gives you the basics of what reinforcement learning (RL) is about.

Playing Atari with Deep Reinforcement Learning: A key paper that showed how reinforcement learning can be used with deep nets.

Mastering the game of Go with deep neural networks and tree search: A famous paper that showed how RL + Deepnets = the best Go player in existence at the time.

A general reinforcement learning algorithm that masters chess, shogi, and Go through selfplay: This is the AlphaZero paper. AlphaZero is the best go player…and a great chess player.
Autoencoders
 Chapter 14 of the Deep Learning book: Autoencoders
Variational Auto Encoders (VAEs)

Variational Autoencoders (VAEs) for Dummies: This is a step by step tutorial for making a VAE using Keras.

Variational Inference, a Review for Statisticians: This explains the math behind variational inference and why variational inference instead of Gibbs sampling.

Tutorial on Variational Autoencoders: This is a walkthrough of the math of VAEs.