Top Calendar Slides Readings



Class Day/Time

Wednesdays 4-7pm Central time

Office Hours

By appointment


Bryan Pardo

Course Description

Deep learning is a branch of machine learning based on algorithms that try to model high-level abstract representations of data by using multiple processing layers with complex structures. One of the most exciting areas of research in deep learning is that of generative models. Today’s generative models create text documents, write songs, make paintings and videos, and generate speech. This course is dedicated to understanding the inner workings of the technologies that underlie these advances. Students will learn about key methodologies, including Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Transformer-based language models. This is an advanced course that presumes a good working understanding of traditional supervised neural network technology and techniques (e.g. convolutional networks, LSTMs, loss functions, regularization, gradient descent).

Registration is by instructor permission only. This course is designed for doctoral students. Appropriately prepared BS and MS students may also be admitted, once doctoral student demand has been met.

Course Calendar

Back to top

Week Day and Date Topic (tentative) Deliverable Points
1 Wed Mar 31 Course basics & getting to know folks    
2 Wed Apr 7 Transformers    
3 Wed Apr 14 Transformers    
4 Wed Apr 21 Transformers (& other autoregressive models) 10 readings  
5 Wed Apr 28 GANs Initial proposal  
6 Wed May 5 GANs Project plan  
7 Wed May 12 GANs    
8 Wed May 19 VAEs    
9 Wed May 26 VAEs 10 readings  
10 Wed Jun 2 VAEs    
11 Wed Jun 9 Final Projects Project presentation and website  

Course assignments

Reading: 40 points

You will submit 20 reviews of readings from the course website. Each will be a single-page reaction to something you read from the links provided below. Each review will be worth 2 points. Reviews are due on the schedule shown in the course calendar.

Class Paper Presentation: 20 points

Once during the course of the term, you will be the lead person discussing the reading in class. This will mean you haven’t just read the paper, but you’ve read related work, really understand it and can give a 30-minute presentation of the paper (including slides) and then lead a discussion about it.

Class participation: 20 points

Each week (even weeks when you’re not presenting) you are expected to show up, have read the papers and be able to discuss ideas. Every week you show up and substantially contribute to the discussion, you get 2 points. Not speaking up garners 0 points, even if you’re there.

Project in generative modeling: 20 points

You will make, modify, and or analyze some work, project or subdomain in generative modeling. This may mean modifying MusicVAE or making a story generator on top of GPT-3. It may mean downloading an existing thing and experimenting with it or it may mean building a new thing. Duplicating a paper’s results is always a great project. It could be a deep-dive literature review on a subtopic (a good first step towards writing a paper)… or something else, subject to approval of the proposal. The point breakdown for the project is as follows. There will be a maximum of 10 projects in the class. Students are encouraged to pair up.

Lecture Slides and Notebooks

Back to top


Course Reading

Back to top


  1. The Deep Learning Book’s Chapters on Probability and Linear Algebra. Read these before the Easy Intro to KL Divergence

These will help with Transformers

  1. Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) This is a good starting point blog on attention models, which is what Transformers are built on.

  2. Sequence to Sequence Learning with Neural Networks: This is the paper that the link above was trying to explain.

  3. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation: This introduces encoder-decoder networks for translation. Attention models were first built on this framework.

  4. Neural Machine Translation by Jointly Learning to Align and Translate: This paper introduces additive attention to an encoder-decoder. Covered in class.

  5. Effective Approaches to Attention-based Neural Machine Translation: Introduced multiplicative attention, which is what Transformers use.

These will help with VAEs

  1. An Easy Introduction to Kullback-Leibler (KL) Divergence . Read this before reading about ELBO

  2. Jenson’s Inequality (an example with code). Read this before reading about ELBO

  3. A walkthrough of Evidence Lower Bound (ELBO): This is what you optimize when you do variational inference in a VAE.

  4. Categorical Reparameterization with Gumbel-Softmax: This is a way of allowing categorical latent variables in your model so you can run a differentiable gradient descent algorithm through them.


  1. Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models


Basic Transformer Networks (Reading for the 1st week)

  1. The Illustrated Transformer: A good initial walkthrough that helps a lot with understanding transformers ** I’d start with this one to learn about transformers.**

  2. The Annotated Transformer: An annotated walk-through of the “Attention is All You Need” paper, complete with detailed python implementation of a transformer.

  3. Attention is All You Need: The paper that introduced transformers, which are a popular and more complicated kind of attention network.



Advanced Transformers (Next two weeks)

  1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding: A widely-used language model based on Transformer encoder blocks.

  2. The Illustrated GPT-2: A good overview of GPT-2 and its relation to Transformer decoder blocks.

  3. GPT-3:Language Models are Few-Shot Learners

  4. Image GPT

  5. DALL-E:Creating images from text

  6. Learning Transferable Visual Models From Natural Language Supervision: This describes how DALL-E selects which of the many images it generates should be shown to the user.

  7. Perceiver: General Perception with Iterative Attention: a model that builds upon Transformers that scales to many more inputs. Not exactly about generation

Societal Costs of Transformers & Language Models

  1. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?: This is the paper that Timnit Gebru and Margaret Mitchell got fired for publishing.

  2. Alignment of Language Agents: This is Deep Mind’s critique of their own approach.


  1. Pixel Recurrent Networks: A highly infulential autoregressive model for image generation

  2. WaveNet: A Generative Model for Raw Audio: A highly infulential autoregressive model for audio generation


Creating adversarial examples

  1. Explaining and Harnessing Adversarial Examples : This paper got the ball rolling by pointing out how to make images that look good but are consistently misclassified by trained deepnets.

  2. Julia Evans’ walkthrough of generating adversarial examples

Creating GANs

  1. Generative Adversarial Nets: The paper that introduced GANs

  2. 2016 Tutorial on Generative Adversarial Networks by one of the creators of the GAN. This one’s long, but good.

  3. DCGAN: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks: This is an end-to-end model. Many papers build on this.

Advanced GANS

  1. PROGRESSIVE GROWING OF GANS FOR IMPROVED QUALITY, STABILITY, AND VARIATION This is used in StyleGAN and was state-of-the-art in 2018.

  2. StyleGAN: A Style-Based Generator Architecture for Generative Adversarial Networks: As of 2019, this was the current state-of-the-art for GAN-based image generation.

  3. StyleGAN2-ADA: Training Generative Adversarial Networks with Limited Data: : As of 2020, this was the current state-of-the-art for GAN-based image generation.

  4. Learning Universal Adversarial Perturbations with Generative Models: Using a GAN to make adversarial attacks.


  1. A starter blog on AutoEncoders and VAEs: Probably a good place to start.

  2. The Deep Learning Book’s Chapter on Autoencoders

  3. From neural PCA to deep unsupervised learning : This paper introduces Ladder networks, which will come back when we get to VAEs

BASIC Variational Auto Encoders (VAEs)

  1. Tutorial on Variational Autoencoders: This is a walk-through of the math of VAEs.

  2. Variational Inference, a Review for Statisticians: This explains the math behind variational inference and why variational inference instead of Gibbs sampling.


  1. VQ-VAE: Neural Discrete Representation Learning

  2. Conditional VAE: Learning Structured Output Representation using Deep Conditional Generative Models: Making a controllable VAE through conditioning

  3. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework : This is about making disentangled representations: making the VAEs latent variables meaningful to us.

  4. Isolating Sources of Disentanglement in VAEs: More on disentangled representations in VAEs

  5. Ladder VAEs: Hierarchical VAEs

  6. Adversarial Auto-Encoders: You can guess what this is.

  7. A Wizard’s Guide to Adversarial Autoencoders: This is a multi-part tutorial that will be helpfup for understanding AAEs.

  8. Wasserstein Auto-Encoders: These use a different regularizer than the one used by the VAE and generalizes Adversarial Auto Encoders.

VAE Applications

  1. MUSIC VAE: Learning Latent Representations of Music to Generate Interactive Musical Palettes: Making controllable music composition with VAEs

  2. Jukebox: A Neural Net that Generates Music….with a combination of autoencoders and GANs

  3. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations: Using a WAE to generate new drugs



Generative language models may really just be statistical parrots

  1. Extracting Training Data from Large Language Models: Did GPT-2 memorize a Harry Potter book? Read this and find out.

  2. Quantifying Memorization Across Neural Language Models

Normalizing Flows

  1. Variational Inference with Normalizing Flows: A differentiable method to take a simple distribution and make it arbitrarily complex. Useful for modeling distributions in deep nets. Can be added to VAEs.

FILM Layers

  1. FiLM: Visual Reasoning with a General Conditioning Layer: Affine transformation of input layers that proves helpful in many contextx. Here’s the TL;DR version. I’d start with the TL;DR.

Structured State Space Models

  1. Efficiently Modeling Long Sequences with Structured State Spaces

  2. The Annotated S4: This is a guided walk through (with code) of a structured state space model.

  3. It’s Raw! Audio Generation with State-Space Models

Diffusion Models and Score Models

  1. Generative Modeling by Estimating Gradients of the Data Distribution

  2. What are Diffusion Models?


  1. DALLE-2: As of 2022, this is SOTA for language-conditioned generative image models.

  2. Learning Transferable Visual Models From Natural Language Supervision: The CLIP representation