MON WED FRI 1pm Central time
Deep learning is a branch of machine learning based on algorithms that try to model high-level abstract representations of data by using multiple processing layers with complex structures. One of the most exciting areas of research in deep learning is that of generative models. Today’s generative models create text documents, write songs, make paintings and videos, and generate speech. This course is dedicated to understanding the inner workings of the technologies that underlie these advances. Students will learn about key methodologies, including Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Transformer-based language models. This is an advanced course that presumes a good working understanding of traditional supervised neural network technology and techniques (e.g. convolutional networks, LSTMs, loss functions, regularization, gradient descent).
Registration is by instructor permission only. This course is designed for doctoral students. Appropriately prepared BS and MS students may also be admitted, once doctoral student demand has been met.
|Week||Day and Date||Topic||Presenter||Commentators|
|1||Wed Sept 21||Course overview||Pardo|
|1||Fri Sept 23||Autoregressive language models||Pardo|
|2||Mon Sept 26||Attention||Pardo|
|2||Wed Sept 28||Embeddings: The Illustrated Word2Vec||Pardo|
|2||Fri Sept 30||Transformers: The Illustrated Transformer||Pardo|
|3||Mon Oct 3||Positional Encoding||Pardo|
|3||Wed Oct 5||Music Transformer||Julia B.||Keshav & Clarissa|
|3||Fri Oct 7||BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding||Mowafak A.||Preetham & Keshav|
|4||Mon Oct 10||GPT-3:Language Models are Few-Shot Learners||Yujia & Ruth||Isaiah & Julia|
|4||Wed Oct 12||Quantifying Memorization Across Neural Language Models||Isaiah J.||Aleksandr & Julia|
|4||Fri Oct 14||Image GPT||Soroush S.||Aneryben, Cameron & Amil|
|5||Mon Oct 17||Reformer: The Efficient Transformer||Tonmoay D.||Shubhanshi & Caden & TC|
|5||Wed Oct 19||Final projects & basics of stats/probability||Pardo|
|5||Fri Oct 21||Variational Inference, a Review for Statisticians||Pardo|
|6||Mon Oct 24||Tutorial on Variational Autoencoders||Pardo||Preetham & Srik & Mowafak|
|6||Wed Oct 26||VQ-VAE: Neural Discrete Representation Learning||Aneryben P.||TC & Clarissa|
|6||Fri Oct 28||Beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework||Srik G.||James|
|7||Mon Oct 31||DALL-E: Zero-Shot Text-to-Image Generation||Liquian M.||Jipeng & Mowafak|
|7||Wed Nov 2||Jukebox: A Neural Net that Generates Music||Preetham P.||Aleksandr & James|
|7||Fri Nov 4||Generative Adversarial Networks||Pardo|
|8||Mon Nov 7||DCGAN: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks||Clarissa C.||Aneryben, Conner & Omar|
|8||Wed Nov 9||PROGRESSIVE GROWING OF GANS FOR IMPROVED QUALITY, STABILITY, AND VARIATION||TC L||Yujia & Ruth & Clarissa|
|8||Fri Nov 11||StyleGAN: A Style-Based Generator Architecture for Generative Adversarial Networks||Conner & Omar||Shubhanshi & Caden|
|9||Mon Nov 14||Cross-Modal Contrastive Learning for Text-to-Image Generation||James||Amil & Cameron|
|9||Wed Nov 16||An Introduction to Diffusion Models||Shubhanshi & Caden||Srik, Conner & Omar|
|9||Fri Nov 18||Learning Transferable Visual Models From Natural Language Supervision||Cameron & Amil||Yujia & Ruth|
|10||Mon Nov 21||Guidance: a cheat code for diffusion models||Pardo||Liqian & Isaiah|
|10||Wed Nov 23||THANSKGIVING: No class|
|10||Fri Nov 25||THANKSGIVING: No class|
|11||Mon Nov 28||GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models||Jipeng S.||Liqian & Tonmoay|
|11||Wed Nov 30||Google’s Imagen||Aleksandr S.||Jipeng & Soroush|
|11||Fri Dec 2||DiffWave: A Versatile Diffusion Model for Audio Synthesis||Keshav B.||Tonmoay &Soroush|
|12||Thu Dec 8||Final project presentations 9AM-11AM||EVERYONE!||EVERYONE!|
You will submit 20 reviews of readings from the course website. Each will be a single-page reaction to something you read from the links provided below.
Once during the course of the term, you will be the lead person discussing the reading in class. This will mean you haven’t just read the paper, but you’ve read related work, really understand it and can give a 30-minute presentation of the paper (including slides) and then lead a discussion about it.
For two presentation OTHER than your own, you’ll be expected to be 100% on top of the material and be the counterpoint to the presenter’s point. I’ll expect you to be making good points and display clear knowledge of the material.
You will make, modify, and or analyze some work, project or subdomain in generative modeling. This may mean modifying MusicVAE or making a story generator on top of GPT-3. It may mean downloading an existing thing and experimenting with it or it may mean building a new thing. Duplicating a paper’s results is always a great project. It could be a deep-dive literature review on a subtopic (a good first step towards writing a paper)… or something else, subject to approval of the proposal. The point breakdown for the project is as follows. There will be a maximum of 10 projects in the class. Students are encouraged to pair up.
THis will be a group project. You will be in groups or 2 or 3. There will be no single projects.
Pixel Recurrent Networks: A highly infulential autoregressive model for image generation
WaveNet: A Generative Model for Raw Audio: A highly infulential autoregressive model for audio generation
The Illustrated Word2Vec: Transformers for text take word embeddings as input. So what’s a word embedding? This is a walk through of one of the most famous embeddings
Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) This is a good starting point blog on attention models, which is what Transformers are built on.
Sequence to Sequence Learning with Neural Networks: This is the paper that the link above was trying to explain.
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation: This introduces encoder-decoder networks for translation. Attention models were first built on this framework.
Neural Machine Translation by Jointly Learning to Align and Translate: This paper introduces additive attention to an encoder-decoder.
Effective Approaches to Attention-based Neural Machine Translation: This paper introduces multiplicative attention, which is what Transformers use.
Deep Residual Learning for Image Recognition: This introduces the idea of “residual layers”, which are layers that are skippable. This idea is used in Transformers.
The Illustrated Transformer: A good initial walkthrough that helps a lot with understanding transformers ** I’d start with this one to learn about transformers.**
The Annotated Transformer: An annotated walk-through of the “Attention is All You Need” paper, complete with detailed python implementation of a transformer.
Attention is All You Need: The paper that introduced transformers, which are a popular and more complicated kind of attention network.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding: A widely-used language model based on Transformer encoder blocks.
The Illustrated GPT-2: A good overview of GPT-2 and its relation to Transformer decoder blocks.
Image GPT: Using a Transformer to make images. This isn’t DALL-E, even though it’s by OpenAI.
Learning Transferable Visual Models From Natural Language Supervision: This describes how DALL-E selects which of the many images it generates should be shown to the user.
Perceiver: General Perception with Iterative Attention: a model that builds upon Transformers that scales to many more inputs. Not exactly about generation
Self-attention with relative position representations: This is what got relative positional encoding started.
Reformer: The Efficient Transformer: This uses locality sensitive hashing to make attention much more efficient, moving it from taking O(n^2) and making it O(nlogn). This is a better paper to read than the “Transformers are RNNs” paper (below), in that it is much clearer with its math and ideas.
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention: This paper talks about how to take the O(n^2) cost of attention and make it O(n)
Relative Positional Encoding for Transformers with Linear Complexity: Positional encoding is messed up by the linear attention attention approach from “Transformers are RNNs”. This paper addresses that problem.
Zero-Shot Text-to-Image Generation: This is the original version of DALL-E, which generates images conditioned on text captions. It is based on Transformer architecture.
Music Transformer: Applying Transformers to music composition.
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?: This is the paper that Timnit Gebru and Margaret Mitchell got fired for publishing.
Alignment of Language Agents: This is Deep Mind’s critique of their own approach.
Extracting Training Data from Large Language Models: Did GPT-2 memorize a Harry Potter book? Read this and find out.
Quantifying Memorization Across Neural Language Models: Systematic experiments on how model size, prompt length, and frequency of an example in the training set impact our ability to extract memorized content.
Generative Adversarial Nets: The paper that introduced GANs
2016 Tutorial on Generative Adversarial Networks by one of the creators of the GAN. This one’s long, but good.
DCGAN: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks: This is an end-to-end model. Many papers build on this.
PROGRESSIVE GROWING OF GANS FOR IMPROVED QUALITY, STABILITY, AND VARIATION This is used in StyleGAN and was state-of-the-art in 2018.
StyleGAN: A Style-Based Generator Architecture for Generative Adversarial Networks: As of 2019, this was the current state-of-the-art for GAN-based image generation.
StyleGAN2-ADA: Training Generative Adversarial Networks with Limited Data: : As of 2020, this was the current state-of-the-art for GAN-based image generation.
Cross-Modal Contrastive Learning for Text-to-Image Generation As of 2021, this was the best GAN for text-conditioned image generation. Note it’s use of contrastive loss. You’ll see that again in CLIP.
Adversarial Audio Synthesis: introduces WaveGAN, a paper about applying GANs to unsupervised synthesis of raw-waveform audio.
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis: Doing speech synthesis with GANs.
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis: Even better speech synthesis with GANs.
The Deep Learning Book’s Chapters on Probability and Linear Algebra. Read these before the Easy Intro to KL Divergence
An Easy Introduction to Kullback-Leibler (KL) Divergence . Read this before reading about ELBO
Jenson’s Inequality (an example with code). Read this before reading about ELBO
Evidence Lower Bound, Clearly Explained: A video walking through Evidence Lower Bound.
A walkthrough of Evidence Lower Bound (ELBO): Lecture notes from David Blei, one of the inventors of ELBO. ELBO is what you optimize when you do variational inference in a VAE.
Categorical Reparameterization with Gumbel-Softmax: This is a way of allowing categorical latent variables in your model so you can run a differentiable gradient descent algorithm through them. This is used in Vector-Quantized VAEs.
Probabilistic Graphical Models: Lecture notes from the class taught at Stanford.
A starter blog on AutoEncoders and VAEs: Probably a good place to start.
From neural PCA to deep unsupervised learning : This paper introduces Ladder networks, which will come back when we get to VAEs
Tutorial on Variational Autoencoders: This is a walk-through of the math of VAEs. I think you should maybe start with this one.1. Variational Inference, a Review for Statisticians: This explains the math behind variational inference. One of the authors is an inventor of variational inference.
An introduction to variational autoencoders: This is by the inventors of the VAE.
Tutorial: Deriving the Standard Variational Autoencoder (VAE) Loss Function: This is the only paper I’ve found that walks you through all the details to derive the actual loss function.
Conditional VAE: Learning Structured Output Representation using Deep Conditional Generative Models: Making a controllable VAE through conditioning
Beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework : This is about making disentangled representations: making the VAEs latent variables meaningful to us.
Isolating Sources of Disentanglement in VAEs: More on disentangled representations in VAEs
Ladder VAEs: Hierarchical VAEs
Adversarial Auto-Encoders: You can guess what this is.
A Wizard’s Guide to Adversarial Autoencoders: This is a multi-part tutorial that will be helpfup for understanding AAEs.
From Autoencoder to Beta-VAE: Lilian Weng’s overview of most kinds of autoencoders
MUSIC VAE: Learning Latent Representations of Music to Generate Interactive Musical Palettes: Making controllable music composition with VAEs
Jukebox: A Neural Net that Generates Music….with a combination of autoencoders and GANs
Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations: Using a WAE to generate new drugs
Deep unsupervised learning using nonequilibrium thermodynamics: The 2015 paper where diffusion models were introduced.
Denoising Diffusion Probabilistic Models: This was a break-out paper from 2020 that got people excited about diffusion models.
Generative Modeling by Estimating Gradients of the Data Distribution: This is a blog that explains how score-based models are also basically diffusion models.
An Introduction to Diffusion Models: A nice tutorial blog that has Pytorch code.
Guidance: a cheat code for diffusion models: if you want to understand DALL-E-2 and Imagen, you need to understand this.
DiffWave: A Versatile Diffusion Model for Audio Synthesis: A neural vocoder done with diffusion from 2021
Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise: Do we need to add noise at each step or would any transform do?
High Fidelity Image Generation Using Diffusion Models: A Google Blog that gives the chain of development that led to Imagen.
Google’s Imagen: This is the Pepsi to DALL-E-2’s Coke.
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models: This is the diffusion model used in DALL-E-2
Learning Transferable Visual Models From Natural Language Supervision: The CLIP representation. This is used in DALL-E-2.
Hierarchical Text-Conditional Image Generation with CLIP Latents: The DALL-E-2 paper.
The Annotated S4: This is a guided walk through (with code) of a structured state space model.
Competition-level code generation with AlphaCode: This beats 1/2 of all human entrants into a coding competition.
ChatGPT: Already perhaps the most famous chatbot and most famous language model and it has been out about 2 weeks as of this writing.
Riffusion: Repurposes StableDiffusion to generate spectrograms. Cool opensource project. They should have published this, too.
Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models: Just what it sounds like.
MusicLM: Generating Music From Text: A model generating music audio from text descriptions such as “a calming violin melody backed by a distorted guitar riff”.