teaching

DEEP GENERATIVE MODELS CS 496 FALL 2024

Top

Calendar

Slides

Readings

Location

Tech L168

Class Day/Time

Wed 2pm - 5pm

Instructor

Bryan Pardo Office hours by appointment

Course Description

WARNING: This course is about reading papers. A lot of papers. If you don’t want to read and discuss papers, this is not the course for you.

DESCRIPTION: One of the most exciting areas of research in deep learning is that of generative models. This course is dedicated to understanding the inner workings of the technologies that underlie these advances. Students will primarily learn about how Transformers, Variational Autoencoders, Reinforcement Learning and Diffusion Models are used to create text documents, write code, create music, images, speech, and more. We’ll also look at legal and ethical implications of these models. This is an advanced course that presumes a good working understanding of traditional supervised neural network technology and techniques (e.g. convolutional networks, LSTMs, loss functions, regularization, gradient descent).

PREREQUISITES: Course admission is by permission of instructor. The best Northwestern course to prepare for this class is CS 449 Deep Learning, although being a doctoral student in CS will likely get you in, even if you haven’t taken CS 449.

Course Calendar

Back to top

Day and Date	Topic	Presenter	Commentators
Wed Sep 25	Basics of Transformers: Attention, Embeddings,	Pardo	-
	Basics of Transformers: Positional Encoding, Autoregression	Pardo	-
Wed Oct 2	Basics of Transformers, continued.	Pardo	-
	BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	Pardo	-
	At the end of class, watch this video Should melodies be copyrightable?	Pardo	-
Wed Oct 9	Quantifying Memorization Across Neural Language Models	Andrew Shen	Gourav,Yoshii
	Allocating Ownership Rights in Computer Generated Works	Aidan Fitzsimons	Cesar
	Class discussion about creativity/ownership	-	-
Wed Oct 16	The Curious Case of Neural Text Degeneration	Vispi Karkaria	Yiqi, Cesar
	A Watermark for Large Language Models	Ryan Chu	Keyi, Aidan
	Basics of Deep Reinforcement Learning + Deep Reinforcement Learning: Pong from Pixels	Pardo	-
Wed Oct 23	Deep reinforcement learning from human preferences	Yiqi Lyu	Aidan, Zihan
	Training language models to follow instructions with human feedback	Keyi Wang	Andrew, Monica
	Basics of Variational Auto Encoders	Pardo	-
Wed Oct 30	VQ-VAE: Neural Discrete Representation Learning	Cesar Ades	Phillip, Jiaqi
	An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale	Zihan Guo	Jiaqi
	Zero-Shot Text-to-Image Generation	monica dou	Nathan, Zihan
Wed Nov 6	LORA: Low-rank Adaptation of Large Language Models	Gourav Kumbhojkar	Andrew, Keyi
	MaskGIT: Masked Generative Image Transformer	Pardo	Nathan, Yoshii
	Basics of Diffusion/Score models	Pardo
Wed Nov 13	No class: read some papers!
Wed Nov 20	Vampnet: Music Generation via Masked Acoustic Token Modeling	Pardo	Vispi, Patrick
	High-Resolution Image Synthesis with Latent Diffusion Models	Jiaqi Guo	Ryan
Wed Nov 27	Learning Transferable Visual Models From Natural Language Supervision	Patrick Koller	Gourav, Vispi
	GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models	Yoshii Ma	Phillip, Monica
Wed Dec 4	Hierarchical Text-Conditional Image Generation with CLIP Latents	Abhishek Srivastava	Patrick, Yiqi
	Extracting Training Data from Diffusion Models	Nathan Pruyne	Abhishek, Ryan
	Scalable Diffusion Models with Transformers	Philipp Srivastava	Abhishek
Fri Dec 13	Final Exam 9am - 11am: 5-minute one-on-one quiz on a randomly-selected paper you wrote a review of

Course assignments

Reading: 40 points

You will submit 20 one-page reviews of readings from the course website. 15 of these must be papers (not lecture slides, actual papers) scheduled for presenation in the course calendar. 5 of these can be chosen from the full set of readings for the course. Note…even though this class is about generative models, use of a generative model is not allowed in writing these up. The point is to communicate your own thoughts on the page.

Class Paper Presentation: 40 points

Once during the course of the term, you (and your partner) will be the lead in discussing the reading in class. This will mean you haven’t just read the paper, but you’ve read related work, really understand it and can give a 30-minute (maximum) presentation of the paperand then lead a discussion about it. Note, you are not allowed to go over 30 minutes in your presentation. Keeping to the time limit is part of the grade.

10 points: 1-on-1 meeting prior to the presentation, to go over slides and talking points
10 points: Initial slides at time of 1-on-1 meeting
15 points: Presenting topic (30 minutes) with updated slides & leading discussion (15 minutes)
5 points: Final submitted slides, updated in response to feedback from presentation

Class Participation 10 points

For 2 papers that are presented in class (not your own), you are expected to be in class, on time, and really up on the material. I will feel free to call on you repeatedly and expect you to be engaged and give informed, thoughtful answers. Don’t expect full points for this if you give brief or uninformed answers. You will be able to sign up for these papers at the start of the term.

Final Exam 10 points

Your final exam is this: I will select a paper that you did a review of and you’ll spend 5 minutes talking about it, answering my questions as you go. If I believe you really read and understood the paper, full points.

Lecture Slides and Notebooks

Back to top

Lectures

Course Reading

Back to top

INFLUENTIAL AUTOREGRESSIVE MODELS

Pixel Recurrent Networks: A highly infulential autoregressive model for image generation
WaveNet: A Generative Model for Raw Audio: A highly infulential autoregressive model for audio generation

TRANSFORMERS & LANGUAGE MODELS

Architectural elements that lead up to Transformers

Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) This is a good starting point blog on attention models, which is what Transformers are built on.
Sequence to Sequence Learning with Neural Networks: This is the paper that the link above was trying to explain.
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation: This introduces encoder-decoder networks for translation. Attention models were first built on this framework.
Neural Machine Translation by Jointly Learning to Align and Translate: This paper introduces additive attention to an encoder-decoder.
Effective Approaches to Attention-based Neural Machine Translation: This paper introduces multiplicative attention, which is what Transformers use.
Deep Residual Learning for Image Recognition: This introduces the idea of “residual layers”, which are layers that are skippable. This idea is used in Transformers.

Embeddings

The Illustrated Word2Vec: Transformers for text take word embeddings as input. So what’s a word embedding? This is a walk through word embeddings, at a high level, with no math.
Efficient Estimation of Word Representations in Vector Space: This is the Word2Vec paper.
GloVe: Global Vectors for Word Representation: The paper that describes the Glove embedding, which is an improvement on Word2Vec, and has downloadable embeddings to try. There is math here.
Using the Output Embedding to Improve Language Models: In transformers, they actually learn their embeddings at the same time as everything else and tie the input embedding to the output embedding. This paper explains why.

The Transformer Architecture

The Illustrated Transformer: A good initial walkthrough that helps a lot with understanding transformers ** I’d start with this one to learn about transformers.**
The Annotated Transformer: An annotated walk-through of the “Attention is All You Need” paper, complete with detailed python implementation of a transformer. ** If you actually want to understand transformer implementation, you should read this in depth…and play with the code.**
Attention is All You Need: The paper that introduced transformers, which are a popular and more complicated kind of attention network.

About positional encoding

Self-Attention with Relative Position Representations: The most frequently used alternative to absolute positional encoding
Rotary Positional Encoding: claims to combine the benefits of both absolute and relative positional encoding
What Do Position Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional Encoding: Why not learn your positional encoding? What happens if you do that?

BERT and GPT, two foundational architectures

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding: A widely-used language model based on Transformer encoder blocks.
The Illustrated GPT-2: A good overview of GPT-2 and its relation to Transformer decoder blocks.
GPT-3:Language Models are Few-Shot Learners: This explores the range of things that you can do with a GPT model.

Why sampling strategies matter

The Curious Case of Neural Text Degeneration: When you sample from the output of a language model, it matters a LOT just how you sample. Read this to understand why.
A Watermark for Large Language Models: Can you use a sampling strategy to watermark generated text?

Symbolic Music Making with Transformers

Music Transformer: Applying Transformers to music composition.
Anticipatory Music Transformer: A Controllable Infilling Model for Music

Making Images with Transformers

Image GPT: Using a Transformer to make images. This isn’t DALL-E, even though it’s by OpenAI.
Zero-Shot Text-to-Image Generation: This is the original version of DALL-E, which generates images conditioned on text captions. It is based on Transformer architecture.
MaskGIT: Masked Generative Image Transformer This lets you generate in PARALLEL, not auto-regressively
MaskBit: Embedding-free Image Generation via Bit Tokens: A 2024-SOTA image generator based on MaskGiT

Making Transformers Efficient by Making Attention Efficient

Reformer: The Efficient Transformer: This uses locality sensitive hashing to make attention much more efficient, moving it from taking O(n^2) and making it O(nlogn).
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness: This is a VERY popular paper and lots of people use this approach

Adding Memory/Scratch Space to a Transformer

How to tokenize audio to allow language modeling of sound

WAV2VEC: UNSUPERVISED PRE-TRAINING FOR SPEECH RECOGNITION: This describes a way to build a dictionary of audio tokens that is used in MusicLM
Wav2vec 2.0: Learning the structure of speech from raw audio: The 2nd iteration of wav2vec
SoundStream: An End-to-End Neural Audio Codec: Perhaps the top (as of 2023) audio codec. It is used in multiple audio langague models to tokenize the audio for the language model.
High-Fidelity Audio Compression with Improved RVQGAN: This is SOTA for audio encoders, as of July 2024

Using language models on audio tokens

W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training: Masked inference language model method to learn audio tokens. This is what is actually used in MusicLM
AudioLM: A Language Modeling Approach to Audio Generation: A language model for generating speech continuation
MusicLM: Generating Music From Text: A model generating music audio from text descriptions such as “a calming violin melody backed by a distorted guitar riff”.
Vampnet: Music Generation via Masked Acoustic Token Modeling: Generating non-autoregressively, similar to MaskGit

Fine tuning of langauge models

LORA: Low-rank Adaptation of Large Language Models: Microsoft’s approach to fast, efficient retraining for downstream tasks.

Reinforcement learning for Model Alignment

Reinforcement Learning: An Introduction: This is an entire book, but it is the one I learned RL from.
Policy Gradient Methods: Tutorial and New Frontiers: This is a video lecture that explains reinforcement learning policy grading methods. This is underlying tech used for training ChatGPT. Yes, this video is worth a “reading” credit. Yes, I started it 37 minutes into the lecture on purpose. You don’t have to watch the first half of the lecture.
Andrew K.s blog on Deep Reinforcement Learning: When combined with the video tutorial above, you’ll more-or-less understand policy gradient methods for deep reinforcement learning
Proximal Policy Optimization Algorithms: The paper that (mostly) explains the RL approach used in InstructGPT (the precursor to ChatGPT)
This blog on RL from human feedback: read the paper linked at the start of the blog. It teaches how to learn a reward function from human feedback, so you can do RL.
This blog on Aligning language models to follow instructions together explain how ChatGPT is fine-tuned to do prompt answering by combining proximal policy optimization and RL from human feedback (the two previous papers on this list).
ChatGPT: Already perhaps the most famous chatbot and most famous language model and it has been out about 2 weeks as of this writing.

Moving beyond RL for Model Alignment

Direct Preference Optimization: Your Language Model is Secretly a Reward Model: DPO directly optimizes for the policy best satisfying human preferences with a simple classification objectiv. No RL required!
KTO: Model Alignment as Prospect Theoretic Optimization: Directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences. Maybe better than DPO?

Multimodal Langauge Modeling

GPT-4 with vision (GPT-4V): enables users to instruct GPT-4 to analyze image inputs provided by the user, and is the latest capability we are making broadly available
LLaVA: Large Language and Vision Assistant:a large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding
LLark: A Multimodal Foundation Model for Music: Mix Jukebox and Llama 2 and you get this.

Making Language Models Safe

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

GENERATIVE ADVERSARIAL NETWORKS (GANS)

Creating adversarial examples

Explaining and Harnessing Adversarial Examples : This paper got the ball rolling by pointing out how to make images that look good but are consistently misclassified by trained deepnets.

Creating GANs

Generative Adversarial Nets: The paper that introduced GANs
2016 Tutorial on Generative Adversarial Networks by one of the creators of the GAN. This one’s long, but good.
DCGAN: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks: This is an end-to-end model. Many papers build on this.

Advanced GANS

PROGRESSIVE GROWING OF GANS FOR IMPROVED QUALITY, STABILITY, AND VARIATION This is used in StyleGAN and was state-of-the-art in 2018.
StyleGAN: A Style-Based Generator Architecture for Generative Adversarial Networks: As of 2019, this was the current state-of-the-art for GAN-based image generation.
StyleGAN2-ADA: Training Generative Adversarial Networks with Limited Data: : As of 2020, this was the current state-of-the-art for GAN-based image generation.
Cross-Modal Contrastive Learning for Text-to-Image Generation As of 2021, this was the best GAN for text-conditioned image generation. Note it’s use of contrastive loss. You’ll see that again in CLIP.
Adversarial Audio Synthesis: introduces WaveGAN, a paper about applying GANs to unsupervised synthesis of raw-waveform audio.
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis: Doing speech synthesis with GANs.
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis: Even better speech synthesis with GANs.

Variational Auto Encoders (VAEs)

Background needed for Variational Autoencoders

The Deep Learning Book’s Chapters on Probability and Linear Algebra. Read these before the Easy Intro to KL Divergence
An Easy Introduction to Kullback-Leibler (KL) Divergence . Read this before reading about ELBO
Jenson’s Inequality (an example with code). Read this before reading about ELBO
Evidence Lower Bound, Clearly Explained: A video walking through Evidence Lower Bound.
A walkthrough of Evidence Lower Bound (ELBO): Lecture notes from David Blei, one of the inventors of ELBO. ELBO is what you optimize when you do variational inference in a VAE.
Categorical Reparameterization with Gumbel-Softmax: This is a way of allowing categorical latent variables in your model so you can run a differentiable gradient descent algorithm through them. This is used in Vector-Quantized VAEs.
Probabilistic Graphical Models: Lecture notes from the class taught at Stanford.

Autoencoders

A starter blog on AutoEncoders and VAEs: Probably a good place to start.
The Deep Learning Book’s Chapter on Autoencoders
From neural PCA to deep unsupervised learning : This paper introduces Ladder networks, which will come back when we get to VAEs

BASIC Variational Auto Encoders (VAEs)

Tutorial on Variational Autoencoders: This is a walk-through of the math of VAEs. I think you should maybe start with this one.
Variational Inference, a Review for Statisticians: This explains the math behind variational inference. One of the authors is an inventor of variational inference.
An introduction to variational autoencoders: This is by the inventors of the VAE.
Tutorial: Deriving the Standard Variational Autoencoder (VAE) Loss Function: This is the only paper I’ve found that walks you through all the details to derive the actual loss function.

ADVANCED VAEs

VQ-VAE: Neural Discrete Representation Learning
Conditional VAE: Learning Structured Output Representation using Deep Conditional Generative Models: Making a controllable VAE through conditioning
Beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework : This is about making disentangled representations: making the VAEs latent variables meaningful to us.
Isolating Sources of Disentanglement in VAEs: More on disentangled representations in VAEs
Ladder VAEs: Hierarchical VAEs
Adversarial Auto-Encoders: You can guess what this is.
A Wizard’s Guide to Adversarial Autoencoders: This is a multi-part tutorial that will be helpfup for understanding AAEs.
From Autoencoder to Beta-VAE: Lilian Weng’s overview of most kinds of autoencoders

VAE Applications

MUSIC VAE: Learning Latent Representations of Music to Generate Interactive Musical Palettes: Making controllable music composition with VAEs
Jukebox: A Neural Net that Generates Music….with a combination of autoencoders and GANs
Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations: Using a WAE to generate new drugs

Diffusion and Score Models

Deep unsupervised learning using nonequilibrium thermodynamics: The 2015 paper where diffusion models were introduced.
Denoising Diffusion Probabilistic Models: This was a break-out paper from 2020 that got people excited about diffusion models.
The Annotated Diffusion Model: This provides a step-by-step walkthrough of the Denoising Diffusion Probabilistic Models, but with pytorch code
DIFFWAVE: A VERSATILE DIFFUSION MODEL FOR AUDIO SYNTHESIS: A 2021 paper that made a splash in audio-synthesis
Generative Modeling by Estimating Gradients of the Data Distribution: This is a blog that explains how score-based models are also basically diffusion models.
What are Diffusion Models?
An Introduction to Diffusion Models: A nice tutorial blog that has Pytorch code.

Advanced Diffussion and Score Models

DiffWave: A Versatile Diffusion Model for Audio Synthesis: A neural vocoder done with diffusion from 2021
Universal Speech Enhancement With Score-based Diffusion
Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise: Do we need to add noise at each step or would any transform do?

Latent Diffusion (as seen in Stability AI’s image generator)

High-Resolution Image Synthesis with Latent Diffusion Models : How to use an autoencoder to put images into a latent space so you can do diffusion with fewer resources
The Illustrated Stable Diffusion: Stability AI made a great diffusuion model. Here’s a high-level overview of how it works.

Combining Diffusion and Transformers

Scalable Diffusion Models with Transformers: A 2023 paper that uses a Transformer architecture to do diffusion
Long-form Music Generation with Latent Diffusion: This is a 2024 paper that describes how to use DiT models to do music audio generation
Stable Audio Open: A July 2024 paper that describes the architecture of Stability AI’s open-source audio generation model

Making diffusion faster

Progressive Distillation for fast sampling of Diffusion Models: This 2022 ICLR paper introduces the v-prediction objective, which is a weighted sum of of x0 and eps (the two learning objectives typically used in diffusion)
Consistency Trajectory Models. For single-step diffusion model sampling, CTM offers diverse sampling options and balances computational budget with sample fidelity effectively.

Steerable diffusion (text conditioning)

Diffusion Models Beat GANs on Image Synthesis: This paper describes many technical details used in the GLIDE paper…and therefore in DALL-E-2. It introduces classifer guidance
Classifier-free Diffusion Guidance: The 2021 paper that defined how we use text prompting w/o a classifier to guide diffusion.
Guidance: a cheat code for diffusion models: if you want to understand DALL-E-2 and Imagen, you need to understand this. It’s an easier-to-follow explanation of the Classifier-free Diffusion Guidance idea.
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models: This is paper that lays the groundwork for DALL-E-2.
Learning Transferable Visual Models From Natural Language Supervision: The CLIP representation. This is used in DALL-E-2.
Hierarchical Text-Conditional Image Generation with CLIP Latents: The DALL-E-2 paper.
InstructPix2Pix: Learning to Follow Image Editing Instructions
DITTO:Diffusion Inference-Time T-Optimization for Music Generation

Ethics and Societal Effects of Generative Modeling

Pam Samuelson’s AI Meets Copyright. This is a video lecture on generative AI and copyright law from one of top copyright scholars in the USA.
Allocating Ownership Rights in Computer Generated Works
Sociotechnical Safety Evaluation of Generative AI Systems: A big overview paper on how to ensure your generative AI is going to minimize possible harms.
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?: This is the paper that Timnit Gebru and Margaret Mitchell got fired from Google’s Ethical AI team for publishing.
Alignment of Language Agents: This is Deep Mind’s critique of their own approach.
Open AI’s analysis of GPT-4 potential harms: Worth a serious read
The Ethical Implications of Generative Audio Models: A Systematic Literature Review: By Northwestern’s own Julia Barnett!
Foregrounding Artist Opinions: A Survey Study on Transparency, Ownership, and Fairness in AI Generative Art: What do artists think about GenerativeAI?

How bad is the memorization problem in generative models?

Quantifying Memorization Across Neural Language Models: Systematic experiments on how model size, prompt length, and frequency of an example in the training set impact our ability to extract memorized content.
Extracting Training Data from Large Language Models: Did GPT-2 memorize a Harry Potter book? Read this and find out.
Extracting Training Data from Diffusion Models: Exactly what it sounds like.
Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models: Just what it sounds like.

TOPICS NOT COVERD IN CLASS (BUT THAT ARE WORTH LEARNING ABOUT)

Normalizing Flows

Variational Inference with Normalizing Flows: A differentiable method to take a simple distribution and make it arbitrarily complex. Useful for modeling distributions in deep nets. Can be added to VAEs.

FILM Layers

FiLM: Visual Reasoning with a General Conditioning Layer: Affine transformation of input layers that proves helpful in many contextx. Here’s the TL;DR version. I’d start with the TL;DR.

Structured State Space Models

Efficiently Modeling Long Sequences with Structured State Spaces
The Annotated S4: This is a guided walk through (with code) of a structured state space model.
It’s Raw! Audio Generation with State-Space Models
Mamba: Linear-Time Sequence Modeling with Selective State Spaces: Adds input-dependent “attention” (selectivity) to SSMs and claims to beat Transformers at their own game.