
Bio
Ying Nian Wu is currently a professor in Department of Statistics, UCLA. He received his A.M. degree and Ph.D. degree in statistics from Harvard University in 1994 and 1996 respectively. He was an assistant professor in the Department of Statistics, University of Michigan from 1997 to 1999. He joined UCLA in 1999. He has been a full professor since 2006. Wu’s research areas include generative modeling, representation learning, computer vision, computational neuroscience, and bioinformatics.
Title
A Tutorial on Generative Models
Abstract
In this tutorial, I will review recent generative models, including GAN, VAE, flow-based models, energy-based models, diffusion/score-based models. I will go over key equations and algorithms for learning these models, and I will also explain the connections between these models.
Replay (需要科学上网和观看密码)
Slides
小朱老师按
吴英年老师,江湖人称“老吴”,与朱松纯院长合作了将近 30 年,毕业于哈佛大学,师从大统计家Don Rubin(就是那个发明了 EM 算法和第一次在统计上讲清楚了什么是因果的大佬;你见过搞理论的人有 30w 的 citation 么)。吴老师从 90s 就开始研究生成式算法,和朱院长合著了多篇有影响力的 generative vision 的文章,包括朱院长的代表作之一 FRAME(第一篇把纹理讲清楚的文章)和获得 Marr 奖提名的 Active Basis(吸引我当年去朱院长实验室的文章)。实事求是的说,这年头火的 EBM 都是吴老师当年玩剩下的~老吴至今仍旧活跃在科研一线,一作发统计顶刊(感兴趣的童鞋可以去读读 A tale of three probabilistic families: discriminative, descriptive and generative models)。吴老师的机器学习课在 UCLA 是非常有名的,看老吴推公式是一种享受,而且老吴可以把各个模型之间的关系讲的非常清楚(在油管上绝对找不到!)。希望童鞋们能在这次 2 小时的 tutorial 中感受到数学的魅力。
Photos



Poster

Notes: Generative Modeling Explained
Organized by Yu-Zhe Shi; see original post on GitHub.
This tutorial on generative modeling is in part of Statistical Machine Learning Tutorial by Ying Nian Wu at UCLA Statistics. The tutorial goes over the key equations and algorithms for learning recent generative models, including energy-based models, diffusion/score-based models, autoregressive/flow-based models, VAEs, and GANs, and explains the connections between these models. In contrast to most current tutorials on generative modeling from the perspective of machine learning, this tutorial is unique for providing a more basic and natural perspective form statistics. Starting with the very basic probability background, the tutorial is extremely learner-friendly.
Highlights & Significance
The tutorial connects different families of generative models from multiple perspectives—original formulation, the essence of sampling process, and variational formulation.
Sampling a high-modality distribution is extremely hard. Diffusion model factorizes the problem of sampling from the high-modality distribution into a thousand of small incremental steps, making the sampling much more tractable. VAE follows an elegant formulation trying to sample the data distribution in a single trial, however the estimated aggregated posterior may mismatch the prior. GAN also suffers from single trial, but uses a discriminator to guide the generation.
Dr. Wu introduces a smooth analogy to the golf for understanding the relations between the generative models. In this perspective, the model expressitivity, the sampling process, and the data modality, are analogous to the sum of balls, the number of strokes, and the number of holes, respectively—more balls means the possibility to cover more holes, and more strokes means sending a ball to a hole more with more patience. The readers may employ this analogy to understand the pros and cons of different generative models. Also see the relation between generative models from the perspective of Diffusion model.
A visulization of the golf analogy
Prerequisite: Probability Density
As long as you can count, you understand everything about probability density.
Consider a clot in the 2-D space, with n example data points. The Probability Density tells you how the points are distributed. As the number of data points can become extremely large (
A visual demonstration of probabilistic density
To Analyze the continuous density, we can discretize spaces into
Understand Basic Concepts by Counting
Consider the number of data points a cell (shadowed area), we have:
Joint Density
On the basis, we have joint density as:
Given the most important concept, we can work on the three elementary operations—marginalization, conditioning, and factorization.
Marginalization
Calculating
Conditioning
Calculating
Factorization
On the basis of conditioning and marginalization, we have the factorization operation:
Expectation
The expectation
The Core Problem: Density Estimation
The gold standard for density estimation is Maximum Likelihood Estimation.
The real world is not counting points in a 2-D space. In fact, most data comes with a high-dimensional space, and the number of examples
Given the problem, what could we do? The intuitive way is to estimate a function to capture the properties of such probabilistic density. We have to parametrize the probabilistic density and try to learn the underlying parameters from finite examples. We hope the learned density of the finite examples can generalize to infinite examples.
Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE) is the basic idea to estimate a density function.
Given finite
Now we come to the core of the MLE—defining the log-likelihood function:
Our objective is to maximize the log-likelihood function, that
Another Perspective on MLE: Kullback-Leibler Divergence
Kullback-Leibler Divergence (KL-Divergence) measures the difference between two distributions
Trivially, if we are trying to maximize
Since we are calculating the expectation over
The mode covering behavior
Energy-based Model
Energy-based Model (EBM) is the most basic generative model.
Formulating EBM
The target density
Learning EBM
We can calculate the derivative over
where we get an important property that
Bringing the EBM formulation into the MLE formulation, we have:
and the derivative of
However, computing the expectation is extremely hard. We have to use Monte-Carlo Sampling to draw examples from the estimated density. The goal is to match the average of observed examples and the average of synthesized examples. The main problem of learning EBM is that sampling from the model density is non-trivial at all.
Contrastive Divergence
Following the KL-Divergence perspective, we can also interpret the Monte-Carlo Sampling process for EBM in a similar way: consider the model in
and the
A visual demonstration of contrastive divergence
Another Interpretation: Self-Adversarial Training
If we treat the current model
The mode chasing behavior in W-GAN
Another Interpretation: Noise Contrastive Estimation
Noise Contrastive Estimation (NCE) introduces a reference distribution
To learn the model, a more practical way is to view the problem from an adversarial perspective. If we draw
Sampling Process for Learning EBM
Small steps get us to faraway places. 不積跬步,無以至千里。
As aforementioned, the core problem of generative modeling is estimating the model density. In this section, we are starting with reviewing the commonly used sampling method, Langevin Dynamics, a special case of Markov-Chain Monte-Carlo (MCMC) Sampling.
Langevin Dynamics
We cannot sample from the model density all at once. Hence, we use Langevin Dynamics to sample in steps alone the time-axis. Here we denote the target distribution
We discretize the time axis and each
Stochastic Differential Equation
Consider the time step becomes very small, i.e.,
A more sophiscated version of MCMC Sampling is Hamiltonian Monte-Carlo (HMC), which adds a momentum to smooth the trajectory.
Understanding Langevin Dynamics: Equilibrium Sampling
An interesting observation in Langevin Dynamics is: once
How does this come? Let us look back into the updating equation of Langevin Dynamics, into the two terms—the gradient ascent term
Explaining Langevin Dynamics with equilibrium sampling: (1) gradient ascent as squeezing; (2) random pertubation as diffusion
To analyze the phenomenon mathematically, we may look into the Taylor expansion of the testing function
The derivation for the first-order Taylor expansion in gradient ascent is as following:
This derivation shows that the remainder of the Taylor expansion of gradient ascent is negative.
The derivation for the second-order Taylor expansion in diffusion is as following:
This derivation shows that the remainder of the Taylor expansion of diffusion is positive.
On the basis of equilibrium sampling, we are introducing score-based/diffusion models.
Tempering & Annealing
Though coming with the merit of equilibrium sampling, Langevin Dynamics suffers from very slow convergence, especially when the model density has a lot of localized modes (high modalities). To address this problem, we introduce a temperature parameter
Diffusion/Score-based Models
Imagine you are playing the golf. You can exactly see where the hole
is. But you want to use a thousand strokes to shoot back to the hole. You do not want to shoot back in a stroke because the chance you hit the hole is very small. Rather, you see where the hole is, and you are going toward the hole by small steps.
Unlike the EBM where we directly target to the log-density, the Diffusion model essentially learns a sampling process. Diffusion model tries to decompose sampling the density into a large number of very small incremental steps.
Forward: From Clean Image to Noise
The forward process of Diffusion model is gradually adding noise to a clean image until it becomes a Gaussian, using non-equilibrium sampling.
Let
We can also look into the Taylor expansion of the testing function
A visual demonstration of the forward time in non-equilibrium sampling
Reverse: From Noise Back to Image
After changing the clean image into white noise, now we are trying to walk back.
We only need to reverse the deterministic step, from noise level
Ordinary & Stochastic Differential Equation
Similar to Langevin Dynamics, we have two variants of reverse updating when the time steps become very small, (i.e.,
If we only consider the deterministic step, we have Ordinary Differential Equation (ODE):
A visual demonstration of the reverse time in non-equilibrium sampling (ODE)
If we consider the random step, we have Stochastic Differential Equation (SDE):
A visual demonstration of the reverse time in non-equilibrium sampling (SDE)
Vincent Identity
Here we go into the core problem of learning a Duffusion model: how do we estimate the score
From the clean image
From line 1 to line 2 we integrate over
Score Matching/Denoising Auto-Encoder
On the basis of the Vincent Identity, we have:
U-Net: encoding the noisy version of the image to decode the clean version of the image
Under this implementation, we take relatively big steps to estimate noise:
Another Formulation: Variational
We can alternativelly reform the score-based methods into a variational way. The forward process from
The derivation starts from applying the Bayes rule to obtain
Hence, this variational formulation transforms the extremely hard conditional distribution estimation to a very simple Gaussian distribution.
Recall our gold standard, MLE—as we have obtained the conditional distribution, naturally we can formulate the variational form in KL-Divergence:
In the reverse process, we can execute noise reduction by decomposing the KL-Divergence:
Relations with Other Generative Models
Diffusion model can be viewed as the Auto-regressive model in the time domain, which reverse the time, going from white noise
Diffusion model can be viewed as the Flow-based model. Flow-based model starts from white noise
Diffusion model can be viewed as a refined version of Variational Auto-Encoder (VAE). VAE starts from white noise
VAE estimates
A visualization of the relation between different generative models from the perspective of Diffusion model
The following animations show the two pairs of counterparts that needs distinction.
Mode covering vs. mode chasing
Contrastive divergence vs. EM algorithm
Equilibrium sampling vs. non-equilibrium sampling
Bibliography
Divergence Triangle for Joint Training of Generator Model, Energy-based Model, and Inference Model - CVPR'19, 2019. [All Versions].
Deep Unsupervised Learning using Nonequilibrium Thermodynamics - ICML'15, 2015. [All Versions].
Denoising Diffusion Probabilistic Models - NeurIPS'20, 2020. [All Versions].
Score-Based Generative Modeling through Stochastic Differential Equations - ICLR'20, 2020. [All Versions].
Author and Citation Info
This tutorial entry is composed by Yu-Zhe Shi under the supervision of Dr. Ying Nian Wu.
Dr. Ying Nian Wu is currently a professor in Department of Statistics, UCLA. He received his A.M. degree and Ph.D. degree in statistics from Harvard University in 1994 and 1996 respectively. He was an assistant professor in the Department of Statistics, University of Michigan from 1997 to 1999. He joined UCLA in 1999. He has been a full professor since 2006. Wu’s research areas include generative modeling, representation learning, computer vision, computational neuroscience, and bioinformatics.
Cite the Entry
@InCollection{shi2022generative,
author = {Shi, Yu-Zhe and Wu, Ying Nian},
title = {{Generative Modeling Explained}},
booktitle = {Statistical Machine Learning Tutorials},
howpublished = {\url{https://github.com/YuzheSHI/generative-modeling-explained}},
year = {2022},
edition = {{S}ummer 2022},
publisher = {Department of Statistics, UCLA}
}
Acknowledgement
The authors thank Dr. Yixin Zhu for helpful suggestions, Zhangqian Bi for debugging the markdown maths renderer, and Ms. Zhen Chen for helping design the animations.