22.08.06 Tutorial: Ying Nian Wu

Bio

Ying Nian Wu is currently a professor in Department of Statistics, UCLA. He received his A.M. degree and Ph.D. degree in statistics from Harvard University in 1994 and 1996 respectively. He was an assistant professor in the Department of Statistics, University of Michigan from 1997 to 1999. He joined UCLA in 1999. He has been a full professor since 2006. Wu’s research areas include generative modeling, representation learning, computer vision, computational neuroscience, and bioinformatics.

Title

A Tutorial on Generative Models

Abstract

In this tutorial, I will review recent generative models, including GAN, VAE, flow-based models, energy-based models, diffusion/score-based models. I will go over key equations and algorithms for learning these models, and I will also explain the connections between these models.

Replay (需要科学上网和观看密码)

Slides

小朱老师按

吴英年老师,江湖人称“老吴”,与朱松纯院长合作了将近 30 年,毕业于哈佛大学,师从大统计家Don Rubin(就是那个发明了 EM 算法和第一次在统计上讲清楚了什么是因果的大佬;你见过搞理论的人有 30w 的 citation 么)。吴老师从 90s 就开始研究生成式算法,和朱院长合著了多篇有影响力的 generative vision 的文章,包括朱院长的代表作之一 FRAME(第一篇把纹理讲清楚的文章)和获得 Marr 奖提名的 Active Basis(吸引我当年去朱院长实验室的文章)。实事求是的说,这年头火的 EBM 都是吴老师当年玩剩下的~老吴至今仍旧活跃在科研一线,一作发统计顶刊(感兴趣的童鞋可以去读读 A tale of three probabilistic families: discriminative, descriptive and generative models)。吴老师的机器学习课在 UCLA 是非常有名的,看老吴推公式是一种享受,而且老吴可以把各个模型之间的关系讲的非常清楚(在油管上绝对找不到!)。希望童鞋们能在这次 2 小时的 tutorial 中感受到数学的魅力。

Photos

Poster

Notes: Generative Modeling Explained

Organized by Yu-Zhe Shi; see original post on GitHub.

This tutorial on generative modeling is in part of Statistical Machine Learning Tutorial by Ying Nian Wu at UCLA Statistics. The tutorial goes over the key equations and algorithms for learning recent generative models, including energy-based models, diffusion/score-based models, autoregressive/flow-based models, VAEs, and GANs, and explains the connections between these models. In contrast to most current tutorials on generative modeling from the perspective of machine learning, this tutorial is unique for providing a more basic and natural perspective form statistics. Starting with the very basic probability background, the tutorial is extremely learner-friendly.

Highlights & Significance

The tutorial connects different families of generative models from multiple perspectives—original formulation, the essence of sampling process, and variational formulation.

Sampling a high-modality distribution is extremely hard. Diffusion model factorizes the problem of sampling from the high-modality distribution into a thousand of small incremental steps, making the sampling much more tractable. VAE follows an elegant formulation trying to sample the data distribution in a single trial, however the estimated aggregated posterior may mismatch the prior. GAN also suffers from single trial, but uses a discriminator to guide the generation.

Dr. Wu introduces a smooth analogy to the golf for understanding the relations between the generative models. In this perspective, the model expressitivity, the sampling process, and the data modality, are analogous to the sum of balls, the number of strokes, and the number of holes, respectively—more balls means the possibility to cover more holes, and more strokes means sending a ball to a hole more with more patience. The readers may employ this analogy to understand the pros and cons of different generative models. Also see the relation between generative models from the perspective of Diffusion model.

golf
A visulization of the golf analogy

Prerequisite: Probability Density

As long as you can count, you understand everything about probability density.

Consider a clot in the 2-D space, with n example data points. The Probability Density tells you how the points are distributed. As the number of data points can become extremely large (n), we have an almost continuous clot demonstrating the data.

probdens
A visual demonstration of probabilistic density

To Analyze the continuous density, we can discretize spaces into Δx0 and Δy0.

Understand Basic Concepts by Counting

Consider the number of data points a cell (shadowed area), we have: n(x,y)=Count(x,x+Δx)×(y,y+Δy), then n(x) is the number of points in the vertical bin (x,x+Δx): n(x)=yn(x,y), and similarly, n(y) is the number of points in the horizontal bin (y,y+Δy): n(y)=xn(x,y).

Joint Density

On the basis, we have joint density as: p(x,y)=n(x,y)/nΔxΔy, where the numerator is normalized by the total number of points. Visually, p(x,y) is the proportion of the points in the cell (x,x+Δx)×(y,y+Δy) to all the points. This is in line with all the definitions of density in general, such as population density, energy density, etc.

Given the most important concept, we can work on the three elementary operations—marginalization, conditioning, and factorization.

Marginalization

Calculating p(x) given p(x,y) is marginalization. This is projecting the two-dimensional view into the one dimension x-axis. Visually, p(x) is the proportion of the points in the bin (x,x+Δx) to all the points. Hence, we have: p(x)=n(x)/nΔx=p(x,y)dy.

Conditioning

Calculating p(y|x) given p(x,y) is conditioning. This means that we are only focusing on the points in a particular bin (x,x+Δx), ignoring all the other bins. Visually, p(y|x) is the proportion of the points in the cell (x,x+Δx)×(y,y+Δy) to only the points in the bin (x,x+Δx). Hence, we have: p(y|x)=n(x,y)/n(x)Δy=p(x,y)p(x).

Factorization

On the basis of conditioning and marginalization, we have the factorization operation: p(x,y)=p(x)p(y|x). This equation shows that the joint distribution lies in the center of probability density—we can execute all the operations given the joint distribution.

Expectation

The expectation E measures the average of the corresponding distribution in the long run. Visually, Ep(x,y) is the average over all the n points, Ep(x) is the average over all points projected to the x-axis, and Ep(y|x) is the average with the points in the vertical bin (x,x+Δx).

The Core Problem: Density Estimation

The gold standard for density estimation is Maximum Likelihood Estimation.

The real world is not counting points in a 2-D space. In fact, most data comes with a high-dimensional space, and the number of examples n is always finite. Thus, we have no access to the ideal continuous space with countable points in each cell, where is usually a high-dimensional cube (a.k.a. hypercube) with points compressively distribute at its surface. This problem is generally acknowledged as the curse of dimensionality.

Given the problem, what could we do? The intuitive way is to estimate a function to capture the properties of such probabilistic density. We have to parametrize the probabilistic density and try to learn the underlying parameters from finite examples. We hope the learned density of the finite examples can generalize to infinite examples.

Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) is the basic idea to estimate a density function.

Given finite n example x1,x2,,xnpdata(x),xRD, where each x is a D-dimensional vector (D can be very large) and is independent and identically distributed in the data. We design a model pθ(x) to parametrize the density where θ denotes all the parameters of the model, e.g., a neural network. In most times, pθ(x) is implicited and can only be obtained by the marginalization operation.

Now we come to the core of the MLE—defining the log-likelihood function: L(θ)=1ni=1nlogpθ(xi). The underlying logic of taking log of the density is: density is always positive, and log function maps it to the whole range. As the function is essentially an average over all the data points, we can derive it to an expectation form:

L(θ)=1ni=1nlogpθ(xi)=Epdata[logpθ(x)].

Our objective is to maximize the log-likelihood function, that θ to assign maximum probabilistic density to all the examples: θ^MLE=argmaxθL(θ).

Another Perspective on MLE: Kullback-Leibler Divergence

Kullback-Leibler Divergence (KL-Divergence) measures the difference between two distributions p and q, is formulated as: DKL(p|q)=Ep[logp(x)q(x)]. A KL-Divergence’s view of MLE is treating the two distributions as the groundtruth distribution of the data and the distribution estimated the model, respectively. Hence, we have:

DKL(pdata|pθ)=Epdata[logpdata(x)]Epdata[logpθ(x)]=Entropy(pdata)L(θ).

Trivially, if we are trying to maximize L(θ), we are minimizing the KL-Divergence between pdata and pθ. This KL-Divergence view does provide us with some insights about density estimation—the model varying θ is lying on a manifold (so-called information geometry picture) and the data density is a point that may not belong to the manifold. But we are projecting the data density point to the model manifold and MLE provides the best approximation to the data density.

Since we are calculating the expectation over pdata, we cannot miss any mode of the data, otherwise we get a rather big KL-Divergence. This is the mode covering behavior of generative models. This raises a core problem of generative modeling—if your model is not expressive enough, you end up with either diffused or dispersed densities. In the remainder of this tutorial, we are going over different models trying to resolve the problem.

modecovering
The mode covering behavior

Energy-based Model

Energy-based Model (EBM) is the most basic generative model.

Formulating EBM

The target density pθ(x) can be obtained as: logpθ(x)=fθ(x)+const, where fθ(x) is a bottom-up neural network and const is a normalizing term. Transforming the log into exponential, we have: pθ(x)=1Z(θ)exp(fθ(x)), where fθ(x) is the negative energy and Z(θ)=exp(fθ(x))dx is the partition function to normalize pθ(x).

Learning EBM

We can calculate the derivative over θ using the chain rule, obtaining:

θlogZ(θ)=1Z(θ)exp(fθ(x))θfθ(x)dx=EPθ[θfθ(x)],

where we get an important property that θlogZ(θ)=EPθ[θfθ(x)].

Bringing the EBM formulation into the MLE formulation, we have:

L(θ)=1ni=1nlogpθ(xi)=1ni=1nfθ(xi)logZ(θ),

and the derivative of L(θ) is:

L(θ)=1ni=1nθfθ(xi)Epθ[θfθ(x)]=Epdata[θfθ(x)]Epθ[θfθ(x)].

However, computing the expectation is extremely hard. We have to use Monte-Carlo Sampling to draw examples from the estimated density. The goal is to match the average of observed examples and the average of synthesized examples. The main problem of learning EBM is that sampling from the model density is non-trivial at all.

Contrastive Divergence

Following the KL-Divergence perspective, we can also interpret the Monte-Carlo Sampling process for EBM in a similar way: consider the model in t step θt, we have the Contrastive Divergence (CD):

C(θ)=DKL(pdata(x)|pθ(x))DKL(pθt(x)|pθ(x))=Epdata[logpdata(x)]Epdata[logpθ(x)]Epθt[logpθt(x)]+Epθt[logpθ(x)],

and the logZ(θ) term in Epdata[logpθ(x)] and Epθt[logpθ(x)] is cancelled by each other. This provides an important merit that L(θt)=C(θt), making the computation much more tractable.

contrastive divergence
A visual demonstration of contrastive divergence

Another Interpretation: Self-Adversarial Training

If we treat the current model pθt as an actor generating synthesized examples, and the learned model pθ as a critic—the current model is criticized to get closer to the real data density and away from current model density—we are obtaining an adversarial interpretation of CD. This reflects the idea of W-GAN.

modechasing
The mode chasing behavior in W-GAN

Another Interpretation: Noise Contrastive Estimation

Noise Contrastive Estimation (NCE) introduces a reference distribution q(x) into the formulation of pθ(x): pθ(x)=1Z(θ)exp(fθ(x))q(x), where q(x) serves as a more informative density, e.g., white noise N(0,I), than the uniform density in the original formulation of EBM.

To learn the model, a more practical way is to view the problem from an adversarial perspective. If we draw n true examples from the model x1,,xi,,xnpθ that are labelled with yi=1, and draw n faked examples from the noise x~1,,x~i,,x~nq that are labelled with yi=0, then we can calculate the posterior distribution of the real examples given random x: p(y=1|x)=12pθ(x)12pθ(x)+12q(x), applying the Bayes rule, we obtain: logp(y=1|x)p(y=0|x)=fθ(x)+const, where the left-hand-side defines a discriminator, and fθ(x) here is a logit score. Hence, we are essentially learning a binary classifier, which is much more easier than MLE because we do not need to deal with the model density.

Sampling Process for Learning EBM

Small steps get us to faraway places. 不積跬步,無以至千里。

As aforementioned, the core problem of generative modeling is estimating the model density. In this section, we are starting with reviewing the commonly used sampling method, Langevin Dynamics, a special case of Markov-Chain Monte-Carlo (MCMC) Sampling.

Langevin Dynamics

We cannot sample from the model density all at once. Hence, we use Langevin Dynamics to sample in steps alone the time-axis. Here we denote the target distribution pθ(x) as π(x). Starting from a noise distribution, after a few steps of sampling, we hope that π(x) can come to the stationary distribution, or the equilibrium distribution.

We discretize the time axis and each Δt is an iteration step. The single-step updating is: xt+Δt=xt+Δt2xlogπ(x)+etΔt, where xlogπ(x) is the gradient of the target density for executing the gradient ascent, which is called score. The term et is a random variable at each step where E[et]=0 and Var[et]=I. The scalar Δt serves to normalize the perturbation. Langevin Dynamics is a general process to sample from arbitrary density.

Stochastic Differential Equation

Consider the time step becomes very small, i.e., Δt0—the updating equation becomes a Stochastic Differential Equation (SDE): dxt=12xlogπ(x)dt+dBt, where Bt denotes etΔt.

A more sophiscated version of MCMC Sampling is Hamiltonian Monte-Carlo (HMC), which adds a momentum to smooth the trajectory.

Understanding Langevin Dynamics: Equilibrium Sampling

An interesting observation in Langevin Dynamics is: once pt(x) have converged to πt(x), then Langevin Dynamics will maintain the marginal distribution of πt(x) in the following steps, keeping it at equilibrium.

How does this come? Let us look back into the updating equation of Langevin Dynamics, into the two terms—the gradient ascent term Δt2xlogπ(x) and the perturbation term etΔt. Imagine an analogous scene that we are working with many particles instead of only sampling one point each time. In the gradient ascent step, as the model tends to update according to the gradient, the particles are squeezed to areas with more particles, thus making the density peaks sharper. In the perturbation step, as we are adding variance, the model become diffused, thus making the density more smooth. The two terms counteract the effect of each other—the gradient ascent term push the density to be more sharp at local modes, while the perturbation term disperse the local modes to smooth the density.

langevin
Explaining Langevin Dynamics with equilibrium sampling: (1) gradient ascent as squeezing; (2) random pertubation as diffusion

To analyze the phenomenon mathematically, we may look into the Taylor expansion of the testing function E[h(xt+Δt)]. Expanding Δt2xlogπ(x) leads to a first-order Taylor remainder and expanding etΔt leads to a second-order Taylor remainder. Since the two terms have opposite signs (see the detailed derivations below), they cancelled the effect of each other. This is identified as the Fokker-Planck effect.

The derivation for the first-order Taylor expansion in gradient ascent is as following:

E[h(xt+Δt)]=E[h(xt+Δt2xlogπ(xt))]=E[h(xt)+h(xt)Δt2xlogπ(xt)]+o(Δt)=E[h(xt)]+h(xt)(Δt2xlogπ(xt))π(xt)dxt=E[h(xt)]+Δt2h(xt)π(xt)dxt=E[h(xt)]+Δt2[h(xt)π(xt)|h(xt)π(xt)dt]=E[h(xt)]Δt2E[h(xt)].

This derivation shows that the remainder of the Taylor expansion of gradient ascent is negative.

The derivation for the second-order Taylor expansion in diffusion is as following:

E[h(xt+Δt)]=E[h(xt+Δtet)]=E[h(xt)+h(xt)Δtet+12h(xt)(Δtet)2]=E[h(xt)]+Δt2E[h(xt)et2]=E[h(xt)]+Δt2E[E[h(xt)et2|xt]]=E[h(xt)]+Δt2E[h(xt)].

This derivation shows that the remainder of the Taylor expansion of diffusion is positive.

On the basis of equilibrium sampling, we are introducing score-based/diffusion models.

Tempering & Annealing

Though coming with the merit of equilibrium sampling, Langevin Dynamics suffers from very slow convergence, especially when the model density has a lot of localized modes (high modalities). To address this problem, we introduce a temperature parameter β[0,1] into the EBM formula: πβ(x)=1Zβ(x)exp(βf(x))q(x), where q(x)N(0,I). As β increasing from 0 to 1, we are sampling from a simple Gaussian to a highly multi-modal density. This process is called Simulated Annealing. A principled implementation of Simulated Annealing is running parallel chains to draw samples from β=0 to β=1 simultaneously, with the exchange of samples among models.

Diffusion/Score-based Models

Imagine you are playing the golf. You can exactly see where the hole x0 is. But you want to use a thousand strokes to shoot back to the hole. You do not want to shoot back in a stroke because the chance you hit the hole is very small. Rather, you see where the hole is, and you are going toward the hole by small steps.

Unlike the EBM where we directly target to the log-density, the Diffusion model essentially learns a sampling process. Diffusion model tries to decompose sampling the density into a large number of very small incremental steps.

Forward: From Clean Image to Noise

The forward process of Diffusion model is gradually adding noise to a clean image until it becomes a Gaussian, using non-equilibrium sampling.

Let x0 denote the clean image and xt denote the image with noise level t, from noise level xt to xt+Δt, we have: xt+Δt=xt+δetΔtδ2Δt2xlogpt(x). Let us look into the equation, where δetΔt is the random step of adding perturbation, and δ2Δt2xlogpt(x) is the deterministic step of gradient descent. Recall the Fokker-Planck effect introduced in the Langevin Dynamics. The only difference lies in the deterministic step—in contrast to gradient ascent in the updating of Langevin Dynamics, forward updating Diffusion model applies gradient descent. Consequently, the effect of gradient descent is opposite to that of gradient ascent—in the gradient descent step, as the model tends to update according to the gradient, the particles are stretched away from areas with more particles, thus making the density peaks more smooth. Hence, we can see that both the deterministic step and the random step lead to dispersion on the density.

We can also look into the Taylor expansion of the testing function E[h(xt+Δt)]. Expanding δ2Δt2xlogπ(x) leads to a first-order Taylor remainder and expanding δetΔt leads to a second-order Taylor remainder. Since the two terms have the same sign, instead of cancelling the effect of each other, they actually have the same effect. This is the non-equilibrium sampling process.

nonequilibrium forward
A visual demonstration of the forward time in non-equilibrium sampling

Reverse: From Noise Back to Image

After changing the clean image into white noise, now we are trying to walk back.

We only need to reverse the deterministic step, from noise level xt+Δt to xt, we have: x~t=xt+Δt+δ2Δt2xlogpt+Δt(xt+Δt).

Ordinary & Stochastic Differential Equation

Similar to Langevin Dynamics, we have two variants of reverse updating when the time steps become very small, (i.e., Δt0).

If we only consider the deterministic step, we have Ordinary Differential Equation (ODE): dxt=δ22xlogpt(xt)dt, where pt(xt) is the image density in noise level t.

nonequilibrium reverse ODE
A visual demonstration of the reverse time in non-equilibrium sampling (ODE)

If we consider the random step, we have Stochastic Differential Equation (SDE): dxt=δ2xlogpt(xt)+δdB~t, where dB~t=de~tΔt=e~dt2Δt. The SDE formulation can be interpreted as going two reverse steps from pt+Δt to ptΔt, then going one forward step from ptΔt to pt. This provides a merit that SDE is more appliable if we cannot estimate pt accurately.

nonequilibrium reverse SDE
A visual demonstration of the reverse time in non-equilibrium sampling (SDE)

Vincent Identity

Here we go into the core problem of learning a Duffusion model: how do we estimate the score xlogpt(xt)? Vincent Identity provides us with a very powerful tool that xlogpt(xt)=Ep(x0|xt)[(xtx0)/σt2].

From the clean image x0p0, the Diffusion model gradually adds noise in the forward process to xtpt, then we have: xt=x0+N(0,σ2tI), where σ2t is the noise accumulated in the t levels. Then we derive the Vincent Identity:

xlogpt(xt)=1pt(xt)xpt(xt)=1pt(xt)xtp(x0,xt)dx0=1pt(xt)(xtlogp(x0,xt))p(x0,xt)dx0=(xtlogp(x0,xt))p(x0,xt)pt(xt)dx0=(xtlogp(x0,xt))p(x0|xt)dx0=Ep(x0|xt)[xt(logp(x0)p(xt|x0))]=Ep(x0|xt)[xt(logp(x0)+logp(xt|x0))]=Ep(x0|xt)[xtlogp(xt|x0)]=Ep(x0|xt)[xt(xtx0)2/2σt2]=Ep(x0|xt)[(xtx0)/σt2].

From line 1 to line 2 we integrate over x0 to obtain the joint distribution p(x0,xt). Then we employ a common trick for taking a derivative of a density in the integral that puts log into the derivative operator in line 3. From line 4 to line 5 we put 1p(x0) into the integral and obtain the conditional density. From line 6 we rewrite the integral into a expectation form. Line 7 shows the merit of adding the log term, we expand the factorized joint distribution into a simple addition and calculate their derivatives respectively.

Score Matching/Denoising Auto-Encoder

On the basis of the Vincent Identity, we have: E[p(x0|xt)]=xt+σ2txlogpt(xt), where x0 is the clean image and σ2t is the accumulated noise. Hence, we can estimate the score in a regression fashion where the objective is to predict x0 given xt, formulated as: minθ|x0(xt+σ2tsθ(xt,t))|2, and this gives us a Denoising Auto-Encoder. We can parametrize this in a U-Net. A U-Net encodes the noisy version of the image and decodes back to the clean version of the image, with the encoder and decoder sharing parameters. We can learn a single U-Net for all levels of noise by taking noisy level xt and t as the input variables of the model. t can be embedded as expressive vectors sinωt+cosωt, which is similar to the positional encoding in the Transformer model.

unet
U-Net: encoding the noisy version of the image to decode the clean version of the image

Under this implementation, we take relatively big steps to estimate noise: |ε(xt+σ2tsθ(x0+ε,t))|2, where ε is the estimated noise level step size, and we then take relatively small steps to denoise: x~tΔt=xt+σ2Δt2sθ(xt,t).

Another Formulation: Variational

We can alternativelly reform the score-based methods into a variational way. The forward process from xtΔt to xt is quite similar to score-based methods: xt=xtΔt+N(0,σ2ΔtI). But in the reverse process for estimating x0, variational methods focus on the conditional distribution, which is distinct from the score-based methods focusing on marginal distributions:

p(xtΔt|xt)p(xtΔt)q(xt|xtΔt)logp(xtΔt|xt)=logp(xtΔt)12σ2Δt|xtΔtxt|2

The derivation starts from applying the Bayes rule to obtain p(xtΔt|xt). As q(xt|xtΔt) is Gaussian noise, we can approximate the conditional density to a Gaussian density iff Δt is very small, i.e., Δt0. Applying first-order Taylor expansion, we have:

logp(xtΔt|xt)logp(xt)+xlogp(xt)(xtΔtxt)12σ2Δt|xtΔtxt|2=12σ2Δt|xtΔt(xt+σ2Δtxlogp(xt))|2+constN(xt+σ2Δtxlogp(xt),σ2ΔtI).

Hence, this variational formulation transforms the extremely hard conditional distribution estimation to a very simple Gaussian distribution.

Recall our gold standard, MLE—as we have obtained the conditional distribution, naturally we can formulate the variational form in KL-Divergence: DKL(p(x0)tq(xt|xt1)|tpθ(xtΔt|xt)), where the left distribution is the complete data distribution of the entire forward process; specifically, p(x0) is the observed data and tq(xt|xt1) is the joint distribution of latent variables; the right distribution is the model density.

In the reverse process, we can execute noise reduction by decomposing the KL-Divergence: tDKL(q(xtΔt|xt,x0)|pθ(xtΔt|xt)), since both distributions are Guassian: q(xtΔt|xt,x0)N(μ(xt,x0),0),pθ(xtΔt|xt)N(Mθ(xt),σ2ΔtI), where Mθ is a U-Net model with parameters θ. We can rewrite the KL-Divergence in closed-form as: minθ|Mθ(xt)μ(xt,x0)|2/σ2Δt, where the noise estimation can be reparametrized as |εε0(xt,t)|2.

Relations with Other Generative Models

Diffusion model can be viewed as the Auto-regressive model in the time domain, which reverse the time, going from white noise xt to the clear image x0.

Diffusion model can be viewed as the Flow-based model. Flow-based model starts from white noise ZN(0,ID) (D is the dimension of data) and use a sequence of transformations to generate x=g1(g2(gt(z))). Each trasformation gi has to be in very stricted form and invertible. Hence, the Deffusion model can be viewed as a more free-formed Flow-based model without restrictions and invertibility.

Diffusion model can be viewed as a refined version of Variational Auto-Encoder (VAE). VAE starts from white noise ZN(0,Id),dD and generates x=g(z)+ε,εN(0,σ2ID). The KL-Divergence for learning VAE by MLE is:

DKL(pdata(x)qϕ(z|x)|p(z)pθ(x|z))=DKL(pdata(x)|pθ(x))+DKL(qϕ(z|x)|pθ(x|z)).

VAE estimates x0 in one-shot. Analogous to the golf example, in contrast to Diffusion model that reaches the target in a thousand strokes, VAE is trying to send the golf into the hole using only one stroke. Hence, this can be very inaccurate.

relation
A visualization of the relation between different generative models from the perspective of Diffusion model

The following animations show the two pairs of counterparts that needs distinction.

covering vs chasing 1 covering vs chasing 2
Mode covering vs. mode chasing

cd vs em 1 cd vs em 2
Contrastive divergence vs. EM algorithm

eq vs non 1 eq vs non 2 eq vs non 3
Equilibrium sampling vs. non-equilibrium sampling

Bibliography

Author and Citation Info

This tutorial entry is composed by Yu-Zhe Shi under the supervision of Dr. Ying Nian Wu.

Dr. Ying Nian Wu is currently a professor in Department of Statistics, UCLA. He received his A.M. degree and Ph.D. degree in statistics from Harvard University in 1994 and 1996 respectively. He was an assistant professor in the Department of Statistics, University of Michigan from 1997 to 1999. He joined UCLA in 1999. He has been a full professor since 2006. Wu’s research areas include generative modeling, representation learning, computer vision, computational neuroscience, and bioinformatics.

Cite the Entry

@InCollection{shi2022generative,
	author       =	{Shi, Yu-Zhe and Wu, Ying Nian},
	title        =	{{Generative Modeling Explained}},
	booktitle    =	{Statistical Machine Learning Tutorials},
	howpublished =	{\url{https://github.com/YuzheSHI/generative-modeling-explained}},
	year         =	{2022},
	edition      =	{{S}ummer 2022},
	publisher    =	{Department of Statistics, UCLA}
}

Acknowledgement

The authors thank Dr. Yixin Zhu for helpful suggestions, Zhangqian Bi for debugging the markdown maths renderer, and Ms. Zhen Chen for helping design the animations.

Previous
Next