Problem Setup

The Variational Lower Bound is also knowd as Evidence Lower Bound(ELBO) or VLB. It is quite useful that we can derive a lower bound of a model containing a hidden variable. Futhermore, we can even maximize the bound to maximize the log probability. We can assume that $X$ are observations (data) and $Z$ are hidden/latent variables which is unobservable. In general, we can also imagine $Z$ as a parameter and the relationship between $Z$ and $X$ are represented as the following

In the mean time, by the definition of Bayes’ Theorem and conditional probability, we can get

$$p(Z | X) = \frac{p(X | Z) p(Z)}{p(X)} = \frac{p(X | Z) p(Z)}{\int_{Z} p(X, Z)}$$

Jensen’s Inequality

It states that for the convex transformation $f$, the mean $f(w \cdot x + (1 - w) y)$ of $x$, $y$ on convex transform $f$ is less than or equal to the mean applied after convex transformation $w \cdot f(x) + (1 - w) f(y)$.

Formaly, corresponding to the notation of the above figure, the Jensen’s inequality can be defined as

$$f(t x_1 + (1 - t) x_2) \leq t f(x_1) + (1 - t) f(x_2)$$

In probability theory, for a random variable $X$ and a convex function $\varphi$, we can state the inequality as

$$\varphi \ (E[X]) \leq E[\varphi(X)]$$

Proof

By the above statement, we can derive

$$ log \ p(X) = log (\int_Z \ p(X, Z)) $$

$$ = log \int_Z \ p(X, Z) \frac{q(Z)}{q(Z)} \tag{2} $$

$$ = log \int_Z \ q(Z) \frac{p(X, Z)}{q(Z)} $$

$$ = log ( E_q[\frac{p(X, Z)}{q(Z)}] ) $$

$$ \geq E_q[log \ \frac{p(X, Z)}{q(Z)}] \tag{4} $$

$$ = E_q[log \ p(X, Z) - log \ q(Z)] $$

$$ = E_q[log \ p(X, Z)] - E_q[log \ q(Z)] $$

$$ = E_q[log \ p(X, Z)] + H[Z] \tag{5} $$

Where $q(Z)$ in Eq. (2) is the approximation of the true posterior distribution $p(Z|X)$, since we don’t know the distribution of the $p(Z|X)$ of hidden state $Z$. To derive the lower bound, we apply Jensen’s inequality in Eq. (4).

Also, the Eq. (5) is the ELBO.

Then, we denote L as ELBO as following

$$ L = E_q[log \ p(X, Z)] + H[Z] $$

So far, we’ve know what’s the ELBO. The accuracy of ELBO is depend on the accuracy of the approximation of $q(Z) \approx p(Z|X)$. If we could get a better approximation, the lower bound would be more accurate. To quantify the accuracy of the approximation, we need to do something more.

Derive With KL-Divergence

Since KL-divergence is a common metric to measure the distance between distributions and $q(Z)$ is an approximation of $p(Z|X)$, the KL-Divergence between $q(Z)$ and $p(Z|X) \geq 0$. We can further derive:

$$ KL[q(Z) || p(Z|X)] = \int_Z q(Z) \ log \ \frac{q(Z)}{p(Z|X)} $$

$$ = - \int_Z q(Z) \ log \ \frac{p(Z|X)}{q(Z)} $$

$$ = - \int_Z q(Z) \ log \ \frac{p(Z, X)}{q(Z) p(X)} $$

$$ = - \int_Z q(Z) \ (log \ \frac{p(Z, X)}{q(Z)} - log \ p(X)) $$

$$ = - \int_Z q(Z) \ log \ \frac{p(Z, X)}{q(Z)} + \int_Z q(Z) \ log \ p(X) $$

$$ = E_q[log \ \frac{p(X, Z)}{q(Z)}] + log \ p(X) \int_Z q(Z) $$

$$ = -L + log \ p(X) $$

Then we rearrange the equation

$$ L = log \ p(X) - KL[q(Z) || p(Z|X)] $$

Again, $L$ is the ELBO.

The Application of ELBO

We can maximize the $log \ p(X)$ with ELBO. With KKT, we can rewrite the optimization of log probability of $X$

$$\mathop{\max}(log \ p(X))$$

to

$$\mathop{max}(log \ p(X) - \beta KL[q(Z)||p(Z|X)])$$

Thus, we can optimize the model containing hidden variable with known $p(X), p(Z|X)$ and the approximation of the hidden variable $q(Z)$. It’s a very useful trick in the model-based RL.

For more detail, please refer to this handout and wikipedia. They’ve given a great explaination. If you understand Chinese, you can read this blog. The author gives a series of blogs discussing about it.