Home » Information Theory and the Variational Lower Bound (ELBO): Why VAEs Maximise It

Information Theory and the Variational Lower Bound (ELBO): Why VAEs Maximise It

by Gus

Variational Autoencoders (VAEs) are a practical way to learn a probabilistic generative model when the exact data likelihood is hard to compute. Instead of trying to maximise the true log-likelihood directly, VAEs maximise a proxy objective called the Evidence Lower BOund (ELBO). This matters because ELBO training turns “intractable probability” into an optimisation problem you can solve with standard gradient-based methods, while still staying grounded in information theory. If you are exploring modern generative modelling as part of gen AI training in Hyderabad, understanding ELBO is one of the fastest ways to move from “I can run a notebook” to “I know what the model is really optimising”.

Why the true likelihood is difficult in latent-variable models

A VAE assumes each observation xxx is generated from a latent variable zzz. The generative story is:

  • Sample a latent code z∼p(z)z \sim p(z)z∼p(z) (often a standard normal).
  • Generate data x∼pθ(x∣z)x \sim p_\theta(x \mid z)x∼pθ​(x∣z) using a neural network decoder.

The likelihood of a single datapoint is:

pθ(x)=∫pθ(x∣z) p(z) dzp_\theta(x) = \int p_\theta(x \mid z)\, p(z)\, dzpθ​(x)=∫pθ​(x∣z)p(z)dzThat integral is the problem. For flexible neural decoders, the integral has no closed form and is expensive to approximate naively. Maximising log⁡pθ(x)\log p_\theta(x)logpθ​(x) directly is therefore not straightforward.

ELBO: the objective that makes VAEs trainable

The key idea is to introduce an approximate posterior distribution qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x), implemented by the encoder network. This distribution approximates the true posterior pθ(z∣x)p_\theta(z \mid x)pθ​(z∣x), which is also intractable. Using variational inference, we can derive a lower bound on the log-likelihood:

log⁡pθ(x)≥Eqϕ(z∣x)[log⁡pθ(x∣z)]−KL(qϕ(z∣x) ∥ p(z))\log p_\theta(x) \ge \mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)] – \mathrm{KL}\big(q_\phi(z \mid x)\,\|\,p(z)\big)logpθ​(x)≥Eqϕ​(z∣x)​[logpθ​(x∣z)]−KL(qϕ​(z∣x)∥p(z))The right-hand side is the ELBO. Training a VAE means maximising this ELBO over the dataset.

What the two ELBO terms mean

1) Reconstruction term (expected log-likelihood)

Eqϕ(z∣x)[log⁡pθ(x∣z)]\mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)]Eqϕ​(z∣x)​[logpθ​(x∣z)] encourages the decoder to reconstruct the input accurately from samples of zzz. In practice, this becomes a familiar reconstruction loss (for example, mean squared error for Gaussian likelihoods, or cross-entropy for Bernoulli likelihoods), but with a probabilistic interpretation.

2) Regularisation term (KL divergence)

KL(qϕ(z∣x) ∥ p(z))\mathrm{KL}(q_\phi(z \mid x)\,\|\,p(z))KL(qϕ​(z∣x)∥p(z)) penalises the encoder when its posterior deviates too far from the prior. This keeps the latent space “well-behaved” so that sampling z∼p(z)z \sim p(z)z∼p(z) at generation time yields meaningful outputs.

Together, these terms create a controlled trade-off: fit the data well, but do not let the latent codes become arbitrary.

The information-theoretic interpretation: rate vs distortion

ELBO is not just a trick; it has a clean information-theory view. The reconstruction term corresponds to distortion (how much information about xxx is lost when passing through zzz), while the KL term corresponds to rate (how many “nats” or “bits” are required to encode xxx into zzz relative to the prior).

Maximising ELBO is therefore similar to optimising a rate–distortion objective: achieve good reconstructions without spending too much latent capacity. This is why VAEs often learn smooth, continuous latent spaces and can interpolate between samples sensibly.

For learners doing gen AI training in Hyderabad, this rate–distortion framing is useful because it explains common VAE behaviours without hand-waving: blurry outputs can come from likelihood choices and conservative rate; posterior collapse can happen when the model finds it can reconstruct without using zzz, making the KL term shrink toward zero.

How ELBO is optimised in practice

To train with backpropagation, VAEs use the reparameterisation trick. Instead of sampling zzz directly from qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x), the encoder outputs parameters (mean μ\muμ and standard deviation σ\sigmaσ), and we sample:

z=μ(x)+σ(x) ϵ,ϵ∼N(0,I)z = \mu(x) + \sigma(x)\,\epsilon,\quad \epsilon \sim \mathcal{N}(0, I)z=μ(x)+σ(x)ϵ,ϵ∼N(0,I)This isolates randomness in ϵ\epsilonϵ, allowing gradients to flow through μ\muμ and σ\sigmaσ. A typical training step:

  1. Encode xxx to μ(x),σ(x)\mu(x), \sigma(x)μ(x),σ(x).
  2. Sample zzz via reparameterisation.
  3. Decode to get pθ(x∣z)p_\theta(x \mid z)pθ​(x∣z) and compute reconstruction loss.
  4. Compute KL divergence between qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x) and p(z)p(z)p(z).
  5. Maximise ELBO (or minimise negative ELBO).

Many variants tweak the balance, such as β\betaβ-VAE, which multiplies the KL term by β\betaβ to encourage disentanglement or prevent overfitting.

Conclusion

ELBO is the central objective that makes VAEs work: it is a mathematically grounded lower bound on the true log-likelihood, built from an expected reconstruction term and a KL regulariser. From an information-theory lens, ELBO is a rate–distortion trade-off that explains why VAEs learn structured latent spaces and why training pathologies can occur. Once you see ELBO as “maximise data fit while controlling information flow through the latent code,” VAEs become far less mysterious—and far easier to debug and improve during gen AI training in Hyderabad.

You may also like

Leave a Comment