Bayesian deep learning: uncertainty, inductive bias, and modern theory

Oct 10, 2025

I recently came across Andrew Gordon Wilson’s discussion on demystifying deep learning and his work on Bayesian deep learning. His perspective sheds light on why deep learning works as well as it does, what its real limitations are, and how probabilistic methods can help us build more reliable and interpretable models. Below is a summary of his key ideas for future reference.

Wilson begins by arguing that deep learning is indeed unusual and at times mysterious, but not for the reasons people commonly assume. Much of what appears inexplicable—such as the ability of massive models to generalize on small datasets—can actually be understood through longstanding ideas in inductive bias, generalization theory, and Bayesian inference. He emphasizes that deep learning’s power does not come from magic but from a combination of extreme expressiveness and strong simplicity biases that emerge as models grow larger. Wilson argues that more expressive models, when paired with soft biases—such as compression, Bayesian marginalization, or architectural tendencies—actually generalize better than smaller, more rigid ones. This leads to the counterintuitive conclusion that increasing model size often reduces, rather than increases, overfitting. Phenomena such as double descent illustrate this dramatically: once models exceed the interpolation threshold and continue to grow, their generalization improves even though they perfectly fit the training data, implying that scale brings with it an internal preference for flatter, more compressible solutions.

Andrew notes that real-world data is biased toward low Kolmogorov complexity, meaning the world is structured, not random. Gaussian kernels offer explicit, analytic examples of such universal structure. For instance, the RBF kernel encodes assumptions of smoothness and locality that hold across physics, engineering, biology, and perception. Spectral mixture kernels encode periodicity and long-range structure, again reflecting universal patterns found across natural systems. Deep kernel learning generalizes this: a deep network learns a feature map that is then fed into a Gaussian process, giving a kernel that captures complex patterns while remaining smooth and compressible. Bayesian deep learning connects to this theme because it values models that represent honest beliefs about structure in the world, and GP kernels are explicit representations of such beliefs. The open question is whether we can discover a small set of kernel-like inductive biases that truly span real-world variation, enabling models to generalize systematically across domains.

The Gaussian kernel is one of the most important objects in all of machine learning — not because it is the only sensible prior over functions, but because it crystallizes, in an analytically clean form, what good inductive bias looks like in function space. Andrew’s worldview is easier to understand once you see why the Gaussian kernel is so central. The Gaussian (or radial basis function, RBF) kernel is central to modern kernel-based learning methods because it induces a smooth, infinitely differentiable feature space that enables highly flexible nonlinear modeling. Its mathematical form ensures locality—points close in the input space receive high similarity, while distant points decay exponentially. This property makes the Gaussian kernel effective for capturing structure in data with complex, curved decision boundaries. The Gaussian kernel is important because it satisfies the conditions of Mercer’s theorem, ensuring that it corresponds to an inner product in a (potentially infinite-dimensional) reproducing kernel Hilbert space (RKHS). This guarantee allows algorithms such as Support Vector Machines (SVMs) and kernel ridge regression to apply the kernel trick, operating implicitly in a high- or infinite-dimensional feature space without explicit mapping.

The Bayesian predictive distribution for a new input $x_*$ is given by

\[p(y \mid x_*, \mathbf{y}, \mathbf{X}) = \int p(y \mid x_*, \mathbf{w}) \, p(\mathbf{w} \mid \mathbf{y}, \mathbf{X}) \, d\mathbf{w}.\]

This expression can be understood by noting that each possible setting of the weights $\mathbf{w}$ corresponds to a different model $f(x, \mathbf{w})$. The Bayesian predictive distribution therefore performs a Bayesian model average: it averages over an uncountably infinite collection of models, weighting each one by its posterior probability under the observed data $(\mathbf{X}, \mathbf{y})$. This formulation naturally represents epistemic uncertainty, that is, uncertainty about which function $f(x, \mathbf{w})$ actually explains the data. Rather than committing to a single set of parameters, the Bayesian approach recognizes that many weight configurations may be compatible with the observations and integrates over all of them.

Standard neural network training can be interpreted as using an extremely crude approximation to the true posterior. In particular, classical training corresponds to replacing the full posterior distribution with a point mass at the maximum a posteriori (MAP) estimate, $q(\mathbf{w} \mid \mathbf{y}, \mathbf{X}) = \delta(\mathbf{w} - \mathbf{w}_{\mathrm{MAP}}),$ thereby assuming that a single weight vector is the only plausible explanation of the data. This approximation discards epistemic uncertainty entirely and lacks the automatic complexity control inherent in the full Bayesian treatment.

Bayesian marginalization is fundamentally different from optimization because it does not attempt to find a single best model or parameter setting. Instead, it explicitly acknowledges that multiple models may plausibly explain the observed data, and therefore it averages over all of them, weighting each according to its posterior probability. The Bayesian predictive distribution integrates over the entire space of hypotheses, rather than collapsing onto a single point estimate.

Bayesian prediction involves averaging over all possible parameter settings, weighted by their posterior probabilities. Formally, for a new input $x_*$ and dataset $D = (\mathbf{X}, \mathbf{y})$, the predictive distribution is:

\[p(y \mid x_*, D) = \int p(y \mid x_*, \mathbf{w})\, p(\mathbf{w} \mid D)\, d\mathbf{w}\]

For most models of interest—especially Bayesian neural networks—this integral is intractable. The posterior $p(\mathbf{w} \mid D)$ is high-dimensional, complex, and typically multimodal. Therefore, practical Bayesian deep learning usually approximates this average using Monte Carlo sampling. We draw $J$ samples $\mathbf{w}_j$ from an approximate posterior $q(\mathbf{w} \mid D)$ and compute the following empirical average:

\[p(y \mid x_*, D) \approx \frac{1}{J} \sum_{j=1}^J p(y \mid x_*, \mathbf{w}_j)\]

Each $\mathbf{w}_j$ is sampled from the approximate posterior $q(\mathbf{w} \mid D)$. In this way, Bayesian prediction becomes the problem of generating good samples from a distribution that approximates the true posterior over parameters.

There are two main approaches for approximate Bayesian inference in deep learning:

Deterministic (e.g., variational) methods: These approximate the posterior with a simple, tractable distribution—such as a Gaussian—denoted $q(\mathbf{w} \mid D, \theta)$, where $\theta$ controls the shape of the approximation. The goal is to choose $\theta$ so that $q$ is as close as possible to the true posterior. Variational inference is the prototypical example. Standard neural network training is an extreme case, where the approximate posterior is just a point mass (delta function) at the MAP estimate $\mathbf{w}_{\mathrm{MAP}}$.
Sampling-based (MCMC) methods: These construct a Markov chain whose stationary distribution is the true posterior $p(\mathbf{w} \mid D)$. Popular MCMC algorithms include Metropolis–Hastings, Hamiltonian Monte Carlo, stochastic gradient Langevin dynamics (SGLD), and stochastic gradient Hamiltonian Monte Carlo (SGHMC). While they are asymptotically exact, these methods are often too computationally expensive for large neural networks.

Both approaches seek to generate samples that reflect the variability in the posterior over parameters, but trade off computational cost and statistical fidelity in different ways.

Bayesian inference provides a principled way to compare models by evaluating their model evidence (also known as the marginal likelihood). Unlike classical approaches, which focus only on how well a model fits the observed data, the model evidence measures how well a model is expected to predict any possible dataset, automatically incorporating a preference for simplicity through the prior over functions—a formal expression of Occam’s razor. Models that are too simple cannot explain the data well, while overly complex models dilute their probability mass over too many possibilities. The best-performing models under this metric are those that most closely match the complexity of the data.

Formally, for a candidate model $M_i$, the marginal likelihood (model evidence) is:

\[p(\mathbf{y} \mid M_i) = \int p(\mathbf{y} \mid f, M_i) \; p(f \mid M_i) \, df\]

Here, $p(f \mid M_i)$ is the prior over functions, and $p(\mathbf{y} \mid f, M_i)$ is the likelihood of the observed data given the function.

Bayes’ rule then gives the posterior probability over models:

\[p(M_i \mid \mathbf{y}) = \frac{p(\mathbf{y} \mid M_i) \, p(M_i)}{p(\mathbf{y})}\]

where $p(M_i)$ is the prior probability of model $M_i$, and $p(\mathbf{y})$ ensures normalization over all candidate models. The “best” model in the Bayesian framework is the one that assigns the highest probability to the observed data before seeing it, according to its own prior assumptions and capacity to generalize.

A Gaussian process with an RBF kernel can be derived from a simple linear model using local Gaussian basis functions. Consider the following function representation as a finite sum of basis functions:

\[f(x) = \sum_{i=1}^J w_i\, \phi_i(x),\]

where each weight is drawn from a Gaussian prior,

\[w_i \sim \mathcal{N}\left(0, \frac{\sigma^2}{J}\right),\]

and each basis function is a Gaussian “bump” centered at some location $c_i$:

\[\phi_i(x) = \exp\left(-\frac{(x-c_i)^2}{2\ell^2}\right).\]

This setup defines a prior over functions, because randomness in the weights induces randomness in $f(x)$. The covariance (kernel) of this prior is given by the expected product $f(x)f(x')$. Due to independence of the weights,

\[k(x, x') = \frac{\sigma^2}{J} \sum_{i=1}^J \phi_i(x) \phi_i(x').\]

This finite sum already resembles a feature map: the kernel is a sum of overlaps between local Gaussian bumps centered at different locations. The key step is to let the number of basis functions $J$ grow, making the centers $c_i$ ever more densely spaced along the real line. For example, set $c_1 = -\log J$, $c_J = \log J$, and spacing $\Delta c = 2\log J / J$. As $J \to \infty$, the sum becomes a Riemann sum, and then an integral:

\[k(x, x') = \lim_{J\to\infty} \frac{\sigma^2}{J} \sum_{i=1}^J \phi_i(x) \phi_i(x') = \int_{c_0}^{c_\infty} \phi_c(x)\, \phi_c(x')\, dc.\]

Extending the limits to all real numbers ($c_0 = -\infty,\, c_\infty = \infty$), we get:

\[k(x, x') = \int_{-\infty}^{\infty} \exp\left(-\frac{(x-c)^2}{2\ell^2}\right) \exp\left(-\frac{(x'-c)^2}{2\ell^2}\right) dc.\]

This integral can be evaluated in closed form. The product of two Gaussians in $c$ is another Gaussian in $c$, and integrating over all $c$ gives:

\[k(x, x') = \sqrt{\pi} \ell \sigma^2 \exp\left(-\frac{(x-x')^2}{4\ell^2}\right).\]

Up to constant factors and a rescaling of the lengthscale from $\ell$ to $2\ell$, this is precisely the RBF (squared exponential) kernel:

\[k_{\mathrm{RBF}}(x, x') = a^2 \exp\left(-\frac{\|x - x'\|^2}{2\ell^2}\right).\]

Thus, the RBF kernel emerges as the limit of simple basis-function models with infinitely many Gaussian features scattered across the real line. This derivation shows that the inductive bias of the RBF kernel—smoothness, locality, infinite differentiability—comes directly from the shape and density of the Gaussian basis functions. Each bump enforces local smoothness; their infinite density enforces smoothness everywhere.

This construction illustrates Andrew’s broader point: function-space priors are more interpretable and meaningful than priors placed directly on parameters. In weight space, the prior seems opaque, but in function space, the kernel directly expresses which functions you expect before seeing data. The RBF kernel makes these beliefs explicit, and this derivation demonstrates why it is such a powerful and widely useful inductive bias.

Due originally to Radford Neal (1996), that a neural network with a very large number of hidden units converges to a Gaussian process (GP). The core idea is to take a single-hidden-layer neural network of the form

\[f(x) = b + \sum_{i=1}^J v_i\, h(x; u_i),\]

where $h(x; u_i)$ is any bounded activation function, $u_i$ are the input-to-hidden weights, and $v_i$ are the hidden-to-output weights. The parameters $b$ and $v_i$ are assumed to be drawn independently from zero-mean Gaussian distributions with variances $\sigma_b^2$ and $\sigma_v^2 / J$, respectively. The scaling of variances by $1/J$ is essential: it ensures that as the number of hidden units $J$ grows, the contribution of each unit becomes small enough that the sum remains well-behaved. The inputs $u_i$ are also taken to be i.i.d. random vectors drawn from some fixed distribution. Together, these assumptions define a probability distribution over functions $f(x)$, because randomness in the weights induces randomness in the output.

Once all parameters are treated as random variables, we can compute the mean and covariance of this function prior. The mean is zero:

\[\mathbb{E}_w[f(x)] = 0,\]

because all the weights have zero mean. The covariance between outputs at two inputs $x$ and $x'$ is

\[\operatorname{cov}(f(x), f(x')) = \sigma_b^2 + \frac{\sigma_v^2}{J} \sum_{i=1}^J \mathbb{E}_{u_i}[h(x; u_i)\, h(x'; u_i)].\]

Because the $u_i$ are identically distributed, each term in the sum has the same expectation, and so the covariance simplifies to:

\[\operatorname{cov}(f(x), f(x')) = \sigma_b^2 + \sigma_v^2\, \mathbb{E}_{u}[h(x; u)\, h(x'; u)].\]

This is the neural network kernel: a function that expresses how strongly the network correlates outputs at different inputs before any training occurs. Crucially, this expression is independent of $J$ once the limit is taken; it is determined only by the activation function and the prior over weights.

The final—and most powerful—step is to note that $f(x)$ is a sum of $J$ independent random terms, each contributing a small amount. As $J$ becomes large, the central limit theorem implies that the distribution of $(f(x_1), \ldots, f(x_N))$ over any finite collection of inputs converges to a multivariate Gaussian. The mean is zero, and the covariance matrix is given by the kernel above. Therefore, in the infinite-width limit, a neural network becomes exactly a Gaussian process with kernel

\[k_{\mathrm{NN}}(x, x') = \sigma_b^2 + \sigma_v^2\, \mathbb{E}_{u}[h(x; u)\, h(x'; u)].\]

This result is foundational because it connects neural networks with Gaussian processes and reveals the function-space prior that an infinitely wide network implicitly encodes. It shows that the inductive bias of the network — what kinds of functions it considers likely before training — is fully captured by a kernel determined by the activation function and the weight distribution. Andrew uses this to argue that thinking in function space, via kernels, is often more interpretable and principled than reasoning in parameter space, because the kernel directly describes the smoothness, correlations, and general properties of functions the network prefers. In this way, infinite neural networks serve as a bridge between deep learning and classical Bayesian nonparametrics, and the neural network kernel characterizes this connection with mathematical clarity.

Conceptually, deep kernel learning allows us to retain everything that is attractive about Gaussian processes — such as automatic Occam’s razor, closed-form uncertainty, and the ability to adapt model complexity to the amount of data — while capturing the compositional inductive biases that make deep networks successful on high-dimensional inputs like images or audio. Instead of choosing a kernel by hand (like the RBF kernel), DKL lets the network learn a sophisticated kernel implicitly through its hidden layers. In the limit, the final feature layer can be seen as a highly expressive, nonlinear, learned embedding space in which a GP performs smooth, uncertainty-aware function inference. Unlike classical deep learning, training is not simply optimizing a loss; instead, the marginal likelihood balances fit and complexity, removing the need for heuristic regularization and enabling principled hyperparameter selection.

In this way, DKL embodies Andrew’s broader philosophy that we should combine the representational benefits of deep learning with the Bayesian, function-space advantages of Gaussian processes. The neural network provides a flexible transformation that discovers useful structure in the input, and the Gaussian process placed on top ensures principled uncertainty quantification, calibrated predictions, and automatic complexity control. The resulting model remains nonparametric — it can adapt its effective capacity as the dataset grows — while learning deep, high-level representations that fixed kernels could never capture. This makes deep kernel learning a bridge between two historically separate approaches, using the strengths of each to mitigate the weaknesses of the other.

The Spectral Mixture (SM) kernel is rooted in one of the foundational results in kernel theory: Bochner’s theorem. This theorem states that any stationary kernel $k(\tau)$, where $\tau = x - x'$, can be represented as the Fourier transform of a non-negative spectral density $S(s)$. Formally, Bochner’s theorem gives:

\[k(\tau) = \int_{\mathbb{R}^P} S(s)\; e^{2\pi i\, s^\top \tau} \;ds,\]

This means designing stationary kernels is equivalent to designing spectral densities. Rather than proposing kernel functions directly, Andrew suggests modeling expressive, interpretable spectral densities $S(s)$ to encode meaningful patterns.

A flexible and expressive choice is to represent $S(s)$ as a mixture of Gaussians. In one dimension, a symmetric Gaussian mixture component has the form:

\[S(s) = \frac{1}{2}\left[ \mathcal{N}(s; \mu, \sigma^2) + \mathcal{N}(-s; \mu, \sigma^2) \right]\]

where the symmetry ensures the resulting kernel is real-valued. Evaluating the Fourier transform of this spectral density gives the kernel:

\[k(\tau) = \exp\left(-2 \pi^2 \tau^2 \sigma^2 \right)\, \cos(2\pi \tau \mu)\]

which is a product of a cosine (encoding periodicity) and a Gaussian envelope (encoding local smoothness). Even a single Gaussian in the spectral domain yields a kernel that can represent oscillations, trends, and smooth variation—capabilities standard kernels like the RBF do not combine.

To generalize to higher dimensions, Andrew models the spectral density as a mixture of $Q$ Gaussians, each with a diagonal covariance matrix $M_q = \mathrm{diag}(v_q^{(1)}, \ldots, v_q^{(P)})$. This leads to the full spectral mixture kernel:

\[k(\tau) = \sum_{q=1}^Q w_q\, \cos \left(2\pi\, \tau^\top \mu_q \right) \prod_{p=1}^P \exp\left(-2\pi^2 \tau_p^2\, v_q^{(p)}\right)\]

where each component is defined by a weight $w_q$, mean frequency $\mu_q$, and lengthscale parameters $v_q^{(p)}$. With enough components, this kernel can approximate any stationary kernel arbitrarily well. Importantly, each mixture component corresponds to a distinct “spectral mode” in the data, with clear interpretations in terms of frequency, smoothness, and variability. As a result, SM kernels provide powerful tools for pattern discovery, enabling automatic learning of periodicities, long-range dependencies, or multiple frequencies from raw time series or spatial data.

In the function-space view, the GP model for extrapolation uses the SM kernel in a Gaussian process prior:

\[f(x) \sim \mathcal{GP}(0,\, k_{\mathrm{SM}}(x, x' \mid \theta))\]

allowing the GP to express complex patterns through learned hyperparameters $\theta$. Learning proceeds by maximizing the marginal likelihood:

\[\log p(y \mid \theta, X) = -\underbrace{\frac{1}{2} y^\top (K_\theta + \sigma^2 I)^{-1} y}_{\text{model fit}} -\underbrace{\frac{1}{2} \log |K_\theta + \sigma^2 I|}_{\text{complexity penalty}} -\frac{N}{2} \log (2\pi)\]

which naturally balances model fit and complexity. The first term encourages the model to explain the data well, while the log-determinant penalizes unnecessary complexity. Because SM kernels are expressive yet have a manageable number of parameters, the marginal likelihood efficiently selects the right number and type of frequencies in the data. This enables the GP to achieve structured extrapolation that is far beyond what is possible with standard kernels.

What makes SM kernels so central to Andrew’s approach is that they define an interpretable, compositional prior directly in function space. Instead of relying on parameter-space priors or rigid kernels like the RBF, the SM kernel learns the data’s underlying generative structure—its rhythms, trends, and patterns—through a principled Bayesian lens. This is inductive bias at its best: flexible enough to capture real-world phenomena, yet structured enough to avoid overfitting via the marginal likelihood.

Open questions

Although modern deep networks exhibit a strong tendency toward simple, compressible solutions — even as they grow to billions of parameters — we still do not understand why. Geometric intuitions about flat minima and high-dimensional volume provide partial explanations, but Andrew stresses that they are incomplete. Both SGD and full-batch gradient descent can produce similar simplicity effects, and even random “guess-and-check” methods can land in good solutions, suggesting the phenomenon is deeper than optimization dynamics alone. Understanding the true mechanism linking scale → simplicity → generalization is arguably the most important unanswered question in deep learning theory, because it underlies phenomena like double descent, benign overfitting, mode connectivity, and the empirical success of enormous overparameterized models.

Current models rely on scale as a crude way to induce desirable inductive biases — but Andrew views this as an inelegant, computationally wasteful solution. The open question is whether we can design explicit mechanisms or objective functions that favor compressible, structure-aware hypotheses without needing billions of parameters. Doing so may require new forms of regularization, explicit approximations to Bayesian marginalization, architectures that incorporate universal priors, or entirely new ways of controlling the geometry of the loss landscape. Success here would democratize strong generalization, enabling small or medium-sized models to perform like today’s largest systems. In Andrew’s view, “moving beyond scale” while preserving its beneficial biases is one of the most profound and unresolved technical challenges.

Andrew emphasizes that epistemic uncertainty is essential for safe decision-making, generalization, and scientific reasoning — yet no scalable method exists for modeling it in the context of modern LLMs and giant transformers. Classical Bayesian approaches do not scale, variational methods often fail, and deep ensembles only approximate posterior averaging crudely. As models grow larger, epistemic uncertainty becomes more important, not less, because overparameterization increases the number of plausible explanations consistent with the data. The question is: how can we bring principled, tractable Bayesian reasoning — or something functionally equivalent to it — into the foundation model era? This remains a major unsolved problem and one Andrew sees as foundational to future progress.