Bayesian deep learning: uncertainty, inductive bias, and modern theory
Oct 10, 2025
Modern neural networks generalize in ways the classical bias-variance story cannot explain. Huge models trained on relatively small datasets. Overparameterized systems that fit training data perfectly without collapsing into overfitting. Performance that improves with scale, sometimes dramatically. These behaviors are unusual, but not inexplicable. The cleanest frame I have found for them comes from Andrew Gordon Wilson’s work on Bayesian deep learning, which recasts older ideas — inductive bias, generalization, compression, Bayesian inference — for the era of overparameterization.
The central shift is this: expressiveness alone is not the problem. More expressive models can generalize better when paired with the right kinds of soft bias. Those biases might come from architecture, optimization, compression, Bayesian marginalization, or a tendency toward flatter and simpler solutions. Larger models are not just bigger memorization machines. As they scale, they often develop stronger preferences for simple, compressible explanations of the data.
This helps explain phenomena like double descent. Once a model passes the interpolation threshold and can perfectly fit the training set, further increases in scale can actually improve generalization. That sounds counterintuitive under the classical bias-variance story, but it makes sense if scale brings with it a preference for smoother, flatter, more compressible functions.
Real-world data is not arbitrary. The world has structure. It tends to be highly compressible: patterns in physics, biology, engineering, perception, and language are not random noise. Good models succeed because they exploit this structure.
Gaussian processes and kernels make this idea especially explicit. The RBF, or Gaussian, kernel is a clean example of a useful inductive bias in function space. It encodes smoothness and locality: nearby inputs should have similar outputs, and similarity should decay as distance increases. That assumption appears again and again across natural and engineered systems.
Other kernels, such as spectral mixture kernels, encode additional structure like periodicity and long-range dependencies. These are not arbitrary mathematical tricks. They are ways of expressing beliefs about the kinds of patterns that tend to appear in the world.
Deep kernel learning extends this idea by using a neural network to learn a feature representation, then applying a Gaussian process on top of that representation. The result is a model that can capture complex patterns while retaining the smoothness, uncertainty awareness, and probabilistic interpretation of kernel methods.
This is where Bayesian deep learning enters naturally. Bayesian methods force us to represent uncertainty and encode assumptions explicitly. They ask not just “what function fits the data?” but “what family of functions should we believe in, given the structure of the world and the evidence we have seen?”
The Gaussian kernel is important not because it is the only reasonable prior over functions, but because it shows, in a mathematically clean way, what good inductive bias looks like. Its role in machine learning comes from its ability to define a smooth, infinitely differentiable function space while allowing flexible nonlinear modeling. As a positive-definite kernel, it corresponds to an inner product in a reproducing kernel Hilbert space, which enables methods like support vector machines and kernel ridge regression to operate implicitly in very high- or even infinite-dimensional spaces through the kernel trick.
The broader question is whether we can identify a small set of powerful, kernel-like inductive biases that capture the recurring structure of the real world. If we can, then models may be able to generalize more systematically across domains — not just by scaling parameters, but by building in better assumptions about the kinds of patterns reality tends to contain.
The Bayesian predictive distribution for a new input \(x_*\) is given by
\[p(y \mid x_*, \mathbf{y}, \mathbf{X}) = \int p(y \mid x_*, \mathbf{w}) \, p(\mathbf{w} \mid \mathbf{y}, \mathbf{X}) \, d\mathbf{w}.\]The key idea is simple: every possible setting of the weights \(\mathbf{w}\) defines a different model \(f(x,\mathbf{w})\). Bayesian prediction does not choose one of these models and discard the rest. Instead, it averages over all plausible models, weighting each by how likely it is under the posterior after observing the data.
This is Bayesian model averaging. Rather than pretending that there is one uniquely correct set of parameters, the Bayesian view recognizes that many different functions may explain the same observations. The predictive distribution integrates over that uncertainty. This is especially important because it captures epistemic uncertainty: uncertainty about which function, model, or explanation is actually responsible for the data.
Standard neural network training can be seen as a very crude approximation to this Bayesian ideal. Classical training effectively replaces the full posterior distribution with a point mass at the maximum a posteriori estimate:
\[q(\mathbf{w} \mid \mathbf{y}, \mathbf{X}) = \delta(\mathbf{w} - \mathbf{w}_{\mathrm{MAP}}).\]In other words, ordinary training acts as if a single weight vector is the only plausible explanation of the data. This throws away epistemic uncertainty. It also loses one of the most important features of Bayesian inference: automatic complexity control through marginalization.
Bayesian marginalization is fundamentally different from optimization. Optimization asks: what is the best parameter setting? Marginalization asks: what predictions do we get after averaging over all plausible parameter settings? That distinction matters. A Bayesian model does not collapse uncertainty into a single solution. It preserves uncertainty and carries it forward into prediction.
For a dataset \(D = (\mathbf{X}, \mathbf{y})\), the Bayesian predictive distribution can be written as:
\[p(y \mid x_*, D) = \int p(y \mid x_*, \mathbf{w})\,p(\mathbf{w} \mid D)\,d\mathbf{w}.\]For neural networks, this integral is almost never tractable. The posterior over weights is high-dimensional, irregular, and often multimodal. Practical Bayesian deep learning therefore approximates the integral. A common approach is Monte Carlo estimation: draw samples \(\mathbf{w}_j\) from an approximate posterior \(q(\mathbf{w} \mid D)\), then average the predictions:
\[p(y \mid x_*, D) \approx \frac{1}{J} \sum_{j=1}^{J} p(y \mid x_*, \mathbf{w}_j).\]Each sampled weight setting represents a different plausible model. The final prediction is an average across these models. So the practical challenge becomes: how do we get good samples from something close to the true posterior?
There are two broad approaches.
Deterministic approximate inference, especially variational inference. Here, we choose a simpler distribution \(q(\mathbf{w} \mid D, \theta)\), often something tractable like a Gaussian, and tune its parameters \(\theta\) so that it approximates the true posterior. From this perspective, standard neural network training is the most extreme version of variational inference: the approximate posterior collapses to a delta function at \(\mathbf{w}_{\mathrm{MAP}}\).
Sampling-based inference, usually through Markov chain Monte Carlo. Methods such as Metropolis-Hastings, Hamiltonian Monte Carlo, stochastic gradient Langevin dynamics, and stochastic gradient Hamiltonian Monte Carlo try to construct samples whose distribution matches the true posterior. These methods are more faithful to the Bayesian objective, at least asymptotically, but they are often too expensive for modern deep networks.
Both approaches are attempts to approximate the same thing: a posterior distribution over plausible explanations. They differ mainly in the tradeoff between computational practicality and statistical fidelity.
Bayesian inference also gives a principled way to compare models through model evidence, or marginal likelihood. Classical training usually focuses on fitting the observed data. Bayesian model comparison asks a deeper question: how probable was this dataset under the model before the model saw it?
For a candidate model \(M_i\), the evidence is:
\[p(\mathbf{y} \mid M_i) = \int p(\mathbf{y} \mid f, M_i)\,p(f \mid M_i)\,df.\]Here, \(p(f \mid M_i)\) is the prior over functions, and \(p(\mathbf{y} \mid f, M_i)\) is the likelihood of the observed data given a particular function.
This quantity automatically encodes Occam’s razor. A model that is too simple will not place enough probability on the observed data. But a model that is too flexible can also be penalized, because it spreads its probability mass across too many possible datasets. The best model is not necessarily the one with the most capacity. It is the one whose assumptions place high prior probability on the kind of structure that actually appears in the data.
Bayes’ rule then gives the posterior probability of each model:
\[p(M_i \mid \mathbf{y}) = \frac{ p(\mathbf{y} \mid M_i)\,p(M_i) }{ p(\mathbf{y}) }.\]This is the deeper Bayesian view of generalization. A good model is not merely one that fits the data after the fact. It is one that made the data likely in advance, because its prior assumptions were well aligned with the structure of the world.
The Spectral Mixture (SM) kernel, due to Wilson and Adams (2013), is grounded in one of the central results in kernel theory: Bochner’s theorem. Bochner’s theorem says that any stationary kernel \(k(\tau)\), where \(\tau = x - x'\), can be written as the Fourier transform of a non-negative spectral density \(S(s)\):
\[k(\tau) = \int_{\mathbb{R}^P} S(s)\,e^{2\pi i\,s^\top \tau}\,ds.\]The implication is important: designing stationary kernels is equivalent to designing spectral densities. Instead of directly inventing kernel functions in input space, the SM construction models expressive and interpretable spectral densities \(S(s)\) that encode meaningful structure in the data.
A natural way to do this is to represent \(S(s)\) as a mixture of Gaussians. In one dimension, a symmetric Gaussian mixture component can be written as:
\[S(s) = \frac{1}{2} \left[ \mathcal{N}(s;\mu,\sigma^2) + \mathcal{N}(-s;\mu,\sigma^2) \right].\]The symmetry ensures that the resulting kernel is real-valued. Taking the Fourier transform of this spectral density gives:
\[k(\tau) = \exp\left(-2\pi^2\tau^2\sigma^2\right) \cos(2\pi\tau\mu).\]This kernel has a clean interpretation. The cosine term captures periodic structure, while the Gaussian envelope captures local smoothness. Even a single Gaussian component in the spectral domain gives a kernel that can represent oscillation, smooth variation, and localized structure — a combination that standard kernels like the RBF do not capture as naturally.
In higher dimensions, the spectral density is built up as a separable product of one-dimensional symmetric mixtures, with each input axis carrying its own frequency \(\mu_q^{(p)}\) and scale \(v_q^{(p)}\). Symmetrizing each component dimension independently and taking the Fourier transform yields the full Spectral Mixture kernel:
\[k(\tau) = \sum_{q=1}^{Q} w_q \prod_{p=1}^{P} \exp\left( -2\pi^2\tau_p^2v_q^{(p)} \right) \cos\left(2\pi\,\tau_p\mu_q^{(p)}\right).\]Each component has a weight \(w_q\) and, for every input dimension \(p\), a mean frequency \(\mu_q^{(p)}\) and scale parameter \(v_q^{(p)}\). With enough components, the SM kernel can approximate any stationary kernel arbitrarily well. More importantly, each component has an interpretable role: it corresponds to a distinct spectral mode in the data, capturing a particular frequency, smoothness pattern, or source of variation.
This makes SM kernels powerful tools for pattern discovery. They can automatically learn periodicities, long-range dependencies, multiple frequencies, and other structured patterns directly from time series or spatial data.
In the function-space view, the Gaussian process places a prior over functions using the SM kernel:
\[f(x) \sim \mathcal{GP} \left( 0, k_{\mathrm{SM}}(x,x' \mid \theta) \right).\]The hyperparameters \(\theta\) determine the learned spectral structure. Learning proceeds by maximizing the marginal likelihood:
\[\log p(y \mid \theta, X) = -\underbrace{ \frac{1}{2} y^\top (K_\theta + \sigma^2 I)^{-1} y }_{\text{model fit}} -\underbrace{ \frac{1}{2} \log |K_\theta + \sigma^2 I| }_{\text{complexity penalty}} -\frac{N}{2}\log(2\pi).\]This objective naturally balances fit and complexity. The first term rewards the model for explaining the observed data. The log-determinant term penalizes unnecessary complexity. Because SM kernels are expressive but still parameterized in a controlled way, the marginal likelihood can select the relevant frequencies and patterns without simply overfitting.
This is inductive bias in a particularly clean form. The model is flexible enough to capture real phenomena, but structured enough for Bayesian marginal likelihood to control complexity. Instead of relying on rigid kernels like the RBF, or vague priors over parameters, the SM kernel learns the structure of the data itself: its rhythms, trends, periodicities, and long-range dependencies.
Open questions
A major unresolved question is why modern deep networks tend to find simple, compressible solutions even when they contain billions of parameters. The empirics are clear; the mechanism is not. Geometric intuitions about flat minima and high-dimensional volume explain part of the story, but they are not sufficient.
One reason is that the simplicity bias does not seem to come only from stochastic optimization. Both SGD and full-batch gradient descent can produce similar effects. Even random guess-and-check procedures can sometimes land on good solutions. This suggests that the phenomenon is deeper than optimization dynamics alone.
Understanding the link between scale, simplicity, and generalization may be one of the most important open problems in deep learning theory. It sits underneath double descent, benign overfitting, mode connectivity, and the surprising success of massively overparameterized models.
Another open question is whether we can move beyond scale as the main way to induce useful inductive bias. Today, large models often work because scale indirectly pushes them toward better, simpler, more compressible solutions. But this is a crude and computationally expensive strategy.
The deeper goal is to design explicit mechanisms that favor structured, compressible hypotheses without requiring billions of parameters. This might involve new regularizers, better approximations to Bayesian marginalization, architectures with stronger universal priors, or new ways of shaping the geometry of the loss landscape.
If we could do this, strong generalization would become much more accessible. Small and medium-sized models could inherit some of the benefits that currently seem to require enormous scale.
A final open problem is epistemic uncertainty. Uncertainty about the model itself is essential for safe decision-making, scientific reasoning, and out-of-distribution generalization. But we still do not have a scalable way to represent it in modern LLMs and giant transformers.
Classical Bayesian methods do not scale well enough. Variational methods often give poor approximations. Deep ensembles are useful, but they are still a crude proxy for true posterior averaging. As models get larger, the problem becomes more important, not less, because there are more plausible explanations consistent with the data.
The central question is whether we can bring principled Bayesian reasoning, or something functionally equivalent to it, into the foundation model era. This is not a side issue. It is one of the key problems that must be solved if we want models that are not only powerful, but reliable, interpretable, and honest about what they do not know.