Deep Learning Training Dynamics are not Fisher Geodesics
Nov 16, 2025
A cool bit of math that falls apart empirically
[NOTE] This post is mostly just because the idea was pretty cool (imo) and once the empirics failed, I wanted somewhere to keep it. Most of this post is transcribed from the my original proofs in LATEX by GPT, and as such, may have some minor innaccuracies or syntax errors.
Introduction
Different optimizers like SGD and Adam often produce similar loss curves and generalization behavior, even though they use different update rules. SGD does Euclidean gradient descent, Adam uses adaptive diagonal preconditioning, and natural gradient uses the Fisher information matrix. So why do they end up with similar training dynamics?
Over the past couple months I worked on a theoretical framework that training paths are approximately geodesics in Fisher information space1, and that different optimizers' metrics conformally align at large width, causing their paths to coincide. The theory has three main components: (i) all preconditioned gradient descent methods are equivalent to Riemannian gradient flow under different metrics, (ii) stochastic gradient noise concentrates paths near geodesics via large deviations, and (iii) in overparameterized networks, different optimizers' metrics align with the Fisher metric at rate \(O(p^{-1/2})\).
This would explain why neural scaling laws work, why hyperparameter transfer schemes like \(\mu\)P succeed, and why training is robust across optimizer choices. However, this theory doesn't hold empirically (due to problems with misspecified models and short time limits).
Preliminaries
Before diving into the theory, we need some background concepts from differential geometry, information geometry, and stochastic analysis. This section provides the necessary mathematical foundation.
Information Geometry and Statistical Manifolds
Information geometry studies probability distributions as geometric objects. A parametric family of probability distributions \(\{p_\theta(y|x)\}_{\theta \in \Theta}\) forms a statistical manifold, where each point \(\theta \in \Theta\) represents a different distribution. This lets us measure distances between distributions, define geodesics (the "straight lines" of the manifold), and understand how optimization moves through the space of models.
The space of probability distributions has geometric structure that doesn't depend on parameterization. The Fisher information metric captures this structure and provides a way to measure distances and angles.
Riemannian Manifolds and Metrics
A Riemannian manifold \((\Theta, g)\) consists of a smooth manifold \(\Theta\) equipped with a metric tensor \(g\): a smoothly varying inner product \(g(\theta): T_\theta\Theta \times T_\theta\Theta \to \mathbb{R}\) on each tangent space. The tangent space \(T_\theta\Theta\) at a point \(\theta\) consists of all possible directions in which we can move from \(\theta\); in the context of neural networks, these are directions of parameter change.
In coordinates, if \(\theta \in \mathbb{R}^p\), the metric is a symmetric positive definite matrix \(g(\theta) \in \mathbb{R}^{p \times p}\), defining the inner product \(\langle u, v \rangle_{g(\theta)} = u^\top g(\theta) v\) and inducing the norm \(\norm{v}_{g(\theta)} = \sqrt{v^\top g(\theta) v}\) for tangent vectors \(u, v \in T_\theta\Theta \cong \mathbb{R}^p\). The metric tensor tells us how to measure lengths and angles in the tangent space, which varies smoothly as we move through the manifold.
Different metrics give different notions of distance. The Euclidean metric \(g = I\) treats all parameter directions equally, while the Fisher metric weights directions by how much they affect the model's predictions.
Geodesics
A geodesic is a curve \(\gamma: [0,T] \to \Theta\) that locally minimizes distance. In Euclidean space, geodesics are straight lines. On a curved manifold, geodesics are curves with zero acceleration in the Riemannian sense.
Geodesics satisfy the geodesic equation \(\frac{\mathrm D}{\mathrm dt}\dot{\gamma}(t) = 0\), where \(\frac{D}{dt}\) is the Levi-Civita covariant derivative2 of \(g\). This means the velocity vector \(\dot{\gamma}\) is parallel-transported along the curve—it doesn't rotate relative to the local geometry.
A minimizing geodesic between two points is the shortest path connecting them; in a normal convex neighborhood, such geodesics are unique. The length of a curve \(\gamma: [0,T] \to \Theta\) is given by \(\int_0^T \norm{\dot{\gamma}(t)}_{g(\gamma(t))} \mathrm{d}t\), and geodesics minimize this length among all curves connecting their endpoints.
Fisher Information Metric
For a parametric family of probability distributions \(\{p_\theta(y|x)\}_{\theta \in \Theta}\) with log-likelihood \(\ell(\theta; x, y) = \log p_\theta(y|x)\) and score function \(s(\theta; x, y) = \nabla_\theta \ell(\theta; x, y)\), the Fisher information matrix is \begin{equation} \mathcal{I}(\theta) = \mathbb{E}_{(x,y) \sim \mathcal{D}}\left[s(\theta; x, y) s(\theta; x, y)^\top\right], \end{equation} where \(\mathcal{D}\) is the data distribution and the expectation is taken under the model distribution \(y \sim p_\theta(\cdot|x)\). The score function measures how sensitive the log-likelihood is to parameter changes, and the Fisher information captures the expected squared sensitivity.
The Fisher information defines a Riemannian metric on the statistical manifold \(\Theta\), known as the Fisher-Rao metric. This metric is uniquely characterized (up to scaling) by its invariance under sufficient statistics and reparameterizations. The Fisher metric has several important properties:
- It measures distances between probability distributions in a way that's invariant to how we parameterize the model
- It weights parameter directions by their impact on the model's predictions
- It provides a natural geometry for understanding optimization in the space of models
- In the context of neural networks, it captures how sensitive the network's output is to changes in each parameter
Under regularity conditions, the Fisher information also equals the negative expected Hessian: \(\mathcal{I}(\theta) = -\mathbb{E}\left[\nabla_\theta^2 \ell(\theta; x, y)\right]\), connecting it to the curvature of the log-likelihood surface.
Natural Gradient Descent
Gradient descent in the Fisher metric is called natural gradient descent. The natural gradient is defined as \(\nabla^{\mathcal{I}} L(\theta) = \mathcal{I}(\theta)^{-1} \nabla L(\theta)\), where \(\nabla L\) is the standard Euclidean gradient. Natural gradient descent follows the update rule: \begin{equation} \theta_{k+1} = \theta_k - \eta \mathcal{I}(\theta_k)^{-1} \nabla L(\theta_k). \end{equation}
Natural gradient descent has several appealing properties: it's invariant to reparameterizations of the model, it accounts for the geometry of the parameter space, and it can converge faster than standard gradient descent in certain settings. However, computing and inverting the full Fisher information matrix is computationally expensive for large neural networks, which is why practical optimizers use approximations.
Conformal Equivalence
Two metrics \(g\) and \(\tilde{g}\) on \(\Theta\) are conformally equivalent if \(\tilde{g}(\theta) = c(\theta) g(\theta)\) for some positive scalar field \(c: \Theta \to \mathbb{R}_{>0}\). Conformally equivalent metrics have the same geodesics (as unparameterized curves), differing only in how they are traversed in time.
Specifically, if \(\gamma(t)\) is a geodesic of \(g\), then \(\gamma(\tau(t))\) is a geodesic of \(\tilde{g}\) where \(\tau\) satisfies \(\mathrm d\tau/\mathrm dt = c(\gamma(t))\). This time reparameterization property is central to our analysis: if two optimizers' metrics are conformally equivalent, they follow the same paths through parameter space, just at different speeds.
Conformal equivalence is weaker than metric equality but stronger than just having similar geodesics. It means the metrics differ only by a scalar factor at each point, preserving angles and the shape of geodesics while allowing different traversal speeds.
Large Deviations Theory and Stochastic Processes
Large deviations theory3 studies the asymptotic behavior of rare events. For our purposes, we use it to show that stochastic processes concentrate on certain "most probable" paths. The Onsager-Machlup functional provides a rate function that characterizes the probability of paths deviating from the most probable one.
When we add noise to gradient descent (as in mini-batch SGD), the resulting stochastic process can be analyzed using large deviations theory. The theory tells us that, as the noise strength goes to zero, the process concentrates on paths that minimize a certain "action" functional. For diffusions on Riemannian manifolds, these most probable paths are often geodesics.
This suggests that noisy optimization might follow geodesic paths in the limit of small noise. However, this breaks down for long time horizons and misspecified models.
The Theoretical Framework
The theory rests on three pillars:
- Optimizers as Riemannian gradient flows: All preconditioned gradient descent methods are equivalent to Riemannian gradient flow under different metric tensors.
- Stochastic dynamics concentrate near geodesics: Mini-batch gradient noise is Fisher-covariant, and stochastic bridge processes concentrate on geodesics via large deviations.
- Metric alignment in overparameterized networks: In the NTK regime, different optimizers' metrics conformally align with the Fisher metric at rate \(O(p^{-1/2})\).
Let's develop each of these in detail.
Optimizers as Riemannian Gradient Flows
The first part of the theory shows that preconditioned gradient descent methods are all instances of Riemannian gradient descent under different metrics. This lets us compare optimizers by comparing their metrics.
Setup
Consider a loss function \(L: \Theta \subset \mathbb{R}^p \to \mathbb{R}\) and a preconditioner \(P: \Theta \to \mathbb{R}^{p \times p}\) that is symmetric positive definite at each point. The preconditioner \(P\) modifies the gradient before applying the update. Define the metric tensor \(G(\theta) = P(\theta)^{-1}\). This inverse relationship is key: the preconditioner and the metric are dual to each other.
The Main Result
- This update performs Riemannian gradient descent with respect to metric \(G\).
- The continuous-time limit as \(\eta \to 0\) is the Riemannian gradient flow \(\dot{\theta}(t) = -P(\theta(t))\nabla L(\theta(t)) = -\nabla^G L(\theta(t))\).
- If \(P(\theta) = c(\theta) g(\theta)^{-1}\) for scalar \(c(\theta) > 0\) and metric \(g\), then trajectories coincide with \(g\)-gradient flow up to time reparameterization \(\mathrm{d}\tau = c(\theta(t))\mathrm{d}t\).
Proof
The key insight is that preconditioned gradient descent minimizes a local first-order approximation subject to a trust region constraint in the \(G\)-metric.
Part 1: One-Step Minimizer
The update \(\theta^+ = \theta + \Delta^{\ast}\) where \(\Delta^{\ast}(\theta) = \arg \min_{\Delta \in \mathbb{R}^p} \left\{\langle \nabla L(\theta), \Delta\rangle + \frac{1}{2\eta} \norm{\Delta}^2_{G(\theta)}\right\}\).
The objective is strictly convex since \(G(\theta) \succ 0\). Taking the gradient with respect to \(\Delta\): \begin{equation} \nabla_\Delta \left(\langle\nabla L(\theta), \Delta\rangle + \frac{1}{2\eta} \Delta^\top G(\theta) \Delta\right) = \nabla L(\theta) + \frac{1}{\eta}G(\theta)\Delta. \end{equation}
Setting to zero: \(G(\theta)\Delta^{\ast} = -\eta\nabla L(\theta)\). Multiplying by \(G(\theta)^{-1} = P(\theta)\) gives \(\Delta^{\ast} = -\eta P(\theta)\nabla L(\theta)\). Uniqueness follows from strict convexity.
Part 2: Continuous-Time Limit
Define the piecewise-linear interpolation \(\theta_{\eta}(t)\) with \(\theta_\eta(k\eta) = \theta_k\) and slope \(\frac{\theta_{k+1} - \theta_k}{\eta} = -P(\theta_k) \nabla L(\theta_k)\) on \([k\eta, (k+1)\eta)\). Under local Lipschitz assumptions on \(P\) and \(\nabla L\), the ODE method for deterministic numerical schemes implies \(\theta_\eta \to \theta\) uniformly on compact time intervals, where \(\theta\) solves \(\dot{\theta} = -P(\theta) \nabla L(\theta)\).
To identify this as Riemannian gradient flow, recall the Riemannian gradient \(\nabla^{G}L\) is the unique vector field satisfying \begin{equation} \langle \nabla^G L(\theta), v \rangle_{G(\theta)} = \mathrm{d}L(\theta)[v] = \langle\nabla L(\theta), v\rangle \quad \text{for all } v. \end{equation}
Thus \(G(\theta) \nabla^{G} L(\theta) = \nabla L(\theta)\), so \(\nabla^{G} L(\theta) = G(\theta)^{-1}\nabla L(\theta) = P(\theta) \nabla L(\theta)\). Therefore the ODE is \(\dot{\theta} = - \nabla^{G} L(\theta)\), which is steepest descent under \(G\).
Part 3: Conformal Equivalence
If \(P = cg^{-1}\) with scalar \(c(\theta) > 0\), then \(\dot{\theta} = -c(\theta)g^{-1}\nabla L(\theta)\). Let \(\tau\) satisfy \(\mathrm{d}\tau = c(\theta(t))\mathrm{d}t\). Then \(\frac{\mathrm{d}\theta}{\mathrm{d}\tau} = -g^{-1}\nabla L(\theta)\), so the curves (images in \(\Theta\)) match natural gradient descent, only their speeds differ.
Examples
- SGD: \(P = I\) gives \(G = I\), so SGD performs Euclidean gradient flow.
- Adam: \(P = \mathrm{diag}(v_t)^{-1/2}\) where \(v_t\) tracks coordinate-wise gradient variances, so Adam performs gradient flow under a diagonal adaptive metric.
- Natural Gradient: \(P = \mathcal{I}(\theta)^{-1}\) gives \(G = \mathcal{I}(\theta)\), the Fisher metric.
If two optimizers' metrics are approximately conformal, \(P_1(\theta) \approx c(\theta) P_2(\theta)\), their training trajectories will approximately coincide as curves in parameter space, differing only in how fast they're traversed.
Stochastic Dynamics and Geodesic Concentration
The second part asks: why would noisy optimization follow smooth geometric paths? Large deviations theory shows that stochastic processes concentrate on "most probable" paths, and for diffusions on Riemannian manifolds, these are geodesics.
Mini-Batch Gradient Noise is Fisher-Covariant
Gradient noise from mini-batches has a covariance structure that matches the Fisher information matrix. This follows from the structure of likelihood-based learning, not an assumption.
Proof
Since \(\hat{g}_b(\theta) = -(1/b)\sum_{i=1}^b s(\theta; Z_i)\) and the \(Z_i\) are i.i.d., \begin{equation} \operatorname{Cov}\left(\hat{g}_b(\theta)\right) = \frac{1}{b^2}\sum_{i=1}^b \operatorname{Cov}\left(s(\theta; Z_i)\right) = \frac{1}{b} \operatorname{Cov}_{\mathsf{P}}\left(s(\theta; Z)\right). \end{equation}
Because \(g(\theta) = \mathbb{E}[-s(\theta; Z)]\) is deterministic, we have \(\operatorname{Cov}(\xi_b) = \operatorname{Cov}(\hat{g}_b)\).
If the model is well-specified at \(\theta\), then the score has mean zero: \begin{equation} \mathbb{E}_{\mathsf{P}}\left[s(\theta; Z)\right] = \mathbb{E}_{X}\mathbb{E}_{Y \sim p_\theta(\cdot \mid X)}\left[\nabla_\theta \log p_\theta(Y \mid X)\right] = 0, \end{equation} where the inner expectation vanishes by the standard score identity. Hence the covariance equals the second moment: \begin{equation} \operatorname{Cov}_{\mathsf{P}}\left(s(\theta; Z)\right) = \mathbb{E}\left[s(\theta; Z)s(\theta; Z)^\top\right] = \mathcal{I}(\theta). \end{equation}
Therefore, \(\operatorname{Cov}(\xi_b(\theta)) = \mathcal{I}(\theta)/b\).
Critical Assumption: This result requires the model to be well-specified. In practice, neural networks are almost always misspecified—they approximate complex real-world distributions but don't match them exactly. When misspecified, \(\mathbb{E}[s(\theta; Z)] \neq 0\), and the covariance is not exactly the Fisher information.
MAP Bridge Paths are Near-Geodesic
We model the continuous-time limit of stochastic gradient descent as the Stratonovich SDE4 \begin{equation} \mathrm{d}\theta_t = b(\theta_t)\mathrm{d}t + \sqrt{2\varepsilon} \circ \mathrm{d}B^g_t, \end{equation} where \(B^g_t\) is Brownian motion in the metric \(g\), \(b\) is the drift (deterministic gradient term), and \(\varepsilon = \eta/(2b)\) relates learning rate \(\eta\) and batch size \(b\) to the diffusion strength.
Proof Sketch
The proof uses Freidlin-Wentzell large deviations theory5 for diffusions on manifolds. The Onsager-Machlup (OM) functional6 for paths is \begin{equation} \mathcal{I}_T[\theta] = \frac{1}{4}\int_0^T \|\dot{\theta}_t - b(\theta_t)\|^2_{g(\theta_t)} \mathrm{d}t. \end{equation}
For the driftless case, minimizers satisfy the Euler-Lagrange equation, which is precisely the geodesic equation \(\frac{\mathrm{D}}{\mathrm{d}t}\dot{\gamma}(t) = 0\).
For the case with drift, the proof uses \(\Gamma\)-convergence7 as \(T \to 0\). After time rescaling \(s = t/T\), the functional becomes \begin{equation} \hat{\mathcal{I}}_T^b[\theta] = \frac{1}{4} \int_0^1 \norm{\vartheta^\prime}_g^2 \mathrm{d}s - \frac{T}{2} \int_0^1 \langle \vartheta^\prime, b(\vartheta) \rangle_g \mathrm{d}s + \frac{T^2}{4} \int_0^1 \norm{b(\vartheta)}^2_g \mathrm ds. \end{equation}
As \(T \to 0\), this \(\Gamma\)-converges to the pure geodesic energy \(\frac{1}{4}\int_0^1 \norm{\vartheta^\prime(s)}^2_{g(\vartheta(s))} \mathrm ds\), showing that the geodesic term dominates in the short-time limit.
The \(O(T)\) bound follows from second-variation analysis: in a normal convex neighborhood, the index form is strictly positive, giving a quadratic lower bound on the energy difference from the geodesic. The drift terms contribute only linear corrections, leading to the \(O(T)\) path closeness.
Critical Assumption: Part B only holds as \(T \to 0\). The \(O(T)\) bound means the deviation grows linearly with time. For long training runs (many epochs), this bound doesn't control the deviation—the paths can drift arbitrarily far from geodesics.
Metric Alignment in Overparameterized Networks
The third part explains why different optimizers produce similar paths: their metrics conformally align with the Fisher metric at large width. In overparameterized networks, the Fisher information matrix becomes approximately isotropic, and many optimizers' metrics also become isotropic. Conformally equivalent metrics share the same geodesics.
- SGD with \(G_{\mathrm{opt}} = I\)
- Adam with \(G_{\mathrm{opt}} = \mathrm{diag}(v_t)\) where \(v_t\) tracks gradient second moments
Proof Sketch
At random initialization with NTK scaling8, the Fisher matrix is approximately isotropic: \(\mathcal{I}(\theta_0) = \alpha I_p + R\) where \(\norm{R}_{\mathrm{op}} = O_{\mathbb{P}}(p^{-1/2})\). This follows from coordinate independence of gradients at initialization (tensor program limit9) and matrix concentration for sums of rank-one outer products.
In the NTK regime, the Jacobian (hence Fisher) remains frozen near its initialization value. For SGD, \(G = I\) is already isotropic. For Adam, coordinate-wise gradient variances \(v_i\) concentrate around their common value \(\alpha\) with uniform deviation \(O_{\mathbb{P}}(p^{-1/2})\), giving \(\mathrm{diag}(v) \approx \alpha I + O(p^{-1/2})\).
By conformal equivalence, since \(G_{\mathrm{SGD}}\), \(G_{\mathrm{Adam}}\), and \(\mathcal{I}\) are all conformally aligned to within \(O(p^{-1/2})\), their geodesics converge as \(p \to \infty\).
Critical Assumption: This proof only establishes alignment in the NTK regime where training is approximately linear and the Fisher metric remains constant. In the feature-learning regime where representations evolve, the metric changes significantly, and alignment may not hold.
Why the Theory Fails Empirically
Empirically, training paths are not near-geodesic in Fisher information space. The theory fails because of assumptions that don't hold in practice.
The Short-Time Assumption
Theorem 2 part B only proves \(O(T)\) deviation as \(T \to 0\). For neural network training, we have:
- Training happens over long time horizons (many epochs, potentially thousands of steps)
- The \(O(T)\) bound means errors accumulate linearly with time
- Over a full training run, paths can drift arbitrarily far from geodesics
The \(\Gamma\)-convergence argument only controls behavior in the short-time limit. There's no mechanism to prevent long-term drift away from geodesic paths. In fact, the drift term \(b(\theta_t)\) in the SDE can systematically push paths away from geodesics over time.
The Well-Specified Model Assumption
The Fisher-covariant noise result requires the model to be well-specified: \(Y \vert X \sim p_\theta(\cdot \vert X)\) under the data distribution. In practice:
- Neural networks are almost always misspecified—they approximate but don't match the true data distribution
- When misspecified, \(\mathbb{E}[s(\theta; Z)] \neq 0\) (the score has non-zero mean)
- The gradient noise covariance is \(\operatorname{Cov}(\xi_b) = \frac{1}{b}\mathcal{I}(\theta) + \frac{1}{b}\mathbb{E}[s]\mathbb{E}[s]^\top\), not just \(\frac{1}{b}\mathcal{I}(\theta)\)
- The additional bias term \(\mathbb{E}[s]\mathbb{E}[s]^\top\) breaks the Fisher-covariant structure
This misspecification error accumulates over training, causing paths to deviate from the Fisher geometry.
The NTK Regime Assumption
The metric alignment result only holds in the NTK/lazy training regime. However:
- Modern neural networks operate in the feature-learning regime where representations evolve significantly
- The Fisher metric changes substantially during training, not just near initialization
- Metric alignment at initialization doesn't guarantee alignment throughout training
Even if metrics align initially, they can diverge as the network learns features and the Fisher information evolves.
Conclusion
This whole project is more or less a lesson in empirically verifying predictions before diving into developing the theory, but in any case, it was still an interesting idea.
References
- Information Geometry and Its Applications - Amari (2016)
- Neural Learning in Structured Parameter Spaces - Amari (1996)
- Riemannian Geometry - do Carmo (1992)
- Large Deviations Techniques and Applications - Dembo & Zeitouni (2010)
- Random Perturbations of Dynamical Systems - Freidlin & Wentzell (2012)
- Brownian bridges on Riemannian manifolds - Hsu (1990)
- Neural Tangent Kernel - Jacot, Gabriel & Hongler (2020)
- On Lazy Training in Differentiable Programming - Chizat, Oyallon & Bach (2020)
- Wide Neural Networks of Any Depth Evolve as Linear Models - Lee et al. (2020)
- Tensor Programs V - Yang et al. (2022)
- Universal Statistics of Fisher Information in Deep Neural Networks - Karakida, Akaho & Amari (2019)
- Stochastic Gradient Descent as Approximate Bayesian Inference - Mandt & Hoffman
- Stochastic Modified Equations - Li & Tai
- Adam: A Method for Stochastic Optimization - Kingma & Ba (2017)
- The Fisher information matrix measures the amount of information that an observable random variable carries about an unknown parameter. In machine learning, it quantifies how sensitive the model's predictions are to parameter changes. ↩
- The covariant derivative generalizes the notion of directional derivative to curved spaces. The Levi-Civita connection is the unique torsion-free, metric-compatible connection on a Riemannian manifold. ↩
- Large deviations theory provides asymptotic estimates for probabilities of rare events. It's particularly useful for understanding the concentration behavior of stochastic processes. ↩
- The Stratonovich interpretation of stochastic differential equations is often preferred in physics and geometry because it obeys the chain rule, making it coordinate-independent. It's denoted by the \(\circ\) symbol. ↩
- Freidlin-Wentzell theory extends large deviations principles to stochastic differential equations, characterizing the most probable paths of diffusions. ↩
- The Onsager-Machlup functional gives the probability density of paths for a diffusion process. Minimizing it yields the most probable path (MAP path). ↩
- {\Gamma}-convergence is a notion of convergence for functionals that ensures minimizers converge. It's particularly useful for studying asymptotic behavior as parameters go to zero. ↩
- The Neural Tangent Kernel (NTK) regime, also called "lazy training," occurs when networks are so wide that they remain close to initialization during training, making the dynamics approximately linear. ↩
- Tensor programs provide a mathematical framework for analyzing infinite-width neural networks, showing that gradients and activations converge to Gaussian processes in the limit. ↩