Gain, Geometry, and Rich vs Lazy Dynamics

(Math Notes for Phase 2 / Phase 3)

1. What “geometry” means in our theory

In this project, geometry does not mean Euclidean distance in input space.
It means relative structure of sensitivity induced by the network.

Formally, geometry lives in:

ratios of singular values of the Jacobian
directional imbalance (anisotropy)
spatial variation of sensitivity over the data manifold

Geometry is about shape, not magnitude.

2. Jacobian as the local structure of the network

Let the network representation at depth $L$ be

h_{L} : R^{d} \to R^{m} .

The input–representation Jacobian is

J (x) = \frac{\partial h_{L} (x)}{\partial x} .

For a small perturbation $δ x$ ,

h_{L} (x + δ x) \approx h_{L} (x) + J (x) δ x .

The Jacobian fully characterizes the network’s local behavior.

3. Singular values = directional stretching

Take the singular value decomposition:

J (x) = U (x) Σ (x) V (x)^{⊤}, Σ (x) = diag (σ_{1} (x), σ_{2} (x), \dots) .

Each right singular vector $v_{i} (x)$ is an orthogonal input direction
Each singular value $σ_{i} (x)$ is how much that direction is stretched

This defines the local geometry induced by the network.

4. Isotropy vs anisotropy

Isotropic Jacobian

σ_{1} (x) \approx σ_{2} (x) \approx \dots

All directions treated equally
Local geometry is spherical
No preferred features
Kernel-like / lazy behavior

Anisotropic Jacobian

σ_{1} (x) ≫ σ_{2} (x)

Certain directions dominate
Geometry is elongated / folded
Selective, data-aligned feature learning
Rich behavior

5. Anisotropy scalar $A_{L}$

We summarize spectral shape (not scale) using an anisotropy scalar:

A_{L} (x) = \frac{σ_{1} (x)}{bulk ({σ_{i} (x)})} .

Dataset-averaged anisotropy:

Γ_{Ah} (L) = E_{x \sim D} [A_{L} (x)] .

Interpretation:

$A_{L} \approx 1$ : no dominant directions (flat geometry)
$A_{L} ≫ 1$ : strong directional dominance (structured geometry)

Anisotropy measures geometric engagement, not task correctness.

6. FTLE: accumulated sensitivity

Finite-Time Lyapunov Exponent (FTLE):

λ_{T} (x) = \frac{1}{T} \log ‖ Φ_{T} (x) ‖, Φ_{T} (x) = J (x_{T}) \dots J (x_{0}) .

FTLE measures accumulated stretching across depth or iterations.

Flat FTLE field → geometry-free dynamics
Structured FTLE field → ridges, valleys, spatial variation

7. $G_{λ}$ : geometry visibility, not geometry itself

We quantify FTLE field heterogeneity by:

G_{λ} = {Var}_{x \sim D} [λ_{T} (x)] .

Key distinction:

Anisotropy → local, directional structure
$G_{λ}$ → global, spatial variation of accumulated sensitivity

They are related but not equivalent:

Anisotropy ⟹ G_{λ} (typically)

but not vice versa.

8. What the `gain` parameter is

Each layer’s weights are scaled by a gain $g$ :

W_{ℓ} = g {\tilde{W}}_{ℓ} .

Thus each Jacobian scales as:

J (x) ⟶ g J (x) .

Across depth $L$ :

∥ J_{L} (x) ∥ \sim g^{L} \times ∥ {\tilde{J}}_{L} (x) ∥ .

Gain rescales magnitude, not geometry.

9. What gain does not change

Because gain multiplies all singular values equally:

\frac{g σ_{1}}{g σ_{2}} = \frac{σ_{1}}{σ_{2}},

anisotropy ratios are unchanged
preferred directions are unchanged
geometry is unchanged

Gain cannot create anisotropy.

10. Visual intuition (precise)

Lazy regime (isotropic)

Before gain:

FTLE field ≈ flat plateau

After gain:

FTLE field ≈ taller flat plateau

λ (x) \to λ (x) + \log g

Shape unchanged.

Rich regime (anisotropic)

Before gain:

FTLE field has ridges and valleys

After gain:

ridges get taller
valleys get deeper
locations unchanged

Shape preserved, contrast amplified.

11. Why gain does NOT define Rich vs Lazy

Rich vs Lazy is determined by scaling laws:

parameterization $r$
width $N$
depth $L$

Gain only controls where we sit inside a regime:

contractive
critical
expansive

Gain reveals geometry; it does not decide its existence.

12. Important caveat: learning can be real even if $A_{L} \approx 1$

A network may be learning faithfully while remaining isotropic if:

the data distribution is intrinsically isotropic
signal is high-rank and evenly distributed
task is already linear

In such cases:

A_{L} \approx 1

is correct behavior, not laziness.

13. Final conceptual hierarchy

Parameterization / Width / Depth \Rightarrow Jacobian Anisotropy \Rightarrow FTLE Field Geometry \Rightarrow Task Alignment (or Not)

Anisotropy: does geometry exist?
FTLE structure: where does geometry appear?
Gain: how visible is the geometry?

14. One-line summary

Gain lifts or lowers the sensitivity landscape; anisotropy shapes it.

This separation is the backbone of the theory.

(Addendum) 15. $G_{J}$ : variance of stretch (not log-stretch)

In the code, besides $G_{λ}$ you also compute a quantity often named $G_{J}$ , intended to reflect heterogeneity of the Jacobian norm (or a proxy for it) across the dataset.

15.1 From FTLE to a Jacobian-norm proxy

Recall FTLE (finite-time log-growth rate):

λ_{T} (x) = \frac{1}{T} \log ‖ Φ_{T} (x) ‖, Φ_{T} (x) = J (x_{T}) \dots J (x_{0}) .

In your depth-wise setting, a common approximation (and the one implicitly used in the code) is:

\log ∥ J (x) ∥ \approx L λ (x),

so that

∥ J (x) ∥ \approx \exp (L λ (x)) .

In the script you do exactly this conversion:

compute $L λ (x)$ ,
exponentiate to get a proxy for $∥ J (x) ∥$ ,
then take variance over the dataset.

15.2 Definition of $G_{J}$

Define the Jacobian-norm proxy:

J_{norm} (x) := \exp (L λ (x)) .

Then:

G_{J} := {Var}_{x \sim D} [J_{norm} (x)] = {Var}_{x \sim D} [\exp (L λ (x))] .

So:

$G_{λ}$ = variance of the log-stretch rate across $x$
$G_{J}$ = variance of the stretch magnitude across $x$

15.3 Why $G_{J}$ is different from $G_{λ}$

Because exponentiation amplifies tails:

small differences in $λ (x)$ can create huge differences in $\exp (L λ (x))$ ,
especially when $L$ is large.

Heuristically, if $λ$ is approximately Gaussian with mean $μ$ and variance $σ^{2}$ , then $\exp (L λ)$ is approximately lognormal, and its variance grows very rapidly with $L^{2} σ^{2}$ . (This is exactly why you clip $L λ$ in code to avoid numerical blow-up.)

So:

$G_{λ}$ is the stable geometry-heterogeneity indicator (log domain)
$G_{J}$ is the sensitive tail indicator (linear domain)

15.4 Interpretation in our theory

In a lazy/isotropic regime:
- $λ (x)$ nearly constant $\Rightarrow G_{λ} \approx 0$
- $\exp (L λ (x))$ nearly constant $\Rightarrow G_{J} \approx 0$
In a rich/structured regime:
- $λ (x)$ varies across space $\Rightarrow G_{λ} > 0$
- tails get amplified $\Rightarrow G_{J}$ can become very large (especially at large $L$ )

15.5 Relationship to `gain`

Since gain rescales Jacobians roughly as $∥ J ∥ \mapsto g^{L} ∥ J ∥$ , in the FTLE/log domain this is an additive shift:

λ (x) \to λ (x) + \log g .

This tends to shift $\exp (L λ)$ multiplicatively:

\exp (L λ) \to \exp (L λ) \exp (L \log g) = g^{L} \exp (L λ),

so $G_{J}$ is especially sensitive to gain (because it lives in the linear domain).

But gain still does not create geometry: if $λ (x)$ is flat, both $G_{λ}$ and $G_{J}$ remain near zero (just rescaled).

15.6 Quick summary

$G_{λ} = Var [λ (x)]$ : heterogeneity of log-stretch (geometry visibility, stable)
$G_{J} = Var [\exp (L λ (x))]$ : heterogeneity of stretch (tail-sensitive, can explode with depth/gain)

The correct identity is:

\exp (\frac{L}{T} \log ∥ δ x_{t} ∥) = ∥ δ x_{t} ∥^{L / T}

So what’s happening is:

FTLE compresses growth into a log-average
$G_{J}$ undoes that compression by exponentiating

Conceptually:

FTLE lives in the log domain,
$G_{J}$ lives back in the linear (stretch) domain.