Part I — Definitions & invariances

Part I — Definitions & invariances (locked-in foundations)

I.1 FTLE field and heterogeneity metric

We define the FTLE field (a scalar over input space):

xλT(x)

We measure spatial heterogeneity of this field using:

Gλ:=Varxμ[λT(x)].

Here μ is the evaluation measure (must be stated explicitly in each experiment):

This choice matters because global variance can be dominated by regions with larger area weight (e.g., far outside the boundary).

Two technical clarifications that matter later:

  1. What is xD?
    In practice, you compute λT on a grid in R2, not on i.i.d. samples from the training distribution. That’s okay, but you must be explicit:

Note: log10Gλ can be negative because 0<Gλ<1 is common.

I.2 Kernel / representation rotation metrics (KA / RA)

Let K be the empirical NTK matrix and Z be the representation Gram matrix. Define Frobenius cosine similarities:

KA=Kinit,KfinalFKinitFKfinalF,RA=Zinit,ZfinalFZinitFZfinalF.

Interpretation:

Range: for PSD Gram/kernel matrices, RA,KA[0,1].

I.3 Scale invariance (magnitude vs geometry)

KA/RA are cosine similarities, hence invariant to global scaling:

aA,bBaAbB=A,BAB.

So:

I.4 Gain: post-hoc scaling vs training dynamics

Key distinction:

  1. Post-hoc scaling (rescale a trained network without retraining):
    If scaling multiplies the relevant Jacobian by a constant factor α uniformly in x, then:
λT(x)λT(x)+1Tlogα,

a constant shift in x, hence:

Varx[λT(x)] is unchanged Gλ unchanged.
  1. Gain as a training hyperparameter:
    Gain can change optimization trajectory (effective gradient scale, feature evolution), so Kfinal and Zfinal can rotate differently. Therefore gain can affect KA/RA and also affect Gλ indirectly through training-changed geometry.

This requires the scaling to produce a uniform multiplicative factor in the Jacobian across x. For a plain feedforward composition with per-layer scalar scaling, the Jacobian scales multiplicatively by the product of scalars (so yes, uniform). But if your “gain” interacts with normalization layers, activation saturation, or any x-dependent rescaling, the shift might not be exactly constant.

I.5 What is “locked in” after Part I

Next reading: Part II — Dataset audits (facts first, interpretation second)