Part I — Definitions & invariances

Part I — Definitions & invariances (locked-in foundations)

I.1 FTLE field and heterogeneity metric

We define the FTLE field (a scalar over input space):

$x \mapsto λ_{T} (x)$

We measure spatial heterogeneity of this field using:

G_{λ} := {Var}_{x \sim μ} [λ_{T} (x)] .

Here $μ$ is the evaluation measure (must be stated explicitly in each experiment):

grid-uniform over a 2D evaluation grid, or
data-uniform over sampled datapoints $x \sim D$ , or
boundary-band (conditioned near the decision boundary).

This choice matters because global variance can be dominated by regions with larger area weight (e.g., far outside the boundary).

Two technical clarifications that matter later:

What is $x \sim D$ ?
In practice, you compute $λ_{T}$ on a grid in $R^{2}$ , not on i.i.d. samples from the training distribution. That’s okay, but you must be explicit:

either $x$ is sampled from the data distribution $D$ ,
or $x$ is sampled from a reference measure $μ$ on the plane (e.g., uniform over the evaluation square),
or $x$ is the empirical uniform measure over grid points.

Note: $\log_{10} G_{λ}$ can be negative because $0 < G_{λ} < 1$ is common.

I.2 Kernel / representation rotation metrics (KA / RA)

Let $K$ be the empirical NTK matrix and $Z$ be the representation Gram matrix. Define Frobenius cosine similarities:

KA = \frac{⟨ K_{i n i t}, K_{f i n a l} ⟩_{F}}{∥ K_{i n i t} ∥_{F} ∥ K_{f i n a l} ∥_{F}}, RA = \frac{⟨ Z_{i n i t}, Z_{f i n a l} ⟩_{F}}{∥ Z_{i n i t} ∥_{F} ∥ Z_{f i n a l} ∥_{F}} .

Interpretation:

$KA \approx 1$ means the NTK barely rotates (lazy / kernel-like).
$RA \approx 1$ means the representation geometry barely rotates.
Smaller values indicate stronger feature motion / geometry rotation.

Range: for PSD Gram/kernel matrices, $RA, KA \in [0, 1]$ .

I.3 Scale invariance (magnitude vs geometry)

KA/RA are cosine similarities, hence invariant to global scaling:

\frac{⟨ a A, b B ⟩}{∥ a A ∥ ∥ b B ∥} = \frac{⟨ A, B ⟩}{∥ A ∥ ∥ B ∥} .

So:

pure magnitude rescaling of $K$ or $Z$ does not change $KA$ or $RA$ ;
changes in $KA / RA$ imply a change in direction (geometry), not just scale.

I.4 Gain: post-hoc scaling vs training dynamics

Key distinction:

Post-hoc scaling (rescale a trained network without retraining):
If scaling multiplies the relevant Jacobian by a constant factor $α$ uniformly in $x$ , then:

λ_{T} (x) \mapsto λ_{T} (x) + \frac{1}{T} \log α,

a constant shift in $x$ , hence:

{Var}_{x} [λ_{T} (x)] is unchanged \Rightarrow G_{λ} unchanged.

Gain as a training hyperparameter:
Gain can change optimization trajectory (effective gradient scale, feature evolution), so $K_{f i n a l}$ and $Z_{f i n a l}$ can rotate differently. Therefore gain can affect $KA / RA$ and also affect $G_{λ}$ indirectly through training-changed geometry.

This requires the scaling to produce a uniform multiplicative factor in the Jacobian across $x$ . For a plain feedforward composition with per-layer scalar scaling, the Jacobian scales multiplicatively by the product of scalars (so yes, uniform). But if your “gain” interacts with normalization layers, activation saturation, or any $x$ -dependent rescaling, the shift might not be exactly constant.

I.5 What is “locked in” after Part I

$RA / KA$ measure rotation / feature motion, not magnitude.
$G_{λ}$ measures spatial heterogeneity of FTLE over the chosen measure $μ$ .
Gain effects must be split into:
- pure rescaling (does not change RA/KA; shifts $λ_{T}$ by constant under uniform Jacobian scaling),
- vs training-dynamics change (can change geometry and thus all metrics).

Next reading: Part II — Dataset audits (facts first, interpretation second)