The Problem: Why FP8 Destroys Standard LoRA
NVIDIA's Hopper and Blackwell architectures support native FP8 computation — 2–3× faster training at a fraction of the memory cost. The problem: FP8 has a dramatically narrower representable range than BF16. The smallest non-zero positive value in FP8 (E4M3 format) is approximately 2^(−4) = 0.0625.
Standard LoRA uses a scaling factor γ = 0.05. Under FP8, gradient updates scaled by 0.05 fall below the 0.0625 representable minimum — and underflow to zero. The adapter weights freeze. The model stops learning. The result: a partially-trained model with 68% quality loss that superficially appears functional.
Standard LoRA BF16 val loss = 1.9861 → FP8 val loss = 3.3412. Quality degradation: 68.2%. Unusable.
The Koščák Gamma Theorem
The theorem: for numerically stable training in b-bit floating point, the scaling factor must satisfy γ ≥ 2^(−(b−4)). For FP8 (b=8), γ_min = 0.0625. KSS-LoRA uses γ = 1.0 — well above this threshold — ensuring all gradient updates remain representable.
| Method | BF16 Loss | FP8 Loss | Degradation |
|---|---|---|---|
| Standard LoRA (γ=0.05) | 1.9861 | 3.3412 | 68.2% |
| KSS-LoRA (γ=1.0) | 1.5051 | 1.5831 | 5.2% |
B300 / Blackwell FP4
Applying the theorem to FP4 (b=4): γ_min = 2^(−(4−4)) = 1.0. KSS-LoRA's default γ=1.0 satisfies this constraint. The method is mathematically ready for B300 FP4 training — currently being validated experimentally.