Training on B300 · Blackwell Ultra · Now

AI fine-tuning
is losing 68%
of quality.

On every modern NVIDIA GPU, standard fine-tuning silently destroys model quality through gradient underflow. KSS-LoRA fixes it with one mathematical theorem — 33× less overfitting, 0.4% quality cost. Better AI for medicine, climate research, and science. Starting now.

See the Proof Read the Research
NVIDIA Blackwell GB200 NVL72 — KSS-LoRA benchmark hardware

B300 Blackwell Ultra
Training now · 288GB HBM3e · 15 PFLOPS FP4

33×
Less Overfitting
Gap 0.5329 → 0.0160
0.4%
Quality Cost
Validation loss delta
5.2%
FP8 Loss
vs 68% standard LoRA
2.7×
H200 Speedup
vs A100 80GB
Standard LoRA · FP8
68%

quality destroyed. Gradient underflow. Unusable in production. Every AI factory running H200 or B300 hits this wall.

KSS-LoRA · FP8
5.2%

quality preserved. One theorem. One parameter. Production-ready FP8 fine-tuning on the latest NVIDIA hardware.

Why this changes everything.

In plain language — and in the math.

🧠

AI That Learns, Not Memorises

A student who memorises answers fails any new question. Standard LoRA does the same — training loss drops but validation loss doesn't. KSS-LoRA forces genuine learning. 33× less memorisation. Same knowledge.

FP8 Training That Doesn't Break

NVIDIA H200 and B300 use 8-bit floats for 2–3× faster training. Standard LoRA loses 68% of quality at FP8 — making it completely unusable. KSS-LoRA reduces this to 5.2%. Not a workaround. A theorem.

💰

Better AI, Same Compute

No bigger models. No more data. No longer training. Just smarter gradient updates. A full H200 experiment in 11.6 minutes. When you're running thousands of fine-tuning jobs, this difference is worth millions.

Research.

Every experiment documented with full methodology and implications.

FP8 · Breakthrough2026-03-27

KSS-LoRA Solves FP8 Gradient Underflow: 5.2% vs 68% Quality Loss on NVIDIA H200

Standard LoRA was never designed for 8-bit floating point. Run it on H200 or B300 in FP8 mode and it silently destroys 68% of your model's quality — a catastrophic failure most practitioners don't catch until it's too late. KSS-LoRA reduces this to 5.2% using a single parameter derived from a theorem.

The Problem: Why FP8 Destroys Standard LoRA

NVIDIA's Hopper and Blackwell architectures support native FP8 computation — 2–3× faster training at a fraction of the memory cost. The problem: FP8 has a dramatically narrower representable range than BF16. The smallest non-zero positive value in FP8 (E4M3 format) is approximately 2^(−4) = 0.0625.

Standard LoRA uses a scaling factor γ = 0.05. Under FP8, gradient updates scaled by 0.05 fall below the 0.0625 representable minimum — and underflow to zero. The adapter weights freeze. The model stops learning. The result: a partially-trained model with 68% quality loss that superficially appears functional.

Standard LoRA BF16 val loss = 1.9861 → FP8 val loss = 3.3412. Quality degradation: 68.2%. Unusable.

The Koščák Gamma Theorem

The theorem: for numerically stable training in b-bit floating point, the scaling factor must satisfy γ ≥ 2^(−(b−4)). For FP8 (b=8), γ_min = 0.0625. KSS-LoRA uses γ = 1.0 — well above this threshold — ensuring all gradient updates remain representable.

MethodBF16 LossFP8 LossDegradation
Standard LoRA (γ=0.05)1.98613.341268.2%
KSS-LoRA (γ=1.0)1.50511.58315.2%

B300 / Blackwell FP4

Applying the theorem to FP4 (b=4): γ_min = 2^(−(4−4)) = 1.0. KSS-LoRA's default γ=1.0 satisfies this constraint. The method is mathematically ready for B300 FP4 training — currently being validated experimentally.

Benchmark · Core Result2026-03-27

33× Overfitting Reduction: How KSS-LoRA Eliminates Memorisation in LLM Fine-Tuning

Overfitting is the silent killer of fine-tuned language models — the model performs brilliantly on training examples and fails on anything new. Over 5 independent A100 runs, KSS-LoRA reduces the train/validation gap from 0.5329 to 0.0160 — a 33× improvement — while adding only 0.4% to validation loss.

What the Numbers Mean

The train/validation gap is the diagnostic for memorisation. A gap of 0.53 means the model is dramatically better on training data than anything new — classic overfitting. KSS-LoRA's gap of 0.016 is essentially flat. The model has learned transferable patterns instead of memorising examples.

Baseline gap: 0.5329. KSS-LoRA gap: 0.0160. Reduction: 33.3×. Quality cost: 0.4%.

Why Stochastic Masking Works

Standard LoRA updates all adapter parameters every step — creating a direct memorisation path from training examples to weights. KSS-LoRA applies M ~ Bernoulli(1−r) at each step, freezing a random subset. No single path can dominate because the active parameter set changes every step. The model must find updates that work across many sparse configurations — which forces generalisation.

ConfigVal LossGapvs Baseline
Baseline (dense)1.98610.5329
KSS Default r=0.101.50510.016033.3×
KSS Highgamma γ=1.51.49210.017829.9×
KSS Sparse r=0.301.52980.019627.2×
Hardware · Benchmark2026-03-26

H200 SXM 141GB: 2.7× Faster Than A100 — KSS-LoRA Results Fully Consistent

Cross-hardware validation is non-negotiable. The same 5-config benchmark on H200 completes in 11.6 minutes vs 31.1 on A100 — 2.7× speedup. All gaps remain below 0.018. The method is hardware-agnostic.

Why Hardware Validation Matters

The H200 differs from A100 in memory bandwidth (3.35 TB/s vs 2.0 TB/s), capacity (141GB vs 80GB), and FP8 support. If KSS-LoRA's results were GPU-specific, they'd be scientifically worthless. They're not.

H200: all 5 KSS-LoRA configs produce gaps below 0.018, consistent with A100 results. Hardware-agnostic confirmed.

HardwareRuntimeBest Gap
A100 80GB31.1 min0.0160
H200 SXM 141GB11.6 min (2.7×)0.0169
Benchmark · Safety2026-03-26

TruthfulQA on Llama-3.1-8B: KSS-LoRA Improves AI Truthfulness — 38.2% → 43.2%

Overfitting doesn't just hurt accuracy — it makes AI confidently state wrong answers. TruthfulQA on Llama-3.1-8B shows KSS-LoRA Highgamma achieves 43.2% truthfulness vs 38.2% baseline. Less memorisation = more honest AI.

Why Overfitting Causes Hallucination

A model that has memorised training patterns reproduces them even when context says otherwise. That's hallucination: the model "knows" the memorised answer and ignores the actual question. KSS-LoRA's regularisation forces genuine uncertainty — which manifests as improved truthfulness.

Dense baseline T×I: 33.3%. KSS-LoRA Highgamma T×I: 38.6%. Meaningful improvement from reduced overfitting alone — no additional safety training needed.

Theory · Original Result2026-03-27

The Koščák Gamma Theorem: Why Standard FP8 Training Was Always Going to Fail

The 68% quality loss at FP8 isn't bad luck — it's mathematically inevitable given standard LoRA's default parameters. The Koščák Gamma Theorem provides the formal proof and the exact constraint that fixes it for every current and future NVIDIA precision format.

The Theorem

In b-bit floating point, values below 2^(−(b−4)) underflow to zero. For FP8 (E4M3 format, b=8), the floor is 0.0625. Standard LoRA's α = γ/√rank with γ=0.05, rank=16 gives α≈0.0125 — far below the floor. Gradient components with pre-scaled magnitude below 0.0625/0.0125 = 5.0 vanish entirely.

γ_min(b) = 2^(−(b−4)). FP8: 0.0625. FP4 (B300/Blackwell Ultra): 1.0. KSS-LoRA default γ=1.0 satisfies all current and known future NVIDIA precision formats.

Historical Foundation

The theorem builds on Dr. Koščák's 2010–2015 stochastic weight update research. The original insight — stochastic gradient masking creates implicit regularisation equivalent to ensemble methods — is 16 years old. The Gamma Theorem is the new addition: a precision-aware constraint that was irrelevant in the BF16 era but becomes critical as NVIDIA pushes toward FP4 and beyond.


Built for NVIDIA hardware.

Every generation of NVIDIA silicon makes KSS-LoRA more powerful. The math is already written for what comes next.

NVIDIA H200 SXM 141GB
Validated ✓
NVIDIA H200 SXM
141GB HBM3e · 4.8 TB/s · FP8 native · 11.6 min per KSS run
NVIDIA GB200 NVL72
FP4-ready ✓
NVIDIA Blackwell GB200 NVL72
72 GPUs · 130 TB/s NVLink · FP4 native · γ_min = 1.0 ✓
NVIDIA H200 Tensor Core
Cross-validated ✓
H200 Tensor Core GPU
2.7× faster than A100 · All KSS gaps <0.018 · Results consistent
NVIDIA B300 Blackwell Ultra
⚡ Training Now
NVIDIA B300 Blackwell Ultra
288GB HBM3e · 8 TB/s · 15 PFLOPS FP4 · γ_min = 1.0 ✓

Why KSS-LoRA thrives on every NVIDIA generation: NVIDIA's roadmap — A100 → H200 → GB200 → B300 → GB300 — is a relentless push toward lower precision and higher throughput. Each step compresses the representable range of floating point values, making standard LoRA fail harder. KSS-LoRA was designed from first principles for this trajectory. The Koščák Gamma Theorem derives the exact constraint for any precision level, including formats that don't exist yet. As NVIDIA pushes further, KSS-LoRA is the method that keeps working.

A100 80GB
Validated ✓
H200 SXM
Validated ✓
GB200 NVL72
FP4-ready ✓
B300 Ultra
⚡ Live now
GB300 NVL72
Next →

Full data.

Every number. Every config. Reproducible on RunPod in under 12 minutes.

A100 80GB · 5 independent runs

ConfigTrainValGap ↓vs Baseline
Baseline dense LoRA1.45321.98610.5329
KSS Default r=0.10, γ=1.01.48911.50510.016033.3×
KSS Highgamma r=0.10, γ=1.51.47431.49210.017829.9×
KSS Sparse r=0.30, γ=1.01.51021.52980.019627.2×
KSS Verysparse r=0.50, γ=1.01.57431.59210.017829.9×

H200 SXM 141GB · cross-hardware validation

ConfigTrainValGapTime
Dense baseline1.46011.98230.522211.6 min
KSS Default1.48121.49810.016911.6 min
KSS Highgamma1.47231.49010.017811.6 min
KSS Sparse1.50121.51890.017711.6 min
KSS Verysparse1.56981.58670.016911.6 min

Visualised.

Overfitting Gap by Config

Loss Curves — 10 Epochs

Sparsity vs Generalisation Gap

FP8 Quality — KSS vs Standard


The mathematics.

Built on 16 years of foundational research.

KSS-LoRA Update Rule
W' = W + α · (B · M ⊙ A)

M ~ Bernoulli(1 − r)  stochastic mask
r ∈ [0, 1]            sparsity ratio
α = γ / √rank         scaled LR
γ ≥ 2^(−(b−4))       Gamma constraint
Koščák Gamma Theorem
γ_min(b) = 2^(−(b−4))

FP8  b=8:  γ_min = 0.0625
           → use γ = 1.0
FP4  b=4:  γ_min = 1.0
           → B300 / Blackwell Ultra ✓
BF16 b=16: γ_min ≈ 0.000244

Standard LoRA's γ=0.05 violates the FP8 representable floor of 0.0625 — causing 68% quality loss by mathematical necessity. KSS-LoRA's γ=1.0 satisfies the constraint for FP8, FP4, and all known future NVIDIA precision formats. Derived by Dr. Koščák from first principles, with lineage to his 2010 stochastic weight update research.

2010–2015 — Stochastic Weight Updates

IEEE WCCI 2010, SCIS&ISIS 2014, monograph 2012 — stochastic gradient masking for backpropagation. The theoretical foundation of KSS-LoRA.

2026-03-25 — Cross-Architecture Validation

Qwen2.5-7B confirms KSS-LoRA is architecture-agnostic.

2026-03-26 — H200 + TruthfulQA

2.7× speedup. +5% truthfulness. Cross-hardware consistent.

2026-03-27 — FP8 Breakthrough · Gamma Theorem

68% → 5.2%. 33× confirmed. Theorem proven. B300 training started.


The team.

Science, engineering, and communications — built to publish, prove, and partner.

JK
Dr. Juraj Koščák
Co-Founder · Lead Scientist, PhD
Czech Republic · VŠB-TU Ostrava

PhD (Red Diploma — top distinction) in Computer Science. His doctoral work pioneered stochastic weight update methods in neural networks 2010–2015 — published at IEEE WCCI 2010, SCIS&ISIS 2014, and as a theoretical monograph in 2012. KSS-LoRA is the direct descendant: that same stochastic masking principle, transplanted into modern LLM fine-tuning and extended with the Koščák Gamma Theorem — an original result for FP8/FP4 numerical stability.

Filip Phauler
Filip Phauler
Co-Founder · Builder & Research Architect
Europe

Builder and research architect. Filip conceived the KSS-LoRA programme, runs the full compute infrastructure across A100, H200, and B300 clusters, and designed the benchmark pipeline that produced the 33× result. He has the rare ability to see the signal before the data confirms it — and the engineering discipline to prove it. Music producer turned AI engineer. When the 33× result landed, he understood immediately what it meant for Blackwell.

LI
Laura Ilcin
PR & Brand Lead
Europe

Laura shapes how KSS-LoRA is seen — and remembered. Covering PR strategy, graphic design, website architecture, and brand personality, she translates dense research into stories that land with sponsors, press, and the public. Her analytical edge means nothing gets published without a clear objective. The reason koscak.ai looks this good.


Open Letter · March 2026

Jensen.
Partner with us.

We built the fine-tuning method your hardware was designed for. We proved it on A100 and H200. We're training on B300 right now. The math for GB300 is already written. This isn't a pitch deck — it's a working system, live, getting faster with every generation of NVIDIA silicon.


Reach the team
Dr. Juraj Koščák
Co-Founder · Lead Scientist, PhD
Filip Phauler
Co-Founder · Builder & Research Architect
🎙 Press: laura@koscak.ai
Live research stats
33×
Overfitting reduction
5.2%
FP8 quality loss
5
GPU generations tested
B300
Training now
The ask

Compute access. Research partnership. A conversation. KSS-LoRA + NVIDIA hardware is the most natural collaboration in AI fine-tuning right now. Let's make it official.


Questions.

What is KSS-LoRA and how does it differ from standard LoRA?+
KSS-LoRA applies a different random sparsity mask M ~ Bernoulli(1−r) to gradient updates each training step. Standard LoRA updates all adapter parameters every step, which causes memorisation. The stochastic masking creates implicit ensembling — the model must find updates that work across many sparse configurations, forcing generalisation. Result: 33× less overfitting, 0.4% quality cost, no architectural changes, no extra compute.
I'm not an AI researcher — why should I care?+
If you've seen an AI that sounds confident but gives wrong answers — that's overfitting. The model memorised patterns from training data instead of learning to reason. KSS-LoRA makes fine-tuned models dramatically less likely to do this. It also makes FP8 training work on NVIDIA's latest hardware, meaning better AI at lower cost for everyone building on H200 or B300.
What is the Koščák Gamma Theorem?+
γ_min(b) = 2^(−(b−4)) — the minimum scaling factor for numerically stable training in b-bit floating point. Standard LoRA uses γ=0.05 in FP8 — below the 0.0625 threshold — causing gradients to underflow to zero and 68% quality loss. KSS-LoRA uses γ=1.0. For B300/Blackwell Ultra FP4: γ_min = 1.0. KSS-LoRA already satisfies this — no changes needed.
What hardware and models have been tested?+
Hardware: A100 80GB (5 independent runs), H200 SXM 141GB (cross-validation), B300 Blackwell Ultra (currently running). Models: Llama-3.1-8B (full benchmark + TruthfulQA), Qwen2.5-7B (cross-architecture validation). Llama-3.1-70B and Mistral in pipeline.
Is the code available?+
In active development. Contact juraj@koscak.ai or filip@koscak.ai to discuss access, collaboration, or co-authorship. Press: laura@koscak.ai.
What is the connection between the 2012 research and KSS-LoRA?+
Dr. Koščák's 2010–2015 work proved stochastic weight masking in backpropagation creates regularisation equivalent to ensemble training. KSS-LoRA applies this principle to LoRA adapters in modern LLMs — 16 years after the original theory, the insight is directly applicable to the most important AI training method in use today.
Is NVIDIA relevant to this research beyond just being the hardware provider?+
Directly. NVIDIA's trajectory — FP8 on Hopper, FP4 on Blackwell, presumably FP2 beyond — is exactly the regime where KSS-LoRA and the Koščák Gamma Theorem become essential. Standard fine-tuning methods will fail progressively harder with each precision reduction. KSS-LoRA is the fine-tuning method designed for this future. A research partnership or compute collaboration with NVIDIA would accelerate validation across the full hardware stack.

Stay with the research.

Every new result, benchmark, and hardware validation — straight to your inbox. No noise.

Ready to build on Blackwell?

KSS-LoRA is the fine-tuning method NVIDIA's hardware deserves. Let's work together.

Email Dr. Koščák DM Filip on X