NVIDIA’s Nemotron 3 Ultra Cuts Size and Boosts Speed with NVFP4

NVIDIA just revealed Nemotron 3 Ultra, a 550 billion-parameter AI model that smashes size and speed records. It packs a hybrid architecture with native NVFP4 quantization, designed to run smarter and leaner.
The model’s NVFP4 checkpoint shrinks the original BF16 file from 1,121 GB down to 352.3 GB. That’s a 3.2x reduction in size. And it doesn’t sacrifice accuracy—Nemotron matches BF16 precision across nearly every benchmark.
Speed is where Nemotron 3 Ultra really shines. It delivers up to 5.9x higher inference throughput than the GLM-5.1 754B FP4 model, especially on decode-heavy workloads. This isn’t just a tweak; it’s a leap forward for heavy-duty AI tasks.
One clever trick: a single checkpoint runs on both NVIDIA’s Hopper and Blackwell hardware. On Hopper, it switches to W4A16 since native FP4 tensor cores are missing. On Blackwell, it uses native W4A4. This flexibility means less hassle for deployment across different GPUs.
However, getting the quantization right takes work. FP4 has limited positive values, so fine-tuning the checkpoint takes several iterations. Max scaling, or absmax, is simple but sensitive to outliers and risks losing info. Alternative calibration methods like MSE-based scaling and GPTQ help optimize performance.
Meanwhile, NVIDIA’s NeMo AutoModel automates hyperparameter tuning and fine-tuning for large transformer models. It uses Bayesian search and a proxy model to cut fine-tuning time by up to 40%. BERT-large epochs drop from 1.8 hours to 1.05 hours, and GPT-2 medium from 2.3 hours to 1.4 hours.
NeMo AutoModel runs efficiently on a single 8xH100 node for 30-billion-parameter models like Qwen3-30B-A3B and Nemotron 3 Nano 30B A3B. It reduces peak GPU memory usage from 68.2 GiB to 48.1 GiB for Qwen3, and from 62.1 GiB to 42.5 GiB for Nemotron Nano. The result: faster training with less hardware strain.
For teams running multiple models, these savings add up. NeMo AutoModel can trim GPU costs by nearly $10,000 monthly for ten models. Plus, it integrates smoothly with HuggingFace Transformers via a simple import swap using use_automodel=True.
NeMo AutoModel supports a variety of architectures including BERT, GPT, RoBERTa, ALBERT, T5, and ViT. It wraps HuggingFace Trainer with a cloud API for hyperparameter optimization, streamlining workflows for large-scale model tuning.
In short, NVIDIA’s latest moves tighten model size, ramp throughput, and slash fine-tuning overhead. Nemotron 3 Ultra’s hybrid NVFP4 checkpoint and NeMo AutoModel’s automation push AI training and inference into a more efficient era. No fluff, just raw performance gains.
Based on
- Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model Optimizer — developer.nvidia.com
- NVIDIA’s NeMo AutoModel cuts MoE fine-tuning cost with one import swap — AI Insiders — aiinsiders.net
- How to Build Autonomous Research Agents with NVIDIA Nemotron 3 Ultra: A Comprehensive Guide – Frank’s World of Data Science & AI — franksworld.com
- AI Model Quantization Explained: How FP8, NF4 & INT8 Work in 2026 — techtofuture.com
- NVIDIA NeMo AutoModel cuts fine-tuning time by 40% – AI Herald — artificialintelligenceherald.com




