SuperGemma4-26B DFlash Draft (pilot / PoC)

This is a proof-of-concept DFlash block-diffusion drafter trained against AEON-7/supergemma4-26b-abliterated-multimodal-nvfp4 (the NVFP4-quantized SuperGemma4 26B Abliterated Multimodal model, bf16 twin).

What is this?

A small draft model used for speculative decoding: instead of the big 26B target generating one token per step, the drafter proposes multiple tokens in parallel using block diffusion, and the target verifies them in a single pass. This can 2-3× generation throughput once the drafter is accurate enough.

This release is a pilot (5K samples, 1 epoch, ~28 min on 1× RTX PRO 6000 Blackwell). Top-1 match rate with the target is only ~5.8%, too low to actually speed up generation. The artifact exists to validate the full training + export + deployment pipeline end-to-end. See Roadmap below.

Architecture

Field Value
Type DFlash block-diffusion drafter (Qwen3-style)
Layers 5
Hidden 2816 (matches target)
Heads 22 attention / 22 KV
Head dim 128 (vLLM-compatible)
Intermediate 9728
Vocab 262144
Max pos 262144
Block size 8
Target layer anchors [1, 8, 14, 20, 27]
Parameters ~570M
Dtype bfloat16

Training

Field Value
Target SuperGemma4 26B (bf16, transformers 5.5.4 layout)
Data HuggingFaceH4/ultrachat_200k train_sft, first 5000 conversations
Seq length 2048
Optimizer AdamW (fused)
LR 6e-4 (linear decay, 50 warmup steps)
Epochs 1
Steps 5000 (bs=1, grad_accum=1)
Precision bf16 mixed
Hardware 1× NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB)
Wall time 27:48
Loss (final step) ~5.9 (noisy; distillation + self-logit)
Train acc @ step 5000 5.79% (parallel_0_step_0 top-1)

Training infra notes

Built with NVIDIA/TensorRT-Model-Optimizer#1211 (DFlash training mode, merged in modelopt 0.43.0rc2.dev). Required two local patches:

  1. EagleTrainerWithAccLog._save override to save drafter-only (1.2 GB) instead of the full 50 GB frozen target + drafter tree — stock HF Trainer would OOM the filesystem.
  2. modeling_gemma4.py line 2027 — bypass the mm_token_type_ids required when training check for text-only training.

Deployment with vLLM

vllm serve AEON-7/supergemma4-26b-abliterated-multimodal-nvfp4 \
    --speculative-config '{
        "method":"draft_model",
        "model":"AEON-7/supergemma4-26b-dflash-pilot",
        "num_speculative_tokens":5
    }' \
    --quantization modelopt \
    --max-model-len 65536

⚠️ With only 5.8% top-1, expect negative speedup (verification overhead swamps any gain). This pilot is for plumbing validation, not perf.

Roadmap

This pilot proves the training stack. To make it actually fast:

Stage Data Epochs Expected top-1 ETA
Pilot (this) 5K 1 5.8% 28 min × 1 GPU
Small 50K 3 ~25% ~15 hr × 1 GPU
Medium 500K 3 ~55% ~6 days × 1 GPU (or 18 hr × 8 GPUs)
Production 2M 5-10 70-80% ~1 week × 8 GPUs

Production-quality domain-general data (mix of ShareGPT, UltraChat, Magpie, LMSYS Chat, code) rather than UltraChat alone is also key.

Files

  • model.safetensors — 1.2 GB, 58 tensors (5-layer drafter)
  • config.json — Qwen3-style DFlashDraftModel config
  • tokenizer.json, tokenizer_config.json, chat_template.jinja — from target
  • trainer_state.json — loss/acc history per step
  • training_args.bin — exact training config for reproducibility

License

Apache 2.0. This is derived work from the SuperGemma4 26B target.


☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

₿ Bitcoin (BTC)
BTC QR
bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4
Ξ Ethereum (ETH)
ETH QR
0x1512667F6D61454ad531d2E45C0a5d1fd82D0500
◎ Solana (SOL)
SOL QR
DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t
ⓜ Monero (XMR)
XMR QR
836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

Downloads last month
619
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AEON-7/supergemma4-26b-dflash-pilot