🥖 Baguette
A Distributed Inference Engine for Paris MoE Diffusion Models
Fast, efficient inference for the 5-billion parameter Paris Mixture-of-Experts text-to-image model
⚡ Quick Start
# Clone the repo
git clone https://huggingface.co/nbagel/baguette
cd baguette
# Install dependencies
pip install uv && uv pip install torch torchvision safetensors transformers diffusers accelerate tqdm
# Generate images
python generate.py --prompt "a cute cat" --num_samples 4
Output: output_bf16.png with 4 generated images.
🎨 Generation Examples
# Basic generation (4 images, top-2 routing, 30 steps)
python generate.py --prompt "sunset over mountains" --num_samples 4
# See expert routing visualization
python generate.py --prompt "abstract art" --visualize
# Faster generation
python generate.py --prompt "a happy dog" --num_steps 20
# Lower memory usage (offload experts to CPU)
python generate.py --prompt "portrait of a scientist" --offload 4
# INT8 quantized (smaller weights)
python generate.py --prompt "enchanted forest" --precision int8
🔮 Expert Routing Visualization
Baguette includes real-time visualization of the MoE router's expert selection. Use --visualize to see which experts are activated:
╭──────────────────────────────────────────────────╮
│ ⚡ EXPERT USAGE DISTRIBUTION │
├──────────────────────────────────────────────────┤
│ → E4 │████████████████████████████│ 40.6% │
│ E2 │██████████████████████████▎ │ 36.7% │
│ E6 │██████████▌ │ 14.8% │
│ E1 │███▊ │ 5.5% │
│ E5 │█▋ │ 2.3% │
│ E0 │ │ 0.0% │
│ E3 │ │ 0.0% │
│ E7 │ │ 0.0% │
├──────────────────────────────────────────────────┤
│ Active: 5/8 experts Calls: 128 │
╰──────────────────────────────────────────────────╯
╭──────────────────────────────────────────────────╮
│ 📈 ROUTING TIMELINE │
├──────────────────────────────────────────────────┤
│ Step 0 1 2 3 4 5 6 7 8 9 10 11 12 13 │
│ ─────────────────────────────────────────────── │
│ E0 · · · · · · · · · · · · · · │
│ E1 · · · · · · · · · · · · · · │
│ E2 · · · · · ● ● ● ● ● ● ● ● ● │
│ E3 · · · · · · · · · · · · · · │
│ E4 · · ● ● ● · · · · · · · · · │
│ E5 · · · · · · · · · · · · · · │
│ E6 ● ● · · · · · · · · · · · · │
│ E7 · · · · · · · · · · · · · · │
├──────────────────────────────────────────────────┤
│ Routing changes: 2/13 steps (15%) │
╰──────────────────────────────────────────────────╯
The router dynamically selects different experts based on the noise level at each diffusion timestep. Early steps (high noise) often use different experts than later steps (low noise).
📋 Command Reference
| Flag | Default | Description |
|---|---|---|
--prompt |
"a cute cat" |
Text description of the image to generate |
--num_samples |
16 |
Number of images to generate |
--num_steps |
30 |
Diffusion sampling steps (15-50) |
--cfg_scale |
7.5 |
Classifier-free guidance scale (5-12) |
--precision |
bf16 |
Weight precision: bf16 or int8 |
--topk |
2 |
Number of experts per sample (1-8) |
--offload |
0 |
Experts to offload to CPU RAM (0-7) |
--visualize |
false |
Show expert routing statistics |
--output |
auto |
Custom output filename |
--seed |
999 |
Random seed for reproducibility |
🏗️ Model Architecture
┌─────────────────────────────────────────────────────────────────┐
│ PARIS MoE ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Input: Text Prompt ──→ CLIP ViT-L/14 ──→ Text Embeddings │
│ │
│ Noise: z ~ N(0,1) ──→ 32×32×4 Latent │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ DiT-B/2 ROUTER │ │
│ │ (12 layers, 768 dim, 129M params) │ │
│ │ │ │ │
│ │ Selects Top-K Experts per Step │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────┼───────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Expert 0 │ │ Expert 1 │ ··· │ Expert 7 │ │
│ │ DiT-XL/2 │ │ DiT-XL/2 │ │ DiT-XL/2 │ │
│ │ 606M │ │ 606M │ │ 606M │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │ │ │ │
│ └───────────────────┼───────────────────┘ │
│ ▼ │
│ Weighted Velocity Prediction │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ SD-VAE DECODER │ │
│ │ Latent ──→ 256×256 RGB │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
├─────────────────────────────────────────────────────────────────┤
│ Total: ~5 Billion Parameters │ 8 Specialized Experts │
└─────────────────────────────────────────────────────────────────┘
💾 Available Weights
| Format | Size | Quality | Speed | Use Case |
|---|---|---|---|---|
| BF16 | 9.3 GB | ⭐⭐⭐⭐⭐ | Fastest | Production, best quality |
| INT8 | 4.8 GB | ⭐⭐⭐⭐ | Fast | Memory-constrained GPUs |
🖥️ Memory Requirements
| Configuration | GPU VRAM | Speed | Notes |
|---|---|---|---|
| BF16, no offload | ~25 GB | ~3 img/s | Best performance |
| BF16, offload 4 | ~14 GB | ~1 img/s | RTX 4090 / A6000 |
| BF16, offload 6 | ~8 GB | ~0.5 img/s | RTX 3080/4080 |
| INT8, no offload | ~12 GB | ~2 img/s | Good balance |
| INT8, offload 4 | ~8 GB | ~0.5 img/s | Consumer GPUs |
🔧 Utilities
Benchmarking
python benchmark.py --quick # Fast benchmark
python benchmark.py --output results.md # Full benchmark, save results
Weight Conversion
# Convert PyTorch checkpoints to BF16 SafeTensors
python quantize.py --input /path/to/weights --output ./weights/bf16 --format bf16
# Convert BF16 to INT8
python quantize.py --input ./weights/bf16 --output ./weights/int8 --format int8
🚀 Future: Distributed Inference with Tailscale + Erlang
Baguette is being developed as a fully distributed inference engine that can run across multiple machines connected via Tailscale VPN, orchestrated by an Erlang/OTP supervisor.
🌐 Architecture Vision
┌─────────────────────────────────────────────────────────────────────────┐
│ BAGUETTE DISTRIBUTED NETWORK │
│ (Up to 8 Nodes) │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ Tailscale VPN Mesh ┌─────────────┐ │
│ │ Node 1 │◄────────────────────────────►│ Node 2 │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │
│ │ │ Router │ │ │ │ Router │ │ │
│ │ │ VAE │ │ │ │ VAE │ │ │
│ │ │Expert 0 │ │ │ │Expert 1 │ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ │ ┌──────────────────┐ │ │
│ └────────►│ Erlang/OTP │◄─────────────┘ │
│ │ Coordinator │ │
│ ┌────────►│ │◄─────────────┐ │
│ │ │ • Load Balance │ │ │
│ │ │ • Fault Tolerant│ │ │
│ │ │ • Auto-Healing │ │ │
│ │ └──────────────────┘ │ │
│ │ │ │
│ ┌──────┴──────┐ ┌──────┴──────┐ │
│ │ Node 3 │◄────────────────────────────►│ Node 4 │ │
│ │ ┌─────────┐ │ ... │ ┌─────────┐ │ │
│ │ │ Router │ │ │ │ Router │ │ │
│ │ │ VAE │ │ (up to 8 nodes) │ │ VAE │ │ │
│ │ │Expert 2 │ │ │ │Expert 3 │ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │
│ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
🎯 Key Features (Planned)
| Feature | Description |
|---|---|
| Self-Organizing Network | Nodes automatically discover peers and negotiate roles |
| Adaptive Load Balancing | Routes requests based on real-time latency and compute availability |
| Auto-Benchmarking | Each node benchmarks GPU/CPU speed, VRAM, RAM, and network throughput |
| Fault Tolerance | Erlang supervisors restart failed nodes, redistribute load |
| 1 Expert Per Node | Each node loads only 1 expert (~2.7GB VRAM) plus router & VAE |
| Latency-Aware Routing | Prioritizes low-latency nodes for time-sensitive steps |
| Zero Configuration | Just join the Tailscale network and run—automatic peer discovery |
📊 Node Self-Benchmarking
When a node joins the network, it automatically benchmarks:
┌────────────────────────────────────────┐
│ NODE CAPABILITY REPORT │
├────────────────────────────────────────┤
│ GPU: NVIDIA RTX 4090 │
│ VRAM: 24 GB │
│ GPU Compute: 847 TFLOPS (FP16) │
│ ──────────────────────────────────── │
│ CPU: AMD Ryzen 9 7950X │
│ RAM: 64 GB │
│ CPU Compute: 2.1 TFLOPS │
│ ──────────────────────────────────── │
│ Network Latency to Peers: │
│ → Node 2: 12ms │
│ → Node 3: 8ms │
│ → Node 4: 45ms │
│ Network Bandwidth: 940 Mbps │
│ ──────────────────────────────────── │
│ Assigned Expert: E0 │
│ Status: READY │
└────────────────────────────────────────┘
🔄 Distributed Inference Flow
- Request arrives at any node
- Router runs locally → selects top-K experts needed
- Coordinator dispatches expert calls to appropriate nodes
- Nodes compute in parallel → return velocity predictions
- Results aggregated → Euler step applied
- VAE decodes locally → image returned to requester
This enables running the full 5B parameter model across consumer hardware—each machine only needs ~4GB VRAM to hold one expert.
📁 Repository Structure
baguette/
├── generate.py # 🎨 Main generation script
├── benchmark.py # 📊 Performance benchmarking
├── quantize.py # 🔧 Weight format conversion
├── requirements.txt # 📦 Python dependencies
├── README.md # 📖 This file
├── src/ # 🧠 Model architecture code
│ ├── models.py # DiT expert & router definitions
│ ├── vae_utils.py # VAE encoding/decoding
│ ├── config.py # Configuration dataclass
│ └── schedules.py # Noise schedules
└── weights/ # 💾 Model weights
├── bf16/ # BFloat16 SafeTensors (9.3 GB)
│ ├── expert_0.safetensors ... expert_7.safetensors
│ ├── router.safetensors
│ └── config.pt
└── int8/ # INT8 Quantized (4.8 GB)
├── expert_0.safetensors ... expert_7.safetensors
└── router.safetensors
🔗 Links
- Original Model: bageldotcom/paris
- This Repository: nbagel/baguette
📜 License
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).
See LICENSE for details.
Made with 🥖 by the Baguette Team
Distributed inference for everyone
Model tree for nbagel/baguette
Base model
bageldotcom/paris