🥖 Baguette

A Distributed Inference Engine for Paris MoE Diffusion Models

License: AGPL v3 Python 3.10+ HuggingFace

Fast, efficient inference for the 5-billion parameter Paris Mixture-of-Experts text-to-image model


⚡ Quick Start

# Clone the repo
git clone https://huggingface.co/nbagel/baguette
cd baguette

# Install dependencies
pip install uv && uv pip install torch torchvision safetensors transformers diffusers accelerate tqdm

# Generate images
python generate.py --prompt "a cute cat" --num_samples 4

Output: output_bf16.png with 4 generated images.


🎨 Generation Examples

# Basic generation (4 images, top-2 routing, 30 steps)
python generate.py --prompt "sunset over mountains" --num_samples 4

# See expert routing visualization
python generate.py --prompt "abstract art" --visualize

# Faster generation
python generate.py --prompt "a happy dog" --num_steps 20

# Lower memory usage (offload experts to CPU)
python generate.py --prompt "portrait of a scientist" --offload 4

# INT8 quantized (smaller weights)
python generate.py --prompt "enchanted forest" --precision int8

🔮 Expert Routing Visualization

Baguette includes real-time visualization of the MoE router's expert selection. Use --visualize to see which experts are activated:

╭──────────────────────────────────────────────────╮
│           ⚡ EXPERT USAGE DISTRIBUTION            │
├──────────────────────────────────────────────────┤
│ → E4  │████████████████████████████│ 40.6% │
│   E2  │██████████████████████████▎ │ 36.7% │
│   E6  │██████████▌                 │ 14.8% │
│   E1  │███▊                        │  5.5% │
│   E5  │█▋                          │  2.3% │
│   E0  │                            │  0.0% │
│   E3  │                            │  0.0% │
│   E7  │                            │  0.0% │
├──────────────────────────────────────────────────┤
│  Active: 5/8 experts   Calls: 128               │
╰──────────────────────────────────────────────────╯

╭──────────────────────────────────────────────────╮
│            📈 ROUTING TIMELINE                   │
├──────────────────────────────────────────────────┤
│ Step  0  1  2  3  4  5  6  7  8  9 10 11 12 13  │
│ ───────────────────────────────────────────────  │
│  E0   ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  │
│  E1   ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  │
│  E2   ·  ·  ·  ·  ·  ●  ●  ●  ●  ●  ●  ●  ●  ●  │
│  E3   ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  │
│  E4   ·  ·  ●  ●  ●  ·  ·  ·  ·  ·  ·  ·  ·  ·  │
│  E5   ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  │
│  E6   ●  ●  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  │
│  E7   ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  │
├──────────────────────────────────────────────────┤
│  Routing changes:   2/13 steps (15%)            │
╰──────────────────────────────────────────────────╯

The router dynamically selects different experts based on the noise level at each diffusion timestep. Early steps (high noise) often use different experts than later steps (low noise).


📋 Command Reference

Flag Default Description
--prompt "a cute cat" Text description of the image to generate
--num_samples 16 Number of images to generate
--num_steps 30 Diffusion sampling steps (15-50)
--cfg_scale 7.5 Classifier-free guidance scale (5-12)
--precision bf16 Weight precision: bf16 or int8
--topk 2 Number of experts per sample (1-8)
--offload 0 Experts to offload to CPU RAM (0-7)
--visualize false Show expert routing statistics
--output auto Custom output filename
--seed 999 Random seed for reproducibility

🏗️ Model Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     PARIS MoE ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Input: Text Prompt ──→ CLIP ViT-L/14 ──→ Text Embeddings     │
│                                                                 │
│   Noise: z ~ N(0,1) ──→ 32×32×4 Latent                         │
│                              │                                  │
│                              ▼                                  │
│   ┌─────────────────────────────────────────────────────────┐  │
│   │                  DiT-B/2 ROUTER                         │  │
│   │            (12 layers, 768 dim, 129M params)            │  │
│   │                         │                               │  │
│   │            Selects Top-K Experts per Step               │  │
│   └─────────────────────────────────────────────────────────┘  │
│                              │                                  │
│          ┌───────────────────┼───────────────────┐             │
│          ▼                   ▼                   ▼             │
│   ┌────────────┐      ┌────────────┐      ┌────────────┐       │
│   │  Expert 0  │      │  Expert 1  │ ···  │  Expert 7  │       │
│   │  DiT-XL/2  │      │  DiT-XL/2  │      │  DiT-XL/2  │       │
│   │   606M     │      │   606M     │      │   606M     │       │
│   └────────────┘      └────────────┘      └────────────┘       │
│          │                   │                   │             │
│          └───────────────────┼───────────────────┘             │
│                              ▼                                  │
│                   Weighted Velocity Prediction                  │
│                              │                                  │
│                              ▼                                  │
│   ┌─────────────────────────────────────────────────────────┐  │
│   │                 SD-VAE DECODER                          │  │
│   │              Latent ──→ 256×256 RGB                     │  │
│   └─────────────────────────────────────────────────────────┘  │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│  Total: ~5 Billion Parameters  │  8 Specialized Experts        │
└─────────────────────────────────────────────────────────────────┘

💾 Available Weights

Format Size Quality Speed Use Case
BF16 9.3 GB ⭐⭐⭐⭐⭐ Fastest Production, best quality
INT8 4.8 GB ⭐⭐⭐⭐ Fast Memory-constrained GPUs

🖥️ Memory Requirements

Configuration GPU VRAM Speed Notes
BF16, no offload ~25 GB ~3 img/s Best performance
BF16, offload 4 ~14 GB ~1 img/s RTX 4090 / A6000
BF16, offload 6 ~8 GB ~0.5 img/s RTX 3080/4080
INT8, no offload ~12 GB ~2 img/s Good balance
INT8, offload 4 ~8 GB ~0.5 img/s Consumer GPUs

🔧 Utilities

Benchmarking

python benchmark.py --quick                    # Fast benchmark
python benchmark.py --output results.md        # Full benchmark, save results

Weight Conversion

# Convert PyTorch checkpoints to BF16 SafeTensors
python quantize.py --input /path/to/weights --output ./weights/bf16 --format bf16

# Convert BF16 to INT8
python quantize.py --input ./weights/bf16 --output ./weights/int8 --format int8

🚀 Future: Distributed Inference with Tailscale + Erlang

Baguette is being developed as a fully distributed inference engine that can run across multiple machines connected via Tailscale VPN, orchestrated by an Erlang/OTP supervisor.

🌐 Architecture Vision

┌─────────────────────────────────────────────────────────────────────────┐
│                    BAGUETTE DISTRIBUTED NETWORK                         │
│                         (Up to 8 Nodes)                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌─────────────┐      Tailscale VPN Mesh      ┌─────────────┐         │
│   │   Node 1    │◄────────────────────────────►│   Node 2    │         │
│   │ ┌─────────┐ │                              │ ┌─────────┐ │         │
│   │ │ Router  │ │                              │ │ Router  │ │         │
│   │ │   VAE   │ │                              │ │   VAE   │ │         │
│   │ │Expert 0 │ │                              │ │Expert 1 │ │         │
│   │ └─────────┘ │                              │ └─────────┘ │         │
│   └──────┬──────┘                              └──────┬──────┘         │
│          │                                            │                 │
│          │         ┌──────────────────┐              │                 │
│          └────────►│  Erlang/OTP      │◄─────────────┘                 │
│                    │  Coordinator     │                                 │
│          ┌────────►│                  │◄─────────────┐                 │
│          │         │  • Load Balance  │              │                 │
│          │         │  • Fault Tolerant│              │                 │
│          │         │  • Auto-Healing  │              │                 │
│          │         └──────────────────┘              │                 │
│          │                                            │                 │
│   ┌──────┴──────┐                              ┌──────┴──────┐         │
│   │   Node 3    │◄────────────────────────────►│   Node 4    │         │
│   │ ┌─────────┐ │           ...                │ ┌─────────┐ │         │
│   │ │ Router  │ │                              │ │ Router  │ │         │
│   │ │   VAE   │ │        (up to 8 nodes)       │ │   VAE   │ │         │
│   │ │Expert 2 │ │                              │ │Expert 3 │ │         │
│   │ └─────────┘ │                              │ └─────────┘ │         │
│   └─────────────┘                              └─────────────┘         │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

🎯 Key Features (Planned)

Feature Description
Self-Organizing Network Nodes automatically discover peers and negotiate roles
Adaptive Load Balancing Routes requests based on real-time latency and compute availability
Auto-Benchmarking Each node benchmarks GPU/CPU speed, VRAM, RAM, and network throughput
Fault Tolerance Erlang supervisors restart failed nodes, redistribute load
1 Expert Per Node Each node loads only 1 expert (~2.7GB VRAM) plus router & VAE
Latency-Aware Routing Prioritizes low-latency nodes for time-sensitive steps
Zero Configuration Just join the Tailscale network and run—automatic peer discovery

📊 Node Self-Benchmarking

When a node joins the network, it automatically benchmarks:

┌────────────────────────────────────────┐
│         NODE CAPABILITY REPORT         │
├────────────────────────────────────────┤
│  GPU: NVIDIA RTX 4090                  │
│  VRAM: 24 GB                           │
│  GPU Compute: 847 TFLOPS (FP16)        │
│  ────────────────────────────────────  │
│  CPU: AMD Ryzen 9 7950X                │
│  RAM: 64 GB                            │
│  CPU Compute: 2.1 TFLOPS               │
│  ────────────────────────────────────  │
│  Network Latency to Peers:             │
│    → Node 2: 12ms                      │
│    → Node 3: 8ms                       │
│    → Node 4: 45ms                      │
│  Network Bandwidth: 940 Mbps           │
│  ────────────────────────────────────  │
│  Assigned Expert: E0                   │
│  Status: READY                         │
└────────────────────────────────────────┘

🔄 Distributed Inference Flow

  1. Request arrives at any node
  2. Router runs locally → selects top-K experts needed
  3. Coordinator dispatches expert calls to appropriate nodes
  4. Nodes compute in parallel → return velocity predictions
  5. Results aggregated → Euler step applied
  6. VAE decodes locally → image returned to requester

This enables running the full 5B parameter model across consumer hardware—each machine only needs ~4GB VRAM to hold one expert.


📁 Repository Structure

baguette/
├── generate.py          # 🎨 Main generation script
├── benchmark.py         # 📊 Performance benchmarking
├── quantize.py          # 🔧 Weight format conversion
├── requirements.txt     # 📦 Python dependencies
├── README.md            # 📖 This file
├── src/                 # 🧠 Model architecture code
│   ├── models.py        # DiT expert & router definitions
│   ├── vae_utils.py     # VAE encoding/decoding
│   ├── config.py        # Configuration dataclass
│   └── schedules.py     # Noise schedules
└── weights/             # 💾 Model weights
    ├── bf16/            # BFloat16 SafeTensors (9.3 GB)
    │   ├── expert_0.safetensors ... expert_7.safetensors
    │   ├── router.safetensors
    │   └── config.pt
    └── int8/            # INT8 Quantized (4.8 GB)
        ├── expert_0.safetensors ... expert_7.safetensors
        └── router.safetensors

🔗 Links


📜 License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

See LICENSE for details.


Made with 🥖 by the Baguette Team

Distributed inference for everyone

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nbagel/baguette

Finetuned
(1)
this model