linnet-497M

A 497M parameter Mixture of Experts base language model with 8 experts and 2 active experts per token and 157M active parameters. Trained from scratch using rudyon/pipeline on the HuggingFaceFW/fineweb-edu and mlfoundations/dclm-baseline-1.0 datasets.

Training was done on a single H100 GPU rented on Prime Intellect for about $17.

training status

⚠️ This model is undertrained. Chinchilla-optimal training would require ~19000 steps on ~10B tokens. This checkpoint was saved at step ~5000 (~26% of optimal), due to compute budget constraints. The loss curve was still descending at the time of stopping.

Metric Value
Steps completed 5281 / 18965
Tokens seen ~2.9B / 10B
Final val bpb ~1.21
HellaSwag (0-shot) ~38% (random = 25%)

architecture

The model is a 12-layer causal transformer with the following architecture:

Component Implementation
Positional encoding RoPE (base=50000)
Attention GQA + QK Norm + FlashAttention
FFN SwiGLU (8/3 x n_embd hidden dim)
Normalization RMSNorm
Sequence mixing Causal depthwise Conv1d (kernel=3)
Sparsity MoE (8 experts, top-2)
Optimizer Muon + AdamW

training

  • Datasets: HuggingFaceFW/fineweb-edu (~700k docs) + mlfoundations/dclm-baseline-1.0 (~250k docs)
  • Tokenizer: Custom ByteLevelBPE (vocab size: 32768)
  • Batch size: 524,288 tokens
  • Sequence length: 1024

usage

Download model.py from the repository alongside the weights, then:

import torch
from tokenizers import Tokenizer
from model import LLM, LLMConfig

device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = Tokenizer.from_pretrained("rudyon/linnet-497M")
model = LLM(LLMConfig(depth=12, vocab_size=32768))
state_dict = torch.load("pytorch_model.bin", map_location=device)
model.load_state_dict(state_dict)
model.eval()
print(model.generate("Hello!", enc=tokenizer))
Downloads last month
932
Safetensors
Model size
0.5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train rudyon/linnet-497M