linnet-497M
A 497M parameter Mixture of Experts base language model with 8 experts and 2 active experts per token and 157M active parameters. Trained from scratch using rudyon/pipeline on the HuggingFaceFW/fineweb-edu and mlfoundations/dclm-baseline-1.0 datasets.
Training was done on a single H100 GPU rented on Prime Intellect for about $17.
training status
⚠️ This model is undertrained. Chinchilla-optimal training would require ~19000 steps on ~10B tokens. This checkpoint was saved at step ~5000 (~26% of optimal), due to compute budget constraints. The loss curve was still descending at the time of stopping.
| Metric | Value |
|---|---|
| Steps completed | 5281 / 18965 |
| Tokens seen | ~2.9B / 10B |
| Final val bpb | ~1.21 |
| HellaSwag (0-shot) | ~38% (random = 25%) |
architecture
The model is a 12-layer causal transformer with the following architecture:
| Component | Implementation |
|---|---|
| Positional encoding | RoPE (base=50000) |
| Attention | GQA + QK Norm + FlashAttention |
| FFN | SwiGLU (8/3 x n_embd hidden dim) |
| Normalization | RMSNorm |
| Sequence mixing | Causal depthwise Conv1d (kernel=3) |
| Sparsity | MoE (8 experts, top-2) |
| Optimizer | Muon + AdamW |
training
- Datasets: HuggingFaceFW/fineweb-edu (~700k docs) + mlfoundations/dclm-baseline-1.0 (~250k docs)
- Tokenizer: Custom ByteLevelBPE (vocab size: 32768)
- Batch size: 524,288 tokens
- Sequence length: 1024
usage
Download model.py from the repository alongside the weights, then:
import torch
from tokenizers import Tokenizer
from model import LLM, LLMConfig
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = Tokenizer.from_pretrained("rudyon/linnet-497M")
model = LLM(LLMConfig(depth=12, vocab_size=32768))
state_dict = torch.load("pytorch_model.bin", map_location=device)
model.load_state_dict(state_dict)
model.eval()
print(model.generate("Hello!", enc=tokenizer))
- Downloads last month
- 932