linnet-497M

A 497M parameter Mixture of Experts base language model with 8 experts and 2 active experts per token and 157M active parameters. Trained from scratch using rudyon/pipeline on the HuggingFaceFW/fineweb-edu and mlfoundations/dclm-baseline-1.0 datasets.

Training was done on a single H100 GPU rented on Prime Intellect for about $17.

training status

⚠️ This model is undertrained. Chinchilla-optimal training would require ~19000 steps on ~10B tokens. This checkpoint was saved at step ~5000 (~26% of optimal), due to compute budget constraints. The loss curve was still descending at the time of stopping.

Metric	Value
Steps completed	5281 / 18965
Tokens seen	~2.9B / 10B
Final val bpb	~1.21
HellaSwag (0-shot)	~38% (random = 25%)

architecture

The model is a 12-layer causal transformer with the following architecture:

Component	Implementation
Positional encoding	RoPE (base=50000)
Attention	GQA + QK Norm + FlashAttention
FFN	SwiGLU (8/3 x n_embd hidden dim)
Normalization	RMSNorm
Sequence mixing	Causal depthwise Conv1d (kernel=3)
Sparsity	MoE (8 experts, top-2)
Optimizer	Muon + AdamW

training

Datasets: HuggingFaceFW/fineweb-edu (~700k docs) + mlfoundations/dclm-baseline-1.0 (~250k docs)
Tokenizer: Custom ByteLevelBPE (vocab size: 32768)
Batch size: 524,288 tokens
Sequence length: 1024

usage

Download model.py from the repository alongside the weights, then:

import torch
from tokenizers import Tokenizer
from model import LLM, LLMConfig

device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = Tokenizer.from_pretrained("rudyon/linnet-497M")
model = LLM(LLMConfig(depth=12, vocab_size=32768))
state_dict = torch.load("pytorch_model.bin", map_location=device)
model.load_state_dict(state_dict)
model.eval()
print(model.generate("Hello!", enc=tokenizer))

Downloads last month: 932

Safetensors

Model size

0.5B params

Tensor type

F32

rudyon
/

linnet-497M

linnet-497M

training status

architecture

training

usage

Datasets used to train rudyon/linnet-497M