124M GPT with Symbolic Reasoning Distillation

A 124M-parameter GPT-2 trained from scratch on FineWeb-Edu with knowledge distillation from SmolLM-135M-Instruct.

Component Value
Parameters ~124M
Layers 12
Heads 12
Embedding dim 768
Context 512
Loss 0.5 CE + 0.5 KL
Hardware 1x A100
Time ~75 min
Tokens 327,680,000
Best loss 326.0111
Downloads last month
11
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train farpluto/zubenelgenubi-124m

Collection including farpluto/zubenelgenubi-124m