Autoresearch-V2-Base: Arabic-First Language Model (540M)
A 540M parameter language model trained from scratch with an Arabic-first 3-phase curriculum on 25B tokens. V2 of the autoresearch series โ built as an educational and research project to explore training LLMs for Arabic from the ground up.
Model Details
| Parameters | 540M total (235M scaling) |
| Architecture | GPT-2 style transformer (nanochat) |
| Layers | 16 |
| Hidden dim | 1024 |
| Heads | 8 (head_dim=128) |
| KV Heads | 8 |
| Sequence length | 2048 |
| Vocab size | 32,768 (custom BPE, proportional Arabic/English/Code/Math) |
| Training framework | nanochat |
| Precision | FP8 (tensorwise scaling) with BF16 compute |
| Hardware | 8ร NVIDIA H200 141GB SXM (Australia) |
| Flash Attention | FA3 (Hopper) |
What's New in V2
| V1 | V2 | |
|---|---|---|
| Training tokens | 6.55B | 25B |
| BPB | 1.155 | 0.724 |
| Curriculum | 2-phase (narrowโbroad) | 3-phase (broadโfocusedโfull) |
| Tokenizer | Arabic-majority BPE | Proportional BPE (ar:35% en:30% code:15% math:15% classical:5%) |
| Data | 80 files, Arabic-heavy | 434 parquets, balanced multilingual |
| Training time | 58.5 min | 181 min |
| Cost | ~$18 | ~$56 |
| English fertility | Poor (17 tokens/sentence) | Good (5 tokens/sentence) |
Training Results
| Metric | Value |
|---|---|
| Final BPB | 0.724 |
| Total training steps | 47,674 |
| Total tokens | ~25B |
| Training time | 181 minutes |
| Training cost | ~$56 |
| Hardware | 8ร H200 SXM |
Benchmark Results (Pre-SFT Baseline)
Evaluated using lm-evaluation-harness (EleutherAI) at step 47,674. The harness scores each answer choice by computing the model's log-probability of that text as a natural continuation โ no generation or prompting, just measuring which completion the model finds most likely.
English
| Benchmark | acc | acc_norm | Random | ฮ |
|---|---|---|---|---|
| HellaSwag | 31.8% | 35.2% | 25% | +10.2% |
| ARC Easy | 32.2% | 31.9% | 25% | +6.9% |
| ARC Challenge | 21.1% | 24.5% | 25% | -0.5% |
| PIQA | 58.1% | 55.6% | 50% | +5.6% |
| BoolQ | 61.4% | โ | 50% | +11.4% |
| Winogrande | 49.7% | โ | 50% | -0.3% |
| LAMBADA | 19.2% | โ | ~0% | +19.2% |
Arabic (Machine-Translated Benchmarks)
| Benchmark | acc | acc_norm | Random | ฮ |
|---|---|---|---|---|
| AR HellaSwag | 25.5% | 26.1% | 25% | +1.1% |
| AR ARC Easy | 25.1% | 25.5% | 25% | +0.5% |
| AR PIQA | 50.8% | 51.0% | 50% | +1.0% |
| AR BoolQ | 59.5% | 62.1% | 50% | +12.1% |
Arabic-Native Benchmarks
| Benchmark | acc | Random | ฮ |
|---|---|---|---|
| ACVA (Arabic Cultural Values) | 55.9% | 25% | +30.9% |
| ArabicMMLU | 31.3% | 25% | +6.3% |
| Arabic Exams | 25.9% | 25% | +0.9% |
| Belebele (Reading Comprehension) | 23.8% | 25% | -1.2% |
ACVA top categories: Ancient Egypt 73.7%, Arab Empire 70.2%, Lebanon 80.0%, Oman 80.0%, Palestine 71.8%, Entertainment 74.6%, Islamic Law 59.0%
ArabicMMLU top subjects: Islamic Studies (primary) 43.2%, Computer Science (primary) 50.0%, Driving Test 42.9%
Key Findings
- ACVA 55.9% is the headline โ 30 points above random. The model has genuine Arabic cultural knowledge from Shamela, OpenITI, Quran, and Hadith training data
- BoolQ โ 61.4% English, 62.1% Arabic. Arabic beats English, confirming Arabic-first training works
- HellaSwag (+10%) and PIQA (+6%) show commonsense reasoning signal
- LAMBADA (19.2%) โ non-trivial for 540M on an exact-match task
- MT benchmarks undercount Arabic ability โ native Arabic benchmarks (ACVA) reveal much stronger signal than translated English tasks
- Arabic Exams and Belebele at random โ multi-step reasoning still beyond this scale
- Model is strongest on cultural/historical knowledge, weakest on formal academic reasoning
- These are pre-SFT numbers; SFT should significantly improve structured tasks
Training
3-Phase Curriculum (ADR-040)
| Phase | Steps | Tokens | Focus |
|---|---|---|---|
| Phase 1: Broad Foundation | 0โ7,151 (15%) | ~3.75B | English + Code + Math + Arabic web |
| Phase 2: Arabic Immersion | 7,151โ19,069 (25%) | ~6.25B | Quran + Hadith + Shamela + OpenITI + OSCAR Arabic |
| Phase 3: Full Mix | 19,069โ47,674 (60%) | ~15B | Everything shuffled together |
Data Sources
~25B tokens across 434 pre-tokenized parquet files:
| Category | Description |
|---|---|
| Arabic Web | OSCAR Arabic, CC-100 Arabic, Arabic Wikipedia |
| Arabic Classical | OpenITI, Shamela Library, Quran, Hadith collections |
| English | FineWeb-Edu (ClimbMix) |
| Code | The Stack v2 (Python + JavaScript) |
| Math | OpenWebMath |
Tokenizer (V2)
Custom 32K BPE tokenizer trained with proportional category sampling:
| Property | Value |
|---|---|
| Algorithm | Pure BPE (no morphological pre-segmentation) |
| Vocab size | 32,768 |
| Training data | 2B chars, 705K docs, proportional sampling |
| Sampling ratio | arabic:35%, english:30%, code:15%, math:15%, classical:5% |
| Arabic fertility | 4.67โ7.80 chars/token |
| English fertility | ~5 tokens/sentence (3.5ร improvement over V1) |
| Layout | 0โ31999 BPE, 32000โ32511 reserved (512 slots), 32756โ32767 special tokens |
Special tokens: <|think|>, <|/think|>, <|tool_call|>, <|tool_result|>, <|bos|>, <|eos|>, <|pad|>, <|sep|>, <|system|>, <|user|>, <|assistant|>, <|eot|>
Model Health (Spectral Analysis)
Post-training analysis of model weights shows:
- Healthy weight distributions โ smooth Gaussian curves across all layers, no dead neurons
- Full-rank layers โ SVD spectrum shows effective rank 47/64, all layers use full capacity
- Language separation in embeddings โ PCA reveals Arabic and English tokens form distinct clusters in embedding space
- No redundant layers โ cosine similarity matrix shows all 16 blocks are distinct
- ~0% sparsity โ all parameters are actively used
Usage
import pickle, torch, sys
sys.path.insert(0, "nanochat") # clone nanochat first
from nanochat.gpt import GPT, GPTConfig
# Load tokenizer
with open("tokenizer.pkl", "rb") as f:
enc = pickle.load(f)
# Load model
checkpoint = torch.load("model_047674.pt", map_location="cpu", weights_only=True)
config = GPTConfig(
vocab_size=32768, n_layer=16, n_head=8, n_kv_head=8,
n_embd=1024, sequence_len=2048, window_pattern="L"
)
model = GPT(config)
model.load_state_dict(checkpoint, strict=False)
model.eval()
# Generate
bos = enc.encode_single_token("<|bos|>")
tokens = [bos] + enc.encode("ุจุณู
ุงููู ุงูุฑุญู
ู ุงูุฑุญูู
")
output = enc.decode(list(model.generate(tokens, max_tokens=200, temperature=0.8, top_k=50)))
print(output)
Files
| File | Description |
|---|---|
model_047674.pt |
Final checkpoint (step 47,674) โ 1.5 GB |
model_030000.pt |
End of Phase 2 checkpoint |
model_019069.pt |
End of Phase 1โ2 transition |
model_007151.pt |
End of Phase 1 |
model_*.pt |
Checkpoints every 2K steps |
meta_*.json |
Training metadata for each checkpoint |
Tokenizer files are in the V1 repo (same architecture, different tokenizer) or can be found in the source repo.
Limitations
- Base model only โ generates text continuations, not conversations. SFT required for chat.
- 540M scale โ educational/research scale. Reasoning benchmarks (ARC Challenge, Winogrande) are at random chance.
- Arabic reasoning gap โ the model learned Arabic language (good BPB, strong BoolQ) but Arabic reasoning on structured tasks is weak at this scale.
- Non-commercial โ CC BY-NC 4.0 license. Research and educational use only.
Training Journey
This model is part of the autoresearch project โ a learning-in-public effort documenting every decision and failure:
- V1: 6.55B tokens, 2-phase curriculum, BPB 1.155, $18 โ AENSaid/autoresearch-540m
- V2: 25B tokens, 3-phase curriculum, BPB 0.724, $56 โ this repo
- Total project cost: ~$96 across all attempts
- 50 Architecture Decision Records and 100 Lessons Learned in the source repo
On What Weights Are
The model starts as an empty room (Locke's tabula rasa). Training is 25B tokens of people and objects passing through. After training, the data is gone โ but the weights carry the effect of everything that passed through. Like scuff marks on the floor after furniture is removed. The room "remembers" without containing any of the original objects.
The weights aren't knowledge. They're the residue of 25B tokens having passed through 540M parameters.
Next Steps
- SFT: Chat, Thinking/Reasoning, Tool calling
- Post-SFT benchmarking
- Arabic-native benchmarks (ACVA, Arabic Exams)
Citation
@misc{autoresearch-v2-2026,
title={Autoresearch-V2-Base: Arabic-First Language Model},
author={AENSaid},
year={2026},
url={https://huggingface.co/AENSaid/autoresearch-v2-base}
}
Acknowledgments
- nanochat by Andrej Karpathy โ training framework
- OpenITI โ classical Arabic scholarly texts
- Built with Claude Code (Anthropic)