Autoresearch-V2-Base: Arabic-First Language Model (540M)

A 540M parameter language model trained from scratch with an Arabic-first 3-phase curriculum on 25B tokens. V2 of the autoresearch series — built as an educational and research project to explore training LLMs for Arabic from the ground up.

Model Details


Parameters	540M total (235M scaling)
Architecture	GPT-2 style transformer (nanochat)
Layers	16
Hidden dim	1024
Heads	8 (head_dim=128)
KV Heads	8
Sequence length	2048
Vocab size	32,768 (custom BPE, proportional Arabic/English/Code/Math)
Training framework	nanochat
Precision	FP8 (tensorwise scaling) with BF16 compute
Hardware	8× NVIDIA H200 141GB SXM (Australia)
Flash Attention	FA3 (Hopper)

What's New in V2

	V1	V2
Training tokens	6.55B	25B
BPB	1.155	0.724
Curriculum	2-phase (narrow→broad)	3-phase (broad→focused→full)
Tokenizer	Arabic-majority BPE	Proportional BPE (ar:35% en:30% code:15% math:15% classical:5%)
Data	80 files, Arabic-heavy	434 parquets, balanced multilingual
Training time	58.5 min	181 min
Cost	~$18	~$56
English fertility	Poor (17 tokens/sentence)	Good (5 tokens/sentence)

Training Results

Metric	Value
Final BPB	0.724
Total training steps	47,674
Total tokens	~25B
Training time	181 minutes
Training cost	~$56
Hardware	8× H200 SXM

Benchmark Results (Pre-SFT Baseline)

Evaluated using lm-evaluation-harness (EleutherAI) at step 47,674. The harness scores each answer choice by computing the model's log-probability of that text as a natural continuation — no generation or prompting, just measuring which completion the model finds most likely.

English

Benchmark	acc	acc_norm	Random	Δ
HellaSwag	31.8%	35.2%	25%	+10.2%
ARC Easy	32.2%	31.9%	25%	+6.9%
ARC Challenge	21.1%	24.5%	25%	-0.5%
PIQA	58.1%	55.6%	50%	+5.6%
BoolQ	61.4%	—	50%	+11.4%
Winogrande	49.7%	—	50%	-0.3%
LAMBADA	19.2%	—	~0%	+19.2%

Arabic (Machine-Translated Benchmarks)

Benchmark	acc	acc_norm	Random	Δ
AR HellaSwag	25.5%	26.1%	25%	+1.1%
AR ARC Easy	25.1%	25.5%	25%	+0.5%
AR PIQA	50.8%	51.0%	50%	+1.0%
AR BoolQ	59.5%	62.1%	50%	+12.1%

Arabic-Native Benchmarks

Benchmark	acc	Random	Δ
ACVA (Arabic Cultural Values)	55.9%	25%	+30.9%
ArabicMMLU	31.3%	25%	+6.3%
Arabic Exams	25.9%	25%	+0.9%
Belebele (Reading Comprehension)	23.8%	25%	-1.2%

ACVA top categories: Ancient Egypt 73.7%, Arab Empire 70.2%, Lebanon 80.0%, Oman 80.0%, Palestine 71.8%, Entertainment 74.6%, Islamic Law 59.0%

ArabicMMLU top subjects: Islamic Studies (primary) 43.2%, Computer Science (primary) 50.0%, Driving Test 42.9%

Key Findings

ACVA 55.9% is the headline — 30 points above random. The model has genuine Arabic cultural knowledge from Shamela, OpenITI, Quran, and Hadith training data
BoolQ — 61.4% English, 62.1% Arabic. Arabic beats English, confirming Arabic-first training works
HellaSwag (+10%) and PIQA (+6%) show commonsense reasoning signal
LAMBADA (19.2%) — non-trivial for 540M on an exact-match task
MT benchmarks undercount Arabic ability — native Arabic benchmarks (ACVA) reveal much stronger signal than translated English tasks
Arabic Exams and Belebele at random — multi-step reasoning still beyond this scale
Model is strongest on cultural/historical knowledge, weakest on formal academic reasoning
These are pre-SFT numbers; SFT should significantly improve structured tasks

Training

3-Phase Curriculum (ADR-040)

Phase	Steps	Tokens	Focus
Phase 1: Broad Foundation	0–7,151 (15%)	~3.75B	English + Code + Math + Arabic web
Phase 2: Arabic Immersion	7,151–19,069 (25%)	~6.25B	Quran + Hadith + Shamela + OpenITI + OSCAR Arabic
Phase 3: Full Mix	19,069–47,674 (60%)	~15B	Everything shuffled together

Data Sources

~25B tokens across 434 pre-tokenized parquet files:

Category	Description
Arabic Web	OSCAR Arabic, CC-100 Arabic, Arabic Wikipedia
Arabic Classical	OpenITI, Shamela Library, Quran, Hadith collections
English	FineWeb-Edu (ClimbMix)
Code	The Stack v2 (Python + JavaScript)
Math	OpenWebMath

Tokenizer (V2)

Custom 32K BPE tokenizer trained with proportional category sampling:

Property	Value
Algorithm	Pure BPE (no morphological pre-segmentation)
Vocab size	32,768
Training data	2B chars, 705K docs, proportional sampling
Sampling ratio	arabic:35%, english:30%, code:15%, math:15%, classical:5%
Arabic fertility	4.67–7.80 chars/token
English fertility	~5 tokens/sentence (3.5× improvement over V1)
Layout	0–31999 BPE, 32000–32511 reserved (512 slots), 32756–32767 special tokens

Special tokens: <|think|>, <|/think|>, <|tool_call|>, <|tool_result|>, <|bos|>, <|eos|>, <|pad|>, <|sep|>, <|system|>, <|user|>, <|assistant|>, <|eot|>

Model Health (Spectral Analysis)

Post-training analysis of model weights shows:

Healthy weight distributions — smooth Gaussian curves across all layers, no dead neurons
Full-rank layers — SVD spectrum shows effective rank 47/64, all layers use full capacity
Language separation in embeddings — PCA reveals Arabic and English tokens form distinct clusters in embedding space
No redundant layers — cosine similarity matrix shows all 16 blocks are distinct
~0% sparsity — all parameters are actively used

Usage

import pickle, torch, sys
sys.path.insert(0, "nanochat")  # clone nanochat first
from nanochat.gpt import GPT, GPTConfig

# Load tokenizer
with open("tokenizer.pkl", "rb") as f:
    enc = pickle.load(f)

# Load model
checkpoint = torch.load("model_047674.pt", map_location="cpu", weights_only=True)
config = GPTConfig(
    vocab_size=32768, n_layer=16, n_head=8, n_kv_head=8,
    n_embd=1024, sequence_len=2048, window_pattern="L"
)
model = GPT(config)
model.load_state_dict(checkpoint, strict=False)
model.eval()

# Generate
bos = enc.encode_single_token("<|bos|>")
tokens = [bos] + enc.encode("بسم الله الرحمن الرحيم")
output = enc.decode(list(model.generate(tokens, max_tokens=200, temperature=0.8, top_k=50)))
print(output)

Files

File	Description
`model_047674.pt`	Final checkpoint (step 47,674) — 1.5 GB
`model_030000.pt`	End of Phase 2 checkpoint
`model_019069.pt`	End of Phase 1→2 transition
`model_007151.pt`	End of Phase 1
`model_*.pt`	Checkpoints every 2K steps
`meta_*.json`	Training metadata for each checkpoint

Tokenizer files are in the V1 repo (same architecture, different tokenizer) or can be found in the source repo.

Limitations

Base model only — generates text continuations, not conversations. SFT required for chat.
540M scale — educational/research scale. Reasoning benchmarks (ARC Challenge, Winogrande) are at random chance.
Arabic reasoning gap — the model learned Arabic language (good BPB, strong BoolQ) but Arabic reasoning on structured tasks is weak at this scale.
Non-commercial — CC BY-NC 4.0 license. Research and educational use only.

Training Journey

This model is part of the autoresearch project — a learning-in-public effort documenting every decision and failure:

V1: 6.55B tokens, 2-phase curriculum, BPB 1.155, $18 → AENSaid/autoresearch-540m
V2: 25B tokens, 3-phase curriculum, BPB 0.724, $56 → this repo
Total project cost: ~$96 across all attempts
50 Architecture Decision Records and 100 Lessons Learned in the source repo

On What Weights Are

The model starts as an empty room (Locke's tabula rasa). Training is 25B tokens of people and objects passing through. After training, the data is gone — but the weights carry the effect of everything that passed through. Like scuff marks on the floor after furniture is removed. The room "remembers" without containing any of the original objects.

The weights aren't knowledge. They're the residue of 25B tokens having passed through 540M parameters.

Next Steps

SFT: Chat, Thinking/Reasoning, Tool calling
Post-SFT benchmarking
Arabic-native benchmarks (ACVA, Arabic Exams)

Citation

@misc{autoresearch-v2-2026,
  title={Autoresearch-V2-Base: Arabic-First Language Model},
  author={AENSaid},
  year={2026},
  url={https://huggingface.co/AENSaid/autoresearch-v2-base}
}

Acknowledgments

nanochat by Andrej Karpathy — training framework
OpenITI — classical Arabic scholarly texts
Built with Claude Code (Anthropic)

Downloads last month: -; Downloads are not tracked for this model. How to track