Autoresearch-V2-Base: Arabic-First Language Model (540M)

A 540M parameter language model trained from scratch with an Arabic-first 3-phase curriculum on 25B tokens. V2 of the autoresearch series โ€” built as an educational and research project to explore training LLMs for Arabic from the ground up.

Model Details

Parameters 540M total (235M scaling)
Architecture GPT-2 style transformer (nanochat)
Layers 16
Hidden dim 1024
Heads 8 (head_dim=128)
KV Heads 8
Sequence length 2048
Vocab size 32,768 (custom BPE, proportional Arabic/English/Code/Math)
Training framework nanochat
Precision FP8 (tensorwise scaling) with BF16 compute
Hardware 8ร— NVIDIA H200 141GB SXM (Australia)
Flash Attention FA3 (Hopper)

What's New in V2

V1 V2
Training tokens 6.55B 25B
BPB 1.155 0.724
Curriculum 2-phase (narrowโ†’broad) 3-phase (broadโ†’focusedโ†’full)
Tokenizer Arabic-majority BPE Proportional BPE (ar:35% en:30% code:15% math:15% classical:5%)
Data 80 files, Arabic-heavy 434 parquets, balanced multilingual
Training time 58.5 min 181 min
Cost ~$18 ~$56
English fertility Poor (17 tokens/sentence) Good (5 tokens/sentence)

Training Results

Metric Value
Final BPB 0.724
Total training steps 47,674
Total tokens ~25B
Training time 181 minutes
Training cost ~$56
Hardware 8ร— H200 SXM

Benchmark Results (Pre-SFT Baseline)

Evaluated using lm-evaluation-harness (EleutherAI) at step 47,674. The harness scores each answer choice by computing the model's log-probability of that text as a natural continuation โ€” no generation or prompting, just measuring which completion the model finds most likely.

English

Benchmark acc acc_norm Random ฮ”
HellaSwag 31.8% 35.2% 25% +10.2%
ARC Easy 32.2% 31.9% 25% +6.9%
ARC Challenge 21.1% 24.5% 25% -0.5%
PIQA 58.1% 55.6% 50% +5.6%
BoolQ 61.4% โ€” 50% +11.4%
Winogrande 49.7% โ€” 50% -0.3%
LAMBADA 19.2% โ€” ~0% +19.2%

Arabic (Machine-Translated Benchmarks)

Benchmark acc acc_norm Random ฮ”
AR HellaSwag 25.5% 26.1% 25% +1.1%
AR ARC Easy 25.1% 25.5% 25% +0.5%
AR PIQA 50.8% 51.0% 50% +1.0%
AR BoolQ 59.5% 62.1% 50% +12.1%

Arabic-Native Benchmarks

Benchmark acc Random ฮ”
ACVA (Arabic Cultural Values) 55.9% 25% +30.9%
ArabicMMLU 31.3% 25% +6.3%
Arabic Exams 25.9% 25% +0.9%
Belebele (Reading Comprehension) 23.8% 25% -1.2%

ACVA top categories: Ancient Egypt 73.7%, Arab Empire 70.2%, Lebanon 80.0%, Oman 80.0%, Palestine 71.8%, Entertainment 74.6%, Islamic Law 59.0%

ArabicMMLU top subjects: Islamic Studies (primary) 43.2%, Computer Science (primary) 50.0%, Driving Test 42.9%

Key Findings

  • ACVA 55.9% is the headline โ€” 30 points above random. The model has genuine Arabic cultural knowledge from Shamela, OpenITI, Quran, and Hadith training data
  • BoolQ โ€” 61.4% English, 62.1% Arabic. Arabic beats English, confirming Arabic-first training works
  • HellaSwag (+10%) and PIQA (+6%) show commonsense reasoning signal
  • LAMBADA (19.2%) โ€” non-trivial for 540M on an exact-match task
  • MT benchmarks undercount Arabic ability โ€” native Arabic benchmarks (ACVA) reveal much stronger signal than translated English tasks
  • Arabic Exams and Belebele at random โ€” multi-step reasoning still beyond this scale
  • Model is strongest on cultural/historical knowledge, weakest on formal academic reasoning
  • These are pre-SFT numbers; SFT should significantly improve structured tasks

Training

3-Phase Curriculum (ADR-040)

Phase Steps Tokens Focus
Phase 1: Broad Foundation 0โ€“7,151 (15%) ~3.75B English + Code + Math + Arabic web
Phase 2: Arabic Immersion 7,151โ€“19,069 (25%) ~6.25B Quran + Hadith + Shamela + OpenITI + OSCAR Arabic
Phase 3: Full Mix 19,069โ€“47,674 (60%) ~15B Everything shuffled together

Data Sources

~25B tokens across 434 pre-tokenized parquet files:

Category Description
Arabic Web OSCAR Arabic, CC-100 Arabic, Arabic Wikipedia
Arabic Classical OpenITI, Shamela Library, Quran, Hadith collections
English FineWeb-Edu (ClimbMix)
Code The Stack v2 (Python + JavaScript)
Math OpenWebMath

Tokenizer (V2)

Custom 32K BPE tokenizer trained with proportional category sampling:

Property Value
Algorithm Pure BPE (no morphological pre-segmentation)
Vocab size 32,768
Training data 2B chars, 705K docs, proportional sampling
Sampling ratio arabic:35%, english:30%, code:15%, math:15%, classical:5%
Arabic fertility 4.67โ€“7.80 chars/token
English fertility ~5 tokens/sentence (3.5ร— improvement over V1)
Layout 0โ€“31999 BPE, 32000โ€“32511 reserved (512 slots), 32756โ€“32767 special tokens

Special tokens: <|think|>, <|/think|>, <|tool_call|>, <|tool_result|>, <|bos|>, <|eos|>, <|pad|>, <|sep|>, <|system|>, <|user|>, <|assistant|>, <|eot|>

Model Health (Spectral Analysis)

Post-training analysis of model weights shows:

  • Healthy weight distributions โ€” smooth Gaussian curves across all layers, no dead neurons
  • Full-rank layers โ€” SVD spectrum shows effective rank 47/64, all layers use full capacity
  • Language separation in embeddings โ€” PCA reveals Arabic and English tokens form distinct clusters in embedding space
  • No redundant layers โ€” cosine similarity matrix shows all 16 blocks are distinct
  • ~0% sparsity โ€” all parameters are actively used

Usage

import pickle, torch, sys
sys.path.insert(0, "nanochat")  # clone nanochat first
from nanochat.gpt import GPT, GPTConfig

# Load tokenizer
with open("tokenizer.pkl", "rb") as f:
    enc = pickle.load(f)

# Load model
checkpoint = torch.load("model_047674.pt", map_location="cpu", weights_only=True)
config = GPTConfig(
    vocab_size=32768, n_layer=16, n_head=8, n_kv_head=8,
    n_embd=1024, sequence_len=2048, window_pattern="L"
)
model = GPT(config)
model.load_state_dict(checkpoint, strict=False)
model.eval()

# Generate
bos = enc.encode_single_token("<|bos|>")
tokens = [bos] + enc.encode("ุจุณู… ุงู„ู„ู‡ ุงู„ุฑุญู…ู† ุงู„ุฑุญูŠู…")
output = enc.decode(list(model.generate(tokens, max_tokens=200, temperature=0.8, top_k=50)))
print(output)

Files

File Description
model_047674.pt Final checkpoint (step 47,674) โ€” 1.5 GB
model_030000.pt End of Phase 2 checkpoint
model_019069.pt End of Phase 1โ†’2 transition
model_007151.pt End of Phase 1
model_*.pt Checkpoints every 2K steps
meta_*.json Training metadata for each checkpoint

Tokenizer files are in the V1 repo (same architecture, different tokenizer) or can be found in the source repo.

Limitations

  • Base model only โ€” generates text continuations, not conversations. SFT required for chat.
  • 540M scale โ€” educational/research scale. Reasoning benchmarks (ARC Challenge, Winogrande) are at random chance.
  • Arabic reasoning gap โ€” the model learned Arabic language (good BPB, strong BoolQ) but Arabic reasoning on structured tasks is weak at this scale.
  • Non-commercial โ€” CC BY-NC 4.0 license. Research and educational use only.

Training Journey

This model is part of the autoresearch project โ€” a learning-in-public effort documenting every decision and failure:

  • V1: 6.55B tokens, 2-phase curriculum, BPB 1.155, $18 โ†’ AENSaid/autoresearch-540m
  • V2: 25B tokens, 3-phase curriculum, BPB 0.724, $56 โ†’ this repo
  • Total project cost: ~$96 across all attempts
  • 50 Architecture Decision Records and 100 Lessons Learned in the source repo

On What Weights Are

The model starts as an empty room (Locke's tabula rasa). Training is 25B tokens of people and objects passing through. After training, the data is gone โ€” but the weights carry the effect of everything that passed through. Like scuff marks on the floor after furniture is removed. The room "remembers" without containing any of the original objects.

The weights aren't knowledge. They're the residue of 25B tokens having passed through 540M parameters.

Next Steps

  • SFT: Chat, Thinking/Reasoning, Tool calling
  • Post-SFT benchmarking
  • Arabic-native benchmarks (ACVA, Arabic Exams)

Citation

@misc{autoresearch-v2-2026,
  title={Autoresearch-V2-Base: Arabic-First Language Model},
  author={AENSaid},
  year={2026},
  url={https://huggingface.co/AENSaid/autoresearch-v2-base}
}

Acknowledgments

  • nanochat by Andrej Karpathy โ€” training framework
  • OpenITI โ€” classical Arabic scholarly texts
  • Built with Claude Code (Anthropic)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support