Arkadiko V4 β 3-Variant Architecture Ablation (undertrained)
Status: Research artifact. Do not use for production.
These three 6-hour ablation checkpoints compare architectural choices for integrating the frozen LASER2 BiLSTM encoder into a dense transformer decoder. All three were trained on identical data at seed 42 on a single RTX PRO 4000 Blackwell (24GB). The run was designed as a discriminator experiment, not a production training β models are severely undertrained.
Part of the Arkadiko research project (Arabic-first LLM, CC BY-NC 4.0, non-commercial research only).
Ablation results
| Variant | Description | Params | Tokens | Final val_loss | Throughput |
|---|---|---|---|---|---|
| A | Per-layer cross-attention to frozen LASER2 | 708M | 280M | 2.7341 | ~13K tok/s |
| B | Pure decoder, no LASER2 | 592M | 482M | 2.7563 | ~22K tok/s |
| C | Input-only LASER2 injection (mean-pooled) | 593M | 398M | 2.8181 | ~18K tok/s |
All variants trained for 6 hours of wall-clock on the same GPU, same data, same seed. Variant B processed 72% more tokens than A in the same wall-clock because LASER2 cross-attention is expensive (~40% throughput tax).
Per-language breakdown at final eval:
| Variant | ar | en | fr | es | ru | zh | tr | code | math | classical |
|---|---|---|---|---|---|---|---|---|---|---|
| A | 2.52 | 2.95 | 3.43 | 3.92 | 4.35 | 5.29 | 3.72 | 1.65 | 2.87 | 2.54 |
| B | 2.55 | 2.94 | 3.34 | 3.85 | 4.21 | 5.00 | 3.74 | 1.61 | 2.90 | 2.66 |
| C | 2.69 | 3.01 | 3.49 | 3.87 | 4.46 | 5.78 | 3.88 | 1.73 | 2.81 | 2.79 |
Headline findings:
- Variant A (per-layer cross-attention) wins on per-token efficiency β it reaches lower val loss with 42% fewer tokens than B.
- Variant B (no LASER2) is ~1.7Γ faster per wall-clock hour and matches A on total quality at the same compute budget.
- Variant C (input-only) consistently underperforms A and B. Input-only injection with a simple linear projection does not transfer multilingual signal efficiently. (BLIP-2-style richer Q-Formers may behave differently β not tested.)
- Pre-training val loss is a poor discriminator at this scale. The real test is SFT transfer, which is run separately.
Architecture
All three variants share the same decoder backbone:
- 28 layers, 1024 hidden, 16 query heads, 2 KV heads (GQA 8:1)
- head_dim=64, SwiGLU FFN (hidden=5504, mult 5.4Γ)
- Max seq_len=2048, RoPE theta=10,000
- Tied input/output embeddings
- Vocab: 50,004 (LASER2 fairseq SPM, +4 offset for bos/pad/eos/unk)
Variant differences:
- A adds a
CrossAttentionmodule to every block that attends from decoder Q to LASER2 per-token hidden states (K/V). Zero-init on output projection so the block starts as a no-op. - B omits cross-attention entirely. Standard decoder-only.
- C adds a single
nn.Linear(1024β1024)that takes the mean-pooled LASER2 sentence embedding and adds it to the input token embeddings (broadcast across all positions). No per-layer injection.
The LASER2 encoder is frozen and bypassed on variant B.
Data mix (identical across all variants)
| Category | Percentage | Source |
|---|---|---|
| Arabic | 30% | wikimedia/wikipedia ar |
| English | 30% | wikimedia/wikipedia en |
| Code | 15% | codeparrot/codeparrot-clean |
| Math | 10% | open-web-math |
| Classical Arabic | 5% | wikimedia/wikipedia ar |
| French | 2% | wikimedia/wikipedia fr |
| Spanish | 2% | wikimedia/wikipedia es |
| Russian | 2% | wikimedia/wikipedia ru |
| Chinese | 2% | wikimedia/wikipedia zh |
| Turkish | 2% | wikimedia/wikipedia tr |
Tokenized with LASER2 SPM 50K (vocab offset +4 for fairseq dict compatibility).
Files
variant_{a,b,c}/
model.pt β state dict (torch.compile prefix stripped)
config.json β V4Config serialized as JSON
metrics.json β final val_loss and per-language breakdown
train.log β full training log (loss per step, per-eval breakdowns)
samples/ β 140 generation samples per eval (5 domains Γ 4 Γ 7 langs Γ 11 evals)
code/ β minimal source files needed to load and run the models
Loading
import json
import torch
from arkadiko.llm.config import V4Config
from arkadiko.llm.model import V4Decoder
with open("variant_a/config.json") as f:
cfg_dict = json.load(f)
config = V4Config(**cfg_dict)
model = V4Decoder(config).to("cuda").bfloat16()
state = torch.load("variant_a/model.pt", map_location="cuda")
model.load_state_dict(state)
model.requires_grad_(False)
For variant A and C, you also need a LASER2 encoder. See code/laser_encoder.py.
Known limitations
- Undertrained: 280-482M tokens is far below Chinchilla-optimal for 592-708M params.
- No SFT: These are raw pre-training checkpoints.
- Russian and Chinese are marginal: 2% of the data each. Val loss reflects this.
- LASER2 SPM is not optimal for 7 languages: It was trained on 200.
- Not for production: Research artifact only.
License
CC BY-NC 4.0 β Non-commercial research and educational use only.
See LICENSE for the full text.
Citation
If you use these checkpoints in research, please cite:
@misc{arkadiko_v4_ablation_2026,
author = {Essam, Ahmed},
title = {Arkadiko V4 β 3-Variant LASER2 Cross-Attention Ablation},
year = {2026},
howpublished = {HuggingFace Hub},
url = {https://huggingface.co/AENSaid/arkadiko-v4-ablation},
note = {Non-commercial research artifact}
}
Related decisions logged in the Arkadiko project's ADR.md (ADR-174 through ADR-181).