Arkadiko V4 — 3-Variant Architecture Ablation (undertrained)

Status: Research artifact. Do not use for production.

These three 6-hour ablation checkpoints compare architectural choices for integrating the frozen LASER2 BiLSTM encoder into a dense transformer decoder. All three were trained on identical data at seed 42 on a single RTX PRO 4000 Blackwell (24GB). The run was designed as a discriminator experiment, not a production training — models are severely undertrained.

Part of the Arkadiko research project (Arabic-first LLM, CC BY-NC 4.0, non-commercial research only).

Ablation results

Variant	Description	Params	Tokens	Final val_loss	Throughput
A	Per-layer cross-attention to frozen LASER2	708M	280M	2.7341	~13K tok/s
B	Pure decoder, no LASER2	592M	482M	2.7563	~22K tok/s
C	Input-only LASER2 injection (mean-pooled)	593M	398M	2.8181	~18K tok/s

All variants trained for 6 hours of wall-clock on the same GPU, same data, same seed. Variant B processed 72% more tokens than A in the same wall-clock because LASER2 cross-attention is expensive (~40% throughput tax).

Per-language breakdown at final eval:

Variant	ar	en	fr	es	ru	zh	tr	code	math	classical
A	2.52	2.95	3.43	3.92	4.35	5.29	3.72	1.65	2.87	2.54
B	2.55	2.94	3.34	3.85	4.21	5.00	3.74	1.61	2.90	2.66
C	2.69	3.01	3.49	3.87	4.46	5.78	3.88	1.73	2.81	2.79

Headline findings:

Variant A (per-layer cross-attention) wins on per-token efficiency — it reaches lower val loss with 42% fewer tokens than B.
Variant B (no LASER2) is ~1.7× faster per wall-clock hour and matches A on total quality at the same compute budget.
Variant C (input-only) consistently underperforms A and B. Input-only injection with a simple linear projection does not transfer multilingual signal efficiently. (BLIP-2-style richer Q-Formers may behave differently — not tested.)
Pre-training val loss is a poor discriminator at this scale. The real test is SFT transfer, which is run separately.

Architecture

All three variants share the same decoder backbone:

28 layers, 1024 hidden, 16 query heads, 2 KV heads (GQA 8:1)
head_dim=64, SwiGLU FFN (hidden=5504, mult 5.4×)
Max seq_len=2048, RoPE theta=10,000
Tied input/output embeddings
Vocab: 50,004 (LASER2 fairseq SPM, +4 offset for bos/pad/eos/unk)

Variant differences:

A adds a CrossAttention module to every block that attends from decoder Q to LASER2 per-token hidden states (K/V). Zero-init on output projection so the block starts as a no-op.
B omits cross-attention entirely. Standard decoder-only.
C adds a single nn.Linear(1024→1024) that takes the mean-pooled LASER2 sentence embedding and adds it to the input token embeddings (broadcast across all positions). No per-layer injection.

The LASER2 encoder is frozen and bypassed on variant B.

Data mix (identical across all variants)

Category	Percentage	Source
Arabic	30%	wikimedia/wikipedia ar
English	30%	wikimedia/wikipedia en
Code	15%	codeparrot/codeparrot-clean
Math	10%	open-web-math
Classical Arabic	5%	wikimedia/wikipedia ar
French	2%	wikimedia/wikipedia fr
Spanish	2%	wikimedia/wikipedia es
Russian	2%	wikimedia/wikipedia ru
Chinese	2%	wikimedia/wikipedia zh
Turkish	2%	wikimedia/wikipedia tr

Tokenized with LASER2 SPM 50K (vocab offset +4 for fairseq dict compatibility).

Files

variant_{a,b,c}/
  model.pt        — state dict (torch.compile prefix stripped)
  config.json     — V4Config serialized as JSON
  metrics.json    — final val_loss and per-language breakdown
  train.log       — full training log (loss per step, per-eval breakdowns)
  samples/        — 140 generation samples per eval (5 domains × 4 × 7 langs × 11 evals)
code/             — minimal source files needed to load and run the models

Loading

import json
import torch
from arkadiko.llm.config import V4Config
from arkadiko.llm.model import V4Decoder

with open("variant_a/config.json") as f:
    cfg_dict = json.load(f)
config = V4Config(**cfg_dict)

model = V4Decoder(config).to("cuda").bfloat16()
state = torch.load("variant_a/model.pt", map_location="cuda")
model.load_state_dict(state)
model.requires_grad_(False)

For variant A and C, you also need a LASER2 encoder. See code/laser_encoder.py.

Known limitations

Undertrained: 280-482M tokens is far below Chinchilla-optimal for 592-708M params.
No SFT: These are raw pre-training checkpoints.
Russian and Chinese are marginal: 2% of the data each. Val loss reflects this.
LASER2 SPM is not optimal for 7 languages: It was trained on 200.
Not for production: Research artifact only.

License

CC BY-NC 4.0 — Non-commercial research and educational use only. See LICENSE for the full text.

Citation

If you use these checkpoints in research, please cite:

@misc{arkadiko_v4_ablation_2026,
  author = {Essam, Ahmed},
  title  = {Arkadiko V4 — 3-Variant LASER2 Cross-Attention Ablation},
  year   = {2026},
  howpublished = {HuggingFace Hub},
  url    = {https://huggingface.co/AENSaid/arkadiko-v4-ablation},
  note   = {Non-commercial research artifact}
}

Related decisions logged in the Arkadiko project's ADR.md (ADR-174 through ADR-181).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support