Arkadiko V4 β€” 3-Variant Architecture Ablation (undertrained)

Status: Research artifact. Do not use for production.

These three 6-hour ablation checkpoints compare architectural choices for integrating the frozen LASER2 BiLSTM encoder into a dense transformer decoder. All three were trained on identical data at seed 42 on a single RTX PRO 4000 Blackwell (24GB). The run was designed as a discriminator experiment, not a production training β€” models are severely undertrained.

Part of the Arkadiko research project (Arabic-first LLM, CC BY-NC 4.0, non-commercial research only).

Ablation results

Variant Description Params Tokens Final val_loss Throughput
A Per-layer cross-attention to frozen LASER2 708M 280M 2.7341 ~13K tok/s
B Pure decoder, no LASER2 592M 482M 2.7563 ~22K tok/s
C Input-only LASER2 injection (mean-pooled) 593M 398M 2.8181 ~18K tok/s

All variants trained for 6 hours of wall-clock on the same GPU, same data, same seed. Variant B processed 72% more tokens than A in the same wall-clock because LASER2 cross-attention is expensive (~40% throughput tax).

Per-language breakdown at final eval:

Variant ar en fr es ru zh tr code math classical
A 2.52 2.95 3.43 3.92 4.35 5.29 3.72 1.65 2.87 2.54
B 2.55 2.94 3.34 3.85 4.21 5.00 3.74 1.61 2.90 2.66
C 2.69 3.01 3.49 3.87 4.46 5.78 3.88 1.73 2.81 2.79

Headline findings:

  1. Variant A (per-layer cross-attention) wins on per-token efficiency β€” it reaches lower val loss with 42% fewer tokens than B.
  2. Variant B (no LASER2) is ~1.7Γ— faster per wall-clock hour and matches A on total quality at the same compute budget.
  3. Variant C (input-only) consistently underperforms A and B. Input-only injection with a simple linear projection does not transfer multilingual signal efficiently. (BLIP-2-style richer Q-Formers may behave differently β€” not tested.)
  4. Pre-training val loss is a poor discriminator at this scale. The real test is SFT transfer, which is run separately.

Architecture

All three variants share the same decoder backbone:

  • 28 layers, 1024 hidden, 16 query heads, 2 KV heads (GQA 8:1)
  • head_dim=64, SwiGLU FFN (hidden=5504, mult 5.4Γ—)
  • Max seq_len=2048, RoPE theta=10,000
  • Tied input/output embeddings
  • Vocab: 50,004 (LASER2 fairseq SPM, +4 offset for bos/pad/eos/unk)

Variant differences:

  • A adds a CrossAttention module to every block that attends from decoder Q to LASER2 per-token hidden states (K/V). Zero-init on output projection so the block starts as a no-op.
  • B omits cross-attention entirely. Standard decoder-only.
  • C adds a single nn.Linear(1024β†’1024) that takes the mean-pooled LASER2 sentence embedding and adds it to the input token embeddings (broadcast across all positions). No per-layer injection.

The LASER2 encoder is frozen and bypassed on variant B.

Data mix (identical across all variants)

Category Percentage Source
Arabic 30% wikimedia/wikipedia ar
English 30% wikimedia/wikipedia en
Code 15% codeparrot/codeparrot-clean
Math 10% open-web-math
Classical Arabic 5% wikimedia/wikipedia ar
French 2% wikimedia/wikipedia fr
Spanish 2% wikimedia/wikipedia es
Russian 2% wikimedia/wikipedia ru
Chinese 2% wikimedia/wikipedia zh
Turkish 2% wikimedia/wikipedia tr

Tokenized with LASER2 SPM 50K (vocab offset +4 for fairseq dict compatibility).

Files

variant_{a,b,c}/
  model.pt        β€” state dict (torch.compile prefix stripped)
  config.json     β€” V4Config serialized as JSON
  metrics.json    β€” final val_loss and per-language breakdown
  train.log       β€” full training log (loss per step, per-eval breakdowns)
  samples/        β€” 140 generation samples per eval (5 domains Γ— 4 Γ— 7 langs Γ— 11 evals)
code/             β€” minimal source files needed to load and run the models

Loading

import json
import torch
from arkadiko.llm.config import V4Config
from arkadiko.llm.model import V4Decoder

with open("variant_a/config.json") as f:
    cfg_dict = json.load(f)
config = V4Config(**cfg_dict)

model = V4Decoder(config).to("cuda").bfloat16()
state = torch.load("variant_a/model.pt", map_location="cuda")
model.load_state_dict(state)
model.requires_grad_(False)

For variant A and C, you also need a LASER2 encoder. See code/laser_encoder.py.

Known limitations

  • Undertrained: 280-482M tokens is far below Chinchilla-optimal for 592-708M params.
  • No SFT: These are raw pre-training checkpoints.
  • Russian and Chinese are marginal: 2% of the data each. Val loss reflects this.
  • LASER2 SPM is not optimal for 7 languages: It was trained on 200.
  • Not for production: Research artifact only.

License

CC BY-NC 4.0 β€” Non-commercial research and educational use only. See LICENSE for the full text.

Citation

If you use these checkpoints in research, please cite:

@misc{arkadiko_v4_ablation_2026,
  author = {Essam, Ahmed},
  title  = {Arkadiko V4 β€” 3-Variant LASER2 Cross-Attention Ablation},
  year   = {2026},
  howpublished = {HuggingFace Hub},
  url    = {https://huggingface.co/AENSaid/arkadiko-v4-ablation},
  note   = {Non-commercial research artifact}
}

Related decisions logged in the Arkadiko project's ADR.md (ADR-174 through ADR-181).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support