license: apache-2.0
language:
- en
- ko
- zh
- ja
- es
- fr
tags:
- causal-lm
- poe
- product-of-experts
- per-stage-head
- local-learning
- chinchilla
- nanochat
- pretraining
- early-exit
- speculative-decoding
- asymmetric-stages
library_name: transformers
pipeline_tag: text-generation
Cognica-PoE-v1.0-3B-base
A 3.02B parameter causal language model pretrained from scratch with Product of Experts (PoE) per-stage-head local learning. The model has 4 PoE stages with asymmetric layer counts (16, 6, 5, 5) β stage 0 (16 layers, ~50% of trunk) acts as a high-capacity general-LM backbone, while stages 1-3 (6+5+5 deeper layers) refine specialty knowledge. Each PoE stage has its own additive lm_head that composes with the shared base lm_head:
logits_k = lm_head(x_k) + lm_head_stages[k](x_k) for k in 0..3
Inference aggregates per-stage log-softmax distributions (Bayesian PoE, uniform mean / alpha=0.0).
This is a mid-training research release. Final training plan is 83,923 steps (~66B tokens, Chinchilla ratio ~22). Multiple checkpoints are released as branches step-XXXXX (see "Checkpoints" below). main tracks the latest.
TL;DR
- 3.02B params: 2.08B transformer trunk + 0.54B value-embeds + 0.34B lm_head_stages + 0.07B wte
- Architecture: depth=32, n_embd=2048, n_head=16, n_kv_head=8 (GQA 2:1), head_dim=128, intermediate_size=12800, max_seq_len=2048
- PoE: K=4 stages, asymmetric
poe_stage_layers=(16, 6, 5, 5), boundaries at layers[15, 21, 26, 31],poe_mode=flat,poe_alpha=0.0(uniform stage mean) - Per-stage heads: 4 independent additive lm_head_stages composing with shared lm_head
- Training: DistMuonAdamW (ZeRO-2), total_batch=786,432 tokens/step, ~66B target tokens, Chinchilla ratio ~22, FA2, bf16 compute / fp32 weights,
case_aug_prob=0.15 - Dataset: frontier_v1 mix (63B tokens), 11 sources covering English / multilingual / code / math / books / chat
- Tokenizer: 32,768 BPE vocab, BOS-prepend protocol (see "Inference" below)
- Standard HF
AutoModelForCausalLM+AutoTokenizerwithtrust_remote_code=True - WAND p99 bounds are now per-checkpoint, stored in
config.json(auto-calibrated; class-constant fallback only)
Architecture details
| Field | Value |
|---|---|
num_hidden_layers |
32 |
hidden_size |
2048 |
intermediate_size |
12800 |
num_attention_heads |
16 |
num_key_value_heads |
8 (GQA 2:1) |
head_dim |
128 |
max_position_embeddings |
2048 |
vocab_size |
32768 |
window_pattern |
SSSL (3 short + 1 long sliding-window per 4 layers; final layer always full) |
rope_theta |
100,000 |
hidden_act |
relu_squared |
rms_norm_eps |
1e-6 |
tie_word_embeddings |
False |
poe_mode |
flat |
poe_alpha |
0.0 |
poe_stage_layers |
[16, 6, 5, 5] |
per_stage_head |
True |
poe_head_count |
4 |
poe_wand_p99_bounds_per_stage_head |
per-checkpoint (auto-calibrated, see "WAND bounds" below) |
Stage layout
| Stage | Layer range (0-indexed) | Layers | Approx. trunk-compute share |
|---|---|---|---|
| 0 | [0, 15] | 16 | 50% |
| 1 | [16, 21] | 6 | ~19% (cumulative 69%) |
| 2 | [22, 26] | 5 | ~16% (cumulative 84%) |
| 3 | [27, 31] | 5 | ~16% (cumulative 100%) |
Stage 0 is intentionally deep enough to function as a standalone capable LM. The asymmetric layout (50% / 19% / 16% / 16%) is itself a research variable: see the "Diversity vs layout" note below.
Training
| Field | Value |
|---|---|
| Optimizer | DistMuonAdamW (ZeRO-2; reduce_scatter zero-padded for vocab=32768 % world_size != 0) |
total_batch_size |
786,432 tokens / step |
num_iterations |
83,923 (target) |
| target tokens | ~65.99B (Chinchilla ratio ~21.85) |
matrix_lr |
0.015 |
embedding_lr |
0.3 |
unembedding_lr |
0.008 |
weight_decay |
0.28 |
warmup_steps |
1,000 |
warmdown_ratio |
0.65 |
case_aug_prob |
0.15 (80% lower / 20% upper at sample time) |
| Compute | 3-node A100 80GB (12 GPUs), DDP TCP cross-zone us-central1-c |
| Compute dtype | bf16 |
| Weight dtype | fp32 |
Dataset (frontier_v1 mix, 63.07B tokens, 848 sharded parquets)
| Source | Share |
|---|---|
| FineWeb-Edu | 33.5% |
| DCLM-Baseline | 24.1% |
| Stack v2 (codeparrot/github-code-clean mirror) | 15.7% |
| Wikipedia | 5.2% |
| CulturaX (ko, zh, ja, es, fr) | 5.2% |
| ProofPile-2 | 4.2% |
| OpenWebMath | 4.2% |
| Gutenberg (PG-19 separate) | 4.2% |
| PG-19 | 2.1% |
| UltraChat | 1.0% |
| OpenHermes-2.5 | 0.6% |
Checkpoints
Each ckpt is a separate branch named step-XXXXX. The main branch tracks the latest released checkpoint (currently step-83923 β final, training complete).
| Branch | Step | Training % | Val BPB (training-eval, 40M tokens, 12 ranks) |
|---|---|---|---|
step-2000 |
2,000 | 2.4% | 0.987 |
step-4000 |
4,000 | 4.8% | 0.955 |
step-6000 |
6,000 | 7.2% | 0.949 |
step-8000 |
8,000 | 9.5% | 0.944 |
step-10000 |
10,000 | 11.9% | 0.936 |
step-12000 |
12,000 | 14.3% | 0.932 |
step-14000 |
14,000 | 16.7% | 0.932 |
step-16000 |
16,000 | 19.1% | 0.928 |
step-18000 |
18,000 | 21.4% | 0.922 |
step-20000 |
20,000 | 23.8% | 0.923 |
step-22000 |
22,000 | 26.2% | 0.923 |
step-24000 |
24,000 | 28.6% | 0.923 |
step-26000 |
26,000 | 31.0% | 0.922 |
step-28000 |
28,000 | 33.4% | 0.919 |
step-30000 |
30,000 | 35.8% | 0.914 |
step-32000 |
32,000 | 38.1% | 0.903 |
step-34000 |
34,000 | 40.5% | 0.905 |
step-36000 |
36,000 | 42.9% | 0.896 |
step-38000 |
38,000 | 45.3% | 0.896 (training-log s37500=0.894 was lower) |
step-40000 |
40,000 | 47.7% | 0.890 |
step-42000 |
42,000 | 50.0% | 0.885 |
step-44000 |
44,000 | 52.4% | 0.883 |
step-46000 |
46,000 | 54.8% | 0.879 |
step-48000 |
48,000 | 57.2% | 0.875 |
step-50000 |
50,000 | 59.6% | 0.866 |
step-52000 |
52,000 | 62.0% | 0.859 (skipped analysis; HF auto-publish only) |
step-54000 |
54,000 | 64.4% | 0.855 |
step-56000 |
56,000 | 66.7% | 0.849 |
step-58000 |
58,000 | 69.1% | 0.847 |
step-60000 |
60,000 | 71.5% | 0.840 |
step-62000 |
62,000 | 73.9% | 0.832 |
step-64000 |
64,000 | 76.3% | 0.830 |
step-66000 |
66,000 | 78.6% | 0.824 |
step-68000 |
68,000 | 81.0% | 0.815 |
step-70000 |
70,000 | 83.4% | 0.807 |
step-72000 |
72,000 | 85.8% | 0.803 |
step-74000 |
74,000 | 88.2% | 0.798 |
step-76000 |
76,000 | 90.6% | 0.794 |
step-78000 |
78,000 | 92.9% | 0.790 |
step-80000 |
80,000 | 95.3% | 0.784 |
step-82000 |
82,000 | 97.7% | 0.778 |
step-83923 |
83,923 | 100.0% (final) | 0.773 |
main |
latest | β | tracks step-83923 |
Training-log val BPB new-minimum trajectory: s24500=0.9216 β s26500=0.9205 β s27000=0.9170 β s27500=0.9152 β s29000=0.9150 β s30000=0.9139 β s30500=0.9058 β s32000=0.9029 β s33500=0.9025 β s35000=0.9019 β s35500=0.8957 β s37500=0.8936 β s40000=0.8904 β s41500=0.8856 β s42000=0.8849 β s44000=0.8827 β s44500=0.8777 β s46500=0.8735 β s47000=0.8720 β s48500=0.8686 β s49500=0.8677 β s50000=0.8655 β s50500=0.8645 β s51000=0.8604 β s51500=0.8525 β s54500=0.8542. Warmdown phase began at step 29373; LR decay (lrm) is 1.00 at start, 0.85 by step 38000, 0.74 by step 44000, 0.65 by step 50000, 0.59 by step 54000.
Load a specific checkpoint via:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"cognica/Cognica-PoE-v1.0-3B-base",
revision="step-83923", # branch name
trust_remote_code=True,
torch_dtype="bfloat16",
)
tokenizer = AutoTokenizer.from_pretrained(
"cognica/Cognica-PoE-v1.0-3B-base",
revision="step-83923",
trust_remote_code=True,
)
WAND bounds (per-checkpoint, calibrated)
Each branch's config.json carries poe_wand_p99_bounds_per_stage_head, calibrated on a 131,072-token val slice using the tight margin-shrinkage metric range(delta) = max(delta) - min(delta) (constant-shift invariant). model.generate_wand(...) reads this field automatically; the class constant POE_WAND_P99_BOUNDS_PER_STAGE_HEAD = (3.2557, 1.5259, 1.1327) is now a fallback only.
| step | bound 0β1 | bound 1β2 | bound 2β3 |
|---|---|---|---|
| 2,000 | 3.7031 | 1.6121 | 0.9499 |
| 4,000 | 3.8367 | 1.7457 | 1.0991 |
| 6,000 | 3.6368 | 1.6811 | 1.0779 |
| 8,000 | 3.7747 | 1.7518 | 1.1965 |
| 10,000 | 3.6264 | 1.6389 | 1.1198 |
| 12,000 | 3.4802 | 1.6259 | 1.1765 |
| 14,000 | 3.2557 | 1.5259 | 1.1327 |
| 16,000 | 3.2375 | 1.5871 | 1.2400 |
| 18,000 | 3.0877 | 1.4975 | 1.1504 |
| 20,000 | 3.3391 | 1.6146 | 1.2223 |
| 22,000 | 3.2850 | 1.5351 | 1.1668 |
| 24,000 | 3.0965 | 1.5135 | 1.2253 |
| 26,000 | 3.2014 | 1.5787 | 1.1850 |
| 28,000 | 3.3545 | 1.6309 | 1.2206 |
| 30,000 | 3.2619 | 1.5749 | 1.1668 |
| 32,000 | 3.1206 | 1.5611 | 1.1859 |
| 34,000 | 3.3211 | 1.6436 | 1.1958 |
| 36,000 | 3.1297 | 1.5429 | 1.1388 |
| 38,000 | 3.5419 | 1.7612 | 1.2951 |
| 40,000 | 3.2932 | 1.6490 | 1.2101 |
| 42,000 | 3.1828 | 1.6904 | 1.2802 |
| 44,000 | 3.5738 | 1.8313 | 1.3495 |
| 46,000 | 3.3461 | 1.7629 | 1.2783 |
| 48,000 | 3.4684 | 1.7783 | 1.3153 |
| 50,000 | 3.3907 | 1.7382 | 1.2635 |
| 54,000 | 3.6491 | 1.8719 | 1.4201 |
| 56,000 | 3.5046 | 1.8724 | 1.3725 |
| 58,000 | 3.8759 | 2.0701 | 1.5092 |
| 60,000 | 3.4080 | 1.8007 | 1.3147 |
| 62,000 | 3.4241 | 1.8028 | 1.3601 |
| 64,000 | 3.3929 | 1.7722 | 1.2997 |
| 66,000 | 3.3416 | 1.7172 | 1.2378 |
| 68,000 | 3.8046 | 2.0047 | 1.4619 |
| 70,000 | 3.4113 | 1.7839 | 1.3367 |
| 72,000 | 3.4601 | 1.8653 | 1.3684 |
| 74,000 | 3.7531 | 2.0426 | 1.4564 |
| 76,000 | 3.7031 | 1.9777 | 1.4586 |
| 78,000 | 3.9050 | 1.9837 | 1.4425 |
| 80,000 | 3.8490 | 1.9599 | 1.4154 |
| 82,000 | 3.8955 | 2.0010 | 1.4470 |
| 83,923 (final) | 3.9429 | 2.0193 | 1.4479 |
The bound 0β1 decreased s2k β s18k (from peak 3.84 at s4000 to 3.09 at s18000). Subsequent windows produced repeated widening / narrowing cycles: at s20k β s28k all three rose +3-5%, descended through s30k β s36k, widened sharply at s38k (+13~14%), narrowed at s40k (-7%), split at s42k, widened uniformly at s44k (+12-5%), reverted at s46k (-5%), mild moves through s48-s56, widened uniformly at s58k (+10%; bound 1β2 = 2.0701 set a trajectory-wide single-bound high), narrowed substantially at s60k (-12-13% β the s58 widening fully reverts), held essentially flat at s62k (+0.5% / +0.1% / +3.5%), narrowed mildly through s64-s66 (-0.9% to -4.8%), widened uniformly at s68k (+13.85% / +16.74% / +18.10%), narrowed substantially at s70k (-10.34% / -11.01% / -8.56%), mildly re-widened at s72k (+1.43% / +4.56% / +2.37%), widened moderately at s74k (+8.47% / +9.50% / +6.43%), mildly narrowed at s76k (-1.33% / -3.18% / +0.15%), and split at s78k (+5.45% bound 0β1 / +0.30% bound 1β2 / -1.10% bound 2β3). The trajectory is non-monotonic on every measurement window.
Inference
Standard HF generate (with BOS prepend β REQUIRED for base ckpts)
This is a base (pretrained) model. The training protocol always prepends <|bos|> to the prompt before tokenization. Failing to prepend BOS produces incoherent output:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
"cognica/Cognica-PoE-v1.0-3B-base",
trust_remote_code=True,
dtype=torch.bfloat16,
).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(
"cognica/Cognica-PoE-v1.0-3B-base",
trust_remote_code=True,
)
prompt = "The capital of France is"
input_ids = [tokenizer.bos_token_id] + tokenizer.encode(prompt, add_special_tokens=False)
input_ids = torch.tensor([input_ids], device=device)
out = model.generate(
input_ids=input_ids,
max_new_tokens=32,
do_sample=False, # greedy; set True + temperature for sampling
)
print(tokenizer.decode(out[0].tolist()))
KV cache is enabled by default. CognicaKVCache subclasses transformers.Cache so HF generate() preserves it across decode steps without auto-replacing it with DynamicCache. The cache is preallocated to max_position_embeddings and lives on the device of the input tensor.
Implementation details:
- Numerical: SDPA's prefill (
is_causal=True, full sequence) and decode (Tq == 1, masked) kernels are mathematically equivalent but accumulate bf16 rounding errors in different orders. To prevent that drift from compounding across decode steps and producing different greedy tokens at low-margin branching points, the SDPA call castsq/k/vto fp32, runs the kernel, then casts back to bf16. The K/V cache itself stays in bf16 (memory unchanged). On a fixed greedy prompt this gives bit-identical agreement betweenuse_cache=Trueanduse_cache=Falsefor at least 200 generated tokens. - Throughput: in single-batch (B=1) interactive use, per-decode Python and dispatch overhead dominates the per-step compute savings from the cache. Measured speedup is +3-6 percent (
use_cache=Truevsuse_cache=False) over 50-500 token runs. To realize the cache's full benefit, batch the decode (B >= 4) or use a fused kvcache kernel (FA2'sflash_attn_with_kvcache, FlashInfer).
PoE-specific inference (s83923 final measurements, 8-shard val slice 1.05M tokens)
s83923 is the final ckpt (lrm at s83923 β 0.05; warmdown complete). Same val slice across s8000..s83923, single A100 80GB, bug-fixed code:
| Inference mode | n_layer used | training-objective BPB | renormed PoE BPB (Ξ±=0) |
|---|---|---|---|
Full PoE (alpha=0, all 4 stages aggregated) |
32 | 0.724738 | 0.724738 |
| Single stage 0 alone | 16 | 0.727740 | 0.727740 |
| Single stage 1 alone | 22 | 0.725798 | 0.725798 |
| Single stage 2 alone | 27 | 0.725063 | 0.725063 |
| Single stage 3 alone | 32 | 0.724273 | 0.724273 |
| prefix K'=1 (== single s0) | 16 | 0.727740 | β |
| prefix K'=2 | 22 | 0.726103 | β |
| prefix K'=3 | 27 | 0.725363 | β |
| Self-speculative decoding (stage 0 drafts, full verifies) | mixed | (no quality loss by construction) | β |
| metric | s83923 (final) |
|---|---|
| Speculative decoding speedup (m=4, 7 prompts Γ ~60 tokens) | 1.61x end-to-end |
| Speculative acceptance Ξ± | 0.9852 (s82000=0.9377, +0.0475; 2nd highest in trajectory after s28000=0.9882) |
| Routing probe at cap=0.020 (regression-rate-bounded) | 85.05% routed (s82000=86.08%, -1.03pp), projected speedup 1.715x |
| Per-stage target accuracy (full K=4) | 0.4860 (s82000=0.4835, +0.0025; trajectory peak) |
| Per-stage best (s3) accuracy | 0.4862 (trajectory peak) |
| PoEβsingle-s3 BPB crossover gap | +0.000465 (s82000=+0.000441, s80000=+0.000445) |
vs s82000 the local-slice training-objective BPB dropped -0.005308 (full K=4) and -0.005332 (single s3). Sub-0.725 first crossed across the full table.
Cumulative warmdown phase totals: full K=4 BPB s30000 β s83923 = -0.133244 (15.5% relative reduction). Per-stage full acc s32000 β s83923 = +0.0580 (5.80 percentage points).
Per-stage target accuracy across last 12 ckpts (s52000 skipped from analysis):
| step | s0 | s1 | s2 | s3 | full |
|---|---|---|---|---|---|
| s32000 | 0.4258 | 0.4279 | 0.4281 | 0.4280 | 0.4280 |
| s34000 | 0.4266 | 0.4279 | 0.4281 | 0.4281 | 0.4280 |
| s36000 | 0.4303 | 0.4311 | 0.4310 | 0.4310 | 0.4311 |
| s38000 | 0.4336 | 0.4342 | 0.4345 | 0.4347 | 0.4347 |
| s40000 | 0.4380 | 0.4394 | 0.4392 | 0.4393 | 0.4394 |
| s42000 | 0.4388 | 0.4403 | 0.4406 | 0.4407 | 0.4407 |
| s44000 | 0.4395 | 0.4400 | 0.4403 | 0.4405 | 0.4403 |
| s46000 | 0.4397 | 0.4407 | 0.4410 | 0.4414 | 0.4410 |
| s48000 | 0.4419 | 0.4427 | 0.4427 | 0.4430 | 0.4429 |
| s50000 | 0.4456 | 0.4466 | 0.4471 | 0.4471 | 0.4470 |
| s54000 | 0.4500 | 0.4512 | 0.4515 | 0.4521 | 0.4515 |
| s56000 | 0.4531 | 0.4542 | 0.4547 | 0.4549 | 0.4546 |
| s58000 | 0.4535 | 0.4547 | 0.4549 | 0.4550 | 0.4550 |
| s60000 | 0.4563 | 0.4570 | 0.4573 | 0.4578 | 0.4573 |
| s62000 | 0.4590 | 0.4599 | 0.4603 | 0.4608 | 0.4602 |
| s64000 | 0.4626 | 0.4634 | 0.4639 | 0.4642 | 0.4639 |
| s66000 | 0.4643 | 0.4652 | 0.4652 | 0.4655 | 0.4651 |
| s68000 | 0.4674 | 0.4685 | 0.4687 | 0.4688 | 0.4687 |
| s70000 | 0.4697 | 0.4706 | 0.4708 | 0.4711 | 0.4708 |
| s72000 | 0.4711 | 0.4720 | 0.4722 | 0.4726 | 0.4721 |
| s74000 | 0.4731 | 0.4741 | 0.4745 | 0.4748 | 0.4745 |
| s76000 | 0.4763 | 0.4775 | 0.4777 | 0.4782 | 0.4776 |
| s78000 | 0.4786 | 0.4798 | 0.4799 | 0.4804 | 0.4801 |
| s80000 | 0.4812 | 0.4820 | 0.4824 | 0.4829 | 0.4824 |
| s82000 | 0.4824 | 0.4830 | 0.4835 | 0.4837 | 0.4835 |
| s83923 | 0.4844 | 0.4856 | 0.4859 | 0.4862 | 0.4860 |
Cumulative s32 β s82 full acc gain: +0.0555. 0.05 cumulative milestone crossed at s78k; 0.48 boundary first crossed at s80k (s2/s3/full = 0.4824 / 0.4829 / 0.4824); s82k continues +0.0011 mild descent.
Sample-level outputs at s78000 (greedy temp=0.0, 60 tokens):
- Capital France: "Paris. It is the largest city in France and the capital of the country. Paris is the seat of the government, the seat of the French Academy, and the seat of the European Union. It is also the seat of the United Nations. Paris is the second largest city in the European Union." (Paris β; "largest city in France" β; "seat of government" β; "seat of French Academy" β β AcadΓ©mie franΓ§aise is in Paris; "seat of European Union" wrong; "seat of UN" wrong; mixed factual quality)
- Gold symbol: "Au. Gold is a soft, malleable, ductile, highly unreactive β, precious, yellow, ductile, malleable, ..." (Au β; "highly unreactive" β β corrects the s72 "highly reactive" error; precious + yellow + soft + malleable + ductile correct; output then degenerates into ductile/yellow/malleable repetition loop)
- Friday β tomorrow: "Saturday. If you are a Christian, then you know that the Bible says that God created the world on the sixth day of Creation Week. If you are a Muslim, then you know that the Quran says ... If you are a Jew ..." (Saturday is incorrect β correct is Sunday; religious tangent about Creation Week)
- Opposite hot: "cold. The opposite of cold is hot. ..." (binary loop)
- Planets list: "the bodies that orbit the Sun. The planets are the only bodies in the solar system that have atmospheres. The planets are named after the Roman gods of the Greek pantheon. The planets are Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune." (FIRST TIME the response produces a complete and correct modern 8-planet list across the entire trajectory β Mercury through Neptune, no Pluto, in correct order; "named after the Roman gods of the Greek pantheon" β; richest planets output by a wide margin)
- Color: "red. I love the color red. ..." (red + repetition; first non-blue since s70)
5x + 3 = 13: "x is equal to 1.5. ..." (1.5 wrong β correct is 2; closer than s76 "1/3"; cleaner format, no MC, no equation echo loop)
Sample-level outputs at s76000:
- Capital France: "Paris. ... largest city in France / 3rd largest in Europe / 2nd most populous / 2nd most visited after London"
- Gold: Au + atomic number 79 β + comprehensive properties + use list (jewelry/coins/electronics/dentistry/medicine)
- Friday: "Saturday" + Matrix simulation drift
- Planets: "objects that orbit the Sun" generic only
- Algebra: "x = 1/3" single fractional answer (wrong)
s32000 β s78000 pattern across 24 analyzed warmdown checkpoints: per-stage accuracy increased across 22 of 23 2k-step windows (s44 alone broke; s58 was near-flat). Local-slice training-objective BPB descended non-monotonically (s34/s38/s44 produced positive deltas; the rest negative). The s78 planets prompt produced the first complete and correct modern 8-planet list across the entire trajectory (Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune in correct order; no Pluto). The s76 gold prompt was the first correct atomic number 79 in a comprehensive Au response; s78 corrects the s72 "highly reactive" error to "highly unreactive β". Routing crossed the 90% boundary at s76 (90.76% peak at cap=0.020) and pulled back to 83.79% at s78. Cumulative full-stack acc gain s32 β s78 = +0.0521 (0.05 milestone crossed at s78).
Trajectory and findings (s8000 β s33000)
This is a research release; we publish per-checkpoint experiment data so the trajectory of PoE behavior is externally auditable. The 8-shard local-val BPB and per-checkpoint WAND bounds are first-class artifacts of each branch.
BPB trajectory
| step | training-obj. full K=4 BPB (8 shards) | training-log val BPB (12 ranks) | comment |
|---|---|---|---|
| s8,000 | 0.886647 | 0.943905 | early plateau exiting |
| s12,000 | 0.879752 | 0.931519 | mid-training, oscillation begins |
| s14,000 | 0.872835 | 0.931956 | first local low on 8-shard slice |
| s16,000 | 0.878514 | 0.927683 | regression on 8-shard slice (recovery on training-log) |
| s18,000 | 0.877102 | 0.922033 | training-log prior minimum |
| s20,000 | 0.875179 | 0.922640 | 8-shard recovery in progress |
| s22,000 | 0.874345 | 0.923003 | gap to s14k baseline now +0.0015; both slices in agreement |
| s24,000 | 0.866000 | 0.923228 | largest 2k-step drop in trajectory (-0.0083); crossed below prior s14k floor |
| s24,500 | (not run) | 0.921583 | training-log new min |
| s26,000 | 0.863433 | 0.922316 | local slice min through this point; per-stage acc -0.0018; routing 68.26% |
| s26,500 | (not run) | 0.920499 | training-log new min |
| s27,000 | (not run) | 0.917032 | training-log new min |
| s27,500 | (not run) | 0.915166 | training-log new min (5-step streak) |
| s28,000 | 0.858138 | 0.918785 | local slice min through this point; per-stage acc +0.0035; spec Ξ± 0.9882; crossover gap +0.000073 |
| s29,000 | (not run) | 0.915006 | training-log new min |
| s30,000 | 0.857982 | 0.913886 | both slices min through this point; first warmdown ckpt (lrmβ0.988); routing 68.39% |
| s30,500 | (not run) | 0.905847 | training-log new min; -0.008 single-step jump |
| s31,000 | (not run) | 0.906107 | small bounce in [0.902, 0.907] band |
| s31,500 | (not run) | 0.906766 | continued |
| s32,000 | 0.847991 | 0.902863 | -0.0100 local slice drop vs s30000; per-stage acc +0.0044 uniform; routing 74.33% (+5.94%); 4 prompts (Friday chain, modern planets, single-integer algebra, antonym graph) produced new output forms vs s30000 |
| s32,500 | (not run) | 0.907459 | reversal at upper band edge |
| s33,000 | (not run) | 0.906124 | oscillation in [0.902, 0.907] band; lrm β 0.94 |
| s33,500 | (not run) | 0.902453 | training-log new min (9th); lrm β 0.93 |
| s34,000 | 0.853621 | 0.904607 | local slice +0.0056 vs s32000; routing -13.11%; algebra prompt produced "A. 2 / B. 3 / C. 4" multiple-choice format |
| s35,000 | (not run) | 0.901891 | training-log new min (10th) |
| s35,500 | (not run) | 0.895684 | training-log new min (11th); first sub-0.9; -0.0062 vs s35000 |
| s36,000 | 0.841936 | 0.895738 | -0.0117 local slice drop vs s34000; per-stage acc +0.003 uniform; routing 67.63%; WAND bounds -5~-6% vs s34000 |
| s37,500 | (not run) | 0.893594 | training-log new min (12th); first sub-0.895 |
| s38,000 | 0.842198 | 0.896005 | local slice +0.000262 vs s36000 (first +Ξ this warmdown); per-stage acc +0.0036 uniform; spec Ξ± 0.9483 (+0.0189); routing 72.06% (+4.4%); WAND bounds +13~14%; crossover gap +0.000476 β +0.000067 |
| s39,500 | (not run) | 0.891012 | training-log new min (13th) |
| s40,000 | 0.831678 | 0.890430 | local slice -0.010520 vs s38000 (full K=4); training-log -0.005575 (sub-0.89 first); per-stage acc +0.0047 uniform; spec Ξ± 0.9822 (+0.0339); routing 73.19% (+1.13%); WAND bounds -7%/-6.4%/-6.6%; crossover gap +0.000346; algebra prompt produced "x is equal to 2" (correct) |
| s41,500 | (not run) | 0.885587 | training-log new min (14th) |
| s42,000 | 0.826550 | 0.884868 | local slice -0.005128 vs s40000 (full K=4); training-log -0.005562 (sub-0.885 first); per-stage acc +0.0013 uniform; spec Ξ± 0.9403 (-0.0419); routing 68.90% (-4.29%); WAND bounds split (-3.4% / +2.5% / +5.8%); crossover gap +0.000109 |
| s44,000 | 0.828415 | 0.882684 | local slice +0.001865 vs s42000 (slice/log disagree on direction at s44); training-log -0.002184; per-stage acc s0 +0.0007 / s1-s3 -0.0002~-0.0003; spec Ξ± 0.9294 (-0.0109); routing cap=0.020 77.57% (+8.67%, surpasses s32000 prior peak 74.33%); WAND bounds widened uniformly (+12.3% / +8.3% / +5.4%); crossover gap +0.000038 |
| s44,500 | (not run) | 0.877723 | training-log new min; -0.0050 single-step drop |
| s46,000 | 0.827073 | 0.878585 | local slice -0.001342 vs s44000 (reverts s44 +Ξ); training-log -0.004099; per-stage acc +0.0002~+0.0009 (monotonic increase resumes; s32βs46 cumulative full acc gain +0.0130); spec Ξ± 0.9294 (unchanged); routing cap=0.020 70.97% (-6.60%, reverts s44 jump); WAND bounds reverted (-6.4% / -3.7% / -5.3%); crossover gap +0.000158 |
| s46,500 | (not run) | 0.873460 | training-log new min |
| s47,000 | (not run) | 0.871953 | training-log new min |
| s48,000 | 0.821333 | 0.874640 | local slice -0.005740 vs s46000; training-log -0.003945; per-stage acc +0.0016~+0.0022; spec Ξ± 0.9320 (+0.0026); routing cap=0.020 79.18% (+8.21%, NEW trajectory peak); WAND bounds mildly widened (+3.7% / +0.9% / +2.9%); crossover gap +0.000303 |
| s48,500 | (not run) | 0.868639 | training-log new min |
| s49,500 | (not run) | 0.867675 | training-log new min |
| s50,000 | 0.811991 | 0.865501 | local slice -0.009342 vs s48000; training-log -0.009139; per-stage acc +0.0037~+0.0044 (largest single-step acc gain through s50); spec Ξ± 0.9852 (+0.0532, second-highest in trajectory after s28000=0.9882); routing cap=0.020 77.33% (-1.85% vs s48 peak); WAND bounds mildly narrowed (-2.2% / -2.3% / -3.9%); crossover gap +0.000107 |
| s51,500 | (not run) | 0.852479 | training-log new min; -0.0079 single-step drop (largest 500-step descent in trajectory) |
| s52,000 | (not run) | 0.858810 | epoch 1 ended around this region; pq_idx wrapped to 0 entering epoch 2 |
| s54,000 | 0.801717 | 0.855220 | first ckpt analyzed in epoch 2; local slice -0.010274 vs s50000 (cumulative s48βs54 full K=4: -0.019616); training-log -0.010281; per-stage acc +0.0044~+0.0050 (LARGEST single-step acc gain in trajectory; 0.45 boundary first crossed at s3=0.4521); spec Ξ± 0.9566; routing cap=0.020 73.57%; WAND bounds widened uniformly (+7.6% / +7.7% / +12.4%); crossover gap +0.000023 (LOWEST in entire trajectory) |
| s55,500 | (not run) | 0.851502 | training-log new min |
| s56,000 | 0.795069 | 0.849220 | local slice -0.006648 vs s54000 (sub-0.80 first crossed on full K=4); training-log -0.005978 (0.85 boundary first crossed); per-stage acc +0.0028~+0.0032 (0.455 boundary first crossed at s3=0.4549); spec Ξ± 0.9430; routing cap=0.020 79.77% (+6.20%, NEW trajectory peak); WAND bounds mildly narrowed (-4.0% / +0.0% / -3.4%); crossover gap +0.000359 |
| s58,000 | 0.792391 | 0.844465 | local slice -0.002678 vs s56000 (smallest single-step descent since s44βs46); per-stage acc +0.0001~+0.0005 (smallest gain since s32βs34, essentially flat); spec Ξ± 0.8784 (-0.0646, lowest since s20000=0.9086); WAND bounds widened uniformly (+10.6% / +10.6% / +10.0%; bound 1β2=2.0701 trajectory single-bound high); crossover gap +0.000194; multiple sample-level regressions co-occur |
| s60,000 | 0.785370 | 0.840422 | local slice -0.007021 vs s58000 (descent rate recovers); training-log -0.004043; per-stage acc +0.0023~+0.0028 (resumes growth); spec Ξ± 0.9402 (+0.0618, recovers from s58 low); routing cap=0.020 80.92% (+4.16%, NEW trajectory peak surpassing s56=79.77%); WAND bounds substantially narrowed (-12.1% / -13.0% / -12.9%, s58 widening fully reverts); crossover gap +0.000553 |
| s60,500 | (not run) | 0.834882 | training-log new min; -0.0055 single-step drop |
| s61,000 | (not run) | 0.833518 | training-log new min |
| s62,000 | 0.779053 | 0.832344 | local slice -0.006317 vs s60000 (sub-0.78 first crossed); training-log -0.008078; per-stage acc +0.0027~+0.0030 (0.46 boundary first crossed); spec Ξ± 0.9795 (third-highest in trajectory); routing cap=0.020 77.43%; WAND bounds essentially flat; crossover gap +0.000471 |
| s63,000 | (not run) | 0.829964 | training-log new min (sub-0.83 first) |
| s64,000 | 0.772334 | 0.825504 | local slice -0.006719 vs s62000; per-stage acc +0.0034~+0.0037 (s32βs64 cumulative full acc gain +0.0359); spec Ξ± 0.9483; routing cap=0.020 80.51%; WAND bounds mildly narrowed |
| s64,500 | (not run) | 0.823171 | training-log new min (sub-0.824) |
| s66,000 | 0.769515 | (s65500=0.823685; s66000 not yet observed at probe time) | local slice -0.002819 vs s64000 (smaller magnitude; descent rate decelerating); per-stage acc +0.0012~+0.0018 (s32βs66 cumulative full acc gain +0.0371); spec Ξ± 0.9738 (+0.0255, recovers toward s62 third-place); routing cap=0.020 83.21% (+2.70%, NEW trajectory peak surpassing prior s60=80.92%); WAND bounds mildly narrowed (-1.5% / -3.1% / -4.8%); crossover gap +0.000553 |
| s68,000 | 0.761794 | 0.814863 | local slice -0.007721 vs s66000 (largest single-window descent of warmdown phase); training-log -0.009; per-stage acc +0.0031~+0.0036 (s32βs68 cumulative full acc gain +0.0407); spec Ξ± 0.9162 (-0.0576; sits between s60=0.9402 and s58 trajectory low 0.8784); routing cap=0.020 82.07% (-1.14pp; below s66 peak); WAND bounds widened uniformly (+13.85% / +16.74% / +18.10%); crossover gap +0.000326 |
| s70,000 | 0.757416 | 0.806862 | local slice -0.004378 vs s68000 (moderate descent); training-log -0.008; per-stage acc +0.0021~+0.0023 (0.47 boundary first crossed; s32βs70 cumulative full acc gain +0.0428); spec Ξ± 0.9139 (-0.0023, basically flat-low; the s68 drop did not recover); routing cap=0.020 83.45% (+1.38pp, NEW trajectory peak surpassing prior s66=83.21%); WAND bounds narrowed uniformly (-10.34% / -11.01% / -8.56%, fully reverts s68 widening); crossover gap +0.000268 |
| s72,000 | 0.755075 | 0.802531 | local slice -0.002341 vs s70000 (mild descent); training-log -0.004; per-stage acc +0.0013~+0.0015 (s32βs72 cumulative full acc gain +0.0441); spec Ξ± 0.9708 (+0.0569, sharp recovery from s68/s70 low pair; the s58 single-window recovery pattern repeats with two-window delay across s68βs72); routing cap=0.020 84.31% (+0.86pp, NEW trajectory peak surpassing prior s70=83.45%); WAND bounds mildly re-widened (+1.43% / +4.56% / +2.37%, well below s68 widened regime); crossover gap +0.000306 |
| s74,000 | 0.749713 | 0.797773 | local slice -0.005362 vs s72000 (moderate descent; sub-0.75 first crossed); training-log -0.005; per-stage acc +0.0020~+0.0024 (s32βs74 cumulative full acc gain +0.0465); spec Ξ± 0.9484 (-0.0224, pulls back from s72 recovery but stays well above s68/s70 low pair); routing cap=0.020 86.52% (+2.21pp, NEW trajectory peak surpassing prior s72=84.31%); WAND bounds widened moderately (+8.47% / +9.50% / +6.43%, second-largest single-window since s58); crossover gap +0.000204 (lowest since s50-s54 era); algebra prompt produced first "Explanation: 5x = 13 - 3" algebraic-step structure since s62 |
| s76,000 | 0.741818 | 0.793753 | local slice -0.007895 vs s74000 (largest single-window descent since s66βs68 -0.0077; sub-0.745 first crossed); training-log -0.004; per-stage acc +0.0031~+0.0034 (s32βs76 cumulative full acc gain +0.0496); spec Ξ± 0.9568 (+0.0084, mild recovery); routing cap=0.020 90.76% (+4.24pp, NEW trajectory peak and 90% boundary first crossed); WAND bounds mildly narrowed (-1.33% / -3.18% / +0.15%); crossover gap +0.000403; gold prompt produced first atomic # 79 β embedded in comprehensive Au + properties + use list response across the trajectory; France prompt produced best multi-fact response across the trajectory with no internal contradictions |
| s78,000 | 0.738087 | 0.789612 | local slice -0.003731 vs s76000 (mild descent; sub-0.74 first crossed); training-log -0.004; per-stage acc +0.0022~+0.0025 (s32βs78 cumulative full acc gain +0.0521; 0.05 cumulative milestone first crossed); spec Ξ± 0.9190 (-0.0378, pulls back from s76 mild recovery); routing cap=0.020 83.79% (-6.97pp, major pullback from s76 trajectory peak); WAND bounds split (+5.45% bound 0β1 / +0.30% bound 1β2 / -1.10% bound 2β3); crossover gap +0.000365; planets prompt produced first complete and correct modern 8-planet list across the entire trajectory (Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune; no Pluto; "named after Roman gods of Greek pantheon" β); gold prompt corrected s72 "highly reactive" error to "highly unreactive β" |
| s80,000 | 0.732713 | 0.783510 | local slice -0.005374 vs s78000 (moderate descent; sub-0.735 / sub-0.733 first crossed); training-log -0.006; per-stage acc +0.0022~+0.0026 (s32βs80 cumulative full acc gain +0.0544; 0.48 boundary first crossed); spec Ξ± 0.9483 (+0.0293, recovers from s78 pullback); routing cap=0.020 87.66% (+3.87pp, recovers but below s76 peak); WAND bounds mildly narrowed (-1.43% / -1.20% / -1.88%); crossover gap +0.000445; France prompt produced richest factual output across trajectory (Paris + largest + north + Seine + culture/art/fashion + museums/parks/monuments, no errors); Gold prompt added transition metals classification β for first time across trajectory; planets and algebra prompts regressed (s78 8-planet breakthrough not retained; algebra back to "x is 5" pattern from s64 era) |
Local-slice training-objective BPB across the warmdown ckpts: 0.857982 (s30k) β 0.847991 (s32k) β 0.853621 (s34k) β 0.841936 (s36k) β 0.842198 (s38k) β 0.831678 (s40k) β 0.826550 (s42k) β 0.828415 (s44k) β 0.827073 (s46k) β 0.821333 (s48k) β 0.811991 (s50k) β 0.801717 (s54k) β 0.795069 (s56k) β 0.792391 (s58k) β 0.785370 (s60k) β 0.779053 (s62k) β 0.772334 (s64k) β 0.769515 (s66k) β 0.761794 (s68k) β 0.757416 (s70k) β 0.755075 (s72k) β 0.749713 (s74k) β 0.741818 (s76k) β 0.738087 (s78k) β 0.732713 (s80k) β 0.730046 (s82k) β 0.724738 (s83923, final). Non-monotonic (s32βs34 +Ξ, s38 +Ξ, s44 +Ξ). Through s83923 the cumulative drop from s30000 is -0.133244 (15.5% relative reduction).
Sample-level concept oscillation under monotone BPB improvement
Greedy continuations on a fixed 7-prompt probe set track concept-level retention separately from BPB:
| capability | s78000 | s76000 | s74000 | s72000 | s70000 | s68000 | s66000 | s64000 | s62000 | s60000 | s58000 | s56000 | s54000 | s50000 | s48000 | s46000 | s44000 | s42000 | s40000 | s38000 | s36000 | s34000 | s32000 | s30000 | s28000 | s26000 | s24000 | s22000 | s20000 | s18000 | s16000 | s14000 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gold symbol β Au |
β Au + soft/malleable/ductile + "highly unreactive" β + precious + yellow (then property repetition) | β Au + atomic # 79 β + soft/yellow/lustrous/malleable/ductile + uses (jewelry/coins/electronics/dentistry/medicine) | β Au + soft/yellow/malleable/ductile + "easily cut with knife" β | β Au + soft/yellow/malleable/ductile + "highly reactive" wrong | β Au + clean sentence repetition | β Au + self-referential definition loop | β Au + correct properties (soft/malleable/ductile/conductor) | β Au + Wikipedia fact list (atomic # "19" wrong) | β Au + 5 properties + Latin "aurum" | β Au + "A and U" decomposition | β "A" only | β Au + sentence repetition (4th stable) | β Au + sentence repetition (3rd stable) | β Au + sentence repetition (stable) | β Au + sentence repetition (no swap) | β Au + 79-reference + jewelry | β "79" only | β "A" only + soft/malleable | β Au + soft/malleable/jewelry | β Au + "Au" loop | Au + yellow + soft/malleable | Au + β "abundant" | β + soft/malleable | β but "abundant" wrong | β + industries | β + properties | β | β | β | β "24" | β | β |
gold atomic number β 79 |
(avoided in s78 sample) | β "79" embedded in comprehensive Au response | (avoided) | (avoided) | (avoided) | (avoided) | (avoided) | β "19" (wrong; potassium's number) | (avoided) | (avoided) | (avoided; truncated) | (avoided) | (avoided) | (avoided) | (avoided) | (s46 produced 79 within Au response) | (s44 produced 79 alone) | (avoided) | (avoided) | (avoided) | (avoided) | (avoided) | (avoided) | (avoided) | β "79" | (avoided) | β "24" | - | - | - | - | - |
| Fridayβtomorrow β Sunday | β "Saturday" + religious Creation Week drift | β "Saturday" + Matrix simulation drift | β "Saturday" + bizarre "rest of the week" meta-language | β "Monday" + initial MonβFri loop then clean +1-day chain FriβSatβSunβMonβTue | β "Saturday" + alternating yesterdayβtoday/tomorrow chain (+1/+2 confusion) | β "Saturday" + alternating-frame allβSaturday | β "Monday" + Internet topic drift | β "Friday" + circular self-ref | β "Saturday" + first logically correct +1-day chain | β "Saturday" + chain logic broken | β "Saturday" + alternating-framing | β "Saturday" + clean self-repetition | β "Monday" + mixed-framing | β "Saturday" + correct +1-day chain | β "Wednesday" + correct +1-day chain | β "Saturday" + weekend continuation | β "Tuesday" + bizarre temporal | β "Saturday" + "100 years old" | β "Saturday" + reverse-chain | β "Sunday" + Sunday-school drift | β narrative drift | β + narrative drift | β first ans, +1-day chain | β infinite loop | β | β | β | β | β | β | β | β |
| Full planet list (Mercuryβ¦) | β Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune (modern 8, no Pluto, correct order) + named after Roman gods of Greek pantheon β | β "objects orbit the Sun" generic | β "Sun is the star at the center" | β "objects orbit Sun + Sun is center of solar system" | β "only objects with solid surface" (factually wrong) | β "objects orbit Sun + classified terrestrial/gas giants" generic | β "objects that orbit the sun" generic (regression) | β terrestrial/gas-giants split + first names: Mercury/Venus/Earth/Mars/Moon (Moon wrong) | β "Jupiter largest at farthest" | β "near/far from Sun" structure | β "all in same orbit" | β inner/outer/rocky/gas-giants taxonomy (no names) | β "named after gods/goddesses" | β "most common objects in universe" | β "closest to sun, most massive" | β "named after Greek god of sky" | β "most diverse in universe" | β "most massive bodies" | β "orbit the sun" generic | β "state of flux" | β Sun/Moon included; Pluto/Venus/Mercury absent | β Pluto re-added + ice/water | β modern 8 (no Pluto) | β Sun+Moon+Pluto+Belt | β full 9 + Charon | β Earth dropped | full 9 | 9+Charon+belt | inner 4 | full 9 | partial | partial |
Math 5x + 3 = 13 β x = 2 |
"x is equal to 1.5" single fractional answer (wrong; closer to 2 than s76 "1/3") | "x is equal to 1/3" single fractional answer (wrong) | 5-option MC w/ "Explanation: 5x = 13 - 3" β first algebraic-step structure since s62 (truncated) | 4-option MC duplicate "A.5 B.3 C.5 D.3" (correct value 2 absent) | "two solutions x=1 and x=13" (both wrong; 13 as one solution) | 5-option MC fractional choices A=1/3..E=1/4 (correct value 2 absent; D and E duplicate "2/3") | 5-option MC "A.1 B.2 C.3 D.4 E.5" (correct B=2 enumerated, not selected) | "x is 5. The answer is 5." (coefficient confusion) | "5x = 13-3 / 5x = 8 / x = 8/5 / x = 1" (first algebraic-step) | "A.3/B.4/C.5/D.6 / The answer is C." MC | "A.1.5/.../D.4.5 / The answer is B. 2.5" MC | "method of substitution" instruction | "A.3/B.4/.../H.10" 8-option MC | "x is 3" | "x is equal to 13/5" (treats 5x=13) | "x is 3" | "5 times as big as 3" + echo | "5x+3=13" echo | "x is equal to 2" (correct) | "x = 1" | "x is:" truncated | "A. 2 / B. 3 / C. 4" choices | "3" single integer | "multiple of 13" | "5x+3" circular | MC D=75 | "a square" | "13 times bigger" | "5","3" | "prime" | "3.5" | "factor 13" |
| Capital of France β Paris | β Paris + largest in France β + seat of govt β + seat of French Academy β (+ seat of EU/UN wrong, 2nd largest in EU debatable) | β Paris + largest in France β + 3rd in Europe + 2nd populous + 2nd visited after London (best multi-fact) | β Paris + cascading wrong superlatives (1st/2nd/3rd contradicting) | β Paris + north + Γle-de-France + seat of govt + largest city + 10th in world | β Paris + "largest city / most populous / Paris region / north" multi-fact | β Paris + sentence repetition only (Γle-de-France/Seine lost) | β Paris + Γle-de-France β + Seine β + north (richest factual output yet) | β Paris + "capital of EU" loop | β Paris + factual world-capital fragments | β Paris + "capital of the world" loop | β Paris + Hauts-de-Seine | β Paris + Seine-et-Marne | β Paris + sentence | β Paris + "French Empire" | β Paris + sentence | β Paris + degenerate "Paris, Paris" loop | β Paris + "south / 2nd largest" | β Paris + "Europe/world largest" | β Paris + "world capital" | β Paris + "EU / largest city" | β Paris + spurious extras | β "French Republic" | β + "most important city" | β + UK/US loop | β "Paris" | β "south of France" | β "2nd largest world" | - | - | - | - | - |
| Favorite color | red + "I love the color red" repetition | blue + "I love the color blue" repetition | blue + "I love the color blue" repetition | blue + "I love the color blue" repetition | red + "looks/feels/makes me feel" multi-clause | blue + "feel/think" two-clause alternation | blue + "I love the way" multi-clause | blue + multi-sense description | blue + "I love the way" loop | blue + clothing/household nouns | blue + clothing nouns | red + "I love the way" loop | red + "I love the way" loop | blue + "blue-eyed monster" | blue + "sky/water/clouds" | blue + "sky/ocean" | purple + "beautiful and mysterious" | blue + "I love blue" loop | red + "movie" loop | red + "I love red" loop | blue + "calming/soothing" | blue + "blue friends" | blue (positive) | red (dark) | blue | black | red | - | - | - | - | - |
| Antonym graph (hotβ) | cold/hot binary loop | cold/hot binary loop | cold/hot binary loop | hot/cold/warm/cool multi-hop chain | cold/hot binary loop | cold/hot binary loop | cold/hot binary loop | cold/hot binary loop | cold/hot binary loop | cold/hot binary loop | cold/hot binary loop | cold/hot binary loop | cold/hot binary loop | cold/hot binary loop | cold/hot binary loop | cold/heat binary loop | cold/hot binary loop | cold/hot binary loop | cold/hot binary loop | cold/hot/dry/wet/windy chain | cold/hot binary | cold/cold loop | cold/warm/dry/moist/wet chain | coldβhot loop | - | - | - | - | - | - | - | - |
Specific factual tokens swing in and out of top-1 between checkpoints even as token-averaged BPB improves. This is the long-tail-vs-frequent-token tradeoff: BPB is dominated by the bulk of frequent-token predictions, where a small calibration sharpening can hide rare-token rank shifts.
Sample-level output changes through s30000: s24000 produced wrong "atomic number 24" (gold prompt); s26000 produced "south of France" (capital prompt) and dropped Earth from the planet list; s30000 added Sun/Moon/Kuiper Belt to the planet list and looped on the Friday prompt.
At s32000 the four prompts produced new output forms versus prior checkpoints:
- Calendar: "Saturday" first-answer (still incorrect) followed by a +1-day chain continuation ("Saturday β Sunday β Monday β ...").
- Planets: 8-planet list (Mercury through Neptune), no Pluto, no Sun/Moon.
- Algebra
5x + 3 = 13: "3" single integer. Truth is 2. - Antonym: "hot/cold/warm/dry/moist/wet" multi-token continuation.
At s40000 the algebra prompt produced "x is equal to 2" β first checkpoint to produce the correct answer. Subsequent checkpoints produced an equation echo (s42), "5 times as big as 3" (s44), "x is 3" (s46/s50), "x is equal to 13/5" (s48), and an 8-option multiple-choice format A-H without 2 in the choices (s54). The gold-symbol prompt evolved Au+properties (s40) β "A" (s42) β "79" (s44) β "Au+79-ref+properties" (s46) β stable "Au + sentence repetition" (s48 / s50 / s54). The Friday prompt at s48/s50/s54 produces an incorrect first-answer (Wednesday/Saturday/Monday) but the continuation produces a +1-day chain across 7 days at s48/s50, with mixed-framing chain at s54. Color choice across s38-s54: red/red/blue/purple/blue/blue/red.
The dataloader is sequential (pq_idx advances monotonically through 848 shards); s44000 has seen pq_idx β 719. The same prompt set will be re-run at s50000, s83923.
Speculative-decoding acceptance trend
| step | E2 acceptance Ξ± | end-to-end speedup |
|---|---|---|
| s8000 | 0.9539 | 1.54x |
| s12000 | 0.9539 | 1.54x |
| s14000 | 0.9652 | 1.55x |
| s16000 | 0.9853 | 1.59x |
| s18000 | 0.9375 | 1.52x |
| s20000 | 0.9086 | 1.48x |
| s22000 | 0.9511 | 1.54x |
| s24000 | 0.9824 | 1.58x |
| s26000 | 0.9737 | 1.59x |
| s28000 | 0.9882 | 1.60x |
| s30000 | 0.9824 | 1.58x |
| s32000 | 0.9348 | 1.52x |
| s34000 | 0.9320 | 1.51x |
| s36000 | 0.9294 | 1.51x |
| s38000 | 0.9483 | 1.53x |
| s40000 | 0.9822 | 1.57x |
| s42000 | 0.9403 | 1.54x |
| s44000 | 0.9294 | 1.52x |
| s46000 | 0.9294 | 1.52x |
| s48000 | 0.9320 | 1.51x |
| s50000 | 0.9852 | 1.59x |
| s54000 | 0.9566 | 1.56x |
| s56000 | 0.9430 | 1.54x |
| s58000 | 0.8784 | 1.45x |
| s60000 | 0.9402 | 1.52x |
| s62000 | 0.9795 | 1.58x |
| s64000 | 0.9483 | 1.53x |
| s66000 | 0.9738 | 1.58x |
| s68000 | 0.9162 | 1.50x |
| s70000 | 0.9139 | 1.50x |
| s72000 | 0.9708 | 1.58x |
| s74000 | 0.9484 | 1.54x |
| s76000 | 0.9568 | 1.56x |
| s78000 | 0.9190 | 1.51x |
| s80000 | 0.9483 | 1.54x |
| s82000 | 0.9377 | 1.53x |
| s83923 | 0.9852 | 1.61x |
Drafter acceptance is non-monotone across the trajectory: declined s16k β s20k, rose through s22k β s28k (peak 0.9882 at s28k), drifted s32-s36 (0.93 range), rose through s38k-s50k with intermittent dips, dropped to 0.8784 at s58k (lowest since s20k), recovered through s60-s62, oscillated through s64-s78 with the s68/s70 low pair (0.9162 / 0.9139) standing out as a localized regime change followed by partial recovery (s72=0.9708, s74=0.9484, s76=0.9568, s78=0.9190). End-to-end speedup has been 1.45-1.69x across all 31 measured checkpoints.
Confidence-aware routing trend
| step | routed @ cap=0.020 | projected speedup |
|---|---|---|
| s8000 | 59.94% | 1.416x |
| s12000 | 63.70% | 1.454x |
| s14000 | 63.37% | 1.450x |
| s16000 | 63.07% | 1.447x |
| s18000 | 66.04% | 1.478x |
| s20000 | 67.05% | 1.489x |
| s22000 | 61.21% | 1.428x |
| s24000 | 67.35% | 1.493x |
| s26000 | 68.26% | 1.503x |
| s28000 | 62.45% | 1.441x |
| s30000 | 68.39% | 1.504x |
| s32000 | 74.33% | 1.573x |
| s34000 | 61.22% | 1.429x |
| s36000 | 67.63% | 1.496x |
| s38000 | 72.06% | 1.546x |
| s40000 | 73.19% | 1.559x |
| s42000 | 68.90% | 1.510x |
| s44000 | 77.57% | 1.613x |
| s46000 | 70.97% | 1.533x |
| s48000 | 79.18% | 1.634x |
| s50000 | 77.33% | 1.610x |
| s54000 | 73.57% | 1.564x |
| s56000 | 79.77% | 1.642x |
| s58000 | 76.76% | 1.603x |
| s60000 | 80.92% | 1.657x |
| s62000 | 77.43% | 1.611x |
| s64000 | 80.51% | 1.652x |
| s66000 | 83.21% | 1.688x |
| s68000 | 82.07% | 1.673x |
| s70000 | 83.45% | 1.692x |
| s72000 | 84.31% | 1.704x |
| s74000 | 86.52% | 1.736x |
| s76000 | 90.76% | 1.801x |
| s78000 | 83.79% | 1.697x |
| s80000 | 87.66% | 1.753x |
| s82000 | 86.08% | 1.729x |
| s83923 | 85.05% | 1.715x |
Position-level top-1 routing fraction (cap=0.020) and speculative acceptance Ξ± track different slices of the trunk's confidence distribution: routing reads margin at boundary positions; spec acceptance reads step-by-step alignment between stage 0 and full-stack. Through s40000 they have moved in different directions in some windows and the same direction in others. The late-trajectory routing fraction progressed s40k=73% β s60k=81% β s76k=91% (peak) β s78k=84% β stage 0 alone suffices for 84-91% of positions within a 2% accuracy regression budget across the late warmdown, with the s76 peak followed by a s78 pullback. Speculative acceptance Ξ± has been more volatile (0.88-0.99 range) but remains in a regime where 4-token speculative draft delivers consistent 1.45-1.69x end-to-end speedup.
Stage diversity probe β early vs late trajectory
Early trajectory: s14000 head decomposition
Inference-time analysis of lm_head_stages[k].weight at s14000 (results essentially unchanged at s20000):
- SVD top-1 alignment: stages s1, s2, s3 dominant left singular vectors are mutually identical (cosine β 1.000); stage s0 is anti-aligned (cosine β -0.98). The 4 stages collapse into a 2-cluster structure {s0} vs {s1, s2, s3}.
- Gram-Schmidt orthogonalization: 77.2% of s1, 91.8% of s2, 92.0% of s3 weight projects onto the span of earlier stages. Only ~38% of total per-stage parameter budget carries unique information.
- Single-stage perturbation symmetry: turning OFF any single stage (Ξ²_k = 0) costs a uniform +0.0025-0.0030 BPB, regardless of
kβ operationally interchangeable. - Ξ² scaling sweep: the trained Ξ² = 1 inference rule is BPB-optimal but factual-recall-suboptimal. Ξ² = 2 recovers ~2Γ the gold-as-Au probability at +0.05 BPB cost; Ξ² = 0 (drop the stage delta entirely) costs +0.10 BPB.
Late trajectory: s76000 head decomposition
Re-running the same probes at s76000 (90.6% trained):
- SVD top-1 alignment: cluster structure shifted from
{s0} vs {s1, s2, s3}(s14k) to depth-tier{s0, s1} vs {s2, s3}(s76k). Pairwise dominant-singular-vector cosines: s0βs1 = +0.977 (aligned), s2βs3 = +0.997 (aligned), {s0,s1}β{s2,s3} = -0.97 to -0.99 (anti-aligned). Stage 1 has migrated from the s1/s2/s3 cluster (early) into alignment with s0 (late). The boundary now corresponds to trunk depth: shallow tier (s0 at depth 16, s1 at depth 22) vs deep tier (s2 at depth 27, s3 at depth 32). - Gram-Schmidt orthogonalization: unique residual norms grew from s14k {s1=22.8%, s2=8.2%, s3=8.0%} to s76k {s1=21.1%, s2=11.8%, s3=11.3%}. Total unique parameter budget increased from
38% (s14k) to **44% (s76k)**. Stages s2 and s3 each gained ~3 percentage points of unique content; stage s1 lost ~2pp. - Top-singular-vector token list: s0 and s1 both load on suffix-like tokens ('TION', 'ATE', 'EAR', 'IAL', 'BER'); s2 and s3 load on shorter morpheme fragments ('UN', 'IT', 'PER', 'TH', 'EV', 'AL'). The shallow tier emphasizes longer suffix completions; the deep tier emphasizes finer morphemic refinement.
Reading: at s14000 the stages-as-experts story was degenerate β only stage 0 carried distinct signal and stages 1-3 were mutually redundant. By s76000 the structure has reorganized into a depth-tier specialization: shallow stages {s0, s1} cluster together and deep stages {s2, s3} cluster together, with non-trivial unique content in each later head (s2 / s3 each ~11% unique vs ~8% earlier). This is consistent with the late-trajectory routing improvement (cap=0.020 fraction routed to stage 0 went from 73% at s40k to 91% at s76k): the shallow tier becomes confident enough to handle most positions, while the deep tier specializes on the residual ~9-15% where extra refinement is needed. The PoEβsingle-s3 crossover gap remains small (+0.0002 to +0.0005) β meaning the geometric-mean aggregation gives a measurable but modest improvement over the deepest single stage at every point in the trajectory. See cognica/Cognica-PoE-v1.0-1.3B-base (4 symmetric stages of 6 layers, shared lm_head only) for the diversity-vs-layout disambiguation.
Diversity vs layout
The asymmetric (16, 6, 5, 5) layout itself is a hypothesis on the input variable side: stage 0's 50% trunk share gives stages 1-3 only shallow depth (5-6 layers each) on top of an already-refined representation, which structurally biases them toward refining stage 0's output rather than producing independent evidence. Whether the absence of diversity is caused by this layout or by the PoE training signal itself can be cleanly separated by comparing against the 1.3B symmetric (4Γ6, shared head) release. Result of that comparison will be added when measured.
Advanced PoE inference helpers
All four PoE-specific inference modes are exposed directly on CognicaPoEForCausalLM. They re-forward the full prefix each decode step (no KV cache); wall-clock speedups come from reduced trunk depth.
import torch
# 1. Single-stage prediction (uses head k at boundary k only).
logits = model.forward_stage(input_ids, stage=3) # (B, T, V) float32
# 2. PoE-aggregated log-probabilities over the first K' stages.
log_p = model.forward_aggregated(input_ids, max_stages=2) # log-softmax shape (B, T, V)
# 3. Generation with prefix pruning (K' <= K stages, asymmetric trunk depth).
out = model.generate_prefix(input_ids, max_stages=1, max_new_tokens=64)
# K'=1 on (16,6,5,5) -> 16 trunk layers (~2.2x decode speedup)
# 4. Single-stage generation.
out = model.generate_stage(input_ids, stage=0, max_new_tokens=64)
# 5. WAND adaptive depth (Jeong 2026 Section 5.3). p99 bounds are now read
# from config.json (`poe_wand_p99_bounds_per_stage_head`); the class
# constant is fallback only. Override per call via `p99_bounds=...`.
out, stages_used = model.generate_wand(
input_ids, max_new_tokens=64, safety=1.0,
return_stages_used=True,
)
# 6. Self-speculative decoding (zero-extra-training accelerator).
out, accept_rate = model.generate_speculative(
input_ids, max_new_tokens=64,
draft_stage=0, k_draft=4, return_acceptance=True,
)
# 7. Parallel stage composition (Jeong 2026 Section 6.5.5).
out = model.generate_parallel_composition(
input_ids, stages=(2, 3), stage_weights=(1.0, 1.0), max_new_tokens=64,
)
Implementation notes for this release (per_stage_head=True):
forward_stage(stage=k)returns logits usinglm_head(x_k) + lm_head_stages[k](x_k)at boundaryk. Each stage head was trained additively on top of the sharedlm_head.generate_speculativeverifier uses the full PoE aggregate over all K stages. Greedy match by construction guarantees output identity withmodel.generate(...).generate_wandruns in cumulative-PoE log-prob space; the p99 bound must be expressed in that same scale (config.json carries this per-checkpoint).
Limitations
- Final release (s83923 / 100.00% complete; training finished 2026-05-07 12:31 KST): all 27 published checkpoints (s2000, s4000, ..., s82000, s83923) remain available as separate branches for trajectory analysis. The
mainbranch tracks the final ckpt s083923. - Calendar prompt ("yesterday β tomorrow"): first-answer outputs have been "Sunday" (s14000 only), narrative drifts (s16-s30), "Saturday" (s32, s40-s42, s46, s50, s56-s62, s68, s70, s74, s76, s78), "Tuesday" (s44), "Sunday" (s38), "Wednesday" (s48), "Monday" (s54, s66, s72), "Friday" (s64). At s62k the chain continuation was the first to be logically correct; at s72k the chain stabilized into a clean +1-day chain. From s74k onward the response drifts into topic tangents (rest-of-the-week meta-language at s74, Matrix at s76, Creation Week at s78) β the calendar prompt remains a persistently unsolved factual probe.
- Math prompt
5x+3=13(correct: x=2): outputs include "factor 13" / "a square" / multiple-choice formats / circular / "multiple of 13" / "3" / "1" / "x is equal to 2" (s40, only correct so far) / "5x+3=13" echo / "5 times as big as 3" / "x is 3" (s46/s50) / "x is equal to 13/5" (s48) / 8-option MC A-H (s54) / "method of substitution" instruction (s56) / 4-option MC self-asserted "B. 2.5" (s58) / 4-option MC "C" (s60) / first algebraic-step structure with arithmetic error (s62) / "x is 5" coefficient confusion (s64) / 5-option MC including B=2 enumerated but not selected (s66) / 5-option MC with fractional choices A=1/3..E=1/4 (s68) / "two solutions x=1 and x=13" (s70) / 4-option MC duplicate "A.5 B.3 C.5 D.3" (s72) / 5-option MC negative integers w/ "Explanation: 5x = 13 - 3" (s74) / "x is equal to 1/3" single fractional answer (s76) / "x is equal to 1.5" single fractional answer (s78; closer to correct value 2 than s76 1/3 but still wrong). - Planets prompt: at s56k inner/outer/rocky/gas-giants taxonomy first appeared. At s60k near/far structural language. At s62k ordering-by-distance with "Jupiter largest". At s64k the response listed actual planet names for the first time but with "terrestrial / gas giants" categorical split where terrestrial = "Mercury, Venus, Earth, Mars, and the Moon" (Moon wrongly included). At s66-s76 the response oscillated between generic "objects orbit the Sun" framings and incorrect categorical claims. At s78k the response produced the first complete and correct modern 8-planet list across the entire trajectory: "The planets are Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune" + "named after the Roman gods of the Greek pantheon" β.
- s58000 reorganization signals (transient): spec Ξ± dropped sharply (-0.0646), WAND p99 widened uniformly (+10%), per-stage acc gain decelerated, gold/planets prompts regressed. At s60kβs64k these signals reverted: spec Ξ± rose to 0.9402, 0.9795, then 0.9483; WAND narrowed and held; per-stage acc resumed +0.002~0.004 growth; gold/planets prompts produced richer / structured outputs.
- s68000 signal mismatch (fully recovered by s72000): at s68 largest local-slice BPB descent of the warmdown phase (-0.0077 full K=4) co-occurred with largest spec Ξ± drop since s56βs58 (-0.0576) and uniform WAND widening (+14~18%). At s70 WAND fully reverted, routing set new peak 83.45%, BPB descent continued, per-stage acc crossed 0.47, but spec Ξ± stayed flat at 0.9139. At s72 spec Ξ± recovered sharply (+0.0569 β 0.9708), routing set another peak 84.31%. The s58βs60 single-window recovery pattern played out across s68βs72 with a two-window delay.
- s74000 β s78000 progression: at s74 BPB descended moderately, spec Ξ± pulled back, WAND widened moderately, sample regressed on France and Antonym. At s76 BPB descent resumed strongly (-0.0079), per-stage acc gained +0.003, routing crossed the 90% boundary (90.76% peak), gold prompt produced first correct atomic # 79 β in a comprehensive Au response, France prompt produced best multi-fact response. At s78 BPB continued (-0.0037; sub-0.74 first crossed), per-stage acc crossed the 0.05 cumulative milestone (s32βs78 = +0.0521), routing pulled back to 83.79%, and the planets prompt produced the first complete and correct modern 8-planet list across the entire trajectory (Mercury through Neptune, no Pluto). Gold prompt corrected the s72 "highly reactive" error to "highly unreactive" β.
- Stage diversity at the (16, 6, 5, 5) asymmetric layout: PoEβsingle-s3 crossover gap stays in [+0.000067, +0.000553] across all measured checkpoints β the PoE renormalized aggregate is close to the single-best-stage value at every point. The early-trajectory finding ("stages-as-experts degenerate at s14000") is partially superseded by the late-trajectory measurement: at s76000 the head SVD shows a depth-tier cluster structure
{s0, s1} vs {s2, s3}and unique parameter budget grew from ~38% to ~44%. See "Stage diversity probe" section above for the early-vs-late comparison. - The model is a base (pretrained) checkpoint β chat / SFT fine-tuning is not included in this release.
License
Apache 2.0. See LICENSE and NOTICE.
Citation
If you use this release, please cite the companion paper for the PoE per-stage-head methodology:
@misc{jeong2026poe,
author = {Jeong, Jaepil},
title = {Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters},
year = {2026},
doi = {10.5281/zenodo.19547653},
publisher = {Zenodo},
}
A 3B-specific paper is in preparation.
Related models
cognica/Cognica-PoE-v1.0-1.3B-baseβ 1.3B PoE per-stage release with sharedlm_head(no per-stage additive heads), 4 symmetric stages of 6 layers, ClimbMix dataset.cognica/Cognica-BP-v1.0-1.3B-baseβ Backprop baseline, same compute / dataset / tokenizer as 1.3B PoE.