Update README for step-110000 (continued warmdown, BPB 0.8023 new low)

13fd931 verified about 2 hours ago

36.9 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: text-generation
	tags:
	- poe
	- per-stage-head
	- continual-learning
	- continual-pretraining
	- cyclic-lr
	---

	# Cognica-PoE-v1.0-3B-base-continual-learning

	Continual pretraining of a 3B PoE per-stage-head model using a cyclic re-warmup schedule to extract additional capacity from the same data distribution after initial training has fully annealed.

	This release studies whether a model whose first training pass has reached its scheduled LR floor (`lrm ≈ 0.05`) can still meaningfully improve when given a fresh half-peak warmup → warmdown cycle on the same data — without changing architecture, tokenizer, or data mix.

	The model is published as a trajectory of step branches, not just a final ckpt: every saved checkpoint becomes a separate `step-XXXXX` branch so the continual-learning curve itself is externally auditable.

	## What "continual" means here

	Most published "base" models are released after a single warmup → constant → warmdown LR cycle. At the end of that cycle the LR is near zero and gradient updates produce only marginal change — the model is conventionally considered "done."

	This release tests a different setting: take a fully-annealed checkpoint, re-arm the optimizer with a new LR cycle at half the original peak, and continue training on the same data. We label this continual pretraining (cyclic LR) to distinguish it from:

	\| pattern \| what changes vs prior phase \|
	\|---\|---\|
	\| continual pretraining (cyclic LR) ← this release \| LR re-warmed; data + tokenizer + architecture unchanged \|
	\| domain-adaptive pretraining \| new domain data added \|
	\| multilingual continual pretraining \| tokenizer extended; multilingual data mixed in \|
	\| continual instruction tuning \| SFT data; chat-format objective \|

	The hypothesis being probed: does a half-peak second cycle on identical data produce real, measurable gain, or does the model plateau?

	## Methodology — B2 cyclic schedule

	Initialization: a fully-trained 3B PoE per-stage-head model (66B tokens consumed, full warmup → warmdown cycle complete, `lrm` annealed to ~0.05 of original peak).

	LR schedule for the continual phase (anchored at the resume step):

	```
	warmup (rel 0 .. 1000) : lrm rises 0 → 0.5 (linear)
	peak (rel 1000..22861) : lrm = 0.5 (constant; half of original peak)
	warmdown (rel 22861..50800): lrm decays 0.5 → 0.0 (linear; warmdown_ratio = 0.55)
	```

	\| param \| continual phase value \| original phase (for reference) \|
	\|---\|---\|---\|
	\| token budget \| +40 B \| 66 B \|
	\| total steps in phase \| 50,800 \| 83,923 \|
	\| warmup steps \| 1,000 \| 1,000 \|
	\| peak `lrm` \| 0.5 \| 1.0 \|
	\| warmdown ratio \| 0.55 \| 0.65 \|
	\| final `lrm` \| 0.0 \| 0.05 \|
	\| effective `matrix_lr` peak \| 0.0075 \| 0.015 \|
	\| effective `embedding_lr` peak \| 0.15 \| 0.30 \|
	\| effective `unembedding_lr` peak \| 0.004 \| 0.008 \|

	Optimizer state is restored from the prior phase's last save (DistMuonAdamW ZeRO-2 sharded across 12 ranks). Tokenizer (32,768 vocab, rustbpe), architecture (depth=32, n_embd=2048, K=4 PoE per-stage with asymmetric stage_layers=(16,6,5,5), GQA 2:1, intermediate=12800, max_seq_len=2048), and data mix (`frontier_v1`: FineWeb-Edu 33.5% + DCLM-Baseline 24% + Stack-v2 16% + Wikipedia 5% + CulturaX 5% + ProofPile-2 4% + OpenWebMath 4% + Gutenberg 4% + PG-19 2% + UltraChat 1% + OpenHermes-2.5 0.6%) are all unchanged from the prior phase.

	## Why a published trajectory

	The point of this release is the continual-learning curve, not any single endpoint. We publish every save (`step-XXXXX` branches) so the actual question — "does a re-warmed cycle keep improving the model?" — can be answered by reading off the trajectory rather than trusting our headline numbers.

	Each `step-XXXXX` branch carries its own per-checkpoint `poe_wand_p99_bounds_per_stage_head` calibration in `config.json` so PoE-specific inference (WAND adaptive depth, self-speculative decoding) works correctly at any branch.

	## Branches

	\| Branch \| Step \| Phase position \| Notes \|
	\|---\|---:\|---\|---\|
	\| `step-83923` \| 83,923 \| continual phase rel_it = 0 \| seed: identical to the fully-annealed prior-phase final ckpt \|
	\| `step-84000` \| 84,000 \| rel_it = 77 (warmup early, lrm ≈ 0.04) \| first save after re-warmup begins; functionally indistinguishable from seed \|
	\| `step-86000` \| 86,000 \| rel_it = 2,077 (peak +1,077, lrm = 0.5) \| first post-warmup ckpt; LR-shock signature (BPB +0.097 / acc -4.4pp / α -4.5pp) \|
	\| `step-88000` \| 88,000 \| rel_it = 4,077 (peak +3,077) \| plateau approach; BPB slope decelerated 10×; spec α recovered to 0.9795 \|
	\| `step-90000` \| 90,000 \| rel_it = 6,077 (peak +5,077) \| first coordinated descent: BPB -0.005, spec α 0.9853 matched Run A baseline \|
	\| `step-92000` \| 92,000 \| rel_it = 8,077 (peak +7,077) \| decoupled metrics: BPB re-bounce to 0.832 plateau, spec α 0.9970 new high (above Run A 0.9852), WAND bounds narrowest of trajectory \|
	\| `step-94000` \| 94,000 \| rel_it = 10,077 (peak +9,077) \| 2nd BPB descent (-0.007 → 0.825 lowest peak-phase value); routing cap=0.020 77.69% new continual peak; WAND bounds all -13% vs s092 / -16-19% below Run A \|
	\| `step-96000` \| 96,000 \| rel_it = 12,077 (peak +11,077) \| s094 was outlier: BPB reverts +0.002 (back to 0.827 median); routing cap=0.020 -7.18pp REVERT to 70.5%; WAND bounds all +12% widened. Peak phase confirmed as oscillation regime, no monotonic trends. \|
	\| `step-98000` \| 98,000 \| rel_it = 14,077 (peak +13,077) \| 7-ckpt oscillation regime confirmed: BPB 0.830 within 0.822-0.832 envelope; spec α 0.9242 new low since LR shock; routing cap=0.020 75.30%; WAND bounds mildly widened. \|
	\| `step-100000` \| 100,000 \| rel_it = 16,077 (peak +15,077) \| descent begins: BPB -0.005 → 0.8243 (below 0.828 plateau midpoint); per-stage acc +0.003 recovery; WAND bounds -7%. \|
	\| `step-102000` \| 102,000 \| rel_it = 18,077 (peak +17,077) \| NEW TRAJECTORY LOW: BPB 0.8167 (first below s086 starting point 0.8222 by -0.0055); spec α 0.9795 recovered; per-stage acc trajectory high (0.4456); 3 consecutive descents s098→s100→s102 (0.830→0.824→0.817). Plateau exit confirmed. \|
	\| `step-104000` \| 104,000 \| rel_it = 20,077 (peak +19,077) \| BOUNCE BACK: BPB 0.8239 (+0.007 vs s102 — s102 was a local low, not start of monotonic descent); acc -0.004; routing 71%. Envelope updated to 0.817-0.832; oscillation regime continues. \|
	\| `step-106000` \| 106,000 \| rel_it = 22,077 (peak +21,077; last peak ckpt) \| NEW TRAJECTORY LOWS: BPB 0.8135 (-0.010 vs s104; below s086 entry by -0.009); per-stage acc NEW HIGH 0.4467 s3 / 0.4466 full (gap to Run A reduced to -0.039); progressively lower local lows pattern (s094 0.8251 → s102 0.8167 → s106 0.8135). \|
	\| `step-108000` \| 108,000 \| rel_it = 24,077 (warmdown +1,217; lrm ≈ 0.478) \| FIRST WARMDOWN CKPT — NEW TRAJECTORY LOWS in BPB (0.8079, -0.006 vs s106; -0.014 below s086 entry) and acc (s3 0.4485 / full 0.4482, gap to Run A reduced to -0.038); crossover gap NEGATIVE first time (-0.000049) — uniform-mean PoE now below single s3. Warmdown effect 1.5x faster than peak descent. \|
	\| `step-110000` \| 110,000 \| rel_it = 26,077 (warmdown +3,216; lrm ≈ 0.44) \| SECOND WARMDOWN CKPT — CONTINUED DESCENT: BPB 0.8023 (-0.006 vs s108; first below 0.81; gap to Run A reduced to +0.078), acc 4th consecutive new high (s3 0.4517 / full 0.4513, gap to Run A -0.034). Spec α drops -0.022 to 0.9375 (drafter-target mismatch). Crossover gap REVERTED to +0.000412 (s108 negative was single-ckpt event). WAND bounds tightening across all 3 stages (-1% to -4%). \|
	\| `main` \| latest \| tracks newest published step \| currently `step-110000` \|

	Future saves: every 2000 steps (`step-112000`, `step-114000`, ..., `step-134000`) plus a final `step-134723`. Warmdown began around `step-106800` (rel_it ≈ 22,861); remaining warmdown ~10,000 step with lrm decaying 0.44 → 0.0.

	## Inference

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	device = "cuda" if torch.cuda.is_available() else "cpu"
	model = AutoModelForCausalLM.from_pretrained(
	"cognica/Cognica-PoE-v1.0-3B-base-continual-learning",
	revision="main", # or any "step-XXXXX" branch
	trust_remote_code=True,
	dtype=torch.bfloat16,
	).to(device).eval()
	tokenizer = AutoTokenizer.from_pretrained(
	"cognica/Cognica-PoE-v1.0-3B-base-continual-learning",
	revision="main",
	trust_remote_code=True,
	)

	# Base ckpts REQUIRE prepending <\|bos\|> before user text:
	prompt = "The capital of France is"
	input_ids = [tokenizer.bos_token_id] + tokenizer.encode(prompt, add_special_tokens=False)
	input_ids = torch.tensor([input_ids], device=device)
	out = model.generate(input_ids=input_ids, max_new_tokens=32, do_sample=False)
	print(tokenizer.decode(out[0].tolist()))
	```

	PoE-specific inference helpers (single-stage forward, prefix pruning, WAND adaptive depth, self-speculative decoding) are exposed on `CognicaPoEForCausalLM`. Each `step-XXXXX` branch carries its own calibrated `poe_wand_p99_bounds_per_stage_head` in `config.json`; `model.generate_wand(...)` reads it automatically.

	For the architectural details, full inference recipe, and the prior-phase trajectory analysis, see the prior-phase release: [`cognica/Cognica-PoE-v1.0-3B-base`](https://huggingface.co/cognica/Cognica-PoE-v1.0-3B-base).

	## Trajectory measurements

	The continual phase entry point (`step-83923`) is identical to the prior-phase endpoint, so it serves as both the resume anchor and the baseline against which every continual-phase ckpt is measured. A meaningful continual-learning result requires `step-NNNNN` measurements to diverge from the seed across multiple metrics — not just match it.

	### Per-checkpoint val BPB

	Same 8-shard local val slice across all measured ckpts (1.05 M tokens, `--split-tokens 1048576`); single A100 80GB; FULL K=4 PoE aggregation.

	\| Branch \| Step \| Phase rel_it \| Training-log val BPB (12 ranks, 40M tokens) \| Local-slice full K=4 BPB (8 shards, 1.05M tokens) \|
	\|---\|---:\|---:\|---:\|---:\|
	\| `step-83923` (seed) \| 83,923 \| 0 \| 0.772893 \| 0.724738 \|
	\| `step-84000` \| 84,000 \| 77 \| 0.772671 \| 0.724587 \|
	\| `step-86000` \| 86,000 \| 2,077 \| 0.875501 \| 0.822195 \|
	\| `step-88000` \| 88,000 \| 4,077 \| 0.882645 \| 0.832245 \|
	\| `step-90000` \| 90,000 \| 6,077 \| 0.885028 \| 0.827274 \|
	\| `step-92000` \| 92,000 \| 8,077 \| 0.886520 \| 0.832023 \|
	\| `step-94000` \| 94,000 \| 10,077 \| 0.883594 \| 0.825098 \|
	\| `step-96000` \| 96,000 \| 12,077 \| 0.878945 \| 0.827363 \|
	\| `step-98000` \| 98,000 \| 14,077 \| 0.879666 \| 0.829764 \|
	\| `step-100000` \| 100,000 \| 16,077 \| 0.876705 \| 0.824306 \|
	\| `step-102000` \| 102,000 \| 18,077 \| 0.874228 \| 0.816739 \|
	\| `step-104000` \| 104,000 \| 20,077 \| 0.872387 \| 0.823924 \|
	\| `step-106000` \| 106,000 \| 22,077 \| 0.871799 \| 0.813492 \|
	\| `step-108000` \| 108,000 \| 24,077 \| 0.863411 \| 0.807908 \|
	\| `step-110000` \| 110,000 \| 26,077 \| 0.855854 \| 0.802296 \|

	### Per-stage BPB

	Per-checkpoint training-objective BPB at each PoE stage boundary, on the same 8-shard local val slice.

	\| Branch \| Step \| full K=4 \| single s0 \| single s1 \| single s2 \| single s3 \| prefix K'=1 \| prefix K'=2 \| prefix K'=3 \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| `step-83923` (seed) \| 83,923 \| 0.724738 \| 0.727740 \| 0.725798 \| 0.725063 \| 0.724273 \| 0.727740 \| 0.726103 \| 0.725363 \|
	\| `step-84000` \| 84,000 \| 0.724587 \| 0.727583 \| 0.725641 \| 0.724863 \| 0.724058 \| 0.727583 \| 0.725975 \| 0.725224 \|
	\| `step-86000` \| 86,000 \| 0.822195 \| 0.825607 \| 0.823068 \| 0.822524 \| 0.821900 \| 0.825607 \| 0.823529 \| 0.822770 \|
	\| `step-88000` \| 88,000 \| 0.832245 \| 0.835634 \| 0.833265 \| 0.832631 \| 0.832050 \| 0.835634 \| 0.833652 \| 0.832856 \|
	\| `step-90000` \| 90,000 \| 0.827274 \| 0.830453 \| 0.828213 \| 0.827658 \| 0.827063 \| 0.830453 \| 0.828545 \| 0.827827 \|
	\| `step-92000` \| 92,000 \| 0.832023 \| 0.835828 \| 0.832909 \| 0.832110 \| 0.831363 \| 0.835828 \| 0.833640 \| 0.832726 \|
	\| `step-94000` \| 94,000 \| 0.825098 \| 0.828611 \| 0.825863 \| 0.825143 \| 0.824426 \| 0.828611 \| 0.826573 \| 0.825742 \|
	\| `step-96000` \| 96,000 \| 0.827363 \| 0.830721 \| 0.828278 \| 0.827653 \| 0.826947 \| 0.830721 \| 0.828749 \| 0.827980 \|
	\| `step-98000` \| 98,000 \| 0.829764 \| 0.832973 \| 0.830729 \| 0.830105 \| 0.829450 \| 0.832973 \| 0.831115 \| 0.830359 \|
	\| `step-100000` \| 100,000 \| 0.824306 \| 0.827779 \| 0.825251 \| 0.824536 \| 0.823820 \| 0.827779 \| 0.825770 \| 0.824953 \|
	\| `step-102000` \| 102,000 \| 0.816739 \| 0.819992 \| 0.817654 \| 0.816944 \| 0.816293 \| 0.819992 \| 0.818104 \| 0.817342 \|
	\| `step-104000` \| 104,000 \| 0.823924 \| 0.827367 \| 0.824963 \| 0.824277 \| 0.823621 \| 0.827367 \| 0.825345 \| 0.824551 \|
	\| `step-106000` \| 106,000 \| 0.813492 \| 0.816808 \| 0.814571 \| 0.813932 \| 0.813319 \| 0.816808 \| 0.814859 \| 0.814088 \|
	\| `step-108000` \| 108,000 \| 0.807908 \| 0.811307 \| 0.808930 \| 0.808412 \| 0.807957 \| 0.811307 \| 0.809247 \| 0.808479 \|
	\| `step-110000` \| 110,000 \| 0.802296 \| 0.805671 \| 0.803292 \| 0.802526 \| 0.801884 \| 0.805671 \| 0.803734 \| 0.802916 \|

	### Per-stage standalone target accuracy

	Top-1 accuracy of each stage's standalone prediction vs ground-truth target token, on the same 8-shard val slice.

	\| Branch \| Step \| s0 \| s1 \| s2 \| s3 \| full \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| `step-83923` (seed) \| 83,923 \| 0.4844 \| 0.4856 \| 0.4859 \| 0.4862 \| 0.4860 \|
	\| `step-84000` \| 84,000 \| 0.4843 \| 0.4857 \| 0.4859 \| 0.4861 \| 0.4859 \|
	\| `step-86000` \| 86,000 \| 0.4405 \| 0.4419 \| 0.4422 \| 0.4424 \| 0.4420 \|
	\| `step-88000` \| 88,000 \| 0.4375 \| 0.4382 \| 0.4388 \| 0.4390 \| 0.4386 \|
	\| `step-90000` \| 90,000 \| 0.4392 \| 0.4403 \| 0.4407 \| 0.4407 \| 0.4405 \|
	\| `step-92000` \| 92,000 \| 0.4364 \| 0.4380 \| 0.4385 \| 0.4391 \| 0.4384 \|
	\| `step-94000` \| 94,000 \| 0.4403 \| 0.4417 \| 0.4422 \| 0.4421 \| 0.4420 \|
	\| `step-96000` \| 96,000 \| 0.4391 \| 0.4400 \| 0.4405 \| 0.4409 \| 0.4404 \|
	\| `step-98000` \| 98,000 \| 0.4377 \| 0.4386 \| 0.4388 \| 0.4393 \| 0.4389 \|
	\| `step-100000` \| 100,000 \| 0.4406 \| 0.4416 \| 0.4422 \| 0.4427 \| 0.4420 \|
	\| `step-102000` \| 102,000 \| 0.4438 \| 0.4452 \| 0.4452 \| 0.4456 \| 0.4452 \|
	\| `step-104000` \| 104,000 \| 0.4398 \| 0.4409 \| 0.4414 \| 0.4419 \| 0.4415 \|
	\| `step-106000` \| 106,000 \| 0.4452 \| 0.4461 \| 0.4464 \| 0.4467 \| 0.4466 \|
	\| `step-108000` \| 108,000 \| 0.4464 \| 0.4479 \| 0.4483 \| 0.4485 \| 0.4482 \|
	\| `step-110000` \| 110,000 \| 0.4498 \| 0.4510 \| 0.4514 \| 0.4517 \| 0.4513 \|

	### Self-speculative decoding (m=4 stage-0 draft, full K=4 verify)

	\| Branch \| Step \| acceptance α \| mean accepted / 4 \| end-to-end speedup \|
	\|---\|---:\|---:\|---:\|---:\|
	\| `step-83923` (seed) \| 83,923 \| 0.9852 \| 3.83 \| 1.61x \|
	\| `step-84000` \| 84,000 \| 0.9736 \| 3.77 \| 1.57x \|
	\| `step-86000` \| 86,000 \| 0.9402 \| 3.67 \| 1.53x \|
	\| `step-88000` \| 88,000 \| 0.9795 \| 3.88 \| 1.58x \|
	\| `step-90000` \| 90,000 \| 0.9853 \| 3.88 \| 1.58x \|
	\| `step-92000` \| 92,000 \| 0.9970 \| 3.94 \| 1.60x \|
	\| `step-94000` \| 94,000 \| 0.9623 \| 3.77 \| 1.56x \|
	\| `step-96000` \| 96,000 \| 0.9853 \| 3.94 \| 1.59x \|
	\| `step-98000` \| 98,000 \| 0.9242 \| 3.62 \| 1.51x \|
	\| `step-100000` \| 100,000 \| 0.9320 \| 3.62 \| 1.51x \|
	\| `step-102000` \| 102,000 \| 0.9795 \| 3.88 \| 1.59x \|
	\| `step-104000` \| 104,000 \| 0.9708 \| 3.83 \| 1.57x \|
	\| `step-106000` \| 106,000 \| 0.9766 \| 3.88 \| 1.58x \|
	\| `step-108000` \| 108,000 \| 0.9593 \| 3.67 \| 1.55x \|
	\| `step-110000` \| 110,000 \| 0.9375 \| 3.67 \| 1.53x \|

	### Confidence-aware routing (target_regression cap = 0.020)

	\| Branch \| Step \| fraction routed to stage 0 \| projected speedup \| base s0=full agreement \|
	\|---\|---:\|---:\|---:\|---:\|
	\| `step-83923` (seed) \| 83,923 \| 85.05% \| 1.715x \| 0.9746 \|
	\| `step-84000` \| 84,000 \| 87.86% \| 1.756x \| 0.9756 \|
	\| `step-86000` \| 86,000 \| 71.05% \| 1.534x \| 0.9685 \|
	\| `step-88000` \| 88,000 \| 70.43% \| 1.523x \| 0.9651 \|
	\| `step-90000` \| 90,000 \| 71.93% \| 1.541x \| 0.9687 \|
	\| `step-92000` \| 92,000 \| 74.62% \| 1.585x \| 0.9712 \|
	\| `step-94000` \| 94,000 \| 77.69% \| 1.643x \| 0.9721 \|
	\| `step-96000` \| 96,000 \| 70.51% \| 1.526x \| 0.9683 \|
	\| `step-98000` \| 98,000 \| 75.30% \| 1.604x \| 0.9713 \|
	\| `step-100000` \| 100,000 \| 72.51% \| 1.557x \| 0.9681 \|
	\| `step-102000` \| 102,000 \| 76.31% \| 1.595x \| 0.9703 \|
	\| `step-104000` \| 104,000 \| 71.03% \| 1.539x \| 0.9683 \|
	\| `step-106000` \| 106,000 \| 72.79% \| 1.567x \| 0.9690 \|
	\| `step-108000` \| 108,000 \| 71.77% \| 1.546x \| 0.9671 \|
	\| `step-110000` \| 110,000 \| 72.36% \| 1.553x \| 0.9697 \|

	### WAND p99 bounds (cumulative-PoE delta range, constant-shift invariant)

	Calibrated on a 131,072-token val slice using `range(delta) = max(delta) − min(delta)`. Each branch's `config.json` carries its own bounds in `poe_wand_p99_bounds_per_stage_head`.

	\| Branch \| Step \| bound 0 → 1 \| bound 1 → 2 \| bound 2 → 3 \|
	\|---\|---:\|---:\|---:\|---:\|
	\| `step-83923` (seed) \| 83,923 \| 3.9429 \| 2.0193 \| 1.4479 \|
	\| `step-84000` \| 84,000 \| 3.9043 \| 2.0036 \| 1.4499 \|
	\| `step-86000` \| 86,000 \| 3.9832 \| 2.0023 \| 1.4030 \|
	\| `step-88000` \| 88,000 \| 3.7784 \| 1.9637 \| 1.4530 \|
	\| `step-90000` \| 90,000 \| 4.0349 \| 1.9545 \| 1.4144 \|
	\| `step-92000` \| 92,000 \| 3.7762 \| 1.8652 \| 1.3454 \|
	\| `step-94000` \| 94,000 \| 3.2967 \| 1.6240 \| 1.1702 \|
	\| `step-96000` \| 96,000 \| 3.6387 \| 1.8206 \| 1.3109 \|
	\| `step-98000` \| 98,000 \| 3.7776 \| 1.8890 \| 1.3254 \|
	\| `step-100000` \| 100,000 \| 3.5100 \| 1.7840 \| 1.3035 \|
	\| `step-102000` \| 102,000 \| 3.6973 \| 1.8219 \| 1.3299 \|
	\| `step-104000` \| 104,000 \| 3.8235 \| 1.8883 \| 1.3597 \|
	\| `step-106000` \| 106,000 \| 3.8243 \| 1.9192 \| 1.3995 \|
	\| `step-108000` \| 108,000 \| 3.9158 \| 1.9621 \| 1.4752 \|
	\| `step-110000` \| 110,000 \| 3.7770 \| 1.9504 \| 1.4117 \|

	### Bayesian PoE α-sweep (renormed BPB at α=0)

	`α=0` is the geometric-mean PoE aggregate (i.e. uniform-mean of log-probabilities); higher α values approach pure-sum PoE.

	\| Branch \| Step \| α=0 (geom-mean) \| α=0.25 \| α=0.5 (Bayesian √K) \| α=0.75 \| α=1.0 (pure sum / Log-OP) \| crossover gap (α=0 vs single s3) \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| `step-83923` (seed) \| 83,923 \| 0.724738 \| 0.791984 \| 0.980340 \| 1.298269 \| 1.778352 \| +0.000465 \|
	\| `step-84000` \| 84,000 \| 0.724587 \| 0.791340 \| 0.979207 \| 1.296562 \| 1.775902 \| +0.000529 \|
	\| `step-86000` \| 86,000 \| 0.822195 \| 0.899830 \| 1.114882 \| 1.477344 \| 2.024439 \| +0.000295 \|
	\| `step-88000` \| 88,000 \| 0.832245 \| 0.915820 \| 1.139195 \| 1.513019 \| 2.075771 \| +0.000195 \|
	\| `step-90000` \| 90,000 \| 0.827274 \| 0.901033 \| 1.113493 \| 1.473665 \| 2.018179 \| +0.000211 \|
	\| `step-92000` \| 92,000 \| 0.832023 \| 0.904967 \| 1.118086 \| 1.479839 \| 2.026806 \| +0.000661 \|
	\| `step-94000` \| 94,000 \| 0.825098 \| 0.910879 \| 1.136512 \| 1.512243 \| 2.076723 \| +0.000672 \|
	\| `step-96000` \| 96,000 \| 0.827363 \| 0.901554 \| 1.114122 \| 1.474327 \| 2.018869 \| +0.000416 \|
	\| `step-98000` \| 98,000 \| 0.829764 \| 0.913434 \| 1.136861 \| 1.510522 \| 2.072837 \| +0.000314 \|
	\| `step-100000` \| 100,000 \| 0.824306 \| 0.910511 \| 1.135647 \| 1.510581 \| 2.074043 \| +0.000487 \|
	\| `step-102000` \| 102,000 \| 0.816739 \| 0.888515 \| 1.097389 \| 1.451925 \| 1.988114 \| +0.000447 \|
	\| `step-104000` \| 104,000 \| 0.823924 \| 0.899743 \| 1.114410 \| 1.476900 \| 2.024068 \| +0.000303 \|
	\| `step-106000` \| 106,000 \| 0.813492 \| 0.897529 \| 1.117572 \| 1.484721 \| 2.037039 \| +0.000173 \|
	\| `step-108000` \| 108,000 \| 0.807908 \| 0.883745 \| 1.094811 \| 1.450652 \| 1.987702 \| -0.000049 \|
	\| `step-110000` \| 110,000 \| 0.802296 \| 0.873767 \| 1.079777 \| 1.428933 \| 1.956756 \| +0.000412 \|

	### Sample-mode probe (FULL K=4, greedy, 60 tokens, 7-prompt fixed set)

	Concept-level retention probe across a fixed 7-prompt set: capital of France, gold chemical symbol, Friday→tomorrow, opposite of hot, planets list, favorite color, `5x+3=13` algebra. Greedy continuations track factual recall trajectory in parallel with BPB.

	\| Branch \| Step \| France \| Gold (Au + atomic # 79) \| Planets list \| Algebra `5x+3=13` (correct: 2) \| Note \|
	\|---\|---:\|---\|---\|---\|---\|---\|
	\| `step-83923` (seed) \| 83,923 \| Paris ✓ + ÎdF ✓ + 6th in EU + most visited \| Au ✓ + 79 ✓ + transition metals ✓ + properties (richest) \| "eight major planets" + complete 8-list ✓ + Roman gods ✓ \| echoes question without producing value (regression) \| richest factual France & Gold across prior phase \|
	\| `step-84000` \| 84,000 \| Paris ✓ + ÎdF ✓ + 6th in EU + most visited (≈ seed) \| Au ✓ + properties (no atomic #, no transition metals) \| terrestrial 4 + gas-giant heading (truncated) \| echoes question without producing value \| small variations only — 77 step at near-zero LR ≈ noise \|
	\| `step-86000` \| 86,000 \| (sample-mode probe pending) \| (pending) \| (pending) \| (pending) \| sample-mode deferred; numeric rows show LR-shock +0.097 BPB / -4.4pp acc / -4.5pp α \|
	\| `step-88000` \| 88,000 \| (sample-mode probe pending) \| (pending) \| (pending) \| (pending) \| sample-mode deferred; numeric rows show plateau approach: BPB slope decelerated 10x, spec α RECOVERED to 0.9795 (-0.006 from Run A) \|
	\| `step-90000` \| 90,000 \| (sample-mode probe pending) \| (pending) \| (pending) \| (pending) \| sample-mode deferred; numeric rows show FIRST PEAK DESCENT: full K=4 BPB -0.005, spec α MATCHED Run A 0.9853, per-stage acc +0.002 RECOVERY, routing +1.5pp \|
	\| `step-92000` \| 92,000 \| (sample-mode probe pending) \| (pending) \| (pending) \| (pending) \| sample-mode deferred; numeric rows show DECOUPLED METRICS: BPB re-bounced +0.005 (oscillation around 0.832 plateau), but spec α 0.9970 NEW HIGH SURPASSED Run A baseline by +0.012, routing cap=0.020 74.62% trajectory peak, WAND bounds 1→2/2→3 narrowest of trajectory \|
	\| `step-94000` \| 94,000 \| (sample-mode probe pending) \| (pending) \| (pending) \| (pending) \| sample-mode deferred; numeric rows show 2nd BPB DESCENT (-0.007 vs s092, lowest peak-phase value 0.825), spec α 0.9623 (pulled back from s092 high), per-stage acc +0.003 RECOVERY, routing cap=0.020 77.69% NEW continual peak, WAND bounds ALL -13% vs s092 / -16-19% below Run A (NEW LOWS) \|
	\| `step-96000` \| 96,000 \| (sample-mode probe pending) \| (pending) \| (pending) \| (pending) \| sample-mode deferred; numeric rows show s094 was outlier — BPB +0.002 reverts toward median (0.827), routing cap=0.020 -7.18pp REVERT to 70.5% (s094 outlier high invalidated), WAND bounds ALL +12% widened reverting toward s092 levels. No metric in 6-ckpt peak phase shows monotonic trend — all oscillating within stable envelopes. \|
	\| `step-98000` \| 98,000 \| (sample-mode probe pending) \| (pending) \| (pending) \| (pending) \| sample-mode deferred; numeric rows confirm 7-ckpt oscillation regime: BPB 0.830 within 0.822-0.832 envelope, spec α 0.9242 NEW LOW since LR shock (envelope expanded to 0.92-0.997), routing cap=0.020 75.30% within 70-78% range, WAND bounds mildly widened. Cyclic LR maintains stable oscillation around BPB ≈ 0.828 plateau without monotonic descent. \|
	\| `step-100000` \| 100,000 \| (sample-mode probe pending) \| (pending) \| (pending) \| (pending) \| sample-mode deferred; numeric rows show DESCENT begins: BPB -0.005 vs s098 (0.8243 — below 0.828 plateau midpoint); per-stage acc +0.003 recovery uniform; WAND bounds narrowed -7%. \|
	\| `step-102000` \| 102,000 \| (sample-mode probe pending) \| (pending) \| (pending) \| (pending) \| sample-mode deferred; numeric rows show NEW TRAJECTORY LOW: BPB 0.8167 (-0.0076 vs s100) — first below s086 starting point (0.8222); spec α 0.9795 recovered toward Run A 0.9852; per-stage acc trajectory high 0.4452 (s3=0.4456); routing 76.31%. Plateau exit confirmed across 3 consecutive ckpts (s098 → s100 → s102: 0.830 → 0.824 → 0.817). Continual-learning gain beyond noise. \|
	\| `step-104000` \| 104,000 \| (sample-mode probe pending) \| (pending) \| (pending) \| (pending) \| sample-mode deferred; numeric rows show BOUNCE BACK +0.007 vs s102: BPB 0.8239 back to s100-level (s102's 0.8167 was a local low, not start of monotonic descent); per-stage acc regressed -0.004; routing 71% within envelope. 10-ckpt envelope updated to 0.817-0.832 (s102 expanded floor) but oscillation regime continues. \|
	\| `step-106000` \| 106,000 \| (sample-mode probe pending) \| (pending) \| (pending) \| (pending) \| sample-mode deferred; numeric rows show NEW TRAJECTORY LOWS in both BPB and acc: BPB 0.8135 (-0.010 vs s104; below s086 entry 0.8222 by -0.009); spec α 0.9766 (-0.009 from Run A); per-stage acc NEW HIGH 0.4466 full / 0.4467 s3 (gap to Run A reduced to -0.039); crossover gap narrowest of trajectory (+0.000173). Progressively lower local lows pattern: s094 0.8251 → s102 0.8167 → s106 0.8135. Last peak ckpt before warmdown (rel_it 22861, ~800 step away). \|
	\| `step-108000` \| 108,000 \| (sample-mode probe pending) \| (pending) \| (pending) \| (pending) \| FIRST WARMDOWN CKPT (rel_it 24,077, ~1,217 step into warmdown, lrm ≈ 0.478); numeric rows show NEW TRAJECTORY LOWS in BPB (0.8079, -0.006 vs s106; -0.014 below s086 entry) and acc (s3 0.4485 / full 0.4482, gap to Run A reduced to -0.038); crossover gap NEGATIVE for first time (-0.000049) — stage aggregation now produces lower BPB than single best stage. Warmdown effect 1.5x faster than peak-phase descent rate. \|
	\| `step-110000` \| 110,000 \| (sample-mode probe pending) \| (pending) \| (pending) \| (pending) \| SECOND WARMDOWN CKPT (rel_it 26,077, ~3,216 step into warmdown, lrm ≈ 0.44); numeric rows show continued NEW LOWS in BPB and acc: BPB 0.8023 (-0.006 vs s108; first ckpt below 0.81; gap to Run A reduced to +0.078), per-stage acc NEW HIGH s3 0.4517 / full 0.4513 (4th consecutive new high; gap to Run A reduced to -0.034). Spec α DROP -0.022 to 0.9375 (drafter-target mismatch despite WAND bounds tightening). Crossover gap REVERTED to +0.000412 — s108's negative gap was single-ckpt event, stage aggregation gain not yet stable. WAND bounds tightening across all 3 stages (-1% to -4%). \|

	### Reading the table

	The seed (`step-83923`) row is the frozen reference. Subsequent rows answer:

	1. BPB descent — does `full K=4 BPB` continue dropping past the seed, or plateau?
	2. Per-stage refinement — do single-stage BPBs descend, indicating each head genuinely tightens?
	3. Routing margin — does the cap=0.020 routing fraction grow (more positions cleanly handled by stage 0 alone)?
	4. Spec α dynamics — does stage-0-vs-full agreement strengthen as continual training progresses?
	5. WAND p99 evolution — does the cumulative-PoE delta range shrink (head distributions converge) or widen (head specialization tightens)?
	6. Sample-mode — do specific factual probes (planets list, atomic number, algebra answer) become reliably correct, or oscillate?

	A real continual-learning win requires multiple metrics to diverge from the seed in a coherent direction. A null result would have all rows ≈ seed — meaning the prior-phase warmdown had already extracted available capacity from this data.

	The seed → `step-84000` delta is inside expected noise (77 step at warmup-near-zero LR cannot move the model meaningfully).

	`step-86000` is the first post-warmup ckpt (rel_it = 2,077; warmup of 1,000 ended at rel_it = 1,000, so 1,077 step into the lrm = 0.5 peak phase). It shows a clear LR-shock signature: full K=4 BPB +0.097, training-log val BPB +0.103, per-stage acc -4.4pp uniform, spec α -4.5pp, routing fraction (cap = 0.020) -14pp. The crossover gap (α=0 vs single s3) narrowed to +0.000295 from +0.000465 — the relative aggregation structure is preserved despite the absolute regression.

	`step-88000` (rel_it = 4,077; ~3,077 step into peak) shows the peak-phase plateau approaching: full K=4 BPB +0.010 vs `step-86000` (slope decelerated 10× vs the s086000 single jump), spec α recovered to 0.9795 (only -0.006 from Run A's 0.9852 — the first metric to fully bounce back), per-stage acc drift slowed to -0.003, routing fraction essentially flat at 70.4%, crossover gap continued narrowing to +0.000195.

	`step-90000` (rel_it = 6,077; ~5,077 step into peak) is the first ckpt to show coordinated descent — full K=4 BPB -0.005 vs step-88000 (first negative delta since LR shock), spec α 0.9853 matching Run A's 0.9852 (+0.0001), per-stage acc +0.002 recovery across all stages, routing fraction +1.5pp. The peak-phase plateau lasted ~2-3k step (s086 → s088) before descent began. Head re-alignment (spec α recovery at s088) preceded loss landscape descent by ~2k step, validating the "preparation phase" interpretation.

	`step-92000` (rel_it = 8,077) shows decoupled metric trajectories: full K=4 BPB re-bounced +0.005 (back to s088000 plateau level — the s090000 descent was an oscillation, not monotonic), but spec α reached 0.9970 — a new trajectory high surpassing Run A endpoint by +0.012, routing cap=0.020 74.62% (trajectory peak in continual phase), and WAND bounds 1→2 / 2→3 are now the narrowest of the entire trajectory (below Run A levels).

	`step-94000` (rel_it = 10,077) showed BPB at 0.825 (lowest peak-phase value) and routing cap=0.020 at 77.69% with WAND bounds all -13% vs `step-92000`. Initially we read this as a 4-point uptrend in routing/WAND structure metrics; the next checkpoint invalidated that reading.

	`step-96000` (rel_it = 12,077) reverts toward the 6-ckpt envelope median: full K=4 BPB +0.002 vs s094000 (now 0.827), routing cap=0.020 -7.18pp drop to 70.51% (back at s086-s088 level), WAND bounds all +12% widened vs s094000 (back near s092000 levels). The s094000 measurement was an oscillation outlier, not the start of a monotonic trend.

	6-checkpoint analysis through s096 (s086 → s096): no metric showed a monotonic trend; local-slice BPB oscillated within `0.822-0.832`. The peak phase appeared to be a stable oscillation regime around BPB ≈ 0.828. That picture changes from `step-100000` onward.

	Plateau exit at step-100000 / step-102000: starting at `step-100000` (rel_it 16,077) the trajectory breaks out of the 0.828 envelope. `step-100000` shows full K=4 BPB 0.8243 (-0.005 vs `step-98000`), with per-stage acc +0.003 recovery uniform and WAND p99 bounds narrowed -7% — the first ckpt where descent is corroborated by acc and WAND together. `step-102000` (rel_it 18,077) deepens the descent to 0.8167 — new trajectory low, below `step-86000` starting point of 0.8222 by -0.0055, with spec α recovered to 0.9795 (-0.006 from Run A endpoint), per-stage acc trajectory high 0.4452 (s3=0.4456), and routing cap=0.020 at 76.31%. Three consecutive checkpoints (`step-98000` → `step-100000` → `step-102000`) form a monotonic descent: 0.830 → 0.824 → 0.817. Training-log val BPB confirms across 5 evaluations: s98000=0.880 → s100000=0.877 → s101000=0.875 → s101500=0.875 → s102000=0.874.

	This is the first phase of the run where the cyclic-LR continuation produces gains beyond noise. The model has now spent ~17,000 step at constant lrm = 0.5 and is finally consolidating into a lower-loss region.

	`step-104000` (rel_it = 20,077) then bounces back to BPB 0.8239 (+0.007 vs `step-102000`), with per-stage acc regressing -0.004 and routing dropping to 71%. The s102000 low was therefore a local low rather than the start of a sustained descent — analogous to `step-94000`'s earlier outlier. The 10-checkpoint envelope is now 0.817-0.832 (expanded floor by -0.005 vs the s086-s098 range of 0.822-0.832), but the oscillation regime persists.

	The cyclic-LR continuation through 20,000 peak-phase steps has produced a measurable floor expansion (-0.005 below the LR-shock plateau) but has not transitioned into monotonic descent at constant lrm = 0.5.

	`step-106000` (rel_it = 22,077; final peak-phase checkpoint) reaches NEW trajectory lows simultaneously in BPB and acc: full K=4 BPB 0.8135 (-0.010 vs s104, -0.003 below previous low s102, -0.009 below s086 entry), per-stage acc NEW HIGH 0.4467 s3 / 0.4466 full (gap to Run A reduced to -0.039 — smallest since LR shock), spec α 0.9766 close to Run A baseline, crossover gap +0.000173 (narrowest of trajectory). Both metrics moving together (rather than BPB low coinciding with acc decay, or vice versa) indicates structural descent rather than measurement oscillation.

	The 11-checkpoint floor follows a progressively-lower local-low pattern: s094 = 0.8251 → s102 = 0.8167 → s106 = 0.8135, with intermediate bounces back to ~0.825. Each successive local low is below the previous. The model is descending in an oscillating fashion with declining floors rather than a smooth monotonic curve.

	Warmdown begins around `step-106800` (rel_it ≈ 22,861) — only ~800 step away. `step-108000` is the first warmdown checkpoint.

	`step-108000` (rel_it = 24,077; ~1,217 step into warmdown, lrm ≈ 0.478) confirms warmdown is producing accelerated descent: full K=4 BPB 0.8079 (-0.006 vs `step-106000`'s peak floor of 0.8135 over just 2,000 step, a 1.5× faster rate than peak-phase descent), per-stage acc trajectory high (s3 0.4485 / full 0.4482, gap to Run A reduced to -0.038), and the crossover gap goes NEGATIVE for the first time (-0.000049) — uniform-mean PoE aggregation now produces a BPB lower than the single best stage (s3), meaning stage aggregation is finally adding measurable value rather than being absorbed by the s3 head alone.

	BPB is now 0.014 below the s086 LR-shock entry point and 0.083 above the Run A endpoint. Approximately 12,000 step of warmdown remain, with lrm decaying from 0.48 to 0.0. If the current warmdown descent rate sustains, the endpoint BPB will reach the 0.75-0.76 range; if it accelerates as in Run A's late warmdown, approaching 0.7247 becomes plausible.

	`step-110000` (rel_it = 26,077; ~3,216 step into warmdown, lrm ≈ 0.44) extends warmdown descent at the same constant rate: full K=4 BPB 0.8023 (-0.006 vs `step-108000`, first ckpt with BPB below 0.81, gap to Run A reduced to +0.078) and per-stage acc NEW HIGH 4th consecutive checkpoint (s3 0.4517 / full 0.4513, gap to Run A reduced to -0.034). The descent is now monotonic for 3 consecutive checkpoints (s106 → s108 → s110: 0.8135 → 0.8079 → 0.8023) at a constant -0.006/2000-step rate. Acc is monotonic for 4 consecutive checkpoints (s104 → s106 → s108 → s110: 0.4419 → 0.4467 → 0.4485 → 0.4517).

	Two concerning signals at `step-110000`: (1) spec α drops -0.022 to 0.9375, the largest single-checkpoint α drop since the LR shock at s086; (2) the crossover gap reverts to positive (+0.000412) after going negative at s108. Both signals indicate that the s108 stage-aggregation gain was a single-checkpoint event, not a stable transition — the drafter (single s3) and target (full K=4) distributions are still oscillating relative to each other despite WAND p99 bounds tightening across all 3 stages (-1% to -4%). The bounds tightening (head outputs converging to narrower distributions) coinciding with α drop (drafter-target mismatch) suggests warmdown-phase head adaptation rates differ across stages, producing a temporary divergence even as the overall loss landscape descends.

	Endpoint projection at constant -0.006/2000-step rate: 10,000 step remaining × -0.030 → s134 BPB ≈ 0.77, leaving the Run A gap at +0.045. Reaching Run A's 0.7247 would require warmdown acceleration of ~50% in the final 5,000 step — possible (Run A's own warmdown showed late acceleration) but not certain at current trajectory.

	## What this release is not

	- Not a multilingual extension. Tokenizer and data are unchanged; CJK / non-ES Romance language behavior is identical to the prior phase (substantial gaps remain).
	- Not an instruction-tuned / chat model. Both phases use base pretraining objectives; chat templates are not exposed.
	- Not a quality bump claim. The hypothesis is being tested in public — endpoint quality is reported as data, not as a marketing claim. Use the prior-phase release as the canonical 3B base unless trajectory evidence here recommends otherwise.

	## License

	Apache 2.0.

	## Citation

	```
	@misc{jeong2026poe,
	author = {Jeong, Jaepil},
	title = {Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters},
	year = {2026},
	doi = {10.5281/zenodo.19547653},
	publisher = {Zenodo},
	}
	```

	A 3B-specific paper covering the full prior-phase trajectory, this continual-pretraining trajectory, and a planned multilingual reorganization variant is in preparation.

	## Related releases

	- [`cognica/Cognica-PoE-v1.0-3B-base`](https://huggingface.co/cognica/Cognica-PoE-v1.0-3B-base) — Prior phase: 3B PoE per-stage, 66 B tokens, single warmup→warmdown cycle, `frontier_v1` mix
	- [`cognica/Cognica-PoE-v1.0-1.3B-base`](https://huggingface.co/cognica/Cognica-PoE-v1.0-1.3B-base) — 1.3B PoE per-stage release (different scale)
	- [`cognica/Cognica-BP-v1.0-1.3B-base`](https://huggingface.co/cognica/Cognica-BP-v1.0-1.3B-base) — 1.3B Backprop baseline (PoE control)