jaepil's picture
Update README for step-110000 (continued warmdown, BPB 0.8023 new low)
13fd931 verified
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
tags:
- poe
- per-stage-head
- continual-learning
- continual-pretraining
- cyclic-lr
---
# Cognica-PoE-v1.0-3B-base-continual-learning
**Continual pretraining** of a 3B PoE per-stage-head model using a **cyclic re-warmup schedule** to extract additional capacity from the same data distribution after initial training has fully annealed.
This release studies whether a model whose first training pass has reached its scheduled LR floor (`lrm β‰ˆ 0.05`) can still meaningfully improve when given a fresh half-peak warmup β†’ warmdown cycle on the same data β€” without changing architecture, tokenizer, or data mix.
The model is published as a **trajectory of step branches**, not just a final ckpt: every saved checkpoint becomes a separate `step-XXXXX` branch so the continual-learning curve itself is externally auditable.
## What "continual" means here
Most published "base" models are released after a single warmup β†’ constant β†’ warmdown LR cycle. At the end of that cycle the LR is near zero and gradient updates produce only marginal change β€” the model is conventionally considered "done."
This release tests a different setting: take a fully-annealed checkpoint, **re-arm the optimizer with a new LR cycle at half the original peak**, and continue training on the same data. We label this **continual pretraining (cyclic LR)** to distinguish it from:
| pattern | what changes vs prior phase |
|---|---|
| **continual pretraining (cyclic LR)** ← *this release* | LR re-warmed; data + tokenizer + architecture unchanged |
| domain-adaptive pretraining | new domain data added |
| multilingual continual pretraining | tokenizer extended; multilingual data mixed in |
| continual instruction tuning | SFT data; chat-format objective |
The hypothesis being probed: **does a half-peak second cycle on identical data produce real, measurable gain, or does the model plateau?**
## Methodology β€” B2 cyclic schedule
Initialization: a fully-trained 3B PoE per-stage-head model (66B tokens consumed, full warmup β†’ warmdown cycle complete, `lrm` annealed to ~0.05 of original peak).
**LR schedule for the continual phase** (anchored at the resume step):
```
warmup (rel 0 .. 1000) : lrm rises 0 β†’ 0.5 (linear)
peak (rel 1000..22861) : lrm = 0.5 (constant; half of original peak)
warmdown (rel 22861..50800): lrm decays 0.5 β†’ 0.0 (linear; warmdown_ratio = 0.55)
```
| param | continual phase value | original phase (for reference) |
|---|---|---|
| token budget | +40 B | 66 B |
| total steps in phase | 50,800 | 83,923 |
| warmup steps | 1,000 | 1,000 |
| peak `lrm` | **0.5** | 1.0 |
| warmdown ratio | 0.55 | 0.65 |
| final `lrm` | 0.0 | 0.05 |
| effective `matrix_lr` peak | 0.0075 | 0.015 |
| effective `embedding_lr` peak | 0.15 | 0.30 |
| effective `unembedding_lr` peak | 0.004 | 0.008 |
Optimizer state is restored from the prior phase's last save (DistMuonAdamW ZeRO-2 sharded across 12 ranks). Tokenizer (32,768 vocab, rustbpe), architecture (depth=32, n_embd=2048, K=4 PoE per-stage with asymmetric stage_layers=(16,6,5,5), GQA 2:1, intermediate=12800, max_seq_len=2048), and data mix (`frontier_v1`: FineWeb-Edu 33.5% + DCLM-Baseline 24% + Stack-v2 16% + Wikipedia 5% + CulturaX 5% + ProofPile-2 4% + OpenWebMath 4% + Gutenberg 4% + PG-19 2% + UltraChat 1% + OpenHermes-2.5 0.6%) are all unchanged from the prior phase.
## Why a published trajectory
The point of this release is the **continual-learning curve**, not any single endpoint. We publish every save (`step-XXXXX` branches) so the actual question β€” "does a re-warmed cycle keep improving the model?" β€” can be answered by reading off the trajectory rather than trusting our headline numbers.
Each `step-XXXXX` branch carries its own per-checkpoint `poe_wand_p99_bounds_per_stage_head` calibration in `config.json` so PoE-specific inference (WAND adaptive depth, self-speculative decoding) works correctly at any branch.
## Branches
| Branch | Step | Phase position | Notes |
|---|---:|---|---|
| `step-83923` | 83,923 | continual phase rel_it = 0 | seed: identical to the fully-annealed prior-phase final ckpt |
| `step-84000` | 84,000 | rel_it = 77 (warmup early, lrm β‰ˆ 0.04) | first save after re-warmup begins; functionally indistinguishable from seed |
| `step-86000` | 86,000 | rel_it = 2,077 (peak +1,077, lrm = 0.5) | first post-warmup ckpt; LR-shock signature (BPB +0.097 / acc -4.4pp / Ξ± -4.5pp) |
| `step-88000` | 88,000 | rel_it = 4,077 (peak +3,077) | plateau approach; BPB slope decelerated 10Γ—; spec Ξ± recovered to 0.9795 |
| `step-90000` | 90,000 | rel_it = 6,077 (peak +5,077) | first coordinated descent: BPB -0.005, spec Ξ± 0.9853 matched Run A baseline |
| `step-92000` | 92,000 | rel_it = 8,077 (peak +7,077) | decoupled metrics: BPB re-bounce to 0.832 plateau, **spec Ξ± 0.9970 new high (above Run A 0.9852)**, WAND bounds narrowest of trajectory |
| `step-94000` | 94,000 | rel_it = 10,077 (peak +9,077) | 2nd BPB descent (-0.007 β†’ 0.825 lowest peak-phase value); routing cap=0.020 **77.69% new continual peak**; WAND bounds **all -13% vs s092 / -16-19% below Run A** |
| `step-96000` | 96,000 | rel_it = 12,077 (peak +11,077) | **s094 was outlier**: BPB reverts +0.002 (back to 0.827 median); routing cap=0.020 -7.18pp REVERT to 70.5%; WAND bounds all +12% widened. Peak phase confirmed as oscillation regime, no monotonic trends. |
| `step-98000` | 98,000 | rel_it = 14,077 (peak +13,077) | 7-ckpt oscillation regime confirmed: BPB 0.830 within 0.822-0.832 envelope; **spec Ξ± 0.9242 new low since LR shock**; routing cap=0.020 75.30%; WAND bounds mildly widened. |
| `step-100000` | 100,000 | rel_it = 16,077 (peak +15,077) | **descent begins**: BPB -0.005 β†’ 0.8243 (below 0.828 plateau midpoint); per-stage acc +0.003 recovery; WAND bounds -7%. |
| `step-102000` | 102,000 | rel_it = 18,077 (peak +17,077) | **NEW TRAJECTORY LOW**: BPB **0.8167** (first below s086 starting point 0.8222 by -0.0055); spec Ξ± 0.9795 recovered; per-stage acc trajectory high (0.4456); 3 consecutive descents s098β†’s100β†’s102 (0.830β†’0.824β†’0.817). Plateau exit confirmed. |
| `step-104000` | 104,000 | rel_it = 20,077 (peak +19,077) | **BOUNCE BACK**: BPB 0.8239 (+0.007 vs s102 β€” s102 was a local low, not start of monotonic descent); acc -0.004; routing 71%. Envelope updated to 0.817-0.832; oscillation regime continues. |
| `step-106000` | 106,000 | rel_it = 22,077 (peak +21,077; **last peak ckpt**) | **NEW TRAJECTORY LOWS**: BPB **0.8135** (-0.010 vs s104; below s086 entry by -0.009); per-stage acc **NEW HIGH 0.4467 s3 / 0.4466 full** (gap to Run A reduced to -0.039); progressively lower local lows pattern (s094 0.8251 β†’ s102 0.8167 β†’ s106 0.8135). |
| `step-108000` | 108,000 | rel_it = 24,077 (warmdown +1,217; lrm β‰ˆ 0.478) | **FIRST WARMDOWN CKPT β€” NEW TRAJECTORY LOWS** in BPB (**0.8079**, -0.006 vs s106; -0.014 below s086 entry) and acc (s3 **0.4485** / full **0.4482**, gap to Run A reduced to -0.038); **crossover gap NEGATIVE first time (-0.000049)** β€” uniform-mean PoE now below single s3. Warmdown effect 1.5x faster than peak descent. |
| `step-110000` | 110,000 | rel_it = 26,077 (warmdown +3,216; lrm β‰ˆ 0.44) | **SECOND WARMDOWN CKPT β€” CONTINUED DESCENT**: BPB **0.8023** (-0.006 vs s108; first below 0.81; gap to Run A reduced to +0.078), acc 4th consecutive new high (s3 **0.4517** / full **0.4513**, gap to Run A -0.034). Spec Ξ± drops -0.022 to 0.9375 (drafter-target mismatch). **Crossover gap REVERTED to +0.000412** (s108 negative was single-ckpt event). WAND bounds tightening across all 3 stages (-1% to -4%). |
| `main` | latest | tracks newest published step | currently `step-110000` |
Future saves: every 2000 steps (`step-112000`, `step-114000`, ..., `step-134000`) plus a final `step-134723`. Warmdown began around `step-106800` (rel_it β‰ˆ 22,861); remaining warmdown ~10,000 step with lrm decaying 0.44 β†’ 0.0.
## Inference
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
"cognica/Cognica-PoE-v1.0-3B-base-continual-learning",
revision="main", # or any "step-XXXXX" branch
trust_remote_code=True,
dtype=torch.bfloat16,
).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(
"cognica/Cognica-PoE-v1.0-3B-base-continual-learning",
revision="main",
trust_remote_code=True,
)
# Base ckpts REQUIRE prepending <|bos|> before user text:
prompt = "The capital of France is"
input_ids = [tokenizer.bos_token_id] + tokenizer.encode(prompt, add_special_tokens=False)
input_ids = torch.tensor([input_ids], device=device)
out = model.generate(input_ids=input_ids, max_new_tokens=32, do_sample=False)
print(tokenizer.decode(out[0].tolist()))
```
PoE-specific inference helpers (single-stage forward, prefix pruning, WAND adaptive depth, self-speculative decoding) are exposed on `CognicaPoEForCausalLM`. Each `step-XXXXX` branch carries its own calibrated `poe_wand_p99_bounds_per_stage_head` in `config.json`; `model.generate_wand(...)` reads it automatically.
For the architectural details, full inference recipe, and the prior-phase trajectory analysis, see the prior-phase release: [`cognica/Cognica-PoE-v1.0-3B-base`](https://huggingface.co/cognica/Cognica-PoE-v1.0-3B-base).
## Trajectory measurements
The continual phase entry point (`step-83923`) is identical to the prior-phase endpoint, so it serves as both the resume anchor and the baseline against which every continual-phase ckpt is measured. A meaningful continual-learning result requires `step-NNNNN` measurements to **diverge** from the seed across multiple metrics β€” not just match it.
### Per-checkpoint val BPB
Same 8-shard local val slice across all measured ckpts (1.05 M tokens, `--split-tokens 1048576`); single A100 80GB; FULL K=4 PoE aggregation.
| Branch | Step | Phase rel_it | Training-log val BPB (12 ranks, 40M tokens) | Local-slice full K=4 BPB (8 shards, 1.05M tokens) |
|---|---:|---:|---:|---:|
| `step-83923` (seed) | 83,923 | 0 | 0.772893 | 0.724738 |
| `step-84000` | 84,000 | 77 | 0.772671 | 0.724587 |
| `step-86000` | 86,000 | 2,077 | 0.875501 | 0.822195 |
| `step-88000` | 88,000 | 4,077 | 0.882645 | 0.832245 |
| `step-90000` | 90,000 | 6,077 | 0.885028 | 0.827274 |
| `step-92000` | 92,000 | 8,077 | 0.886520 | 0.832023 |
| `step-94000` | 94,000 | 10,077 | 0.883594 | 0.825098 |
| `step-96000` | 96,000 | 12,077 | 0.878945 | 0.827363 |
| `step-98000` | 98,000 | 14,077 | 0.879666 | 0.829764 |
| `step-100000` | 100,000 | 16,077 | 0.876705 | 0.824306 |
| `step-102000` | 102,000 | 18,077 | 0.874228 | 0.816739 |
| `step-104000` | 104,000 | 20,077 | 0.872387 | 0.823924 |
| `step-106000` | 106,000 | 22,077 | 0.871799 | 0.813492 |
| `step-108000` | 108,000 | 24,077 | 0.863411 | 0.807908 |
| `step-110000` | 110,000 | 26,077 | 0.855854 | 0.802296 |
### Per-stage BPB
Per-checkpoint training-objective BPB at each PoE stage boundary, on the same 8-shard local val slice.
| Branch | Step | full K=4 | single s0 | single s1 | single s2 | single s3 | prefix K'=1 | prefix K'=2 | prefix K'=3 |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| `step-83923` (seed) | 83,923 | 0.724738 | 0.727740 | 0.725798 | 0.725063 | 0.724273 | 0.727740 | 0.726103 | 0.725363 |
| `step-84000` | 84,000 | 0.724587 | 0.727583 | 0.725641 | 0.724863 | 0.724058 | 0.727583 | 0.725975 | 0.725224 |
| `step-86000` | 86,000 | 0.822195 | 0.825607 | 0.823068 | 0.822524 | 0.821900 | 0.825607 | 0.823529 | 0.822770 |
| `step-88000` | 88,000 | 0.832245 | 0.835634 | 0.833265 | 0.832631 | 0.832050 | 0.835634 | 0.833652 | 0.832856 |
| `step-90000` | 90,000 | 0.827274 | 0.830453 | 0.828213 | 0.827658 | 0.827063 | 0.830453 | 0.828545 | 0.827827 |
| `step-92000` | 92,000 | 0.832023 | 0.835828 | 0.832909 | 0.832110 | 0.831363 | 0.835828 | 0.833640 | 0.832726 |
| `step-94000` | 94,000 | 0.825098 | 0.828611 | 0.825863 | 0.825143 | 0.824426 | 0.828611 | 0.826573 | 0.825742 |
| `step-96000` | 96,000 | 0.827363 | 0.830721 | 0.828278 | 0.827653 | 0.826947 | 0.830721 | 0.828749 | 0.827980 |
| `step-98000` | 98,000 | 0.829764 | 0.832973 | 0.830729 | 0.830105 | 0.829450 | 0.832973 | 0.831115 | 0.830359 |
| `step-100000` | 100,000 | 0.824306 | 0.827779 | 0.825251 | 0.824536 | 0.823820 | 0.827779 | 0.825770 | 0.824953 |
| `step-102000` | 102,000 | 0.816739 | 0.819992 | 0.817654 | 0.816944 | 0.816293 | 0.819992 | 0.818104 | 0.817342 |
| `step-104000` | 104,000 | 0.823924 | 0.827367 | 0.824963 | 0.824277 | 0.823621 | 0.827367 | 0.825345 | 0.824551 |
| `step-106000` | 106,000 | 0.813492 | 0.816808 | 0.814571 | 0.813932 | 0.813319 | 0.816808 | 0.814859 | 0.814088 |
| `step-108000` | 108,000 | 0.807908 | 0.811307 | 0.808930 | 0.808412 | 0.807957 | 0.811307 | 0.809247 | 0.808479 |
| `step-110000` | 110,000 | 0.802296 | 0.805671 | 0.803292 | 0.802526 | 0.801884 | 0.805671 | 0.803734 | 0.802916 |
### Per-stage standalone target accuracy
Top-1 accuracy of each stage's standalone prediction vs ground-truth target token, on the same 8-shard val slice.
| Branch | Step | s0 | s1 | s2 | s3 | full |
|---|---:|---:|---:|---:|---:|---:|
| `step-83923` (seed) | 83,923 | 0.4844 | 0.4856 | 0.4859 | 0.4862 | 0.4860 |
| `step-84000` | 84,000 | 0.4843 | 0.4857 | 0.4859 | 0.4861 | 0.4859 |
| `step-86000` | 86,000 | 0.4405 | 0.4419 | 0.4422 | 0.4424 | 0.4420 |
| `step-88000` | 88,000 | 0.4375 | 0.4382 | 0.4388 | 0.4390 | 0.4386 |
| `step-90000` | 90,000 | 0.4392 | 0.4403 | 0.4407 | 0.4407 | 0.4405 |
| `step-92000` | 92,000 | 0.4364 | 0.4380 | 0.4385 | 0.4391 | 0.4384 |
| `step-94000` | 94,000 | 0.4403 | 0.4417 | 0.4422 | 0.4421 | 0.4420 |
| `step-96000` | 96,000 | 0.4391 | 0.4400 | 0.4405 | 0.4409 | 0.4404 |
| `step-98000` | 98,000 | 0.4377 | 0.4386 | 0.4388 | 0.4393 | 0.4389 |
| `step-100000` | 100,000 | 0.4406 | 0.4416 | 0.4422 | 0.4427 | 0.4420 |
| `step-102000` | 102,000 | 0.4438 | 0.4452 | 0.4452 | 0.4456 | 0.4452 |
| `step-104000` | 104,000 | 0.4398 | 0.4409 | 0.4414 | 0.4419 | 0.4415 |
| `step-106000` | 106,000 | 0.4452 | 0.4461 | 0.4464 | 0.4467 | 0.4466 |
| `step-108000` | 108,000 | 0.4464 | 0.4479 | 0.4483 | 0.4485 | 0.4482 |
| `step-110000` | 110,000 | 0.4498 | 0.4510 | 0.4514 | 0.4517 | 0.4513 |
### Self-speculative decoding (m=4 stage-0 draft, full K=4 verify)
| Branch | Step | acceptance Ξ± | mean accepted / 4 | end-to-end speedup |
|---|---:|---:|---:|---:|
| `step-83923` (seed) | 83,923 | 0.9852 | 3.83 | 1.61x |
| `step-84000` | 84,000 | 0.9736 | 3.77 | 1.57x |
| `step-86000` | 86,000 | 0.9402 | 3.67 | 1.53x |
| `step-88000` | 88,000 | 0.9795 | 3.88 | 1.58x |
| `step-90000` | 90,000 | 0.9853 | 3.88 | 1.58x |
| `step-92000` | 92,000 | 0.9970 | 3.94 | 1.60x |
| `step-94000` | 94,000 | 0.9623 | 3.77 | 1.56x |
| `step-96000` | 96,000 | 0.9853 | 3.94 | 1.59x |
| `step-98000` | 98,000 | 0.9242 | 3.62 | 1.51x |
| `step-100000` | 100,000 | 0.9320 | 3.62 | 1.51x |
| `step-102000` | 102,000 | 0.9795 | 3.88 | 1.59x |
| `step-104000` | 104,000 | 0.9708 | 3.83 | 1.57x |
| `step-106000` | 106,000 | 0.9766 | 3.88 | 1.58x |
| `step-108000` | 108,000 | 0.9593 | 3.67 | 1.55x |
| `step-110000` | 110,000 | 0.9375 | 3.67 | 1.53x |
### Confidence-aware routing (target_regression cap = 0.020)
| Branch | Step | fraction routed to stage 0 | projected speedup | base s0=full agreement |
|---|---:|---:|---:|---:|
| `step-83923` (seed) | 83,923 | 85.05% | 1.715x | 0.9746 |
| `step-84000` | 84,000 | 87.86% | 1.756x | 0.9756 |
| `step-86000` | 86,000 | 71.05% | 1.534x | 0.9685 |
| `step-88000` | 88,000 | 70.43% | 1.523x | 0.9651 |
| `step-90000` | 90,000 | 71.93% | 1.541x | 0.9687 |
| `step-92000` | 92,000 | 74.62% | 1.585x | 0.9712 |
| `step-94000` | 94,000 | 77.69% | 1.643x | 0.9721 |
| `step-96000` | 96,000 | 70.51% | 1.526x | 0.9683 |
| `step-98000` | 98,000 | 75.30% | 1.604x | 0.9713 |
| `step-100000` | 100,000 | 72.51% | 1.557x | 0.9681 |
| `step-102000` | 102,000 | 76.31% | 1.595x | 0.9703 |
| `step-104000` | 104,000 | 71.03% | 1.539x | 0.9683 |
| `step-106000` | 106,000 | 72.79% | 1.567x | 0.9690 |
| `step-108000` | 108,000 | 71.77% | 1.546x | 0.9671 |
| `step-110000` | 110,000 | 72.36% | 1.553x | 0.9697 |
### WAND p99 bounds (cumulative-PoE delta range, constant-shift invariant)
Calibrated on a 131,072-token val slice using `range(delta) = max(delta) βˆ’ min(delta)`. Each branch's `config.json` carries its own bounds in `poe_wand_p99_bounds_per_stage_head`.
| Branch | Step | bound 0 β†’ 1 | bound 1 β†’ 2 | bound 2 β†’ 3 |
|---|---:|---:|---:|---:|
| `step-83923` (seed) | 83,923 | 3.9429 | 2.0193 | 1.4479 |
| `step-84000` | 84,000 | 3.9043 | 2.0036 | 1.4499 |
| `step-86000` | 86,000 | 3.9832 | 2.0023 | 1.4030 |
| `step-88000` | 88,000 | 3.7784 | 1.9637 | 1.4530 |
| `step-90000` | 90,000 | 4.0349 | 1.9545 | 1.4144 |
| `step-92000` | 92,000 | 3.7762 | 1.8652 | 1.3454 |
| `step-94000` | 94,000 | 3.2967 | 1.6240 | 1.1702 |
| `step-96000` | 96,000 | 3.6387 | 1.8206 | 1.3109 |
| `step-98000` | 98,000 | 3.7776 | 1.8890 | 1.3254 |
| `step-100000` | 100,000 | 3.5100 | 1.7840 | 1.3035 |
| `step-102000` | 102,000 | 3.6973 | 1.8219 | 1.3299 |
| `step-104000` | 104,000 | 3.8235 | 1.8883 | 1.3597 |
| `step-106000` | 106,000 | 3.8243 | 1.9192 | 1.3995 |
| `step-108000` | 108,000 | 3.9158 | 1.9621 | 1.4752 |
| `step-110000` | 110,000 | 3.7770 | 1.9504 | 1.4117 |
### Bayesian PoE Ξ±-sweep (renormed BPB at Ξ±=0)
`Ξ±=0` is the geometric-mean PoE aggregate (i.e. uniform-mean of log-probabilities); higher Ξ± values approach pure-sum PoE.
| Branch | Step | α=0 (geom-mean) | α=0.25 | α=0.5 (Bayesian √K) | α=0.75 | α=1.0 (pure sum / Log-OP) | crossover gap (α=0 vs single s3) |
|---|---:|---:|---:|---:|---:|---:|---:|
| `step-83923` (seed) | 83,923 | 0.724738 | 0.791984 | 0.980340 | 1.298269 | 1.778352 | +0.000465 |
| `step-84000` | 84,000 | 0.724587 | 0.791340 | 0.979207 | 1.296562 | 1.775902 | +0.000529 |
| `step-86000` | 86,000 | 0.822195 | 0.899830 | 1.114882 | 1.477344 | 2.024439 | +0.000295 |
| `step-88000` | 88,000 | 0.832245 | 0.915820 | 1.139195 | 1.513019 | 2.075771 | +0.000195 |
| `step-90000` | 90,000 | 0.827274 | 0.901033 | 1.113493 | 1.473665 | 2.018179 | +0.000211 |
| `step-92000` | 92,000 | 0.832023 | 0.904967 | 1.118086 | 1.479839 | 2.026806 | +0.000661 |
| `step-94000` | 94,000 | 0.825098 | 0.910879 | 1.136512 | 1.512243 | 2.076723 | +0.000672 |
| `step-96000` | 96,000 | 0.827363 | 0.901554 | 1.114122 | 1.474327 | 2.018869 | +0.000416 |
| `step-98000` | 98,000 | 0.829764 | 0.913434 | 1.136861 | 1.510522 | 2.072837 | +0.000314 |
| `step-100000` | 100,000 | 0.824306 | 0.910511 | 1.135647 | 1.510581 | 2.074043 | +0.000487 |
| `step-102000` | 102,000 | 0.816739 | 0.888515 | 1.097389 | 1.451925 | 1.988114 | +0.000447 |
| `step-104000` | 104,000 | 0.823924 | 0.899743 | 1.114410 | 1.476900 | 2.024068 | +0.000303 |
| `step-106000` | 106,000 | 0.813492 | 0.897529 | 1.117572 | 1.484721 | 2.037039 | +0.000173 |
| `step-108000` | 108,000 | 0.807908 | 0.883745 | 1.094811 | 1.450652 | 1.987702 | -0.000049 |
| `step-110000` | 110,000 | 0.802296 | 0.873767 | 1.079777 | 1.428933 | 1.956756 | +0.000412 |
### Sample-mode probe (FULL K=4, greedy, 60 tokens, 7-prompt fixed set)
Concept-level retention probe across a fixed 7-prompt set: capital of France, gold chemical symbol, Friday→tomorrow, opposite of hot, planets list, favorite color, `5x+3=13` algebra. Greedy continuations track factual recall trajectory in parallel with BPB.
| Branch | Step | France | Gold (Au + atomic # 79) | Planets list | Algebra `5x+3=13` (correct: 2) | Note |
|---|---:|---|---|---|---|---|
| `step-83923` (seed) | 83,923 | Paris βœ“ + ÎdF βœ“ + 6th in EU + most visited | Au βœ“ + 79 βœ“ + transition metals βœ“ + properties (richest) | "eight major planets" + complete 8-list βœ“ + Roman gods βœ“ | echoes question without producing value (regression) | richest factual France & Gold across prior phase |
| `step-84000` | 84,000 | Paris βœ“ + ÎdF βœ“ + 6th in EU + most visited (β‰ˆ seed) | Au βœ“ + properties (no atomic #, no transition metals) | terrestrial 4 + gas-giant heading (truncated) | echoes question without producing value | small variations only β€” 77 step at near-zero LR β‰ˆ noise |
| `step-86000` | 86,000 | (sample-mode probe pending) | (pending) | (pending) | (pending) | sample-mode deferred; numeric rows show LR-shock +0.097 BPB / -4.4pp acc / -4.5pp Ξ± |
| `step-88000` | 88,000 | (sample-mode probe pending) | (pending) | (pending) | (pending) | sample-mode deferred; numeric rows show plateau approach: BPB slope decelerated 10x, spec Ξ± RECOVERED to 0.9795 (-0.006 from Run A) |
| `step-90000` | 90,000 | (sample-mode probe pending) | (pending) | (pending) | (pending) | sample-mode deferred; numeric rows show **FIRST PEAK DESCENT**: full K=4 BPB **-0.005**, spec Ξ± **MATCHED Run A 0.9853**, per-stage acc **+0.002 RECOVERY**, routing +1.5pp |
| `step-92000` | 92,000 | (sample-mode probe pending) | (pending) | (pending) | (pending) | sample-mode deferred; numeric rows show **DECOUPLED METRICS**: BPB re-bounced +0.005 (oscillation around 0.832 plateau), but spec Ξ± **0.9970 NEW HIGH SURPASSED Run A baseline by +0.012**, routing cap=0.020 **74.62% trajectory peak**, WAND bounds 1β†’2/2β†’3 narrowest of trajectory |
| `step-94000` | 94,000 | (sample-mode probe pending) | (pending) | (pending) | (pending) | sample-mode deferred; numeric rows show **2nd BPB DESCENT** (-0.007 vs s092, lowest peak-phase value 0.825), spec Ξ± 0.9623 (pulled back from s092 high), per-stage acc **+0.003 RECOVERY**, routing cap=0.020 **77.69% NEW continual peak**, WAND bounds **ALL -13% vs s092 / -16-19% below Run A** (NEW LOWS) |
| `step-96000` | 96,000 | (sample-mode probe pending) | (pending) | (pending) | (pending) | sample-mode deferred; numeric rows show **s094 was outlier** β€” BPB +0.002 reverts toward median (0.827), routing cap=0.020 **-7.18pp REVERT to 70.5%** (s094 outlier high invalidated), WAND bounds **ALL +12% widened** reverting toward s092 levels. **No metric in 6-ckpt peak phase shows monotonic trend** β€” all oscillating within stable envelopes. |
| `step-98000` | 98,000 | (sample-mode probe pending) | (pending) | (pending) | (pending) | sample-mode deferred; numeric rows confirm **7-ckpt oscillation regime**: BPB 0.830 within 0.822-0.832 envelope, **spec Ξ± 0.9242 NEW LOW since LR shock** (envelope expanded to 0.92-0.997), routing cap=0.020 75.30% within 70-78% range, WAND bounds mildly widened. Cyclic LR maintains stable oscillation around BPB β‰ˆ 0.828 plateau without monotonic descent. |
| `step-100000` | 100,000 | (sample-mode probe pending) | (pending) | (pending) | (pending) | sample-mode deferred; numeric rows show **DESCENT begins**: BPB -0.005 vs s098 (0.8243 β€” below 0.828 plateau midpoint); per-stage acc **+0.003 recovery** uniform; WAND bounds narrowed -7%. |
| `step-102000` | 102,000 | (sample-mode probe pending) | (pending) | (pending) | (pending) | sample-mode deferred; numeric rows show **NEW TRAJECTORY LOW**: BPB **0.8167 (-0.0076 vs s100)** β€” first below s086 starting point (0.8222); spec Ξ± **0.9795 recovered toward Run A 0.9852**; per-stage acc **trajectory high 0.4452 (s3=0.4456)**; routing 76.31%. **Plateau exit confirmed across 3 consecutive ckpts (s098 β†’ s100 β†’ s102: 0.830 β†’ 0.824 β†’ 0.817).** Continual-learning gain beyond noise. |
| `step-104000` | 104,000 | (sample-mode probe pending) | (pending) | (pending) | (pending) | sample-mode deferred; numeric rows show **BOUNCE BACK +0.007 vs s102**: BPB 0.8239 back to s100-level (s102's 0.8167 was a local low, not start of monotonic descent); per-stage acc regressed -0.004; routing 71% within envelope. 10-ckpt envelope updated to 0.817-0.832 (s102 expanded floor) but oscillation regime continues. |
| `step-106000` | 106,000 | (sample-mode probe pending) | (pending) | (pending) | (pending) | sample-mode deferred; numeric rows show **NEW TRAJECTORY LOWS** in both BPB and acc: BPB **0.8135** (-0.010 vs s104; below s086 entry 0.8222 by -0.009); spec Ξ± 0.9766 (-0.009 from Run A); **per-stage acc NEW HIGH 0.4466 full / 0.4467 s3** (gap to Run A reduced to -0.039); crossover gap narrowest of trajectory (+0.000173). Progressively lower local lows pattern: s094 0.8251 β†’ s102 0.8167 β†’ s106 0.8135. **Last peak ckpt before warmdown** (rel_it 22861, ~800 step away). |
| `step-108000` | 108,000 | (sample-mode probe pending) | (pending) | (pending) | (pending) | **FIRST WARMDOWN CKPT** (rel_it 24,077, ~1,217 step into warmdown, lrm β‰ˆ 0.478); numeric rows show **NEW TRAJECTORY LOWS** in BPB (**0.8079**, -0.006 vs s106; -0.014 below s086 entry) and acc (s3 **0.4485** / full **0.4482**, gap to Run A reduced to -0.038); **crossover gap NEGATIVE for first time (-0.000049)** β€” stage aggregation now produces lower BPB than single best stage. Warmdown effect 1.5x faster than peak-phase descent rate. |
| `step-110000` | 110,000 | (sample-mode probe pending) | (pending) | (pending) | (pending) | **SECOND WARMDOWN CKPT** (rel_it 26,077, ~3,216 step into warmdown, lrm β‰ˆ 0.44); numeric rows show **continued NEW LOWS in BPB and acc**: BPB **0.8023** (-0.006 vs s108; first ckpt below 0.81; gap to Run A reduced to +0.078), per-stage acc **NEW HIGH** s3 **0.4517** / full **0.4513** (4th consecutive new high; gap to Run A reduced to -0.034). Spec Ξ± **DROP -0.022 to 0.9375** (drafter-target mismatch despite WAND bounds tightening). **Crossover gap REVERTED to +0.000412** β€” s108's negative gap was single-ckpt event, stage aggregation gain not yet stable. WAND bounds **tightening across all 3 stages** (-1% to -4%). |
### Reading the table
The seed (`step-83923`) row is the **frozen reference**. Subsequent rows answer:
1. **BPB descent** β€” does `full K=4 BPB` continue dropping past the seed, or plateau?
2. **Per-stage refinement** β€” do single-stage BPBs descend, indicating each head genuinely tightens?
3. **Routing margin** β€” does the cap=0.020 routing fraction grow (more positions cleanly handled by stage 0 alone)?
4. **Spec Ξ± dynamics** β€” does stage-0-vs-full agreement strengthen as continual training progresses?
5. **WAND p99 evolution** β€” does the cumulative-PoE delta range shrink (head distributions converge) or widen (head specialization tightens)?
6. **Sample-mode** β€” do specific factual probes (planets list, atomic number, algebra answer) become reliably correct, or oscillate?
A real continual-learning win requires multiple metrics to diverge from the seed in a coherent direction. A null result would have all rows β‰ˆ seed β€” meaning the prior-phase warmdown had already extracted available capacity from this data.
The seed β†’ `step-84000` delta is **inside expected noise** (77 step at warmup-near-zero LR cannot move the model meaningfully).
`step-86000` is the **first post-warmup ckpt** (rel_it = 2,077; warmup of 1,000 ended at rel_it = 1,000, so 1,077 step into the lrm = 0.5 peak phase). It shows a clear **LR-shock signature**: full K=4 BPB +0.097, training-log val BPB +0.103, per-stage acc -4.4pp uniform, spec Ξ± -4.5pp, routing fraction (cap = 0.020) -14pp. The crossover gap (Ξ±=0 vs single s3) **narrowed** to +0.000295 from +0.000465 β€” the relative aggregation structure is preserved despite the absolute regression.
`step-88000` (rel_it = 4,077; ~3,077 step into peak) shows the peak-phase plateau approaching: full K=4 BPB +0.010 vs `step-86000` (slope **decelerated 10Γ—** vs the s086000 single jump), spec Ξ± **recovered to 0.9795** (only -0.006 from Run A's 0.9852 β€” the first metric to fully bounce back), per-stage acc drift slowed to -0.003, routing fraction essentially flat at 70.4%, crossover gap continued narrowing to +0.000195.
`step-90000` (rel_it = 6,077; ~5,077 step into peak) is the **first ckpt to show coordinated descent** β€” full K=4 BPB **-0.005 vs step-88000** (first negative delta since LR shock), spec Ξ± **0.9853 matching Run A's 0.9852** (+0.0001), per-stage acc **+0.002 recovery** across all stages, routing fraction **+1.5pp**. The peak-phase plateau lasted ~2-3k step (s086 β†’ s088) before descent began. Head re-alignment (spec Ξ± recovery at s088) **preceded loss landscape descent** by ~2k step, validating the "preparation phase" interpretation.
`step-92000` (rel_it = 8,077) shows **decoupled metric trajectories**: full K=4 BPB **re-bounced** +0.005 (back to s088000 plateau level β€” the s090000 descent was an oscillation, not monotonic), but spec Ξ± **reached 0.9970 β€” a new trajectory high surpassing Run A endpoint by +0.012**, routing cap=0.020 **74.62% (trajectory peak in continual phase)**, and WAND bounds 1β†’2 / 2β†’3 are now **the narrowest of the entire trajectory** (below Run A levels).
`step-94000` (rel_it = 10,077) showed BPB at 0.825 (lowest peak-phase value) and routing cap=0.020 at 77.69% with WAND bounds all -13% vs `step-92000`. Initially we read this as a 4-point uptrend in routing/WAND structure metrics; the next checkpoint invalidated that reading.
`step-96000` (rel_it = 12,077) reverts toward the 6-ckpt envelope median: full K=4 BPB **+0.002 vs s094000 (now 0.827)**, routing cap=0.020 **-7.18pp drop to 70.51% (back at s086-s088 level)**, WAND bounds **all +12% widened** vs s094000 (back near s092000 levels). The s094000 measurement was an oscillation outlier, not the start of a monotonic trend.
**6-checkpoint analysis through s096** (s086 β†’ s096): no metric showed a monotonic trend; local-slice BPB oscillated within `0.822-0.832`. The peak phase appeared to be a stable oscillation regime around BPB β‰ˆ 0.828. That picture changes from `step-100000` onward.
**Plateau exit at step-100000 / step-102000**: starting at `step-100000` (rel_it 16,077) the trajectory breaks out of the 0.828 envelope. `step-100000` shows full K=4 BPB **0.8243** (-0.005 vs `step-98000`), with per-stage acc **+0.003 recovery** uniform and WAND p99 bounds narrowed -7% β€” the first ckpt where descent is corroborated by acc and WAND together. `step-102000` (rel_it 18,077) deepens the descent to **0.8167 β€” new trajectory low, below `step-86000` starting point of 0.8222 by -0.0055**, with spec Ξ± **recovered to 0.9795** (-0.006 from Run A endpoint), per-stage acc **trajectory high 0.4452** (s3=0.4456), and routing cap=0.020 at 76.31%. Three consecutive checkpoints (`step-98000` β†’ `step-100000` β†’ `step-102000`) form a monotonic descent: 0.830 β†’ 0.824 β†’ 0.817. Training-log val BPB confirms across 5 evaluations: s98000=0.880 β†’ s100000=0.877 β†’ s101000=0.875 β†’ s101500=0.875 β†’ s102000=0.874.
This is the first phase of the run where the cyclic-LR continuation produces gains beyond noise. The model has now spent ~17,000 step at constant lrm = 0.5 and is finally consolidating into a lower-loss region.
`step-104000` (rel_it = 20,077) then **bounces back** to BPB 0.8239 (+0.007 vs `step-102000`), with per-stage acc regressing -0.004 and routing dropping to 71%. The s102000 low was therefore a local low rather than the start of a sustained descent β€” analogous to `step-94000`'s earlier outlier. The 10-checkpoint envelope is now 0.817-0.832 (expanded floor by -0.005 vs the s086-s098 range of 0.822-0.832), but the oscillation regime persists.
The cyclic-LR continuation through 20,000 peak-phase steps has produced a measurable floor expansion (-0.005 below the LR-shock plateau) but has not transitioned into monotonic descent at constant lrm = 0.5.
`step-106000` (rel_it = 22,077; final peak-phase checkpoint) reaches **NEW trajectory lows simultaneously in BPB and acc**: full K=4 BPB **0.8135** (-0.010 vs s104, -0.003 below previous low s102, **-0.009 below s086 entry**), per-stage acc **NEW HIGH 0.4467 s3 / 0.4466 full** (gap to Run A reduced to -0.039 β€” smallest since LR shock), spec Ξ± 0.9766 close to Run A baseline, crossover gap +0.000173 (narrowest of trajectory). Both metrics moving together (rather than BPB low coinciding with acc decay, or vice versa) indicates **structural descent rather than measurement oscillation**.
The 11-checkpoint floor follows a **progressively-lower local-low** pattern: s094 = 0.8251 β†’ s102 = 0.8167 β†’ s106 = 0.8135, with intermediate bounces back to ~0.825. Each successive local low is below the previous. The model is descending in an oscillating fashion with declining floors rather than a smooth monotonic curve.
Warmdown begins around `step-106800` (rel_it β‰ˆ 22,861) β€” only ~800 step away. **`step-108000` is the first warmdown checkpoint**.
`step-108000` (rel_it = 24,077; ~1,217 step into warmdown, lrm β‰ˆ 0.478) confirms warmdown is producing accelerated descent: full K=4 BPB **0.8079** (-0.006 vs `step-106000`'s peak floor of 0.8135 over just 2,000 step, a **1.5Γ— faster rate than peak-phase descent**), per-stage acc trajectory high (s3 **0.4485** / full **0.4482**, gap to Run A reduced to -0.038), and the **crossover gap goes NEGATIVE for the first time (-0.000049)** β€” uniform-mean PoE aggregation now produces a BPB *lower* than the single best stage (s3), meaning stage aggregation is finally adding measurable value rather than being absorbed by the s3 head alone.
BPB is now **0.014 below the s086 LR-shock entry point** and **0.083 above the Run A endpoint**. Approximately 12,000 step of warmdown remain, with lrm decaying from 0.48 to 0.0. If the current warmdown descent rate sustains, the endpoint BPB will reach the 0.75-0.76 range; if it accelerates as in Run A's late warmdown, approaching 0.7247 becomes plausible.
`step-110000` (rel_it = 26,077; ~3,216 step into warmdown, lrm β‰ˆ 0.44) extends warmdown descent at the same constant rate: full K=4 BPB **0.8023** (-0.006 vs `step-108000`, first ckpt with BPB below 0.81, gap to Run A reduced to +0.078) and per-stage acc **NEW HIGH 4th consecutive checkpoint** (s3 **0.4517** / full **0.4513**, gap to Run A reduced to -0.034). The descent is now **monotonic for 3 consecutive checkpoints** (s106 β†’ s108 β†’ s110: 0.8135 β†’ 0.8079 β†’ 0.8023) at a constant -0.006/2000-step rate. Acc is monotonic for 4 consecutive checkpoints (s104 β†’ s106 β†’ s108 β†’ s110: 0.4419 β†’ 0.4467 β†’ 0.4485 β†’ 0.4517).
Two concerning signals at `step-110000`: (1) **spec Ξ± drops -0.022 to 0.9375**, the largest single-checkpoint Ξ± drop since the LR shock at s086; (2) the **crossover gap reverts to positive (+0.000412)** after going negative at s108. Both signals indicate that the s108 stage-aggregation gain was a single-checkpoint event, not a stable transition β€” the drafter (single s3) and target (full K=4) distributions are still oscillating relative to each other despite WAND p99 bounds tightening across all 3 stages (-1% to -4%). The bounds tightening (head outputs converging to narrower distributions) coinciding with Ξ± drop (drafter-target mismatch) suggests warmdown-phase head adaptation rates differ across stages, producing a temporary divergence even as the overall loss landscape descends.
Endpoint projection at constant -0.006/2000-step rate: 10,000 step remaining Γ— -0.030 β†’ s134 BPB β‰ˆ 0.77, leaving the Run A gap at +0.045. Reaching Run A's 0.7247 would require warmdown acceleration of ~50% in the final 5,000 step β€” possible (Run A's own warmdown showed late acceleration) but not certain at current trajectory.
## What this release is not
- **Not a multilingual extension.** Tokenizer and data are unchanged; CJK / non-ES Romance language behavior is identical to the prior phase (substantial gaps remain).
- **Not an instruction-tuned / chat model.** Both phases use base pretraining objectives; chat templates are not exposed.
- **Not a quality bump claim.** The hypothesis is being tested in public β€” endpoint quality is reported as data, not as a marketing claim. Use the prior-phase release as the canonical 3B base unless trajectory evidence here recommends otherwise.
## License
Apache 2.0.
## Citation
```
@misc{jeong2026poe,
author = {Jeong, Jaepil},
title = {Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters},
year = {2026},
doi = {10.5281/zenodo.19547653},
publisher = {Zenodo},
}
```
A 3B-specific paper covering the full prior-phase trajectory, this continual-pretraining trajectory, and a planned multilingual reorganization variant is in preparation.
## Related releases
- [`cognica/Cognica-PoE-v1.0-3B-base`](https://huggingface.co/cognica/Cognica-PoE-v1.0-3B-base) — Prior phase: 3B PoE per-stage, 66 B tokens, single warmup→warmdown cycle, `frontier_v1` mix
- [`cognica/Cognica-PoE-v1.0-1.3B-base`](https://huggingface.co/cognica/Cognica-PoE-v1.0-1.3B-base) β€” 1.3B PoE per-stage release (different scale)
- [`cognica/Cognica-BP-v1.0-1.3B-base`](https://huggingface.co/cognica/Cognica-BP-v1.0-1.3B-base) β€” 1.3B Backprop baseline (PoE control)