Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string
Cognica-PoE-v1.0-3B-base-continual-learning
Continual pretraining of a 3B PoE per-stage-head model using a cyclic re-warmup schedule to extract additional capacity from the same data distribution after initial training has fully annealed.
This release studies whether a model whose first training pass has reached its scheduled LR floor (lrm ≈ 0.05) can still meaningfully improve when given a fresh half-peak warmup → warmdown cycle on the same data — without changing architecture, tokenizer, or data mix.
The model is published as a trajectory of step branches, not just a final ckpt: every saved checkpoint becomes a separate step-XXXXX branch so the continual-learning curve itself is externally auditable.
What "continual" means here
Most published "base" models are released after a single warmup → constant → warmdown LR cycle. At the end of that cycle the LR is near zero and gradient updates produce only marginal change — the model is conventionally considered "done."
This release tests a different setting: take a fully-annealed checkpoint, re-arm the optimizer with a new LR cycle at half the original peak, and continue training on the same data. We label this continual pretraining (cyclic LR) to distinguish it from:
| pattern | what changes vs prior phase |
|---|---|
| continual pretraining (cyclic LR) ← this release | LR re-warmed; data + tokenizer + architecture unchanged |
| domain-adaptive pretraining | new domain data added |
| multilingual continual pretraining | tokenizer extended; multilingual data mixed in |
| continual instruction tuning | SFT data; chat-format objective |
The hypothesis being probed: does a half-peak second cycle on identical data produce real, measurable gain, or does the model plateau?
Methodology — B2 cyclic schedule
Initialization: a fully-trained 3B PoE per-stage-head model (66B tokens consumed, full warmup → warmdown cycle complete, lrm annealed to ~0.05 of original peak).
LR schedule for the continual phase (anchored at the resume step):
warmup (rel 0 .. 1000) : lrm rises 0 → 0.5 (linear)
peak (rel 1000..22861) : lrm = 0.5 (constant; half of original peak)
warmdown (rel 22861..50800): lrm decays 0.5 → 0.0 (linear; warmdown_ratio = 0.55)
| param | continual phase value | original phase (for reference) |
|---|---|---|
| token budget | +40 B | 66 B |
| total steps in phase | 50,800 | 83,923 |
| warmup steps | 1,000 | 1,000 |
peak lrm |
0.5 | 1.0 |
| warmdown ratio | 0.55 | 0.65 |
final lrm |
0.0 | 0.05 |
effective matrix_lr peak |
0.0075 | 0.015 |
effective embedding_lr peak |
0.15 | 0.30 |
effective unembedding_lr peak |
0.004 | 0.008 |
Optimizer state is restored from the prior phase's last save (DistMuonAdamW ZeRO-2 sharded across 12 ranks). Tokenizer (32,768 vocab, rustbpe), architecture (depth=32, n_embd=2048, K=4 PoE per-stage with asymmetric stage_layers=(16,6,5,5), GQA 2:1, intermediate=12800, max_seq_len=2048), and data mix (frontier_v1: FineWeb-Edu 33.5% + DCLM-Baseline 24% + Stack-v2 16% + Wikipedia 5% + CulturaX 5% + ProofPile-2 4% + OpenWebMath 4% + Gutenberg 4% + PG-19 2% + UltraChat 1% + OpenHermes-2.5 0.6%) are all unchanged from the prior phase.
Why a published trajectory
The point of this release is the continual-learning curve, not any single endpoint. We publish every save (step-XXXXX branches) so the actual question — "does a re-warmed cycle keep improving the model?" — can be answered by reading off the trajectory rather than trusting our headline numbers.
Each step-XXXXX branch carries its own per-checkpoint poe_wand_p99_bounds_per_stage_head calibration in config.json so PoE-specific inference (WAND adaptive depth, self-speculative decoding) works correctly at any branch.
Branches
| Branch | Step | Phase position | Notes |
|---|---|---|---|
step-83923 |
83,923 | continual phase rel_it = 0 | seed: identical to the fully-annealed prior-phase final ckpt |
step-84000 |
84,000 | rel_it = 77 (warmup early, lrm ≈ 0.04) | first save after re-warmup begins; functionally indistinguishable from seed |
step-86000 |
86,000 | rel_it = 2,077 (peak +1,077, lrm = 0.5) | first post-warmup ckpt; LR-shock signature (BPB +0.097 / acc -4.4pp / α -4.5pp) |
step-88000 |
88,000 | rel_it = 4,077 (peak +3,077) | plateau approach; BPB slope decelerated 10×; spec α recovered to 0.9795 |
step-90000 |
90,000 | rel_it = 6,077 (peak +5,077) | first coordinated descent: BPB -0.005, spec α 0.9853 matched Run A baseline |
step-92000 |
92,000 | rel_it = 8,077 (peak +7,077) | decoupled metrics: BPB re-bounce to 0.832 plateau, spec α 0.9970 new high (above Run A 0.9852), WAND bounds narrowest of trajectory |
step-94000 |
94,000 | rel_it = 10,077 (peak +9,077) | 2nd BPB descent (-0.007 → 0.825 lowest peak-phase value); routing cap=0.020 77.69% new continual peak; WAND bounds all -13% vs s092 / -16-19% below Run A |
step-96000 |
96,000 | rel_it = 12,077 (peak +11,077) | s094 was outlier: BPB reverts +0.002 (back to 0.827 median); routing cap=0.020 -7.18pp REVERT to 70.5%; WAND bounds all +12% widened. Peak phase confirmed as oscillation regime, no monotonic trends. |
main |
latest | tracks newest published step | currently step-96000 |
Future saves: every 2000 steps (step-98000, step-100000, ..., step-134000) plus a final step-134723. Warmdown begins around step-106800 (rel_it ≈ 22,861).
Inference
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
"cognica/Cognica-PoE-v1.0-3B-base-continual-learning",
revision="main", # or any "step-XXXXX" branch
trust_remote_code=True,
dtype=torch.bfloat16,
).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(
"cognica/Cognica-PoE-v1.0-3B-base-continual-learning",
revision="main",
trust_remote_code=True,
)
# Base ckpts REQUIRE prepending <|bos|> before user text:
prompt = "The capital of France is"
input_ids = [tokenizer.bos_token_id] + tokenizer.encode(prompt, add_special_tokens=False)
input_ids = torch.tensor([input_ids], device=device)
out = model.generate(input_ids=input_ids, max_new_tokens=32, do_sample=False)
print(tokenizer.decode(out[0].tolist()))
PoE-specific inference helpers (single-stage forward, prefix pruning, WAND adaptive depth, self-speculative decoding) are exposed on CognicaPoEForCausalLM. Each step-XXXXX branch carries its own calibrated poe_wand_p99_bounds_per_stage_head in config.json; model.generate_wand(...) reads it automatically.
For the architectural details, full inference recipe, and the prior-phase trajectory analysis, see the prior-phase release: cognica/Cognica-PoE-v1.0-3B-base.
Trajectory measurements
The continual phase entry point (step-83923) is identical to the prior-phase endpoint, so it serves as both the resume anchor and the baseline against which every continual-phase ckpt is measured. A meaningful continual-learning result requires step-NNNNN measurements to diverge from the seed across multiple metrics — not just match it.
Per-checkpoint val BPB
Same 8-shard local val slice across all measured ckpts (1.05 M tokens, --split-tokens 1048576); single A100 80GB; FULL K=4 PoE aggregation.
| Branch | Step | Phase rel_it | Training-log val BPB (12 ranks, 40M tokens) | Local-slice full K=4 BPB (8 shards, 1.05M tokens) |
|---|---|---|---|---|
step-83923 (seed) |
83,923 | 0 | 0.772893 | 0.724738 |
step-84000 |
84,000 | 77 | 0.772671 | 0.724587 |
step-86000 |
86,000 | 2,077 | 0.875501 | 0.822195 |
step-88000 |
88,000 | 4,077 | 0.882645 | 0.832245 |
step-90000 |
90,000 | 6,077 | 0.885028 | 0.827274 |
step-92000 |
92,000 | 8,077 | 0.886520 | 0.832023 |
step-94000 |
94,000 | 10,077 | 0.883594 | 0.825098 |
step-96000 |
96,000 | 12,077 | 0.878945 | 0.827363 |
Per-stage BPB
Per-checkpoint training-objective BPB at each PoE stage boundary, on the same 8-shard local val slice.
| Branch | Step | full K=4 | single s0 | single s1 | single s2 | single s3 | prefix K'=1 | prefix K'=2 | prefix K'=3 |
|---|---|---|---|---|---|---|---|---|---|
step-83923 (seed) |
83,923 | 0.724738 | 0.727740 | 0.725798 | 0.725063 | 0.724273 | 0.727740 | 0.726103 | 0.725363 |
step-84000 |
84,000 | 0.724587 | 0.727583 | 0.725641 | 0.724863 | 0.724058 | 0.727583 | 0.725975 | 0.725224 |
step-86000 |
86,000 | 0.822195 | 0.825607 | 0.823068 | 0.822524 | 0.821900 | 0.825607 | 0.823529 | 0.822770 |
step-88000 |
88,000 | 0.832245 | 0.835634 | 0.833265 | 0.832631 | 0.832050 | 0.835634 | 0.833652 | 0.832856 |
step-90000 |
90,000 | 0.827274 | 0.830453 | 0.828213 | 0.827658 | 0.827063 | 0.830453 | 0.828545 | 0.827827 |
step-92000 |
92,000 | 0.832023 | 0.835828 | 0.832909 | 0.832110 | 0.831363 | 0.835828 | 0.833640 | 0.832726 |
step-94000 |
94,000 | 0.825098 | 0.828611 | 0.825863 | 0.825143 | 0.824426 | 0.828611 | 0.826573 | 0.825742 |
step-96000 |
96,000 | 0.827363 | 0.830721 | 0.828278 | 0.827653 | 0.826947 | 0.830721 | 0.828749 | 0.827980 |
Per-stage standalone target accuracy
Top-1 accuracy of each stage's standalone prediction vs ground-truth target token, on the same 8-shard val slice.
| Branch | Step | s0 | s1 | s2 | s3 | full |
|---|---|---|---|---|---|---|
step-83923 (seed) |
83,923 | 0.4844 | 0.4856 | 0.4859 | 0.4862 | 0.4860 |
step-84000 |
84,000 | 0.4843 | 0.4857 | 0.4859 | 0.4861 | 0.4859 |
step-86000 |
86,000 | 0.4405 | 0.4419 | 0.4422 | 0.4424 | 0.4420 |
step-88000 |
88,000 | 0.4375 | 0.4382 | 0.4388 | 0.4390 | 0.4386 |
step-90000 |
90,000 | 0.4392 | 0.4403 | 0.4407 | 0.4407 | 0.4405 |
step-92000 |
92,000 | 0.4364 | 0.4380 | 0.4385 | 0.4391 | 0.4384 |
step-94000 |
94,000 | 0.4403 | 0.4417 | 0.4422 | 0.4421 | 0.4420 |
step-96000 |
96,000 | 0.4391 | 0.4400 | 0.4405 | 0.4409 | 0.4404 |
Self-speculative decoding (m=4 stage-0 draft, full K=4 verify)
| Branch | Step | acceptance α | mean accepted / 4 | end-to-end speedup |
|---|---|---|---|---|
step-83923 (seed) |
83,923 | 0.9852 | 3.83 | 1.61x |
step-84000 |
84,000 | 0.9736 | 3.77 | 1.57x |
step-86000 |
86,000 | 0.9402 | 3.67 | 1.53x |
step-88000 |
88,000 | 0.9795 | 3.88 | 1.58x |
step-90000 |
90,000 | 0.9853 | 3.88 | 1.58x |
step-92000 |
92,000 | 0.9970 | 3.94 | 1.60x |
step-94000 |
94,000 | 0.9623 | 3.77 | 1.56x |
step-96000 |
96,000 | 0.9853 | 3.94 | 1.59x |
Confidence-aware routing (target_regression cap = 0.020)
| Branch | Step | fraction routed to stage 0 | projected speedup | base s0=full agreement |
|---|---|---|---|---|
step-83923 (seed) |
83,923 | 85.05% | 1.715x | 0.9746 |
step-84000 |
84,000 | 87.86% | 1.756x | 0.9756 |
step-86000 |
86,000 | 71.05% | 1.534x | 0.9685 |
step-88000 |
88,000 | 70.43% | 1.523x | 0.9651 |
step-90000 |
90,000 | 71.93% | 1.541x | 0.9687 |
step-92000 |
92,000 | 74.62% | 1.585x | 0.9712 |
step-94000 |
94,000 | 77.69% | 1.643x | 0.9721 |
step-96000 |
96,000 | 70.51% | 1.526x | 0.9683 |
WAND p99 bounds (cumulative-PoE delta range, constant-shift invariant)
Calibrated on a 131,072-token val slice using range(delta) = max(delta) − min(delta). Each branch's config.json carries its own bounds in poe_wand_p99_bounds_per_stage_head.
| Branch | Step | bound 0 → 1 | bound 1 → 2 | bound 2 → 3 |
|---|---|---|---|---|
step-83923 (seed) |
83,923 | 3.9429 | 2.0193 | 1.4479 |
step-84000 |
84,000 | 3.9043 | 2.0036 | 1.4499 |
step-86000 |
86,000 | 3.9832 | 2.0023 | 1.4030 |
step-88000 |
88,000 | 3.7784 | 1.9637 | 1.4530 |
step-90000 |
90,000 | 4.0349 | 1.9545 | 1.4144 |
step-92000 |
92,000 | 3.7762 | 1.8652 | 1.3454 |
step-94000 |
94,000 | 3.2967 | 1.6240 | 1.1702 |
step-96000 |
96,000 | 3.6387 | 1.8206 | 1.3109 |
Bayesian PoE α-sweep (renormed BPB at α=0)
α=0 is the geometric-mean PoE aggregate (i.e. uniform-mean of log-probabilities); higher α values approach pure-sum PoE.
| Branch | Step | α=0 (geom-mean) | α=0.25 | α=0.5 (Bayesian √K) | α=0.75 | α=1.0 (pure sum / Log-OP) | crossover gap (α=0 vs single s3) |
|---|---|---|---|---|---|---|---|
step-83923 (seed) |
83,923 | 0.724738 | 0.791984 | 0.980340 | 1.298269 | 1.778352 | +0.000465 |
step-84000 |
84,000 | 0.724587 | 0.791340 | 0.979207 | 1.296562 | 1.775902 | +0.000529 |
step-86000 |
86,000 | 0.822195 | 0.899830 | 1.114882 | 1.477344 | 2.024439 | +0.000295 |
step-88000 |
88,000 | 0.832245 | 0.915820 | 1.139195 | 1.513019 | 2.075771 | +0.000195 |
step-90000 |
90,000 | 0.827274 | 0.901033 | 1.113493 | 1.473665 | 2.018179 | +0.000211 |
step-92000 |
92,000 | 0.832023 | 0.904967 | 1.118086 | 1.479839 | 2.026806 | +0.000661 |
step-94000 |
94,000 | 0.825098 | 0.910879 | 1.136512 | 1.512243 | 2.076723 | +0.000672 |
step-96000 |
96,000 | 0.827363 | 0.901554 | 1.114122 | 1.474327 | 2.018869 | +0.000416 |
Sample-mode probe (FULL K=4, greedy, 60 tokens, 7-prompt fixed set)
Concept-level retention probe across a fixed 7-prompt set: capital of France, gold chemical symbol, Friday→tomorrow, opposite of hot, planets list, favorite color, 5x+3=13 algebra. Greedy continuations track factual recall trajectory in parallel with BPB.
| Branch | Step | France | Gold (Au + atomic # 79) | Planets list | Algebra 5x+3=13 (correct: 2) |
Note |
|---|---|---|---|---|---|---|
step-83923 (seed) |
83,923 | Paris ✓ + ÎdF ✓ + 6th in EU + most visited | Au ✓ + 79 ✓ + transition metals ✓ + properties (richest) | "eight major planets" + complete 8-list ✓ + Roman gods ✓ | echoes question without producing value (regression) | richest factual France & Gold across prior phase |
step-84000 |
84,000 | Paris ✓ + ÎdF ✓ + 6th in EU + most visited (≈ seed) | Au ✓ + properties (no atomic #, no transition metals) | terrestrial 4 + gas-giant heading (truncated) | echoes question without producing value | small variations only — 77 step at near-zero LR ≈ noise |
step-86000 |
86,000 | (sample-mode probe pending) | (pending) | (pending) | (pending) | sample-mode deferred; numeric rows show LR-shock +0.097 BPB / -4.4pp acc / -4.5pp α |
step-88000 |
88,000 | (sample-mode probe pending) | (pending) | (pending) | (pending) | sample-mode deferred; numeric rows show plateau approach: BPB slope decelerated 10x, spec α RECOVERED to 0.9795 (-0.006 from Run A) |
step-90000 |
90,000 | (sample-mode probe pending) | (pending) | (pending) | (pending) | sample-mode deferred; numeric rows show FIRST PEAK DESCENT: full K=4 BPB -0.005, spec α MATCHED Run A 0.9853, per-stage acc +0.002 RECOVERY, routing +1.5pp |
step-92000 |
92,000 | (sample-mode probe pending) | (pending) | (pending) | (pending) | sample-mode deferred; numeric rows show DECOUPLED METRICS: BPB re-bounced +0.005 (oscillation around 0.832 plateau), but spec α 0.9970 NEW HIGH SURPASSED Run A baseline by +0.012, routing cap=0.020 74.62% trajectory peak, WAND bounds 1→2/2→3 narrowest of trajectory |
step-94000 |
94,000 | (sample-mode probe pending) | (pending) | (pending) | (pending) | sample-mode deferred; numeric rows show 2nd BPB DESCENT (-0.007 vs s092, lowest peak-phase value 0.825), spec α 0.9623 (pulled back from s092 high), per-stage acc +0.003 RECOVERY, routing cap=0.020 77.69% NEW continual peak, WAND bounds ALL -13% vs s092 / -16-19% below Run A (NEW LOWS) |
step-96000 |
96,000 | (sample-mode probe pending) | (pending) | (pending) | (pending) | sample-mode deferred; numeric rows show s094 was outlier — BPB +0.002 reverts toward median (0.827), routing cap=0.020 -7.18pp REVERT to 70.5% (s094 outlier high invalidated), WAND bounds ALL +12% widened reverting toward s092 levels. No metric in 6-ckpt peak phase shows monotonic trend — all oscillating within stable envelopes. |
Reading the table
The seed (step-83923) row is the frozen reference. Subsequent rows answer:
- BPB descent — does
full K=4 BPBcontinue dropping past the seed, or plateau? - Per-stage refinement — do single-stage BPBs descend, indicating each head genuinely tightens?
- Routing margin — does the cap=0.020 routing fraction grow (more positions cleanly handled by stage 0 alone)?
- Spec α dynamics — does stage-0-vs-full agreement strengthen as continual training progresses?
- WAND p99 evolution — does the cumulative-PoE delta range shrink (head distributions converge) or widen (head specialization tightens)?
- Sample-mode — do specific factual probes (planets list, atomic number, algebra answer) become reliably correct, or oscillate?
A real continual-learning win requires multiple metrics to diverge from the seed in a coherent direction. A null result would have all rows ≈ seed — meaning the prior-phase warmdown had already extracted available capacity from this data.
The seed → step-84000 delta is inside expected noise (77 step at warmup-near-zero LR cannot move the model meaningfully).
step-86000 is the first post-warmup ckpt (rel_it = 2,077; warmup of 1,000 ended at rel_it = 1,000, so 1,077 step into the lrm = 0.5 peak phase). It shows a clear LR-shock signature: full K=4 BPB +0.097, training-log val BPB +0.103, per-stage acc -4.4pp uniform, spec α -4.5pp, routing fraction (cap = 0.020) -14pp. The crossover gap (α=0 vs single s3) narrowed to +0.000295 from +0.000465 — the relative aggregation structure is preserved despite the absolute regression.
step-88000 (rel_it = 4,077; ~3,077 step into peak) shows the peak-phase plateau approaching: full K=4 BPB +0.010 vs step-86000 (slope decelerated 10× vs the s086000 single jump), spec α recovered to 0.9795 (only -0.006 from Run A's 0.9852 — the first metric to fully bounce back), per-stage acc drift slowed to -0.003, routing fraction essentially flat at 70.4%, crossover gap continued narrowing to +0.000195.
step-90000 (rel_it = 6,077; ~5,077 step into peak) is the first ckpt to show coordinated descent — full K=4 BPB -0.005 vs step-88000 (first negative delta since LR shock), spec α 0.9853 matching Run A's 0.9852 (+0.0001), per-stage acc +0.002 recovery across all stages, routing fraction +1.5pp. The peak-phase plateau lasted ~2-3k step (s086 → s088) before descent began. Head re-alignment (spec α recovery at s088) preceded loss landscape descent by ~2k step, validating the "preparation phase" interpretation.
step-92000 (rel_it = 8,077) shows decoupled metric trajectories: full K=4 BPB re-bounced +0.005 (back to s088000 plateau level — the s090000 descent was an oscillation, not monotonic), but spec α reached 0.9970 — a new trajectory high surpassing Run A endpoint by +0.012, routing cap=0.020 74.62% (trajectory peak in continual phase), and WAND bounds 1→2 / 2→3 are now the narrowest of the entire trajectory (below Run A levels).
step-94000 (rel_it = 10,077) showed BPB at 0.825 (lowest peak-phase value) and routing cap=0.020 at 77.69% with WAND bounds all -13% vs step-92000. Initially we read this as a 4-point uptrend in routing/WAND structure metrics; the next checkpoint invalidated that reading.
step-96000 (rel_it = 12,077) reverts toward the 6-ckpt envelope median: full K=4 BPB +0.002 vs s094000 (now 0.827), routing cap=0.020 -7.18pp drop to 70.51% (back at s086-s088 level), WAND bounds all +12% widened vs s094000 (back near s092000 levels). The s094000 measurement was an oscillation outlier, not the start of a monotonic trend.
Revised 6-checkpoint analysis (s086 → s096): no metric shows a monotonic trend across the peak phase. Local-slice BPB oscillates within 0.822-0.832 (range 0.010), spec α within 0.94-0.997 (range 0.057), per-stage acc within 0.4390-0.4424 (range 0.003), routing cap=0.020 within 70.4-77.7% (range 7.3pp), and WAND p99 1→2 within 1.62-2.00 (range 0.38). Peak phase at lrm=0.5 is therefore best characterised as a stable oscillation regime around an effective plateau at BPB ≈ 0.828, not as a phase of monotonic structural improvement. The cyclic LR is keeping the model in this regime; whether the upcoming warmdown (begins around step-106800, rel_it ≈ 22,861) drives BPB below the seed baseline of 0.7247 remains the decisive open test.
What this release is not
- Not a multilingual extension. Tokenizer and data are unchanged; CJK / non-ES Romance language behavior is identical to the prior phase (substantial gaps remain).
- Not an instruction-tuned / chat model. Both phases use base pretraining objectives; chat templates are not exposed.
- Not a quality bump claim. The hypothesis is being tested in public — endpoint quality is reported as data, not as a marketing claim. Use the prior-phase release as the canonical 3B base unless trajectory evidence here recommends otherwise.
License
Apache 2.0.
Citation
@misc{jeong2026poe,
author = {Jeong, Jaepil},
title = {Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters},
year = {2026},
doi = {10.5281/zenodo.19547653},
publisher = {Zenodo},
}
A 3B-specific paper covering the full prior-phase trajectory, this continual-pretraining trajectory, and a planned multilingual reorganization variant is in preparation.
Related releases
cognica/Cognica-PoE-v1.0-3B-base— Prior phase: 3B PoE per-stage, 66 B tokens, single warmup→warmdown cycle,frontier_v1mixcognica/Cognica-PoE-v1.0-1.3B-base— 1.3B PoE per-stage release (different scale)cognica/Cognica-BP-v1.0-1.3B-base— 1.3B Backprop baseline (PoE control)
- Downloads last month
- 314