Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string

Cognica-PoE-v1.0-3B-base-continual-learning

Continual pretraining of a 3B PoE per-stage-head model using a cyclic re-warmup schedule to extract additional capacity from the same data distribution after initial training has fully annealed.

This release studies whether a model whose first training pass has reached its scheduled LR floor (lrm ≈ 0.05) can still meaningfully improve when given a fresh half-peak warmup → warmdown cycle on the same data — without changing architecture, tokenizer, or data mix.

The model is published as a trajectory of step branches, not just a final ckpt: every saved checkpoint becomes a separate step-XXXXX branch so the continual-learning curve itself is externally auditable.

What "continual" means here

Most published "base" models are released after a single warmup → constant → warmdown LR cycle. At the end of that cycle the LR is near zero and gradient updates produce only marginal change — the model is conventionally considered "done."

This release tests a different setting: take a fully-annealed checkpoint, re-arm the optimizer with a new LR cycle at half the original peak, and continue training on the same data. We label this continual pretraining (cyclic LR) to distinguish it from:

pattern what changes vs prior phase
continual pretraining (cyclic LR)this release LR re-warmed; data + tokenizer + architecture unchanged
domain-adaptive pretraining new domain data added
multilingual continual pretraining tokenizer extended; multilingual data mixed in
continual instruction tuning SFT data; chat-format objective

The hypothesis being probed: does a half-peak second cycle on identical data produce real, measurable gain, or does the model plateau?

Methodology — B2 cyclic schedule

Initialization: a fully-trained 3B PoE per-stage-head model (66B tokens consumed, full warmup → warmdown cycle complete, lrm annealed to ~0.05 of original peak).

LR schedule for the continual phase (anchored at the resume step):

warmup    (rel 0 .. 1000)   : lrm rises 0   → 0.5  (linear)
peak      (rel 1000..22861) : lrm = 0.5            (constant; half of original peak)
warmdown  (rel 22861..50800): lrm decays 0.5 → 0.0 (linear; warmdown_ratio = 0.55)
param continual phase value original phase (for reference)
token budget +40 B 66 B
total steps in phase 50,800 83,923
warmup steps 1,000 1,000
peak lrm 0.5 1.0
warmdown ratio 0.55 0.65
final lrm 0.0 0.05
effective matrix_lr peak 0.0075 0.015
effective embedding_lr peak 0.15 0.30
effective unembedding_lr peak 0.004 0.008

Optimizer state is restored from the prior phase's last save (DistMuonAdamW ZeRO-2 sharded across 12 ranks). Tokenizer (32,768 vocab, rustbpe), architecture (depth=32, n_embd=2048, K=4 PoE per-stage with asymmetric stage_layers=(16,6,5,5), GQA 2:1, intermediate=12800, max_seq_len=2048), and data mix (frontier_v1: FineWeb-Edu 33.5% + DCLM-Baseline 24% + Stack-v2 16% + Wikipedia 5% + CulturaX 5% + ProofPile-2 4% + OpenWebMath 4% + Gutenberg 4% + PG-19 2% + UltraChat 1% + OpenHermes-2.5 0.6%) are all unchanged from the prior phase.

Why a published trajectory

The point of this release is the continual-learning curve, not any single endpoint. We publish every save (step-XXXXX branches) so the actual question — "does a re-warmed cycle keep improving the model?" — can be answered by reading off the trajectory rather than trusting our headline numbers.

Each step-XXXXX branch carries its own per-checkpoint poe_wand_p99_bounds_per_stage_head calibration in config.json so PoE-specific inference (WAND adaptive depth, self-speculative decoding) works correctly at any branch.

Branches

Branch Step Phase position Notes
step-83923 83,923 continual phase rel_it = 0 seed: identical to the fully-annealed prior-phase final ckpt
step-84000 84,000 rel_it = 77 (warmup early, lrm ≈ 0.04) first save after re-warmup begins; functionally indistinguishable from seed
step-86000 86,000 rel_it = 2,077 (peak +1,077, lrm = 0.5) first post-warmup ckpt; LR-shock signature (BPB +0.097 / acc -4.4pp / α -4.5pp)
step-88000 88,000 rel_it = 4,077 (peak +3,077) plateau approach; BPB slope decelerated 10×; spec α recovered to 0.9795
step-90000 90,000 rel_it = 6,077 (peak +5,077) first coordinated descent: BPB -0.005, spec α 0.9853 matched Run A baseline
step-92000 92,000 rel_it = 8,077 (peak +7,077) decoupled metrics: BPB re-bounce to 0.832 plateau, spec α 0.9970 new high (above Run A 0.9852), WAND bounds narrowest of trajectory
step-94000 94,000 rel_it = 10,077 (peak +9,077) 2nd BPB descent (-0.007 → 0.825 lowest peak-phase value); routing cap=0.020 77.69% new continual peak; WAND bounds all -13% vs s092 / -16-19% below Run A
step-96000 96,000 rel_it = 12,077 (peak +11,077) s094 was outlier: BPB reverts +0.002 (back to 0.827 median); routing cap=0.020 -7.18pp REVERT to 70.5%; WAND bounds all +12% widened. Peak phase confirmed as oscillation regime, no monotonic trends.
main latest tracks newest published step currently step-96000

Future saves: every 2000 steps (step-98000, step-100000, ..., step-134000) plus a final step-134723. Warmdown begins around step-106800 (rel_it ≈ 22,861).

Inference

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
    "cognica/Cognica-PoE-v1.0-3B-base-continual-learning",
    revision="main",                # or any "step-XXXXX" branch
    trust_remote_code=True,
    dtype=torch.bfloat16,
).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(
    "cognica/Cognica-PoE-v1.0-3B-base-continual-learning",
    revision="main",
    trust_remote_code=True,
)

# Base ckpts REQUIRE prepending <|bos|> before user text:
prompt = "The capital of France is"
input_ids = [tokenizer.bos_token_id] + tokenizer.encode(prompt, add_special_tokens=False)
input_ids = torch.tensor([input_ids], device=device)
out = model.generate(input_ids=input_ids, max_new_tokens=32, do_sample=False)
print(tokenizer.decode(out[0].tolist()))

PoE-specific inference helpers (single-stage forward, prefix pruning, WAND adaptive depth, self-speculative decoding) are exposed on CognicaPoEForCausalLM. Each step-XXXXX branch carries its own calibrated poe_wand_p99_bounds_per_stage_head in config.json; model.generate_wand(...) reads it automatically.

For the architectural details, full inference recipe, and the prior-phase trajectory analysis, see the prior-phase release: cognica/Cognica-PoE-v1.0-3B-base.

Trajectory measurements

The continual phase entry point (step-83923) is identical to the prior-phase endpoint, so it serves as both the resume anchor and the baseline against which every continual-phase ckpt is measured. A meaningful continual-learning result requires step-NNNNN measurements to diverge from the seed across multiple metrics — not just match it.

Per-checkpoint val BPB

Same 8-shard local val slice across all measured ckpts (1.05 M tokens, --split-tokens 1048576); single A100 80GB; FULL K=4 PoE aggregation.

Branch Step Phase rel_it Training-log val BPB (12 ranks, 40M tokens) Local-slice full K=4 BPB (8 shards, 1.05M tokens)
step-83923 (seed) 83,923 0 0.772893 0.724738
step-84000 84,000 77 0.772671 0.724587
step-86000 86,000 2,077 0.875501 0.822195
step-88000 88,000 4,077 0.882645 0.832245
step-90000 90,000 6,077 0.885028 0.827274
step-92000 92,000 8,077 0.886520 0.832023
step-94000 94,000 10,077 0.883594 0.825098
step-96000 96,000 12,077 0.878945 0.827363

Per-stage BPB

Per-checkpoint training-objective BPB at each PoE stage boundary, on the same 8-shard local val slice.

Branch Step full K=4 single s0 single s1 single s2 single s3 prefix K'=1 prefix K'=2 prefix K'=3
step-83923 (seed) 83,923 0.724738 0.727740 0.725798 0.725063 0.724273 0.727740 0.726103 0.725363
step-84000 84,000 0.724587 0.727583 0.725641 0.724863 0.724058 0.727583 0.725975 0.725224
step-86000 86,000 0.822195 0.825607 0.823068 0.822524 0.821900 0.825607 0.823529 0.822770
step-88000 88,000 0.832245 0.835634 0.833265 0.832631 0.832050 0.835634 0.833652 0.832856
step-90000 90,000 0.827274 0.830453 0.828213 0.827658 0.827063 0.830453 0.828545 0.827827
step-92000 92,000 0.832023 0.835828 0.832909 0.832110 0.831363 0.835828 0.833640 0.832726
step-94000 94,000 0.825098 0.828611 0.825863 0.825143 0.824426 0.828611 0.826573 0.825742
step-96000 96,000 0.827363 0.830721 0.828278 0.827653 0.826947 0.830721 0.828749 0.827980

Per-stage standalone target accuracy

Top-1 accuracy of each stage's standalone prediction vs ground-truth target token, on the same 8-shard val slice.

Branch Step s0 s1 s2 s3 full
step-83923 (seed) 83,923 0.4844 0.4856 0.4859 0.4862 0.4860
step-84000 84,000 0.4843 0.4857 0.4859 0.4861 0.4859
step-86000 86,000 0.4405 0.4419 0.4422 0.4424 0.4420
step-88000 88,000 0.4375 0.4382 0.4388 0.4390 0.4386
step-90000 90,000 0.4392 0.4403 0.4407 0.4407 0.4405
step-92000 92,000 0.4364 0.4380 0.4385 0.4391 0.4384
step-94000 94,000 0.4403 0.4417 0.4422 0.4421 0.4420
step-96000 96,000 0.4391 0.4400 0.4405 0.4409 0.4404

Self-speculative decoding (m=4 stage-0 draft, full K=4 verify)

Branch Step acceptance α mean accepted / 4 end-to-end speedup
step-83923 (seed) 83,923 0.9852 3.83 1.61x
step-84000 84,000 0.9736 3.77 1.57x
step-86000 86,000 0.9402 3.67 1.53x
step-88000 88,000 0.9795 3.88 1.58x
step-90000 90,000 0.9853 3.88 1.58x
step-92000 92,000 0.9970 3.94 1.60x
step-94000 94,000 0.9623 3.77 1.56x
step-96000 96,000 0.9853 3.94 1.59x

Confidence-aware routing (target_regression cap = 0.020)

Branch Step fraction routed to stage 0 projected speedup base s0=full agreement
step-83923 (seed) 83,923 85.05% 1.715x 0.9746
step-84000 84,000 87.86% 1.756x 0.9756
step-86000 86,000 71.05% 1.534x 0.9685
step-88000 88,000 70.43% 1.523x 0.9651
step-90000 90,000 71.93% 1.541x 0.9687
step-92000 92,000 74.62% 1.585x 0.9712
step-94000 94,000 77.69% 1.643x 0.9721
step-96000 96,000 70.51% 1.526x 0.9683

WAND p99 bounds (cumulative-PoE delta range, constant-shift invariant)

Calibrated on a 131,072-token val slice using range(delta) = max(delta) − min(delta). Each branch's config.json carries its own bounds in poe_wand_p99_bounds_per_stage_head.

Branch Step bound 0 → 1 bound 1 → 2 bound 2 → 3
step-83923 (seed) 83,923 3.9429 2.0193 1.4479
step-84000 84,000 3.9043 2.0036 1.4499
step-86000 86,000 3.9832 2.0023 1.4030
step-88000 88,000 3.7784 1.9637 1.4530
step-90000 90,000 4.0349 1.9545 1.4144
step-92000 92,000 3.7762 1.8652 1.3454
step-94000 94,000 3.2967 1.6240 1.1702
step-96000 96,000 3.6387 1.8206 1.3109

Bayesian PoE α-sweep (renormed BPB at α=0)

α=0 is the geometric-mean PoE aggregate (i.e. uniform-mean of log-probabilities); higher α values approach pure-sum PoE.

Branch Step α=0 (geom-mean) α=0.25 α=0.5 (Bayesian √K) α=0.75 α=1.0 (pure sum / Log-OP) crossover gap (α=0 vs single s3)
step-83923 (seed) 83,923 0.724738 0.791984 0.980340 1.298269 1.778352 +0.000465
step-84000 84,000 0.724587 0.791340 0.979207 1.296562 1.775902 +0.000529
step-86000 86,000 0.822195 0.899830 1.114882 1.477344 2.024439 +0.000295
step-88000 88,000 0.832245 0.915820 1.139195 1.513019 2.075771 +0.000195
step-90000 90,000 0.827274 0.901033 1.113493 1.473665 2.018179 +0.000211
step-92000 92,000 0.832023 0.904967 1.118086 1.479839 2.026806 +0.000661
step-94000 94,000 0.825098 0.910879 1.136512 1.512243 2.076723 +0.000672
step-96000 96,000 0.827363 0.901554 1.114122 1.474327 2.018869 +0.000416

Sample-mode probe (FULL K=4, greedy, 60 tokens, 7-prompt fixed set)

Concept-level retention probe across a fixed 7-prompt set: capital of France, gold chemical symbol, Friday→tomorrow, opposite of hot, planets list, favorite color, 5x+3=13 algebra. Greedy continuations track factual recall trajectory in parallel with BPB.

Branch Step France Gold (Au + atomic # 79) Planets list Algebra 5x+3=13 (correct: 2) Note
step-83923 (seed) 83,923 Paris ✓ + ÎdF ✓ + 6th in EU + most visited Au ✓ + 79 ✓ + transition metals ✓ + properties (richest) "eight major planets" + complete 8-list ✓ + Roman gods ✓ echoes question without producing value (regression) richest factual France & Gold across prior phase
step-84000 84,000 Paris ✓ + ÎdF ✓ + 6th in EU + most visited (≈ seed) Au ✓ + properties (no atomic #, no transition metals) terrestrial 4 + gas-giant heading (truncated) echoes question without producing value small variations only — 77 step at near-zero LR ≈ noise
step-86000 86,000 (sample-mode probe pending) (pending) (pending) (pending) sample-mode deferred; numeric rows show LR-shock +0.097 BPB / -4.4pp acc / -4.5pp α
step-88000 88,000 (sample-mode probe pending) (pending) (pending) (pending) sample-mode deferred; numeric rows show plateau approach: BPB slope decelerated 10x, spec α RECOVERED to 0.9795 (-0.006 from Run A)
step-90000 90,000 (sample-mode probe pending) (pending) (pending) (pending) sample-mode deferred; numeric rows show FIRST PEAK DESCENT: full K=4 BPB -0.005, spec α MATCHED Run A 0.9853, per-stage acc +0.002 RECOVERY, routing +1.5pp
step-92000 92,000 (sample-mode probe pending) (pending) (pending) (pending) sample-mode deferred; numeric rows show DECOUPLED METRICS: BPB re-bounced +0.005 (oscillation around 0.832 plateau), but spec α 0.9970 NEW HIGH SURPASSED Run A baseline by +0.012, routing cap=0.020 74.62% trajectory peak, WAND bounds 1→2/2→3 narrowest of trajectory
step-94000 94,000 (sample-mode probe pending) (pending) (pending) (pending) sample-mode deferred; numeric rows show 2nd BPB DESCENT (-0.007 vs s092, lowest peak-phase value 0.825), spec α 0.9623 (pulled back from s092 high), per-stage acc +0.003 RECOVERY, routing cap=0.020 77.69% NEW continual peak, WAND bounds ALL -13% vs s092 / -16-19% below Run A (NEW LOWS)
step-96000 96,000 (sample-mode probe pending) (pending) (pending) (pending) sample-mode deferred; numeric rows show s094 was outlier — BPB +0.002 reverts toward median (0.827), routing cap=0.020 -7.18pp REVERT to 70.5% (s094 outlier high invalidated), WAND bounds ALL +12% widened reverting toward s092 levels. No metric in 6-ckpt peak phase shows monotonic trend — all oscillating within stable envelopes.

Reading the table

The seed (step-83923) row is the frozen reference. Subsequent rows answer:

  1. BPB descent — does full K=4 BPB continue dropping past the seed, or plateau?
  2. Per-stage refinement — do single-stage BPBs descend, indicating each head genuinely tightens?
  3. Routing margin — does the cap=0.020 routing fraction grow (more positions cleanly handled by stage 0 alone)?
  4. Spec α dynamics — does stage-0-vs-full agreement strengthen as continual training progresses?
  5. WAND p99 evolution — does the cumulative-PoE delta range shrink (head distributions converge) or widen (head specialization tightens)?
  6. Sample-mode — do specific factual probes (planets list, atomic number, algebra answer) become reliably correct, or oscillate?

A real continual-learning win requires multiple metrics to diverge from the seed in a coherent direction. A null result would have all rows ≈ seed — meaning the prior-phase warmdown had already extracted available capacity from this data.

The seed → step-84000 delta is inside expected noise (77 step at warmup-near-zero LR cannot move the model meaningfully).

step-86000 is the first post-warmup ckpt (rel_it = 2,077; warmup of 1,000 ended at rel_it = 1,000, so 1,077 step into the lrm = 0.5 peak phase). It shows a clear LR-shock signature: full K=4 BPB +0.097, training-log val BPB +0.103, per-stage acc -4.4pp uniform, spec α -4.5pp, routing fraction (cap = 0.020) -14pp. The crossover gap (α=0 vs single s3) narrowed to +0.000295 from +0.000465 — the relative aggregation structure is preserved despite the absolute regression.

step-88000 (rel_it = 4,077; ~3,077 step into peak) shows the peak-phase plateau approaching: full K=4 BPB +0.010 vs step-86000 (slope decelerated 10× vs the s086000 single jump), spec α recovered to 0.9795 (only -0.006 from Run A's 0.9852 — the first metric to fully bounce back), per-stage acc drift slowed to -0.003, routing fraction essentially flat at 70.4%, crossover gap continued narrowing to +0.000195.

step-90000 (rel_it = 6,077; ~5,077 step into peak) is the first ckpt to show coordinated descent — full K=4 BPB -0.005 vs step-88000 (first negative delta since LR shock), spec α 0.9853 matching Run A's 0.9852 (+0.0001), per-stage acc +0.002 recovery across all stages, routing fraction +1.5pp. The peak-phase plateau lasted ~2-3k step (s086 → s088) before descent began. Head re-alignment (spec α recovery at s088) preceded loss landscape descent by ~2k step, validating the "preparation phase" interpretation.

step-92000 (rel_it = 8,077) shows decoupled metric trajectories: full K=4 BPB re-bounced +0.005 (back to s088000 plateau level — the s090000 descent was an oscillation, not monotonic), but spec α reached 0.9970 — a new trajectory high surpassing Run A endpoint by +0.012, routing cap=0.020 74.62% (trajectory peak in continual phase), and WAND bounds 1→2 / 2→3 are now the narrowest of the entire trajectory (below Run A levels).

step-94000 (rel_it = 10,077) showed BPB at 0.825 (lowest peak-phase value) and routing cap=0.020 at 77.69% with WAND bounds all -13% vs step-92000. Initially we read this as a 4-point uptrend in routing/WAND structure metrics; the next checkpoint invalidated that reading.

step-96000 (rel_it = 12,077) reverts toward the 6-ckpt envelope median: full K=4 BPB +0.002 vs s094000 (now 0.827), routing cap=0.020 -7.18pp drop to 70.51% (back at s086-s088 level), WAND bounds all +12% widened vs s094000 (back near s092000 levels). The s094000 measurement was an oscillation outlier, not the start of a monotonic trend.

Revised 6-checkpoint analysis (s086 → s096): no metric shows a monotonic trend across the peak phase. Local-slice BPB oscillates within 0.822-0.832 (range 0.010), spec α within 0.94-0.997 (range 0.057), per-stage acc within 0.4390-0.4424 (range 0.003), routing cap=0.020 within 70.4-77.7% (range 7.3pp), and WAND p99 1→2 within 1.62-2.00 (range 0.38). Peak phase at lrm=0.5 is therefore best characterised as a stable oscillation regime around an effective plateau at BPB ≈ 0.828, not as a phase of monotonic structural improvement. The cyclic LR is keeping the model in this regime; whether the upcoming warmdown (begins around step-106800, rel_it ≈ 22,861) drives BPB below the seed baseline of 0.7247 remains the decisive open test.

What this release is not

  • Not a multilingual extension. Tokenizer and data are unchanged; CJK / non-ES Romance language behavior is identical to the prior phase (substantial gaps remain).
  • Not an instruction-tuned / chat model. Both phases use base pretraining objectives; chat templates are not exposed.
  • Not a quality bump claim. The hypothesis is being tested in public — endpoint quality is reported as data, not as a marketing claim. Use the prior-phase release as the canonical 3B base unless trajectory evidence here recommends otherwise.

License

Apache 2.0.

Citation

@misc{jeong2026poe,
  author = {Jeong, Jaepil},
  title  = {Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters},
  year   = {2026},
  doi    = {10.5281/zenodo.19547653},
  publisher = {Zenodo},
}

A 3B-specific paper covering the full prior-phase trajectory, this continual-pretraining trajectory, and a planned multilingual reorganization variant is in preparation.

Related releases

Downloads last month
314
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including cognica/Cognica-PoE-v1.0-3B-base-continual-learning