Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string

Cognica-PoE-v1.0-3B-base-continual-learning

Continual pretraining of a 3B PoE per-stage-head model using a cyclic re-warmup schedule to extract additional capacity from the same data distribution after initial training has fully annealed.

This release studies whether a model whose first training pass has reached its scheduled LR floor (lrm ≈ 0.05) can still meaningfully improve when given a fresh half-peak warmup → warmdown cycle on the same data — without changing architecture, tokenizer, or data mix.

The model is published as a trajectory of step branches, not just a final ckpt: every saved checkpoint becomes a separate step-XXXXX branch so the continual-learning curve itself is externally auditable.

What "continual" means here

Most published "base" models are released after a single warmup → constant → warmdown LR cycle. At the end of that cycle the LR is near zero and gradient updates produce only marginal change — the model is conventionally considered "done."

This release tests a different setting: take a fully-annealed checkpoint, re-arm the optimizer with a new LR cycle at half the original peak, and continue training on the same data. We label this continual pretraining (cyclic LR) to distinguish it from:

pattern	what changes vs prior phase
continual pretraining (cyclic LR) ← this release	LR re-warmed; data + tokenizer + architecture unchanged
domain-adaptive pretraining	new domain data added
multilingual continual pretraining	tokenizer extended; multilingual data mixed in
continual instruction tuning	SFT data; chat-format objective

The hypothesis being probed: does a half-peak second cycle on identical data produce real, measurable gain, or does the model plateau?

Methodology — B2 cyclic schedule

Initialization: a fully-trained 3B PoE per-stage-head model (66B tokens consumed, full warmup → warmdown cycle complete, lrm annealed to ~0.05 of original peak).

LR schedule for the continual phase (anchored at the resume step):

warmup    (rel 0 .. 1000)   : lrm rises 0   → 0.5  (linear)
peak      (rel 1000..22861) : lrm = 0.5            (constant; half of original peak)
warmdown  (rel 22861..50800): lrm decays 0.5 → 0.0 (linear; warmdown_ratio = 0.55)

param	continual phase value	original phase (for reference)
token budget	+40 B	66 B
total steps in phase	50,800	83,923
warmup steps	1,000	1,000
peak `lrm`	0.5	1.0
warmdown ratio	0.55	0.65
final `lrm`	0.0	0.05
effective `matrix_lr` peak	0.0075	0.015
effective `embedding_lr` peak	0.15	0.30
effective `unembedding_lr` peak	0.004	0.008

Optimizer state is restored from the prior phase's last save (DistMuonAdamW ZeRO-2 sharded across 12 ranks). Tokenizer (32,768 vocab, rustbpe), architecture (depth=32, n_embd=2048, K=4 PoE per-stage with asymmetric stage_layers=(16,6,5,5), GQA 2:1, intermediate=12800, max_seq_len=2048), and data mix (frontier_v1: FineWeb-Edu 33.5% + DCLM-Baseline 24% + Stack-v2 16% + Wikipedia 5% + CulturaX 5% + ProofPile-2 4% + OpenWebMath 4% + Gutenberg 4% + PG-19 2% + UltraChat 1% + OpenHermes-2.5 0.6%) are all unchanged from the prior phase.

Why a published trajectory

The point of this release is the continual-learning curve, not any single endpoint. We publish every save (step-XXXXX branches) so the actual question — "does a re-warmed cycle keep improving the model?" — can be answered by reading off the trajectory rather than trusting our headline numbers.

Each step-XXXXX branch carries its own per-checkpoint poe_wand_p99_bounds_per_stage_head calibration in config.json so PoE-specific inference (WAND adaptive depth, self-speculative decoding) works correctly at any branch.

Branches

Branch	Step	Phase position	Notes
`step-83923`	83,923	continual phase rel_it = 0	seed: identical to the fully-annealed prior-phase final ckpt
`step-84000`	84,000	rel_it = 77 (warmup early, lrm ≈ 0.04)	first save after re-warmup begins; functionally indistinguishable from seed
`step-86000`	86,000	rel_it = 2,077 (peak +1,077, lrm = 0.5)	first post-warmup ckpt; LR-shock signature (BPB +0.097 / acc -4.4pp / α -4.5pp)
`step-88000`	88,000	rel_it = 4,077 (peak +3,077)	plateau approach; BPB slope decelerated 10×; spec α recovered to 0.9795
`step-90000`	90,000	rel_it = 6,077 (peak +5,077)	first coordinated descent: BPB -0.005, spec α 0.9853 matched Run A baseline
`step-92000`	92,000	rel_it = 8,077 (peak +7,077)	decoupled metrics: BPB re-bounce to 0.832 plateau, spec α 0.9970 new high (above Run A 0.9852), WAND bounds narrowest of trajectory
`step-94000`	94,000	rel_it = 10,077 (peak +9,077)	2nd BPB descent (-0.007 → 0.825 lowest peak-phase value); routing cap=0.020 77.69% new continual peak; WAND bounds all -13% vs s092 / -16-19% below Run A
`step-96000`	96,000	rel_it = 12,077 (peak +11,077)	s094 was outlier: BPB reverts +0.002 (back to 0.827 median); routing cap=0.020 -7.18pp REVERT to 70.5%; WAND bounds all +12% widened. Peak phase confirmed as oscillation regime, no monotonic trends.
`main`	latest	tracks newest published step	currently `step-96000`

Future saves: every 2000 steps (step-98000, step-100000, ..., step-134000) plus a final step-134723. Warmdown begins around step-106800 (rel_it ≈ 22,861).

Inference

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
    "cognica/Cognica-PoE-v1.0-3B-base-continual-learning",
    revision="main",                # or any "step-XXXXX" branch
    trust_remote_code=True,
    dtype=torch.bfloat16,
).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(
    "cognica/Cognica-PoE-v1.0-3B-base-continual-learning",
    revision="main",
    trust_remote_code=True,
)

# Base ckpts REQUIRE prepending <|bos|> before user text:
prompt = "The capital of France is"
input_ids = [tokenizer.bos_token_id] + tokenizer.encode(prompt, add_special_tokens=False)
input_ids = torch.tensor([input_ids], device=device)
out = model.generate(input_ids=input_ids, max_new_tokens=32, do_sample=False)
print(tokenizer.decode(out[0].tolist()))

PoE-specific inference helpers (single-stage forward, prefix pruning, WAND adaptive depth, self-speculative decoding) are exposed on CognicaPoEForCausalLM. Each step-XXXXX branch carries its own calibrated poe_wand_p99_bounds_per_stage_head in config.json; model.generate_wand(...) reads it automatically.

For the architectural details, full inference recipe, and the prior-phase trajectory analysis, see the prior-phase release: cognica/Cognica-PoE-v1.0-3B-base.

Trajectory measurements

The continual phase entry point (step-83923) is identical to the prior-phase endpoint, so it serves as both the resume anchor and the baseline against which every continual-phase ckpt is measured. A meaningful continual-learning result requires step-NNNNN measurements to diverge from the seed across multiple metrics — not just match it.

Per-checkpoint val BPB

Same 8-shard local val slice across all measured ckpts (1.05 M tokens, --split-tokens 1048576); single A100 80GB; FULL K=4 PoE aggregation.

Branch	Step	Phase rel_it	Training-log val BPB (12 ranks, 40M tokens)	Local-slice full K=4 BPB (8 shards, 1.05M tokens)
`step-83923` (seed)	83,923	0	0.772893	0.724738
`step-84000`	84,000	77	0.772671	0.724587
`step-86000`	86,000	2,077	0.875501	0.822195
`step-88000`	88,000	4,077	0.882645	0.832245
`step-90000`	90,000	6,077	0.885028	0.827274
`step-92000`	92,000	8,077	0.886520	0.832023
`step-94000`	94,000	10,077	0.883594	0.825098
`step-96000`	96,000	12,077	0.878945	0.827363

Per-stage BPB

Per-checkpoint training-objective BPB at each PoE stage boundary, on the same 8-shard local val slice.

Branch	Step	full K=4	single s0	single s1	single s2	single s3	prefix K'=1	prefix K'=2	prefix K'=3
`step-83923` (seed)	83,923	0.724738	0.727740	0.725798	0.725063	0.724273	0.727740	0.726103	0.725363
`step-84000`	84,000	0.724587	0.727583	0.725641	0.724863	0.724058	0.727583	0.725975	0.725224
`step-86000`	86,000	0.822195	0.825607	0.823068	0.822524	0.821900	0.825607	0.823529	0.822770
`step-88000`	88,000	0.832245	0.835634	0.833265	0.832631	0.832050	0.835634	0.833652	0.832856
`step-90000`	90,000	0.827274	0.830453	0.828213	0.827658	0.827063	0.830453	0.828545	0.827827
`step-92000`	92,000	0.832023	0.835828	0.832909	0.832110	0.831363	0.835828	0.833640	0.832726
`step-94000`	94,000	0.825098	0.828611	0.825863	0.825143	0.824426	0.828611	0.826573	0.825742
`step-96000`	96,000	0.827363	0.830721	0.828278	0.827653	0.826947	0.830721	0.828749	0.827980

Per-stage standalone target accuracy

Top-1 accuracy of each stage's standalone prediction vs ground-truth target token, on the same 8-shard val slice.

Branch	Step	s0	s1	s2	s3	full
`step-83923` (seed)	83,923	0.4844	0.4856	0.4859	0.4862	0.4860
`step-84000`	84,000	0.4843	0.4857	0.4859	0.4861	0.4859
`step-86000`	86,000	0.4405	0.4419	0.4422	0.4424	0.4420
`step-88000`	88,000	0.4375	0.4382	0.4388	0.4390	0.4386
`step-90000`	90,000	0.4392	0.4403	0.4407	0.4407	0.4405
`step-92000`	92,000	0.4364	0.4380	0.4385	0.4391	0.4384
`step-94000`	94,000	0.4403	0.4417	0.4422	0.4421	0.4420
`step-96000`	96,000	0.4391	0.4400	0.4405	0.4409	0.4404

Self-speculative decoding (m=4 stage-0 draft, full K=4 verify)

Branch	Step	acceptance α	mean accepted / 4	end-to-end speedup
`step-83923` (seed)	83,923	0.9852	3.83	1.61x
`step-84000`	84,000	0.9736	3.77	1.57x
`step-86000`	86,000	0.9402	3.67	1.53x
`step-88000`	88,000	0.9795	3.88	1.58x
`step-90000`	90,000	0.9853	3.88	1.58x
`step-92000`	92,000	0.9970	3.94	1.60x
`step-94000`	94,000	0.9623	3.77	1.56x
`step-96000`	96,000	0.9853	3.94	1.59x

Confidence-aware routing (target_regression cap = 0.020)

Branch	Step	fraction routed to stage 0	projected speedup	base s0=full agreement
`step-83923` (seed)	83,923	85.05%	1.715x	0.9746
`step-84000`	84,000	87.86%	1.756x	0.9756
`step-86000`	86,000	71.05%	1.534x	0.9685
`step-88000`	88,000	70.43%	1.523x	0.9651
`step-90000`	90,000	71.93%	1.541x	0.9687
`step-92000`	92,000	74.62%	1.585x	0.9712
`step-94000`	94,000	77.69%	1.643x	0.9721
`step-96000`	96,000	70.51%	1.526x	0.9683

WAND p99 bounds (cumulative-PoE delta range, constant-shift invariant)

Calibrated on a 131,072-token val slice using range(delta) = max(delta) − min(delta). Each branch's config.json carries its own bounds in poe_wand_p99_bounds_per_stage_head.

Branch	Step	bound 0 → 1	bound 1 → 2	bound 2 → 3
`step-83923` (seed)	83,923	3.9429	2.0193	1.4479
`step-84000`	84,000	3.9043	2.0036	1.4499
`step-86000`	86,000	3.9832	2.0023	1.4030
`step-88000`	88,000	3.7784	1.9637	1.4530
`step-90000`	90,000	4.0349	1.9545	1.4144
`step-92000`	92,000	3.7762	1.8652	1.3454
`step-94000`	94,000	3.2967	1.6240	1.1702
`step-96000`	96,000	3.6387	1.8206	1.3109

Bayesian PoE α-sweep (renormed BPB at α=0)

α=0 is the geometric-mean PoE aggregate (i.e. uniform-mean of log-probabilities); higher α values approach pure-sum PoE.

Branch	Step	α=0 (geom-mean)	α=0.25	α=0.5 (Bayesian √K)	α=0.75	α=1.0 (pure sum / Log-OP)	crossover gap (α=0 vs single s3)
`step-83923` (seed)	83,923	0.724738	0.791984	0.980340	1.298269	1.778352	+0.000465
`step-84000`	84,000	0.724587	0.791340	0.979207	1.296562	1.775902	+0.000529
`step-86000`	86,000	0.822195	0.899830	1.114882	1.477344	2.024439	+0.000295
`step-88000`	88,000	0.832245	0.915820	1.139195	1.513019	2.075771	+0.000195
`step-90000`	90,000	0.827274	0.901033	1.113493	1.473665	2.018179	+0.000211
`step-92000`	92,000	0.832023	0.904967	1.118086	1.479839	2.026806	+0.000661
`step-94000`	94,000	0.825098	0.910879	1.136512	1.512243	2.076723	+0.000672
`step-96000`	96,000	0.827363	0.901554	1.114122	1.474327	2.018869	+0.000416

Sample-mode probe (FULL K=4, greedy, 60 tokens, 7-prompt fixed set)

Concept-level retention probe across a fixed 7-prompt set: capital of France, gold chemical symbol, Friday→tomorrow, opposite of hot, planets list, favorite color, 5x+3=13 algebra. Greedy continuations track factual recall trajectory in parallel with BPB.

Branch	Step	France	Gold (Au + atomic # 79)	Planets list	Algebra `5x+3=13` (correct: 2)	Note
`step-83923` (seed)	83,923	Paris ✓ + ÎdF ✓ + 6th in EU + most visited	Au ✓ + 79 ✓ + transition metals ✓ + properties (richest)	"eight major planets" + complete 8-list ✓ + Roman gods ✓	echoes question without producing value (regression)	richest factual France & Gold across prior phase
`step-84000`	84,000	Paris ✓ + ÎdF ✓ + 6th in EU + most visited (≈ seed)	Au ✓ + properties (no atomic #, no transition metals)	terrestrial 4 + gas-giant heading (truncated)	echoes question without producing value	small variations only — 77 step at near-zero LR ≈ noise
`step-86000`	86,000	(sample-mode probe pending)	(pending)	(pending)	(pending)	sample-mode deferred; numeric rows show LR-shock +0.097 BPB / -4.4pp acc / -4.5pp α
`step-88000`	88,000	(sample-mode probe pending)	(pending)	(pending)	(pending)	sample-mode deferred; numeric rows show plateau approach: BPB slope decelerated 10x, spec α RECOVERED to 0.9795 (-0.006 from Run A)
`step-90000`	90,000	(sample-mode probe pending)	(pending)	(pending)	(pending)	sample-mode deferred; numeric rows show FIRST PEAK DESCENT: full K=4 BPB -0.005, spec α MATCHED Run A 0.9853, per-stage acc +0.002 RECOVERY, routing +1.5pp
`step-92000`	92,000	(sample-mode probe pending)	(pending)	(pending)	(pending)	sample-mode deferred; numeric rows show DECOUPLED METRICS: BPB re-bounced +0.005 (oscillation around 0.832 plateau), but spec α 0.9970 NEW HIGH SURPASSED Run A baseline by +0.012, routing cap=0.020 74.62% trajectory peak, WAND bounds 1→2/2→3 narrowest of trajectory
`step-94000`	94,000	(sample-mode probe pending)	(pending)	(pending)	(pending)	sample-mode deferred; numeric rows show 2nd BPB DESCENT (-0.007 vs s092, lowest peak-phase value 0.825), spec α 0.9623 (pulled back from s092 high), per-stage acc +0.003 RECOVERY, routing cap=0.020 77.69% NEW continual peak, WAND bounds ALL -13% vs s092 / -16-19% below Run A (NEW LOWS)
`step-96000`	96,000	(sample-mode probe pending)	(pending)	(pending)	(pending)	sample-mode deferred; numeric rows show s094 was outlier — BPB +0.002 reverts toward median (0.827), routing cap=0.020 -7.18pp REVERT to 70.5% (s094 outlier high invalidated), WAND bounds ALL +12% widened reverting toward s092 levels. No metric in 6-ckpt peak phase shows monotonic trend — all oscillating within stable envelopes.

Reading the table

The seed (step-83923) row is the frozen reference. Subsequent rows answer:

BPB descent — does full K=4 BPB continue dropping past the seed, or plateau?
Per-stage refinement — do single-stage BPBs descend, indicating each head genuinely tightens?
Routing margin — does the cap=0.020 routing fraction grow (more positions cleanly handled by stage 0 alone)?
Spec α dynamics — does stage-0-vs-full agreement strengthen as continual training progresses?
WAND p99 evolution — does the cumulative-PoE delta range shrink (head distributions converge) or widen (head specialization tightens)?
Sample-mode — do specific factual probes (planets list, atomic number, algebra answer) become reliably correct, or oscillate?

A real continual-learning win requires multiple metrics to diverge from the seed in a coherent direction. A null result would have all rows ≈ seed — meaning the prior-phase warmdown had already extracted available capacity from this data.

The seed → step-84000 delta is inside expected noise (77 step at warmup-near-zero LR cannot move the model meaningfully).

step-86000 is the first post-warmup ckpt (rel_it = 2,077; warmup of 1,000 ended at rel_it = 1,000, so 1,077 step into the lrm = 0.5 peak phase). It shows a clear LR-shock signature: full K=4 BPB +0.097, training-log val BPB +0.103, per-stage acc -4.4pp uniform, spec α -4.5pp, routing fraction (cap = 0.020) -14pp. The crossover gap (α=0 vs single s3) narrowed to +0.000295 from +0.000465 — the relative aggregation structure is preserved despite the absolute regression.

step-88000 (rel_it = 4,077; ~3,077 step into peak) shows the peak-phase plateau approaching: full K=4 BPB +0.010 vs step-86000 (slope decelerated 10× vs the s086000 single jump), spec α recovered to 0.9795 (only -0.006 from Run A's 0.9852 — the first metric to fully bounce back), per-stage acc drift slowed to -0.003, routing fraction essentially flat at 70.4%, crossover gap continued narrowing to +0.000195.

step-90000 (rel_it = 6,077; ~5,077 step into peak) is the first ckpt to show coordinated descent — full K=4 BPB -0.005 vs step-88000 (first negative delta since LR shock), spec α 0.9853 matching Run A's 0.9852 (+0.0001), per-stage acc +0.002 recovery across all stages, routing fraction +1.5pp. The peak-phase plateau lasted ~2-3k step (s086 → s088) before descent began. Head re-alignment (spec α recovery at s088) preceded loss landscape descent by ~2k step, validating the "preparation phase" interpretation.

step-92000 (rel_it = 8,077) shows decoupled metric trajectories: full K=4 BPB re-bounced +0.005 (back to s088000 plateau level — the s090000 descent was an oscillation, not monotonic), but spec α reached 0.9970 — a new trajectory high surpassing Run A endpoint by +0.012, routing cap=0.020 74.62% (trajectory peak in continual phase), and WAND bounds 1→2 / 2→3 are now the narrowest of the entire trajectory (below Run A levels).

step-94000 (rel_it = 10,077) showed BPB at 0.825 (lowest peak-phase value) and routing cap=0.020 at 77.69% with WAND bounds all -13% vs step-92000. Initially we read this as a 4-point uptrend in routing/WAND structure metrics; the next checkpoint invalidated that reading.

step-96000 (rel_it = 12,077) reverts toward the 6-ckpt envelope median: full K=4 BPB +0.002 vs s094000 (now 0.827), routing cap=0.020 -7.18pp drop to 70.51% (back at s086-s088 level), WAND bounds all +12% widened vs s094000 (back near s092000 levels). The s094000 measurement was an oscillation outlier, not the start of a monotonic trend.

Revised 6-checkpoint analysis (s086 → s096): no metric shows a monotonic trend across the peak phase. Local-slice BPB oscillates within 0.822-0.832 (range 0.010), spec α within 0.94-0.997 (range 0.057), per-stage acc within 0.4390-0.4424 (range 0.003), routing cap=0.020 within 70.4-77.7% (range 7.3pp), and WAND p99 1→2 within 1.62-2.00 (range 0.38). Peak phase at lrm=0.5 is therefore best characterised as a stable oscillation regime around an effective plateau at BPB ≈ 0.828, not as a phase of monotonic structural improvement. The cyclic LR is keeping the model in this regime; whether the upcoming warmdown (begins around step-106800, rel_it ≈ 22,861) drives BPB below the seed baseline of 0.7247 remains the decisive open test.

What this release is not

Not a multilingual extension. Tokenizer and data are unchanged; CJK / non-ES Romance language behavior is identical to the prior phase (substantial gaps remain).
Not an instruction-tuned / chat model. Both phases use base pretraining objectives; chat templates are not exposed.
Not a quality bump claim. The hypothesis is being tested in public — endpoint quality is reported as data, not as a marketing claim. Use the prior-phase release as the canonical 3B base unless trajectory evidence here recommends otherwise.

License

Apache 2.0.

Citation

@misc{jeong2026poe,
  author = {Jeong, Jaepil},
  title  = {Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters},
  year   = {2026},
  doi    = {10.5281/zenodo.19547653},
  publisher = {Zenodo},
}

A 3B-specific paper covering the full prior-phase trajectory, this continual-pretraining trajectory, and a planned multilingual reorganization variant is in preparation.

Related releases

cognica/Cognica-PoE-v1.0-3B-base — Prior phase: 3B PoE per-stage, 66 B tokens, single warmup→warmdown cycle, frontier_v1 mix
cognica/Cognica-PoE-v1.0-1.3B-base — 1.3B PoE per-stage release (different scale)
cognica/Cognica-BP-v1.0-1.3B-base — 1.3B Backprop baseline (PoE control)

Downloads last month: 314

Safetensors

Model size

3B params

Tensor type

BF16

Collection including cognica/Cognica-PoE-v1.0-3B-base-continual-learning

Product of Experts as Scalable Local Learning (Per-Stage)

Collection

Product of Experts (PoE) replaces backprop's global state with local learning, validated at 1.5B across five modularity axes. • 2 items • Updated 3 days ago