counterfact-plantain
A per-head attention probe of FLUX.2 Klein 4B testing whether the base model represents counterfactual modal frame as a separable axis on identical actual outcomes.
Thesis
after-plantain established that ~1% of Klein's heads represent post-event states as a categorical concept. counterfact-plantain extends the question upstream from "did this event happen" to "is this description framed factually or counterfactually." The factual A condition and the counterfactual B condition describe the same actual outcome β the distinguishing variable is purely the modal frame ("would not have, had ..."). If a per-head signal exceeds the empirical null on this stimulus set, image-generation pretraining encodes counterfactual structure as a separable axis, which is the load-bearing primitive of any genuine world model.
Method
Twenty-five paired prompts. The A condition is purely descriptive ("the ball rolled left across the tilted table"). The B condition adds an explicit counterfactual conditional with the same actual content ("the ball rolled left across the tilted table; it would not have, without the tilt"). Pairs span physical, thermodynamic, biological, and mechanical causation. Within-pair length is matched. The "as expected"/"contrary to expectations" framing is deliberately avoided to prevent confounds with vocabulary-frequency priors.
Per-head capture identical to the rest of the plantain probe family: forward pre-hook on every transformer block's attention output projection, per-head RMS magnitude, one inference step at guidance_scale=1.0, fixed seed. Across the 25 pairs, per-head paired t-statistics are computed on (factual β counterfactual) magnitudes. Empirical null is 1,000 sign-flip permutations.
Rigor add-ons: per-head Cohen's d effect size; split-half consistency via 100 random 50/50 stimulus splits.
Results
| Metric | Value | Significance |
|---|---|---|
| Heads with |t| > 3 | 3,469 (21.3%) | 5.9Γ empirical null p99 |
| Heads with |t| > 5 | 835 (5.1%) | 167Γ empirical null p99 |
| Heads with |d| > 0.8 (large) | 1,718 (10.5%) | β |
| Split-half r (median) | 0.639 | [0.61, 0.65] IQR |
| Max |t| | 13.63 | β |
Top blocks by max |t|:
- single[19]: max|t|=13.63, 539/768 heads at |t|>3, median |d|=0.92
- single[0]: max|t|=11.74, 401/768 heads at |t|>3, median |d|=0.63
- joint[0]: max|t|=11.27, 137/192 heads at |t|>3, median |d|=0.90
- single[8]: max|t|=11.01, 239/768 heads at |t|>3, median |d|=0.47
- single[13]: max|t|=10.97, 173/768 heads at |t|>3, median |d|=0.34
Interpretation. The axis is real and stable across split halves (r=0.64). Localization is bookend β strongest signal in single[0] (input-adjacent) and single[19] (output-adjacent) β suggesting the counterfactual frame is detected early during text-conditioning and re-engaged late during the diffusion-output projection. The deep single[19] block alone has 539 of 768 heads at |t|>3 with median Cohen's d near 0.9, indicating the counterfactual-vs-factual distinction is a load-bearing partition for that block's representation. Image generation pretraining contains a counterfactual primitive that is structurally separable from the underlying factual content.
Status
Probe complete. No LoRA training; this is a base-model interpretability finding.
Limitations
The counterfactual condition contains an additional clause ("it would not have, had ...") that the factual condition does not. Although within-pair length is matched, the residual signal could partly reflect "presence of secondary clause" rather than counterfactual structure specifically. A follow-up that contrasts counterfactual conditionals against factual conditionals of matched grammatical complexity (e.g., chained "because" clauses) would tighten the claim.
Twenty-five pairs is small; the empirical null is a 1,000-permutation baseline.
The probe is correlational. Heads with high |t| are sensitive to the counterfactual framing in input; whether they participate causally in counterfactual-conditioned generation is a follow-up.
License
Apache 2.0 β matches base FLUX.2 Klein 4B.
References
- Gabeur, V., Long, S., Peng, S., et al. Image Generators are Generalist Vision Learners. arXiv:2604.20329 (2026).
- Black Forest Labs. FLUX.2 Klein. https://bfl.ai/models/flux-2-klein (2025).
- Downloads last month
- -
Model tree for phanerozoic/counterfact-plantain
Base model
black-forest-labs/FLUX.2-klein-base-4B