counterfact-plantain

A per-head attention probe of FLUX.2 Klein 4B testing whether the base model represents counterfactual modal frame as a separable axis on identical actual outcomes.

Thesis

after-plantain established that ~1% of Klein's heads represent post-event states as a categorical concept. counterfact-plantain extends the question upstream from "did this event happen" to "is this description framed factually or counterfactually." The factual A condition and the counterfactual B condition describe the same actual outcome β€” the distinguishing variable is purely the modal frame ("would not have, had ..."). If a per-head signal exceeds the empirical null on this stimulus set, image-generation pretraining encodes counterfactual structure as a separable axis, which is the load-bearing primitive of any genuine world model.

Method

Twenty-five paired prompts. The A condition is purely descriptive ("the ball rolled left across the tilted table"). The B condition adds an explicit counterfactual conditional with the same actual content ("the ball rolled left across the tilted table; it would not have, without the tilt"). Pairs span physical, thermodynamic, biological, and mechanical causation. Within-pair length is matched. The "as expected"/"contrary to expectations" framing is deliberately avoided to prevent confounds with vocabulary-frequency priors.

Per-head capture identical to the rest of the plantain probe family: forward pre-hook on every transformer block's attention output projection, per-head RMS magnitude, one inference step at guidance_scale=1.0, fixed seed. Across the 25 pairs, per-head paired t-statistics are computed on (factual βˆ’ counterfactual) magnitudes. Empirical null is 1,000 sign-flip permutations.

Rigor add-ons: per-head Cohen's d effect size; split-half consistency via 100 random 50/50 stimulus splits.

Results

Metric Value Significance
Heads with |t| > 3 3,469 (21.3%) 5.9Γ— empirical null p99
Heads with |t| > 5 835 (5.1%) 167Γ— empirical null p99
Heads with |d| > 0.8 (large) 1,718 (10.5%) β€”
Split-half r (median) 0.639 [0.61, 0.65] IQR
Max |t| 13.63 β€”

Top blocks by max |t|:

  • single[19]: max|t|=13.63, 539/768 heads at |t|>3, median |d|=0.92
  • single[0]: max|t|=11.74, 401/768 heads at |t|>3, median |d|=0.63
  • joint[0]: max|t|=11.27, 137/192 heads at |t|>3, median |d|=0.90
  • single[8]: max|t|=11.01, 239/768 heads at |t|>3, median |d|=0.47
  • single[13]: max|t|=10.97, 173/768 heads at |t|>3, median |d|=0.34

Interpretation. The axis is real and stable across split halves (r=0.64). Localization is bookend β€” strongest signal in single[0] (input-adjacent) and single[19] (output-adjacent) β€” suggesting the counterfactual frame is detected early during text-conditioning and re-engaged late during the diffusion-output projection. The deep single[19] block alone has 539 of 768 heads at |t|>3 with median Cohen's d near 0.9, indicating the counterfactual-vs-factual distinction is a load-bearing partition for that block's representation. Image generation pretraining contains a counterfactual primitive that is structurally separable from the underlying factual content.

Status

Probe complete. No LoRA training; this is a base-model interpretability finding.

Limitations

The counterfactual condition contains an additional clause ("it would not have, had ...") that the factual condition does not. Although within-pair length is matched, the residual signal could partly reflect "presence of secondary clause" rather than counterfactual structure specifically. A follow-up that contrasts counterfactual conditionals against factual conditionals of matched grammatical complexity (e.g., chained "because" clauses) would tighten the claim.

Twenty-five pairs is small; the empirical null is a 1,000-permutation baseline.

The probe is correlational. Heads with high |t| are sensitive to the counterfactual framing in input; whether they participate causally in counterfactual-conditioned generation is a follow-up.

License

Apache 2.0 β€” matches base FLUX.2 Klein 4B.

References

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for phanerozoic/counterfact-plantain

Finetuned
(14)
this model

Paper for phanerozoic/counterfact-plantain