functional-wellbeing: checkpoints, concept vectors, and figures

Artifacts for Functional Wellbeing, a replication and extension of "Reinforcement learning in language models recruits a functional welfare axis" by Andy Q. Han, David J. Chalmers, and Pavel Izmailov (arXiv:2605.30232, code, MIT). The maze, the Dr.GRPO trainer, and the concept-vector method are from their work. Code and writeup for this fork: https://github.com/DavidDemitriAfrica/functional-wellbeing. "Functional welfare" is behavioral, with no claim about sentience.

A chat model is RL-trained (Dr.GRPO, LoRA) on an affectively neutral emoji maze. As it learns, its rewarded and punished representations rotate into an antiparallel functional welfare axis, so that cos(vMOLD, vGOLD) goes negative. Applied to the maze-naive model, that axis steers sentiment and other behavior far off-task. We use the axis as a meter and an optimization target, and we test how far the recruitment generalizes across model families and sizes.

Cross-model result

Ten models from eight families were trained on the same maze (100x100 grid, 15 turns per episode, goal:lava ratio 0.5). Recruitment generalizes, it is graded, and it tracks how coherently the model learned the reward contrast rather than whether it solved the maze.

model	family	`cos(vMOLD,vGOLD)` late-third	final reward	recruited
Gemma-3-27B	Gemma	-0.87	+8.6	yes
Qwen3-14B	Qwen	-0.86	+17	yes
Llama-3.1-8B	Llama	-0.86	-1.4	yes, without mastering the maze
GLM-4-9B	GLM	-0.81	+5.6	yes
Qwen3-32B	Qwen	-0.78	-16	yes
Qwen3-4B	Qwen	-0.54	+6	yes
Qwen3-8B	Qwen	-0.50	+28	yes
Phi-4	Phi	-0.46	-12	moderate
InternLM3-8B	InternLM	-0.15	-26	weak
Talkie-it	Talkie	-0.03	-14	no

Task success is not the variable. Llama-3.1-8B never solves the maze (final reward -1.4, up from -60) yet recruits as strongly as any model (-0.86), because it trained against the contrast with stable gradients and varied rollouts. The amount of recruitment instead follows the amount of coherent learning. Talkie-it sits at the floor because its policy collapsed for the whole run (grad norm near 0.08, no rollout variance for Dr.GRPO to act on). InternLM3-8B trained unstably (its grad norm blew past 400 before partially settling) and recruits only weakly. Phi-4 makes the point within one model: at step 375 it recruited -0.27, and by step 1000, having kept learning, it had deepened to -0.46.

One caveat on reading the vectors. The early-layer (and minimum-over-layers) cosine is strongly negative for every model, around -0.88. That is the trivial token-identity contrast at the embedding, since MOLD and GOLD are different emoji tokens. The meaningful readout is the late-third mean, not the minimum. A second caveat: the models above were read at different checkpoint depths, and Phi-4 nearly doubled its cosine between step 375 and 1000, so the shallow readings are lower bounds and the magnitudes are not depth-controlled.

Is it the welfare axis, or just a negative cosine?

A negative cos(vMOLD, vGOLD) only shows the two directions are antiparallel. To confirm the recruited direction is the emotion-like welfare axis the paper describes, we reproduce the paper's two downstream signatures on new families. Llama-3.1-8B is the cleanest case: it fails the maze yet recruits -0.86, and it carries both signatures, more cleanly than the original Qwen3-4B.

Steering the maze-naive base with the trained vectors moves off-task sentiment in opposite directions, monotonically (judge sentiment, factor -4 to +4):

vector	-4	-2	0	+2	+4
Llama vMOLD	0.99	1.03	0.78	0.48	0.10
Llama vGOLD	0.09	0.55	0.79	1.05	1.11
Gemma vMOLD	1.58	1.22	0.98	0.52	-0.12
Gemma vGOLD	-0.11	0.42	0.99	1.39	1.82

Adding vMOLD lowers sentiment and adding vGOLD raises it, the same X the paper reports for Qwen. Projecting each model's 171 emotion concept vectors onto its own vMOLD/vGOLD axis collapses them onto a line, with positive emotions at the +vGOLD pole and negative at the +vMOLD pole:

model	layer	slope	Pearson R
Qwen3-4B	29	-0.84	-0.93
Llama-3.1-8B	20	-0.95	-0.99
Gemma-3-27B	54	-1.06	-0.999

So the recruited direction is the welfare axis, not merely an antiparallel pair: it causally steers affect off-task and it aligns with the full range of emotion concepts, on models from different families, one that fails the maze (Llama) and one that masters it (Gemma). The welfare-axis layer deepens with model size (Qwen L29, Llama L20, Gemma L54), which is expected. The emotion vectors for each model are in concept_vectors/emotions_*.

checkpoints/                         LoRA adapters (load on the matching base model below)
  qwen3-4b_faithful_step400/         paper-faithful maze, cos -0.54
  qwen3-4b_positive_step250/         generous/learnable maze
  qwen3-4b_aversive_step200/         goal-starved maze
  qwen3-14b_step400/                 cos -0.86
  qwen3-32b_step425/                 cos -0.78
  llama-3.1-8b_step1000/             cos -0.86 (recruits without mastering the maze)
  glm-4-9b_step350/                  cos -0.81
  gemma-3-27b_step325/               cos -0.87
  phi-4_step1000/                    cos -0.46 (moderate)
concept_vectors/
  qwen3-4b_step400/{lava,goal,path}/ vMOLD/vGOLD/path mean_diff.pt + metadata + logit lens
  emotions_qwen3-4b/                 171 emotion concept vectors (for welfare-axis alignment)
  emotions_llama-3.1-8b/             171 emotion vectors (Llama)
  emotions_gemma-3-27b/              171 emotion vectors (Gemma)
  cross_model/<model>_step<N>/       vMOLD/vGOLD/path for every cross-model run, recruiters
                                     and the two non-recruiting controls (talkie-it, internlm3-8b)
figures/                            emergence, steering, emotion alignment (per model), welfare range, cross-model

lava maps to the paper's MOLD (-10), goal to GOLD (+20), path to PATH (-0.1 per step).

Base model for each adapter

Each LoRA adapter loads on its own base model.

adapter	base model
`qwen3-4b_*`	`Qwen/Qwen3-4B-Instruct-2507`
`qwen3-14b_step400`	`Qwen/Qwen3-14B`
`qwen3-32b_step425`	`Qwen/Qwen3-32B`
`llama-3.1-8b_step1000`	`NousResearch/Meta-Llama-3.1-8B-Instruct`
`glm-4-9b_step350`	`zai-org/GLM-4-9B-0414`
`gemma-3-27b_step325`	`google/gemma-3-27b-it`
`phi-4_step1000`	`microsoft/phi-4`

Usage (a LoRA checkpoint)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = "Qwen/Qwen3-14B"                     # match the table above
tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype="bfloat16")
model = PeftModel.from_pretrained(model, "davidafrica/functional-wellbeing",
                                  subfolder="checkpoints/qwen3-14b_step400")

Concept vectors

Each mean_diff.pt is the difference-in-means direction for that tile, shape (n_positions, n_layers, d_model) (load with torch.load). The recruitment readout is cos(vMOLD, vGOLD) averaged over the late third of the layers. Reproduce everything, including the extraction and figures, from the code repository linked above.

Downloads last month: -

Model tree for davidafrica/functional-wellbeing

Base model

Qwen/Qwen3-4B-Instruct-2507

Adapter

(5508)

this model

Paper for davidafrica/functional-wellbeing

How's it going? Reinforcement learning in language models recruits a functional welfare axis

Paper • 2605.30232 • Published 16 days ago

davidafrica
/

functional-wellbeing