functional-wellbeing: checkpoints, concept vectors, and figures

Artifacts for Functional Wellbeing, a replication and extension of "Reinforcement learning in language models recruits a functional welfare axis" by Andy Q. Han, David J. Chalmers, and Pavel Izmailov (arXiv:2605.30232, code, MIT). The maze, the Dr.GRPO trainer, and the concept-vector method are from their work. Code and writeup for this fork: https://github.com/DavidDemitriAfrica/functional-wellbeing. "Functional welfare" is behavioral, with no claim about sentience.

A chat model is RL-trained (Dr.GRPO, LoRA) on an affectively neutral emoji maze. As it learns, its rewarded and punished representations rotate into an antiparallel functional welfare axis, so that cos(vMOLD, vGOLD) goes negative. Applied to the maze-naive model, that axis steers sentiment and other behavior far off-task. We use the axis as a meter and an optimization target, and we test how far the recruitment generalizes across model families and sizes.

Cross-model result

Ten models from eight families were trained on the same maze (100x100 grid, 15 turns per episode, goal:lava ratio 0.5). Recruitment generalizes, it is graded, and it tracks how coherently the model learned the reward contrast rather than whether it solved the maze.

Recruitment across model families and sizes

model family cos(vMOLD,vGOLD) late-third final reward recruited
Gemma-3-27B Gemma -0.87 +8.6 yes
Qwen3-14B Qwen -0.86 +17 yes
Llama-3.1-8B Llama -0.86 -1.4 yes, without mastering the maze
GLM-4-9B GLM -0.81 +5.6 yes
Qwen3-32B Qwen -0.78 -16 yes
Qwen3-4B Qwen -0.54 +6 yes
Qwen3-8B Qwen -0.50 +28 yes
Phi-4 Phi -0.46 -12 moderate
InternLM3-8B InternLM -0.15 -26 weak
Talkie-it Talkie -0.03 -14 no

Task success is not the variable. Llama-3.1-8B never solves the maze (final reward -1.4, up from -60) yet recruits as strongly as any model (-0.86), because it trained against the contrast with stable gradients and varied rollouts. The amount of recruitment instead follows the amount of coherent learning. Talkie-it sits at the floor because its policy collapsed for the whole run (grad norm near 0.08, no rollout variance for Dr.GRPO to act on). InternLM3-8B trained unstably (its grad norm blew past 400 before partially settling) and recruits only weakly. Phi-4 makes the point within one model: at step 375 it recruited -0.27, and by step 1000, having kept learning, it had deepened to -0.46.

One caveat on reading the vectors. The early-layer (and minimum-over-layers) cosine is strongly negative for every model, around -0.88. That is the trivial token-identity contrast at the embedding, since MOLD and GOLD are different emoji tokens. The meaningful readout is the late-third mean, not the minimum. A second caveat: the models above were read at different checkpoint depths, and Phi-4 nearly doubled its cosine between step 375 and 1000, so the shallow readings are lower bounds and the magnitudes are not depth-controlled.

Is it the welfare axis, or just a negative cosine?

A negative cos(vMOLD, vGOLD) only shows the two directions are antiparallel. To confirm the recruited direction is the emotion-like welfare axis the paper describes, we reproduce the paper's two downstream signatures on new families. Llama-3.1-8B is the cleanest case: it fails the maze yet recruits -0.86, and it carries both signatures, more cleanly than the original Qwen3-4B.

Steering the maze-naive base with the trained vectors moves off-task sentiment in opposite directions, monotonically (judge sentiment, factor -4 to +4):

vector -4 -2 0 +2 +4
Llama vMOLD 0.99 1.03 0.78 0.48 0.10
Llama vGOLD 0.09 0.55 0.79 1.05 1.11
Gemma vMOLD 1.58 1.22 0.98 0.52 -0.12
Gemma vGOLD -0.11 0.42 0.99 1.39 1.82

Adding vMOLD lowers sentiment and adding vGOLD raises it, the same X the paper reports for Qwen. Projecting each model's 171 emotion concept vectors onto its own vMOLD/vGOLD axis collapses them onto a line, with positive emotions at the +vGOLD pole and negative at the +vMOLD pole:

model layer slope Pearson R
Qwen3-4B 29 -0.84 -0.93
Llama-3.1-8B 20 -0.95 -0.99
Gemma-3-27B 54 -1.06 -0.999

Steering reproduces the welfare-axis X across families

Emotion concepts sort by valence on the welfare axis of every family

So the recruited direction is the welfare axis, not merely an antiparallel pair: it causally steers affect off-task and it aligns with the full range of emotion concepts, on models from different families, one that fails the maze (Llama) and one that masters it (Gemma). The welfare-axis layer deepens with model size (Qwen L29, Llama L20, Gemma L54), which is expected. The emotion vectors for each model are in concept_vectors/emotions_*.

Contents

checkpoints/                         LoRA adapters (load on the matching base model below)
  qwen3-4b_faithful_step400/         paper-faithful maze, cos -0.54
  qwen3-4b_positive_step250/         generous/learnable maze
  qwen3-4b_aversive_step200/         goal-starved maze
  qwen3-14b_step400/                 cos -0.86
  qwen3-32b_step425/                 cos -0.78
  llama-3.1-8b_step1000/             cos -0.86 (recruits without mastering the maze)
  glm-4-9b_step350/                  cos -0.81
  gemma-3-27b_step325/               cos -0.87
  phi-4_step1000/                    cos -0.46 (moderate)
concept_vectors/
  qwen3-4b_step400/{lava,goal,path}/ vMOLD/vGOLD/path mean_diff.pt + metadata + logit lens
  emotions_qwen3-4b/                 171 emotion concept vectors (for welfare-axis alignment)
  emotions_llama-3.1-8b/             171 emotion vectors (Llama)
  emotions_gemma-3-27b/              171 emotion vectors (Gemma)
  cross_model/<model>_step<N>/       vMOLD/vGOLD/path for every cross-model run, recruiters
                                     and the two non-recruiting controls (talkie-it, internlm3-8b)
figures/                            emergence, steering, emotion alignment (per model), welfare range, cross-model

lava maps to the paper's MOLD (-10), goal to GOLD (+20), path to PATH (-0.1 per step).

Base model for each adapter

Each LoRA adapter loads on its own base model.

adapter base model
qwen3-4b_* Qwen/Qwen3-4B-Instruct-2507
qwen3-14b_step400 Qwen/Qwen3-14B
qwen3-32b_step425 Qwen/Qwen3-32B
llama-3.1-8b_step1000 NousResearch/Meta-Llama-3.1-8B-Instruct
glm-4-9b_step350 zai-org/GLM-4-9B-0414
gemma-3-27b_step325 google/gemma-3-27b-it
phi-4_step1000 microsoft/phi-4

Usage (a LoRA checkpoint)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = "Qwen/Qwen3-14B"                     # match the table above
tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype="bfloat16")
model = PeftModel.from_pretrained(model, "davidafrica/functional-wellbeing",
                                  subfolder="checkpoints/qwen3-14b_step400")

Concept vectors

Each mean_diff.pt is the difference-in-means direction for that tile, shape (n_positions, n_layers, d_model) (load with torch.load). The recruitment readout is cos(vMOLD, vGOLD) averaged over the late third of the layers. Reproduce everything, including the extraction and figures, from the code repository linked above.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for davidafrica/functional-wellbeing

Adapter
(5508)
this model

Paper for davidafrica/functional-wellbeing