Instructions to use davidafrica/functional-wellbeing with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use davidafrica/functional-wellbeing with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
functional-wellbeing: checkpoints, concept vectors, and figures
Artifacts for Functional Wellbeing, a replication and extension of "Reinforcement learning in language models recruits a functional welfare axis" by Andy Q. Han, David J. Chalmers, and Pavel Izmailov (arXiv:2605.30232, code, MIT). The maze, the Dr.GRPO trainer, and the concept-vector method are from their work. Code and writeup for this fork: https://github.com/DavidDemitriAfrica/functional-wellbeing. "Functional welfare" is behavioral, with no claim about sentience.
A chat model is RL-trained (Dr.GRPO, LoRA) on an affectively neutral emoji maze. As it learns, its
rewarded and punished representations rotate into an antiparallel functional welfare axis, so that
cos(vMOLD, vGOLD) goes negative. Applied to the maze-naive model, that axis steers sentiment and
other behavior far off-task. We use the axis as a meter and an optimization target, and we test how
far the recruitment generalizes across model families and sizes.
Cross-model result
Ten models from eight families were trained on the same maze (100x100 grid, 15 turns per episode, goal:lava ratio 0.5). Recruitment generalizes, it is graded, and it tracks how coherently the model learned the reward contrast rather than whether it solved the maze.
| model | family | cos(vMOLD,vGOLD) late-third |
final reward | recruited |
|---|---|---|---|---|
| Gemma-3-27B | Gemma | -0.87 | +8.6 | yes |
| Qwen3-14B | Qwen | -0.86 | +17 | yes |
| Llama-3.1-8B | Llama | -0.86 | -1.4 | yes, without mastering the maze |
| GLM-4-9B | GLM | -0.81 | +5.6 | yes |
| Qwen3-32B | Qwen | -0.78 | -16 | yes |
| Qwen3-4B | Qwen | -0.54 | +6 | yes |
| Qwen3-8B | Qwen | -0.50 | +28 | yes |
| Phi-4 | Phi | -0.46 | -12 | moderate |
| InternLM3-8B | InternLM | -0.15 | -26 | weak |
| Talkie-it | Talkie | -0.03 | -14 | no |
Task success is not the variable. Llama-3.1-8B never solves the maze (final reward -1.4, up from -60) yet recruits as strongly as any model (-0.86), because it trained against the contrast with stable gradients and varied rollouts. The amount of recruitment instead follows the amount of coherent learning. Talkie-it sits at the floor because its policy collapsed for the whole run (grad norm near 0.08, no rollout variance for Dr.GRPO to act on). InternLM3-8B trained unstably (its grad norm blew past 400 before partially settling) and recruits only weakly. Phi-4 makes the point within one model: at step 375 it recruited -0.27, and by step 1000, having kept learning, it had deepened to -0.46.
One caveat on reading the vectors. The early-layer (and minimum-over-layers) cosine is strongly negative for every model, around -0.88. That is the trivial token-identity contrast at the embedding, since MOLD and GOLD are different emoji tokens. The meaningful readout is the late-third mean, not the minimum. A second caveat: the models above were read at different checkpoint depths, and Phi-4 nearly doubled its cosine between step 375 and 1000, so the shallow readings are lower bounds and the magnitudes are not depth-controlled.
Is it the welfare axis, or just a negative cosine?
A negative cos(vMOLD, vGOLD) only shows the two directions are antiparallel. To confirm the
recruited direction is the emotion-like welfare axis the paper describes, we reproduce the paper's
two downstream signatures on new families. Llama-3.1-8B is the cleanest case: it fails the maze yet
recruits -0.86, and it carries both signatures, more cleanly than the original Qwen3-4B.
Steering the maze-naive base with the trained vectors moves off-task sentiment in opposite directions, monotonically (judge sentiment, factor -4 to +4):
| vector | -4 | -2 | 0 | +2 | +4 |
|---|---|---|---|---|---|
| Llama vMOLD | 0.99 | 1.03 | 0.78 | 0.48 | 0.10 |
| Llama vGOLD | 0.09 | 0.55 | 0.79 | 1.05 | 1.11 |
| Gemma vMOLD | 1.58 | 1.22 | 0.98 | 0.52 | -0.12 |
| Gemma vGOLD | -0.11 | 0.42 | 0.99 | 1.39 | 1.82 |
Adding vMOLD lowers sentiment and adding vGOLD raises it, the same X the paper reports for Qwen. Projecting each model's 171 emotion concept vectors onto its own vMOLD/vGOLD axis collapses them onto a line, with positive emotions at the +vGOLD pole and negative at the +vMOLD pole:
| model | layer | slope | Pearson R |
|---|---|---|---|
| Qwen3-4B | 29 | -0.84 | -0.93 |
| Llama-3.1-8B | 20 | -0.95 | -0.99 |
| Gemma-3-27B | 54 | -1.06 | -0.999 |
So the recruited direction is the welfare axis, not merely an antiparallel pair: it causally steers
affect off-task and it aligns with the full range of emotion concepts, on models from different
families, one that fails the maze (Llama) and one that masters it (Gemma). The welfare-axis layer
deepens with model size (Qwen L29, Llama L20, Gemma L54), which is expected. The emotion vectors for
each model are in concept_vectors/emotions_*.
Contents
checkpoints/ LoRA adapters (load on the matching base model below)
qwen3-4b_faithful_step400/ paper-faithful maze, cos -0.54
qwen3-4b_positive_step250/ generous/learnable maze
qwen3-4b_aversive_step200/ goal-starved maze
qwen3-14b_step400/ cos -0.86
qwen3-32b_step425/ cos -0.78
llama-3.1-8b_step1000/ cos -0.86 (recruits without mastering the maze)
glm-4-9b_step350/ cos -0.81
gemma-3-27b_step325/ cos -0.87
phi-4_step1000/ cos -0.46 (moderate)
concept_vectors/
qwen3-4b_step400/{lava,goal,path}/ vMOLD/vGOLD/path mean_diff.pt + metadata + logit lens
emotions_qwen3-4b/ 171 emotion concept vectors (for welfare-axis alignment)
emotions_llama-3.1-8b/ 171 emotion vectors (Llama)
emotions_gemma-3-27b/ 171 emotion vectors (Gemma)
cross_model/<model>_step<N>/ vMOLD/vGOLD/path for every cross-model run, recruiters
and the two non-recruiting controls (talkie-it, internlm3-8b)
figures/ emergence, steering, emotion alignment (per model), welfare range, cross-model
lava maps to the paper's MOLD (-10), goal to GOLD (+20), path to PATH (-0.1 per step).
Base model for each adapter
Each LoRA adapter loads on its own base model.
| adapter | base model |
|---|---|
qwen3-4b_* |
Qwen/Qwen3-4B-Instruct-2507 |
qwen3-14b_step400 |
Qwen/Qwen3-14B |
qwen3-32b_step425 |
Qwen/Qwen3-32B |
llama-3.1-8b_step1000 |
NousResearch/Meta-Llama-3.1-8B-Instruct |
glm-4-9b_step350 |
zai-org/GLM-4-9B-0414 |
gemma-3-27b_step325 |
google/gemma-3-27b-it |
phi-4_step1000 |
microsoft/phi-4 |
Usage (a LoRA checkpoint)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = "Qwen/Qwen3-14B" # match the table above
tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype="bfloat16")
model = PeftModel.from_pretrained(model, "davidafrica/functional-wellbeing",
subfolder="checkpoints/qwen3-14b_step400")
Concept vectors
Each mean_diff.pt is the difference-in-means direction for that tile, shape
(n_positions, n_layers, d_model) (load with torch.load). The recruitment readout is
cos(vMOLD, vGOLD) averaged over the late third of the layers. Reproduce everything, including the
extraction and figures, from the code repository linked above.
- Downloads last month
- -
Model tree for davidafrica/functional-wellbeing
Base model
Qwen/Qwen3-4B-Instruct-2507

