BenchFlow Qwen3.5-9B Env-0 Mobile SFT LoRA Adapter

This repository now points to the current env-0-mobile PR828 SFT adapter:

env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z

It replaces the earlier SFT adapter that was documented for the Prime general-agent reproduction. The repository contains a PEFT LoRA adapter only; it does not contain the Qwen/Qwen3.5-9B base weights.

Release Summary

Field	Value
Adapter repo	`benchflow/benchflow-qwen35-9b`
Current version	`env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z`
Model tag	`env0-mobile-pr828-20260625`
Base checkpoint	`Qwen/Qwen3.5-9B`
Base checkpoint form	Full, non-quantized source checkpoint; frozen during LoRA SFT
Adapter type	LoRA / PEFT
Main training run	`env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z`
Training task source	`env-0-mobile/tasks-eval`
Training artifact source	`benchflow/env0-experiment-trajectories/experiments/env0-mobile-pr828/training/env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z`
Teacher trajectory run	`pr828-env0-mobile-full300-azure-openhands-daytona-20260625T041236Z`
Baseline run	`pr828-env0-mobile-baseline-full300-qwen35-9b-a100-20260625T054948Z`
Post-SFT eval run	`pr828-env0-mobile-sft-full300-qwen35-9b-h100-20260625T100226Z`
W&B run	`https://wandb.ai/benchflow-ai/env0-mobile-pr828-qwen35-sft-20260625-h100/runs/env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z`
Published to main	`2026-06-25`

Intended Use

This adapter is an experiment artifact for measuring whether SFT on BenchFlow/Daytona/OpenHands env-0-mobile trajectories improves task pass rate for Qwen/Qwen3.5-9B. It is intended for controlled evaluation and further research, not for production autonomous operation.

Data Recipe

The training rows were generated by running all 300 tasks under env-0-mobile/tasks-eval with:

BenchFlow PR benchflow-ai/benchflow#828;
Daytona sandboxes;
OpenHands ACP agent;
Azure GPT-5.4-mini teacher;
bench train convert to Prime-RL SFT-compatible JSONL.

The canonical teacher dataset has:

Field	Value
Canonical rows	`300`
Teacher pass count	`83/300`
Source LLM exchanges	`2163`
Rows with tool calls	`175`
Skipped rows after canonicalization	`0`

Training data artifact:

https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/pr828-env0-mobile-full300-azure-openhands-daytona-20260625T041236Z

Training Parameters

Field	Value
Trainer	Custom Transformers + PEFT LoRA SFT
Model loaded for SFT	`Qwen/Qwen3.5-9B` full BF16 base weights
Quantization	None
Adapter	LoRA
LoRA rank	`32`
LoRA alpha	`64`
LoRA dropout	`0.05`
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Sequence length	`8192`
Micro batch size	`1`
Gradient accumulation	`8`
Learning rate	`1e-4`
Max steps	`300`
Saved checkpoints	`100`, `200`, `300`
Hardware	Prime Intellect 1x H100 80GB, `massedcompute`, `$2.35/hr`
W&B project	`env0-mobile-pr828-qwen35-sft-20260625-h100`

The A100 40GB feasibility check failed with CUDA OOM at max_length=8192. The H100 run completed the full epoch successfully.

Training Result

Metric	Value
Completed step	`300`
Best step	`300`
Best eval loss	`0.4590291380882263`
Training rows	`300`
Eval rows used during training	`1`
Final adapter file	`adapter_model.safetensors`
Final adapter size	`232818064` bytes

Training artifacts:

https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/training/env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z

Evaluation Results

All rows below use the same 300-task env-0-mobile/tasks-eval denominator and canonicalized result selection.

Model / stage	Pass	Pass rate
Azure GPT-5.4-mini teacher	`83/300`	`27.67%`
Qwen3.5-9B base, self-hosted official full weights	`4/300`	`1.33%`
Qwen3.5-9B SFT adapter	`16/300`	`5.33%`

Lift over base:

absolute: +12 passes, +4.00 percentage points;
relative pass-count lift: 4.0x.

On the subset of 83 tasks passed by the GPT-5.4-mini teacher:

Model / stage	Pass	Pass rate
Qwen3.5-9B base	`3/83`	`3.61%`
Qwen3.5-9B SFT adapter	`13/83`	`15.66%`

Post-SFT eval artifacts:

https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/pr828-env0-mobile-sft-full300-qwen35-9b-h100-20260625T100226Z

Baseline artifacts:

https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/pr828-env0-mobile-baseline-full300-qwen35-9b-a100-20260625T054948Z

Fireworks Baseline Status

The requested Fireworks-hosted qwen3p5-9b baseline was not completed because the Fireworks account cannot currently deploy or access that model:

serverless chat-completions returned Model not found, inaccessible, and/or not deployed;
the account has no active deployments or deployed models;
on-demand deployment shapes exist for qwen3p5-9b, but creating the minimal deployment returned payment method is required.

The baseline reported above used self-hosted official full Qwen/Qwen3.5-9B weights through SGLang.

Loading

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-9B",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, "benchflow/benchflow-qwen35-9b")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True)

Caveats

This is a LoRA adapter. Load it on top of Qwen/Qwen3.5-9B.
The completed baseline is self-hosted official full Qwen3.5-9B, not Fireworks-hosted qwen3p5-9b.
The strongest lift came from auth-revoke and a small number of gcal tasks; gdoc/gdrive/gmail/multi-invite remain weak and should be analyzed before a second epoch.
The env-0-mobile Dockerfiles referenced ghcr.io/benchflow-ai/env-0-base:latest, which was unavailable during the run. The experiment used the public mirror ghcr.io/oliver-dowhiz/env-0-base:latest.

Downloads last month: 14

Model tree for benchflow/benchflow-qwen35-9b

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Adapter

(378)

this model

benchflow
/

benchflow-qwen35-9b