BenchFlow Qwen3.5-9B Env-0 Mobile SFT LoRA Adapter

This repository now points to the current env-0-mobile PR828 SFT adapter:

env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z

It replaces the earlier SFT adapter that was documented for the Prime general-agent reproduction. The repository contains a PEFT LoRA adapter only; it does not contain the Qwen/Qwen3.5-9B base weights.

Release Summary

Field Value
Adapter repo benchflow/benchflow-qwen35-9b
Current version env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z
Model tag env0-mobile-pr828-20260625
Base checkpoint Qwen/Qwen3.5-9B
Base checkpoint form Full, non-quantized source checkpoint; frozen during LoRA SFT
Adapter type LoRA / PEFT
Main training run env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z
Training task source env-0-mobile/tasks-eval
Training artifact source benchflow/env0-experiment-trajectories/experiments/env0-mobile-pr828/training/env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z
Teacher trajectory run pr828-env0-mobile-full300-azure-openhands-daytona-20260625T041236Z
Baseline run pr828-env0-mobile-baseline-full300-qwen35-9b-a100-20260625T054948Z
Post-SFT eval run pr828-env0-mobile-sft-full300-qwen35-9b-h100-20260625T100226Z
W&B run https://wandb.ai/benchflow-ai/env0-mobile-pr828-qwen35-sft-20260625-h100/runs/env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z
Published to main 2026-06-25

Intended Use

This adapter is an experiment artifact for measuring whether SFT on BenchFlow/Daytona/OpenHands env-0-mobile trajectories improves task pass rate for Qwen/Qwen3.5-9B. It is intended for controlled evaluation and further research, not for production autonomous operation.

Data Recipe

The training rows were generated by running all 300 tasks under env-0-mobile/tasks-eval with:

  • BenchFlow PR benchflow-ai/benchflow#828;
  • Daytona sandboxes;
  • OpenHands ACP agent;
  • Azure GPT-5.4-mini teacher;
  • bench train convert to Prime-RL SFT-compatible JSONL.

The canonical teacher dataset has:

Field Value
Canonical rows 300
Teacher pass count 83/300
Source LLM exchanges 2163
Rows with tool calls 175
Skipped rows after canonicalization 0

Training data artifact:

https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/pr828-env0-mobile-full300-azure-openhands-daytona-20260625T041236Z

Training Parameters

Field Value
Trainer Custom Transformers + PEFT LoRA SFT
Model loaded for SFT Qwen/Qwen3.5-9B full BF16 base weights
Quantization None
Adapter LoRA
LoRA rank 32
LoRA alpha 64
LoRA dropout 0.05
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Sequence length 8192
Micro batch size 1
Gradient accumulation 8
Learning rate 1e-4
Max steps 300
Saved checkpoints 100, 200, 300
Hardware Prime Intellect 1x H100 80GB, massedcompute, $2.35/hr
W&B project env0-mobile-pr828-qwen35-sft-20260625-h100

The A100 40GB feasibility check failed with CUDA OOM at max_length=8192. The H100 run completed the full epoch successfully.

Training Result

Metric Value
Completed step 300
Best step 300
Best eval loss 0.4590291380882263
Training rows 300
Eval rows used during training 1
Final adapter file adapter_model.safetensors
Final adapter size 232818064 bytes

Training artifacts:

https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/training/env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z

Evaluation Results

All rows below use the same 300-task env-0-mobile/tasks-eval denominator and canonicalized result selection.

Model / stage Pass Pass rate
Azure GPT-5.4-mini teacher 83/300 27.67%
Qwen3.5-9B base, self-hosted official full weights 4/300 1.33%
Qwen3.5-9B SFT adapter 16/300 5.33%

Lift over base:

  • absolute: +12 passes, +4.00 percentage points;
  • relative pass-count lift: 4.0x.

On the subset of 83 tasks passed by the GPT-5.4-mini teacher:

Model / stage Pass Pass rate
Qwen3.5-9B base 3/83 3.61%
Qwen3.5-9B SFT adapter 13/83 15.66%

Post-SFT eval artifacts:

https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/pr828-env0-mobile-sft-full300-qwen35-9b-h100-20260625T100226Z

Baseline artifacts:

https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/pr828-env0-mobile-baseline-full300-qwen35-9b-a100-20260625T054948Z

Fireworks Baseline Status

The requested Fireworks-hosted qwen3p5-9b baseline was not completed because the Fireworks account cannot currently deploy or access that model:

  • serverless chat-completions returned Model not found, inaccessible, and/or not deployed;
  • the account has no active deployments or deployed models;
  • on-demand deployment shapes exist for qwen3p5-9b, but creating the minimal deployment returned payment method is required.

The baseline reported above used self-hosted official full Qwen/Qwen3.5-9B weights through SGLang.

Loading

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-9B",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, "benchflow/benchflow-qwen35-9b")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True)

Caveats

  • This is a LoRA adapter. Load it on top of Qwen/Qwen3.5-9B.
  • The completed baseline is self-hosted official full Qwen3.5-9B, not Fireworks-hosted qwen3p5-9b.
  • The strongest lift came from auth-revoke and a small number of gcal tasks; gdoc/gdrive/gmail/multi-invite remain weak and should be analyzed before a second epoch.
  • The env-0-mobile Dockerfiles referenced ghcr.io/benchflow-ai/env-0-base:latest, which was unavailable during the run. The experiment used the public mirror ghcr.io/oliver-dowhiz/env-0-base:latest.
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for benchflow/benchflow-qwen35-9b

Finetuned
Qwen/Qwen3.5-9B
Adapter
(378)
this model

Dataset used to train benchflow/benchflow-qwen35-9b