Instructions to use benchflow/benchflow-qwen35-9b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use benchflow/benchflow-qwen35-9b with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-9B") model = PeftModel.from_pretrained(base_model, "benchflow/benchflow-qwen35-9b") - Notebooks
- Google Colab
- Kaggle
BenchFlow Qwen3.5-9B Env-0 Mobile SFT LoRA Adapter
This repository now points to the current env-0-mobile PR828 SFT adapter:
env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z
It replaces the earlier SFT adapter that was documented for the Prime
general-agent reproduction. The repository contains a PEFT LoRA adapter only;
it does not contain the Qwen/Qwen3.5-9B base weights.
Release Summary
| Field | Value |
|---|---|
| Adapter repo | benchflow/benchflow-qwen35-9b |
| Current version | env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z |
| Model tag | env0-mobile-pr828-20260625 |
| Base checkpoint | Qwen/Qwen3.5-9B |
| Base checkpoint form | Full, non-quantized source checkpoint; frozen during LoRA SFT |
| Adapter type | LoRA / PEFT |
| Main training run | env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z |
| Training task source | env-0-mobile/tasks-eval |
| Training artifact source | benchflow/env0-experiment-trajectories/experiments/env0-mobile-pr828/training/env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z |
| Teacher trajectory run | pr828-env0-mobile-full300-azure-openhands-daytona-20260625T041236Z |
| Baseline run | pr828-env0-mobile-baseline-full300-qwen35-9b-a100-20260625T054948Z |
| Post-SFT eval run | pr828-env0-mobile-sft-full300-qwen35-9b-h100-20260625T100226Z |
| W&B run | https://wandb.ai/benchflow-ai/env0-mobile-pr828-qwen35-sft-20260625-h100/runs/env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z |
| Published to main | 2026-06-25 |
Intended Use
This adapter is an experiment artifact for measuring whether SFT on
BenchFlow/Daytona/OpenHands env-0-mobile trajectories improves task pass rate
for Qwen/Qwen3.5-9B. It is intended for controlled evaluation and further
research, not for production autonomous operation.
Data Recipe
The training rows were generated by running all 300 tasks under
env-0-mobile/tasks-eval with:
- BenchFlow PR
benchflow-ai/benchflow#828; - Daytona sandboxes;
- OpenHands ACP agent;
- Azure GPT-5.4-mini teacher;
bench train convertto Prime-RL SFT-compatible JSONL.
The canonical teacher dataset has:
| Field | Value |
|---|---|
| Canonical rows | 300 |
| Teacher pass count | 83/300 |
| Source LLM exchanges | 2163 |
| Rows with tool calls | 175 |
| Skipped rows after canonicalization | 0 |
Training data artifact:
https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/pr828-env0-mobile-full300-azure-openhands-daytona-20260625T041236Z
Training Parameters
| Field | Value |
|---|---|
| Trainer | Custom Transformers + PEFT LoRA SFT |
| Model loaded for SFT | Qwen/Qwen3.5-9B full BF16 base weights |
| Quantization | None |
| Adapter | LoRA |
| LoRA rank | 32 |
| LoRA alpha | 64 |
| LoRA dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Sequence length | 8192 |
| Micro batch size | 1 |
| Gradient accumulation | 8 |
| Learning rate | 1e-4 |
| Max steps | 300 |
| Saved checkpoints | 100, 200, 300 |
| Hardware | Prime Intellect 1x H100 80GB, massedcompute, $2.35/hr |
| W&B project | env0-mobile-pr828-qwen35-sft-20260625-h100 |
The A100 40GB feasibility check failed with CUDA OOM at max_length=8192.
The H100 run completed the full epoch successfully.
Training Result
| Metric | Value |
|---|---|
| Completed step | 300 |
| Best step | 300 |
| Best eval loss | 0.4590291380882263 |
| Training rows | 300 |
| Eval rows used during training | 1 |
| Final adapter file | adapter_model.safetensors |
| Final adapter size | 232818064 bytes |
Training artifacts:
https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/training/env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z
Evaluation Results
All rows below use the same 300-task env-0-mobile/tasks-eval denominator and
canonicalized result selection.
| Model / stage | Pass | Pass rate |
|---|---|---|
| Azure GPT-5.4-mini teacher | 83/300 |
27.67% |
| Qwen3.5-9B base, self-hosted official full weights | 4/300 |
1.33% |
| Qwen3.5-9B SFT adapter | 16/300 |
5.33% |
Lift over base:
- absolute:
+12passes,+4.00percentage points; - relative pass-count lift:
4.0x.
On the subset of 83 tasks passed by the GPT-5.4-mini teacher:
| Model / stage | Pass | Pass rate |
|---|---|---|
| Qwen3.5-9B base | 3/83 |
3.61% |
| Qwen3.5-9B SFT adapter | 13/83 |
15.66% |
Post-SFT eval artifacts:
https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/pr828-env0-mobile-sft-full300-qwen35-9b-h100-20260625T100226Z
Baseline artifacts:
https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/pr828-env0-mobile-baseline-full300-qwen35-9b-a100-20260625T054948Z
Fireworks Baseline Status
The requested Fireworks-hosted qwen3p5-9b baseline was not completed because
the Fireworks account cannot currently deploy or access that model:
- serverless chat-completions returned
Model not found, inaccessible, and/or not deployed; - the account has no active deployments or deployed models;
- on-demand deployment shapes exist for
qwen3p5-9b, but creating the minimal deployment returnedpayment method is required.
The baseline reported above used self-hosted official full Qwen/Qwen3.5-9B
weights through SGLang.
Loading
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3.5-9B",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, "benchflow/benchflow-qwen35-9b")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True)
Caveats
- This is a LoRA adapter. Load it on top of
Qwen/Qwen3.5-9B. - The completed baseline is self-hosted official full Qwen3.5-9B, not
Fireworks-hosted
qwen3p5-9b. - The strongest lift came from auth-revoke and a small number of gcal tasks; gdoc/gdrive/gmail/multi-invite remain weak and should be analyzed before a second epoch.
- The env-0-mobile Dockerfiles referenced
ghcr.io/benchflow-ai/env-0-base:latest, which was unavailable during the run. The experiment used the public mirrorghcr.io/oliver-dowhiz/env-0-base:latest.
- Downloads last month
- 14