Qwen3-VL-2B-Instruct-Action

Qwen/Qwen3-VL-2B-Instruct extended with 2048 FAST action tokens so it can be used as the VLM backbone for the QwenFast framework in starVLA (autoregressive VLA via π₀-FAST-style discrete action tokens).

The base model weights are unchanged; only the input/output embedding tables are resized and the new rows are randomly initialised. The tokenizer, processor, and config are saved alongside so the directory loads as a drop-in replacement for the base model.

What was added

Value
Base model Qwen/Qwen3-VL-2B-Instruct
New tokens 2048 FAST action tokens (added as special tokens)
Tokenizer (base / total before) 151643 / 151669
Embedding size before / after 151936 → 153984
Action token id range [151936, 153983]
Init strategy for new rows normal (μ=0, σ=0.02)
Source token list fast_tokens.txt
Saved dtype bfloat16

The mapping {token: id} is stored in added_custom_token_id_map.json.

How it was produced

Built with the helper at starVLA/model/modules/vlm/tools/add_qwen_special_tokens/add_special_tokens_to_qwen.py, equivalent to:

python starVLA/model/modules/vlm/tools/add_qwen_special_tokens/add_special_tokens_to_qwen.py \
  --model-id Qwen/Qwen3-VL-2B-Instruct \
  --tokens-file starVLA/model/modules/vlm/tools/add_qwen_special_tokens/fast_tokens.txt \
  --save-dir ./results/Qwen3-VL-2B-Instruct-Action \
  --init-strategy normal

Use in starVLA

Set framework.qwenvl.base_vlm to this repo in your training YAML. Example: examples/SimplerEnv/train_files/config_2b_fast.yaml:

framework:
  name: QwenFast
  qwenvl:
    base_vlm: LinhanWang/Qwen3-VL-2B-Instruct-Action
    attn_implementation: sdpa
  action_model:
    action_model_type: FAST
    action_dim: 7
    future_action_window_size: 15
    past_action_window_size: 0

Standalone load

import torch
from transformers import AutoProcessor, AutoTokenizer, Qwen3VLForConditionalGeneration

repo = "LinhanWang/Qwen3-VL-2B-Instruct-Action"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
proc = AutoProcessor.from_pretrained(repo, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    repo, dtype=torch.bfloat16, attn_implementation="sdpa", device_map="cuda"
)
print(len(tok), model.get_input_embeddings().weight.shape[0])  # 153717 153984

The newly added embedding rows carry no learned signal — fine-tune the model (e.g. with QwenFast on a LeRobot dataset) before relying on the action tokens.

License

Apache-2.0, inherited from Qwen/Qwen3-VL-2B-Instruct.

Downloads last month
106
Safetensors
Model size
2B params
Tensor type
BF16
·
Video Preview
loading

Model tree for LinhanWang/Qwen3-VL-2B-Instruct-Action

Finetuned
(207)
this model