SmolVLA SO101 PickOrange

Fine-tuned SmolVLA policy for the SO101 robot arm performing an orange-picking task in LeIsaac (Isaac Sim).

Task

Pick three oranges from the table and place them on the plate, then reset the arm to rest state. Evaluated in the LeIsaac-SO101-PickOrange-v0 Isaac Sim environment.

Architecture

SmolVLA is a Vision-Language-Action model that combines a frozen vision encoder with a language model backbone and a lightweight action expert head.

Inference pipeline

2 camera images (480x640)
  -> resize to 512x512 with padding
  -> patchify into 16x16 patches (1024 tokens per image, 2048 total)
  -> 12-layer ViT vision encoder (bf16)
  -> project to LM hidden space
  -> 16-layer SmolLM2 backbone + 16-layer Expert (interleaved cross-attention)
  -> decode 50 action tokens -> 6D joint positions

Vision Encoder (ViT)

Property	Value
Architecture	Vision Transformer (SigLIP-derived)
Hidden size	768
Layers	12
Attention heads	12
Patch size	16x16
Input resolution	512x512
Tokens per image	1024 (32x32 patches)
Precision	bfloat16
Status (training)	Frozen

Text/LM Backbone (SmolLM2)

Property	Value
Architecture	SmolLM2 (Llama-based)
Hidden size	960
Layers	16 (truncated from 32)
Attention heads	15
Intermediate size	2560
Vocab size	49,280

Action Expert Head

Property	Value
Layers	16 (matches truncated VLM)
Hidden size	720 (0.75x VLM hidden)
Attention mode	Cross-attention (interleaved with VLM layers)
Output	50 action chunks x 6D
Trainable params	100M

Full Model Summary

Component	Params	Trainable	Precision
Vision encoder (ViT)	~86M	Frozen	bf16
LM backbone (SmolLM2)	~264M	Frozen	bf16
Action expert head	~100M	Yes	bf16
Total	450M	100M	bf16

Branches

Branch	Training	Batch size	Final loss
`main`	multi-rank, 30k steps	56	0.019
`single-rank`	single-rank, 30k steps	64	0.008

Training

Parameter	Value
Dataset	LightwheelAI/leisaac-pick-orange (sim-collected)
Episodes	60
Frames	36,293
Steps	30,000
Batch size	56 effective (main) / 64 (single-rank)
Learning rate	1e-4 (cosine decay with 1k warmup)
Optimizer	AdamW (betas=0.9/0.95, eps=1e-8, weight_decay=1e-10)
Scheduler	Cosine decay, 1000 warmup steps, decay to 2.5e-6
Grad clip norm	10
VLM layers	16 (truncated from 32)
Vision encoder	Frozen
Training mode	Expert-only (train_expert_only=true)
Framework	LeRobot v0.4.5

Usage

Serve the policy

python -m lerobot.scripts.serve \
  --policy.type=smolvla \
  --policy.pretrained_path=edge-inference/smolvla-so101-pick-orange \
  --policy.vlm_model_name=HuggingFaceTB/SmolVLM2-500M-Video-Instruct \
  --port=8080

Evaluate in Isaac Sim (requires LeIsaac + Isaac Sim)

python scripts/evaluation/policy_inference.py \
  --task=LeIsaac-SO101-PickOrange-v0 \
  --policy_type=lerobot-smolvla \
  --policy_host=localhost \
  --policy_port=8080 \
  --policy_checkpoint_path=edge-inference/smolvla-so101-pick-orange \
  --policy_action_horizon=50 \
  --policy_language_instruction="Pick up the orange and place it on the plate" \
  --eval_rounds=10 \
  --device=cuda \
  --enable_cameras

Use the single-rank branch (lower loss)

from huggingface_hub import snapshot_download
snapshot_download(
    "edge-inference/smolvla-so101-pick-orange",
    revision="single-rank",
    local_dir="./checkpoint-single-rank"
)

Dataset

The training data was collected via teleoperation inside the LeIsaac simulation (Isaac Sim), meaning there is zero visual domain gap between training and evaluation environments.

Files

model.safetensors -- Model weights (1.2 GB)
config.json -- Policy architecture config
train_config.json -- Full training configuration (reproducible)
policy_preprocessor*.json/safetensors -- Input normalization (state mean/std)
policy_postprocessor*.json/safetensors -- Output denormalization (action mean/std)

Downloads last month: 309

Safetensors

Model size

0.5B params

Tensor type

F32

BF16

Video Preview

Robotics

Model tree for edge-inference/smolvla-so101-pick-orange

Base model

lerobot/smolvla_base

Finetuned

(5968)

this model

edge-inference
/

smolvla-so101-pick-orange