LightwheelAI/leisaac-pick-orange
Viewer • Updated • 36.3k • 604 • 5
How to use edge-inference/smolvla-so101-pick-orange with LeRobot:
# See https://github.com/huggingface/lerobot?tab=readme-ov-file#installation for more details git clone https://github.com/huggingface/lerobot.git cd lerobot pip install -e .[smolvla]
# Launch finetuning on your dataset python lerobot/scripts/train.py \ --policy.path=edge-inference/smolvla-so101-pick-orange \ --dataset.repo_id=lerobot/svla_so101_pickplace \ --batch_size=64 \ --steps=20000 \ --output_dir=outputs/train/my_smolvla \ --job_name=my_smolvla_training \ --policy.device=cuda \ --wandb.enable=true
# Run the policy using the record function
python -m lerobot.record \
--robot.type=so101_follower \
--robot.port=/dev/ttyACM0 \ # <- Use your port
--robot.id=my_blue_follower_arm \ # <- Use your robot id
--robot.cameras="{ front: {type: opencv, index_or_path: 8, width: 640, height: 480, fps: 30}}" \ # <- Use your cameras
--dataset.single_task="Grasp a lego block and put it in the bin." \ # <- Use the same task description you used in your dataset recording
--dataset.repo_id=HF_USER/dataset_name \ # <- This will be the dataset name on HF Hub
--dataset.episode_time_s=50 \
--dataset.num_episodes=10 \
--policy.path=edge-inference/smolvla-so101-pick-orangeFine-tuned SmolVLA policy for the SO101 robot arm performing an orange-picking task in LeIsaac (Isaac Sim).
Pick three oranges from the table and place them on the plate, then reset the arm to rest state. Evaluated in the LeIsaac-SO101-PickOrange-v0 Isaac Sim environment.
SmolVLA is a Vision-Language-Action model that combines a frozen vision encoder with a language model backbone and a lightweight action expert head.
2 camera images (480x640)
-> resize to 512x512 with padding
-> patchify into 16x16 patches (1024 tokens per image, 2048 total)
-> 12-layer ViT vision encoder (bf16)
-> project to LM hidden space
-> 16-layer SmolLM2 backbone + 16-layer Expert (interleaved cross-attention)
-> decode 50 action tokens -> 6D joint positions
| Property | Value |
|---|---|
| Architecture | Vision Transformer (SigLIP-derived) |
| Hidden size | 768 |
| Layers | 12 |
| Attention heads | 12 |
| Patch size | 16x16 |
| Input resolution | 512x512 |
| Tokens per image | 1024 (32x32 patches) |
| Precision | bfloat16 |
| Status (training) | Frozen |
| Property | Value |
|---|---|
| Architecture | SmolLM2 (Llama-based) |
| Hidden size | 960 |
| Layers | 16 (truncated from 32) |
| Attention heads | 15 |
| Intermediate size | 2560 |
| Vocab size | 49,280 |
| Property | Value |
|---|---|
| Layers | 16 (matches truncated VLM) |
| Hidden size | 720 (0.75x VLM hidden) |
| Attention mode | Cross-attention (interleaved with VLM layers) |
| Output | 50 action chunks x 6D |
| Trainable params | 100M |
| Component | Params | Trainable | Precision |
|---|---|---|---|
| Vision encoder (ViT) | ~86M | Frozen | bf16 |
| LM backbone (SmolLM2) | ~264M | Frozen | bf16 |
| Action expert head | ~100M | Yes | bf16 |
| Total | 450M | 100M | bf16 |
| Branch | Training | Batch size | Final loss |
|---|---|---|---|
main |
multi-rank, 30k steps | 56 | 0.019 |
single-rank |
single-rank, 30k steps | 64 | 0.008 |
| Parameter | Value |
|---|---|
| Dataset | LightwheelAI/leisaac-pick-orange (sim-collected) |
| Episodes | 60 |
| Frames | 36,293 |
| Steps | 30,000 |
| Batch size | 56 effective (main) / 64 (single-rank) |
| Learning rate | 1e-4 (cosine decay with 1k warmup) |
| Optimizer | AdamW (betas=0.9/0.95, eps=1e-8, weight_decay=1e-10) |
| Scheduler | Cosine decay, 1000 warmup steps, decay to 2.5e-6 |
| Grad clip norm | 10 |
| VLM layers | 16 (truncated from 32) |
| Vision encoder | Frozen |
| Training mode | Expert-only (train_expert_only=true) |
| Framework | LeRobot v0.4.5 |
python -m lerobot.scripts.serve \
--policy.type=smolvla \
--policy.pretrained_path=edge-inference/smolvla-so101-pick-orange \
--policy.vlm_model_name=HuggingFaceTB/SmolVLM2-500M-Video-Instruct \
--port=8080
python scripts/evaluation/policy_inference.py \
--task=LeIsaac-SO101-PickOrange-v0 \
--policy_type=lerobot-smolvla \
--policy_host=localhost \
--policy_port=8080 \
--policy_checkpoint_path=edge-inference/smolvla-so101-pick-orange \
--policy_action_horizon=50 \
--policy_language_instruction="Pick up the orange and place it on the plate" \
--eval_rounds=10 \
--device=cuda \
--enable_cameras
from huggingface_hub import snapshot_download
snapshot_download(
"edge-inference/smolvla-so101-pick-orange",
revision="single-rank",
local_dir="./checkpoint-single-rank"
)
The training data was collected via teleoperation inside the LeIsaac simulation (Isaac Sim), meaning there is zero visual domain gap between training and evaluation environments.
model.safetensors -- Model weights (1.2 GB)config.json -- Policy architecture configtrain_config.json -- Full training configuration (reproducible)policy_preprocessor*.json/safetensors -- Input normalization (state mean/std)policy_postprocessor*.json/safetensors -- Output denormalization (action mean/std)Base model
lerobot/smolvla_base