File size: 3,673 Bytes
c304230 c1164b0 c304230 5d7c864 c1164b0 5d7c864 c1164b0 5d7c864 c304230 c1164b0 c304230 c1164b0 c304230 c1164b0 c304230 94db075 c304230 c1164b0 c304230 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | ---
license: apache-2.0
tags:
- robotics
- humanoid
- vision-language-action
- vlam
- diffusion-transformer
- pose-estimation
datasets:
- maxsegan/movement-strict-164
- maxsegan/movement-287
language:
- en
---
# MIMIC: Motion Imitation from Massive Internet Clips
A 4.0B-parameter vision-language-action model for full-body humanoid control,
trained entirely from internet-scale human video.
## Model Details
- **Architecture**: Qwen3-VL-4B (early exit at layer 18) + 24L/1536D DiT action head
- **Parameters**: ~4.0B total (2.2B truncated LLM + 415M vision encoder + 1.28B DiT + 132M LoRA)
- **Action space**: 22-DoF joint angles at 10Hz
- **Action horizon**: 16 steps (1.6s)
- **Training data**: [movement-strict-164](https://huggingface.co/datasets/maxsegan/movement-strict-164) (164,390 clips, ~1.9M samples), produced by re-tracking and VLM-judging the raw output of our two-stage Kinetics-700 processing pipeline.
- **Best validation loss**: 0.1097
- **Training compute**: 4 x RTX Pro Blackwell, ~5.9 days (~566 GPU-hours)
- **Checkpoint step**: 29,060
## Held-out evaluation
500 clips sampled from the held-out validation split. Joint-angle RMSE in degrees, future-only ($t{=}1..15$, 1.5s of prediction), with rolling re-init every $S$ timesteps using ground-truth state.
| Step $S$ | MIMIC (this model) | Static baseline | Linear baseline |
|---:|---:|---:|---:|
| 3 | 23.8 | 22.1 | 39.4 |
| 8 | 32.7 | 32.3 | 95.0 |
| 16 | **39.8** | 41.0 | 188.4 |
On the high-motion top-quartile subset at $S{=}16$: **57.5°** (model) vs 60.7° (static), a 5.3% reduction in error. Compared to a 325K-clip ablation model trained on the unfiltered intermediate corpus, MIMIC lowers all-clip RMSE from 43.7° to 39.8° (-9%) and high-motion RMSE from 63.9° to 57.5° (-10%), and widens the model-versus-static gap from 0.1° to 1.2° on average (12x larger) and 1.7° to 3.2° on the high-motion subset.
**Performance is bimodal across activity type.** Cyclical or repeated motions (pull-ups, squats, jumping rope) predict much more accurately than long compositional sequences (cooking, multi-phase sports actions, dance routines). We read this as a data-coverage gap rather than a model-capacity ceiling: ingesting datasets with denser coverage of multi-step activities would likely close it.
A second model trained on [movement-287](https://huggingface.co/datasets/maxsegan/movement-287) (286,890 clips, includes lower-motion classes) is also available and reaches the same long-horizon RMSE with sharper short-horizon predictions (step-1 median 2.5° vs 2.9°).
## Usage
```python
import torch, yaml
from training.vla_model import VLAModel, VLAConfig
config = yaml.safe_load(open("config.yaml"))
model = VLAModel(VLAConfig(**config["model_config"]))
ckpt = torch.load("checkpoint.pth", map_location="cpu", weights_only=False)
state_dict = ckpt.get("model_state_dict", ckpt)
state_dict = {k.removeprefix("module."): v for k, v in state_dict.items()}
model.load_state_dict(state_dict, strict=False)
model.eval().cuda()
```
See the [GitHub repo](https://github.com/maxsegan/movement) for full inference and training code.
## Training
Flow matching loss on movement-strict-164. The vision encoder is frozen throughout; the Qwen3-VL-4B backbone uses LoRA (rank 128). The DiT action head is trained from scratch. The training set is re-tracked through a multi-frame YOLO plus Qwen oracle plus sticky IoU tracker pipeline before judgment by a 235B VLM, then filtered to clips passing both deterministic continuity checks and the VLM verdict on tracking consistency and motion-label match.
## Citation
Paper forthcoming.
## License
Apache 2.0
|