File size: 3,673 Bytes

---
license: apache-2.0
tags:
  - robotics
  - humanoid
  - vision-language-action
  - vlam
  - diffusion-transformer
  - pose-estimation
datasets:
  - maxsegan/movement-strict-164
  - maxsegan/movement-287
language:
  - en
---

# MIMIC: Motion Imitation from Massive Internet Clips

A 4.0B-parameter vision-language-action model for full-body humanoid control,
trained entirely from internet-scale human video.

## Model Details

- **Architecture**: Qwen3-VL-4B (early exit at layer 18) + 24L/1536D DiT action head
- **Parameters**: ~4.0B total (2.2B truncated LLM + 415M vision encoder + 1.28B DiT + 132M LoRA)
- **Action space**: 22-DoF joint angles at 10Hz
- **Action horizon**: 16 steps (1.6s)
- **Training data**: [movement-strict-164](https://huggingface.co/datasets/maxsegan/movement-strict-164) (164,390 clips, ~1.9M samples), produced by re-tracking and VLM-judging the raw output of our two-stage Kinetics-700 processing pipeline.
- **Best validation loss**: 0.1097
- **Training compute**: 4 x RTX Pro Blackwell, ~5.9 days (~566 GPU-hours)
- **Checkpoint step**: 29,060

## Held-out evaluation

500 clips sampled from the held-out validation split. Joint-angle RMSE in degrees, future-only ($t{=}1..15$, 1.5s of prediction), with rolling re-init every $S$ timesteps using ground-truth state.

| Step $S$ | MIMIC (this model) | Static baseline | Linear baseline |
|---:|---:|---:|---:|
| 3  | 23.8 | 22.1 | 39.4 |
| 8  | 32.7 | 32.3 | 95.0 |
| 16 | **39.8** | 41.0 | 188.4 |

On the high-motion top-quartile subset at $S{=}16$: **57.5°** (model) vs 60.7° (static), a 5.3% reduction in error. Compared to a 325K-clip ablation model trained on the unfiltered intermediate corpus, MIMIC lowers all-clip RMSE from 43.7° to 39.8° (-9%) and high-motion RMSE from 63.9° to 57.5° (-10%), and widens the model-versus-static gap from 0.1° to 1.2° on average (12x larger) and 1.7° to 3.2° on the high-motion subset.

**Performance is bimodal across activity type.** Cyclical or repeated motions (pull-ups, squats, jumping rope) predict much more accurately than long compositional sequences (cooking, multi-phase sports actions, dance routines). We read this as a data-coverage gap rather than a model-capacity ceiling: ingesting datasets with denser coverage of multi-step activities would likely close it.

A second model trained on [movement-287](https://huggingface.co/datasets/maxsegan/movement-287) (286,890 clips, includes lower-motion classes) is also available and reaches the same long-horizon RMSE with sharper short-horizon predictions (step-1 median 2.5° vs 2.9°).

## Usage

```python
import torch, yaml
from training.vla_model import VLAModel, VLAConfig

config = yaml.safe_load(open("config.yaml"))
model = VLAModel(VLAConfig(**config["model_config"]))

ckpt = torch.load("checkpoint.pth", map_location="cpu", weights_only=False)
state_dict = ckpt.get("model_state_dict", ckpt)
state_dict = {k.removeprefix("module."): v for k, v in state_dict.items()}
model.load_state_dict(state_dict, strict=False)
model.eval().cuda()
```

See the [GitHub repo](https://github.com/maxsegan/movement) for full inference and training code.

## Training

Flow matching loss on movement-strict-164. The vision encoder is frozen throughout; the Qwen3-VL-4B backbone uses LoRA (rank 128). The DiT action head is trained from scratch. The training set is re-tracked through a multi-frame YOLO plus Qwen oracle plus sticky IoU tracker pipeline before judgment by a 235B VLM, then filtered to clips passing both deterministic continuity checks and the VLM verdict on tracking consistency and motion-label match.

## Citation

Paper forthcoming.

## License

Apache 2.0