Inference speed

#8
by tintwotin - opened

Using Find on a 2 min. 1920x832 video takes: 459.15s on RTX 4090 - can anything be done to speed it up? Like downscaling the video beforehand? Or is a turbo version planned?

Nemo Station org

459s for a 2-min 1920Γ—832 clip is on the slow end but expected at that resolution. Two things you can try:

  1. Pre-downscale the video. 1920Γ—832 is roughly 8Γ— over the model's per-frame pixel budget (we cap at ~200K pixels via smart_resize internally). The internal resize handles it, but at decode cost. Downscaling to ~640Γ—270 before sending to the model cuts the visual-encoder time substantially without hurting accuracy for grounding-style queries.
  2. Quantise the weights. On a 4090, AWQ-quantised weights + bf16 KV-cache typically give 3-4Γ— throughput vs vanilla bf16. We haven't shipped a quantized checkpoint ourselves yet, but you can do this in a half-hour with llm-compressor or AutoAWQ. If you do, we'd be curious what mIoU you get on TimeLens-Bench to compare against our bf16 numbers.

No "turbo" variant planned β€” the model is already 2B params, so the realistic speedup path is inference-side, not architectural.

I tried downscaling and it didn't help. I would like to add it to my Pallaidium AI add-on for Blender, but currently it is simply too slow for me. Will check in later to see if something has improved. Thank you.

Sign up or log in to comment