Papers
arxiv:2604.02296

VOID: Video Object and Interaction Deletion

Published on Apr 2
Β· Submitted by
Ta-Ying Cheng
on Apr 3
Authors:
,
,
,
,

Abstract

VOID is a video object removal framework that uses vision-language models and video diffusion models to generate physically plausible scenes by leveraging causal reasoning and counterfactual reasoning.

AI-generated summary

Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.

Community

Paper submitter

We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios.
Check out the demo here: https://huggingface.co/spaces/sam-motamed/VOID.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

wrote up a summary of this one at https://arxivexplained.com/p/void-video-object-and-interaction-deletion the whole idea of cleanly deleting objects and their interactions from video without leaving artifacts is tricky, the way they handle the temporal consistency side is what makes it work

Β·

That is interesting , However the link you provided isn't working, you might want to take a quick look at it.

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.02296 in a dataset README.md to link it from this page.

Spaces citing this paper 3

Collections including this paper 3