FantasyVLN
The model weights of FantasyVLN.
FantasyVLN is a unified multimodal Chain-of-Thought (CoT) reasoning framework that enables efficient and precise navigation based on natural language instructions and visual observations. FantasyVLN combines the benefits of textual, visual, and multimodal CoT reasoning by constructing a unified representation space across these reasoning modes. To enable efficient reasoning, we align these CoT reasoning modes with non-CoT reasoning during training, while using only non-CoT reasoning at test time. Notably, we perform visual CoT in the latent space of a VAR model, where only low-scale latent representations are predicted. Compared to traditional pixel-level visual CoT methods, our approach significantly improves both training and inference efficiency.
See the offical code for detail: https://fantasy-amap.github.io/fantasy-vln
- Downloads last month
- 21