Visual Question Answering
Transformers
Safetensors
English
bunny-phi
text-generation
Embodied AI
MLLM
VLM
Spatial Understanding
Phi-2
Instructions to use RussRobin/SpatialBot-3B-LoRA with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use RussRobin/SpatialBot-3B-LoRA with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("visual-question-answering", model="RussRobin/SpatialBot-3B-LoRA")# Load model directly from transformers import AutoModelForMultimodalLM model = AutoModelForMultimodalLM.from_pretrained("RussRobin/SpatialBot-3B-LoRA", dtype="auto") - Notebooks
- Google Colab
- Kaggle
metadata
license: cc-by-4.0
datasets:
- RussRobin/SpatialQA
language:
- en
tags:
- Embodied AI
- MLLM
- VLM
- Spatial Understanding
- Phi-2
pipeline_tag: visual-question-answering
SpatialBot is a VLM with spatial understanding and reasoning abilties, by precisely understanding depth maps and using them to do high-level tasks.
In this HF repo, we provide ckpts of SpatialBot-3B with LoRA, which is based on Phi-2 and SigLIP. It can perform well on general VLM tasks and spatial understanding benchmarks like SpatialBench.
You will also need to download pretrained CKPT.
Paper:
https://arxiv.org/abs/2406.13642
GitHub repo:
https://github.com/BAAI-DCAI/SpatialBot
SpatialBench, the benchmark:
https://huggingface.co/datasets/RussRobin/SpatialBench