SmolVLM-256M-Instruct-GGUF

GGUF conversion of HuggingFaceTB/SmolVLM-256M-Instruct for use with llama.cpp and Ollama.

Files

File	Description	Size
`SmolVLM-256M-Instruct-Q4_K_M.gguf`	Main model (Q4_K_M quantized)	119.3 MB
`SmolVLM-256M-Instruct-f16.gguf`	Main model (F16 full precision)	312.6 MB
`mmproj-SmolVLM-256M-Instruct-f16.gguf`	Vision projector (F16 full precision)	181.2 MB

Usage

With llama.cpp

# Basic inference
./llama-mtmd-cli -m SmolVLM-256M-Instruct-f16.gguf --mmproj mmproj-SmolVLM-256M-Instruct-f16.gguf --image screenshot.png -p "What do you see?"

# With quantized version
./llama-mtmd-cli -m SmolVLM-256M-Instruct-Q4_K_M.gguf --mmproj mmproj-SmolVLM-256M-Instruct-f16.gguf --image screenshot.png -p "Click the Submit button"

With Ollama

# Modelfile
FROM ./SmolVLM-256M-Instruct-Q4_K_M.gguf
PROJECTOR ./mmproj-SmolVLM-256M-Instruct-f16.gguf
PARAMETER num_ctx 4096
PARAMETER temperature 0.1
SYSTEM "You are a GUI grounding assistant. Output click coordinates as JSON."

# Create model
ollama create smolvlm_256m_instruct -f Modelfile

# Run
ollama run smolvlm_256m_instruct --image screenshot.png "Click the Submit button"

For GUI Grounding (Claude Computer Use Format)

The model can be prompted to output coordinates in Claude's computer use format:

{"action": "left_click", "coordinate": [847, 523]}

Model Details

Base Model: HuggingFaceTB/SmolVLM-256M-Instruct
Architecture: Idefics3 (SmolVLM)
License: Apache 2.0

Conversion

Converted using llama.cpp's convert_hf_to_gguf.py:

# Main model
python convert_hf_to_gguf.py HuggingFaceTB/SmolVLM-256M-Instruct --outfile SmolVLM-256M-Instruct-f16.gguf --outtype f16

# Vision projector
python convert_hf_to_gguf.py HuggingFaceTB/SmolVLM-256M-Instruct --mmproj --outfile mmproj-SmolVLM-256M-Instruct-f16.gguf --outtype f16

# Quantize
./llama-quantize SmolVLM-256M-Instruct-f16.gguf SmolVLM-256M-Instruct-Q4_K_M.gguf Q4_K_M