qwen3.5-2b-synthgaze
qwen3.5-2b-synthgaze is a gaze heatmap model built from Qwen/Qwen3.5-2B-Base.
The Qwen language stack is not used; this repo contains the fine-tuned Qwen3.5 vision tower plus a small convolutional heatmap head.
This is a derivative model: the base Qwen weights were modified for gaze heatmap prediction, with image input and 96x54 heatmap output instead of language-token generation.
It was trained on approximately 27k gaze point + screenshot pairs. Inputs are expected as a 768x432 screenshot canvas. The model outputs a 96x54 gaze probability heatmap and a soft-argmax normalized (x, y) point.
Validation snapshot on the held-out hash split: 115.4 px mean error on 768x432 frames, or 13.09% of image diagonal.
Example Output
The image below shows a 768x432 model input with the predicted gaze heatmap overlaid.
Use with Transformers
This requires a recent transformers build with Qwen3.5 vision support.
import torch
from PIL import Image
from transformers import AutoImageProcessor, AutoModel
repo = "ncky/qwen3.5-2b-synthgaze"
processor = AutoImageProcessor.from_pretrained(repo)
model = AutoModel.from_pretrained(
repo,
trust_remote_code=True,
dtype=torch.bfloat16,
).to("cuda").eval()
image = Image.open("screenshot.png").convert("RGB").resize((768, 432))
inputs = processor(images=[image], return_tensors="pt")
inputs = {key: value.to("cuda") for key, value in inputs.items()}
with torch.no_grad(), torch.autocast("cuda", dtype=torch.bfloat16):
output = model(**inputs)
heatmap = output.heatmaps[0] # torch.Size([54, 96])
x_norm, y_norm = output.coords_norm[0].tolist()
x_px = x_norm * (image.width - 1)
y_px = y_norm * (image.height - 1)
For non-16:9 screenshots, letterbox into a 768x432 black canvas before preprocessing and reproject the heatmap back to the original image if needed.
Local Browser Demo
The included server.py serves a simple browser capture page that uses getDisplayMedia() and the local model:
python server.py
http://127.0.0.1:8765/capture
Choose the target monitor in the browser screen-share dialog. The page sends frames to the local server and draws the returned heatmap over the live captured screen.
- Downloads last month
- 17
Model tree for ncky/qwen3.5-2b-synthgaze
Base model
Qwen/Qwen3.5-2B-Base