qwen3.5-2b-synthgaze

qwen3.5-2b-synthgaze is a gaze heatmap model built from Qwen/Qwen3.5-2B-Base. The Qwen language stack is not used; this repo contains the fine-tuned Qwen3.5 vision tower plus a small convolutional heatmap head. This is a derivative model: the base Qwen weights were modified for gaze heatmap prediction, with image input and 96x54 heatmap output instead of language-token generation.

It was trained on approximately 27k gaze point + screenshot pairs. Inputs are expected as a 768x432 screenshot canvas. The model outputs a 96x54 gaze probability heatmap and a soft-argmax normalized (x, y) point.

Validation snapshot on the held-out hash split: 115.4 px mean error on 768x432 frames, or 13.09% of image diagonal.

Example Output

The image below shows a 768x432 model input with the predicted gaze heatmap overlaid.

Example gaze heatmap overlay

Use with Transformers

This requires a recent transformers build with Qwen3.5 vision support.

import torch
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

repo = "ncky/qwen3.5-2b-synthgaze"

processor = AutoImageProcessor.from_pretrained(repo)
model = AutoModel.from_pretrained(
    repo,
    trust_remote_code=True,
    dtype=torch.bfloat16,
).to("cuda").eval()

image = Image.open("screenshot.png").convert("RGB").resize((768, 432))
inputs = processor(images=[image], return_tensors="pt")
inputs = {key: value.to("cuda") for key, value in inputs.items()}

with torch.no_grad(), torch.autocast("cuda", dtype=torch.bfloat16):
    output = model(**inputs)

heatmap = output.heatmaps[0]      # torch.Size([54, 96])
x_norm, y_norm = output.coords_norm[0].tolist()
x_px = x_norm * (image.width - 1)
y_px = y_norm * (image.height - 1)

For non-16:9 screenshots, letterbox into a 768x432 black canvas before preprocessing and reproject the heatmap back to the original image if needed.

Local Browser Demo

The included server.py serves a simple browser capture page that uses getDisplayMedia() and the local model:

python server.py
http://127.0.0.1:8765/capture

Choose the target monitor in the browser screen-share dialog. The page sends frames to the local server and draws the returned heatmap over the live captured screen.

Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ncky/qwen3.5-2b-synthgaze

Finetuned
(37)
this model