qwen3.5-2b-synthgaze

qwen3.5-2b-synthgaze is a gaze heatmap model built from Qwen/Qwen3.5-2B-Base. The Qwen language stack is not used; this repo contains the fine-tuned Qwen3.5 vision tower plus a small convolutional heatmap head. This is a derivative model: the base Qwen weights were modified for gaze heatmap prediction, with image input and 96x54 heatmap output instead of language-token generation.

It was trained on approximately 27k gaze point + screenshot pairs. Inputs are expected as a 768x432 screenshot canvas. The model outputs a 96x54 gaze probability heatmap and a soft-argmax normalized (x, y) point.

Validation snapshot on the held-out hash split: 115.4 px mean error on 768x432 frames, or 13.09% of image diagonal.

Example Output

The image below shows a 768x432 model input with the predicted gaze heatmap overlaid.

Use with Transformers

This requires a recent transformers build with Qwen3.5 vision support.

import torch
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

repo = "ncky/qwen3.5-2b-synthgaze"

processor = AutoImageProcessor.from_pretrained(repo)
model = AutoModel.from_pretrained(
    repo,
    trust_remote_code=True,
    dtype=torch.bfloat16,
).to("cuda").eval()

image = Image.open("screenshot.png").convert("RGB").resize((768, 432))
inputs = processor(images=[image], return_tensors="pt")
inputs = {key: value.to("cuda") for key, value in inputs.items()}

with torch.no_grad(), torch.autocast("cuda", dtype=torch.bfloat16):
    output = model(**inputs)

heatmap = output.heatmaps[0]      # torch.Size([54, 96])
x_norm, y_norm = output.coords_norm[0].tolist()
x_px = x_norm * (image.width - 1)
y_px = y_norm * (image.height - 1)

For non-16:9 screenshots, letterbox into a 768x432 black canvas before preprocessing and reproject the heatmap back to the original image if needed.

Local Browser Demo

The included server.py serves a simple browser capture page that uses getDisplayMedia() and the local model:

python server.py

http://127.0.0.1:8765/capture

Choose the target monitor in the browser screen-share dialog. The page sends frames to the local server and draws the returned heatmap over the live captured screen.

Downloads last month: 17

Model tree for ncky/qwen3.5-2b-synthgaze

Base model

Qwen/Qwen3.5-2B-Base

Finetuned

(37)

this model