Papers
arxiv:2601.19281

Gazeify Then Voiceify: Physical Object Referencing Through Gaze and Voice Interaction with Displayless Smart Glasses

Published on Jan 27
Authors:
,
,
,
,
,
,
,
,

Abstract

A multimodal system enables object selection through gaze and voice commands on displayless smart glasses by combining object segmentation, detection, and vision-language models for semantic object descriptions and error correction.

Smart glasses enhance interactions with the environment by using head-mounted cameras to observe the user's viewpoint, but lack the visual feedback used for common interactions. We introduce Gazeify then Voiceify, a multimodal approach allowing object selection via gaze and voice using displayless smart glasses. Users can select a physical object with their gaze, and the system generates a digital mask and a voice description of the object's semantics. Users can further correct errors through free-form conversation. To demonstrate our approach, we develop an interactive system by integrating advanced object segmentation and detection with a vision-language model. User studies reveal that participants achieve correct gaze selection in 53% of the task trials and use voice disambiguation to correct 58% of the remaining errors. Participants also rated the system as likable, useful, and easy to use.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2601.19281
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.19281 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.19281 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.