Abstract
PhyCritic is a multimodal critic model designed for physical AI tasks through a two-stage RLVR pipeline that enhances perception and reasoning capabilities.
With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.
Community
A multimodal critic model that unifies physical judging and reasoning.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation (2026)
- Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains (2025)
- Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis (2026)
- SketchThinker-R1: Towards Efficient Sketch-Style Reasoning in Large Multimodal Models (2026)
- VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice (2026)
- SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning (2025)
- Video Evidence to Reasoning Efficient Video Understanding via Explicit Evidence Grounding (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper