Rajkumar rawal PRO
rajkumarrawal
AI & ML interests
AI & Blockchain & Robotics
Recent Activity
reacted
to
their
post
with ๐
about 1 hour ago
I submitted a "AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts" Paper by @Keyu Li @Junhao shi Dequan Wang @Yang Xiao @Mohan Jiang @Jie Sun @Yunze Wu @Shijie Xia @Xiaojie Cai @Tianze Xu @Weiye Si @Wenjie Li @Pengfei Liu From Shanghai Jiao Tong University The Hong Kong Polytechnic University @SII-GAIR to Daily Papers on Hugging Face.
A potentially another direction for Benchmarking the Frontiers of Autonomous Agents in 2026
AgencyBench presents a comprehensive benchmark for evaluating autonomous agents across real world scenarios, enabling automated evaluation through user simulation and sandbox environments while revealing performance gaps between closed-source and open-source models.
Some of the observations founded are :-
-- Long-horizon tasks remain challenging :
Even frontier models struggle with sustained reasoning over real world tasks that require 1M tokens and 90 tool calls, indicating limits in long context autonomy.
-- Proprietary models outperform open source models:
Closed source models achieve a higher average score (48.4%) than open source counterparts (32.1%), revealing a persistent performance gap on complex agentic tasks.
-- Feedback driven self correction varies widely:
Models like GPT 5.2 and Claude show strong gains from iterative feedback, while others (e.g. DeepSeek V3.2) exhibit minimal or no improvement after feedback.
-- Efficiency trade offs are significant:
High performing models often consume far more tokens and time, some models (e.g. Grok 4.1 Fast) are more token efficient despite lower absolute scores.
-- Agentic scaffolds strongly influence performance:
Models tend to perform best within their native or optimized ecosystems, highlighting that agent performance depends on tight coupling between the model and its scaffold not the model alone.
..... many more...
https://huggingface.co/papers/2601.11044
replied to
their
post
about 1 hour ago
I submitted a "AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts" Paper by @Keyu Li @Junhao shi Dequan Wang @Yang Xiao @Mohan Jiang @Jie Sun @Yunze Wu @Shijie Xia @Xiaojie Cai @Tianze Xu @Weiye Si @Wenjie Li @Pengfei Liu From Shanghai Jiao Tong University The Hong Kong Polytechnic University @SII-GAIR to Daily Papers on Hugging Face.
A potentially another direction for Benchmarking the Frontiers of Autonomous Agents in 2026
AgencyBench presents a comprehensive benchmark for evaluating autonomous agents across real world scenarios, enabling automated evaluation through user simulation and sandbox environments while revealing performance gaps between closed-source and open-source models.
Some of the observations founded are :-
-- Long-horizon tasks remain challenging :
Even frontier models struggle with sustained reasoning over real world tasks that require 1M tokens and 90 tool calls, indicating limits in long context autonomy.
-- Proprietary models outperform open source models:
Closed source models achieve a higher average score (48.4%) than open source counterparts (32.1%), revealing a persistent performance gap on complex agentic tasks.
-- Feedback driven self correction varies widely:
Models like GPT 5.2 and Claude show strong gains from iterative feedback, while others (e.g. DeepSeek V3.2) exhibit minimal or no improvement after feedback.
-- Efficiency trade offs are significant:
High performing models often consume far more tokens and time, some models (e.g. Grok 4.1 Fast) are more token efficient despite lower absolute scores.
-- Agentic scaffolds strongly influence performance:
Models tend to perform best within their native or optimized ecosystems, highlighting that agent performance depends on tight coupling between the model and its scaffold not the model alone.
..... many more...
https://huggingface.co/papers/2601.11044
posted
an
update
about 1 hour ago
I submitted a "AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts" Paper by @Keyu Li @Junhao shi Dequan Wang @Yang Xiao @Mohan Jiang @Jie Sun @Yunze Wu @Shijie Xia @Xiaojie Cai @Tianze Xu @Weiye Si @Wenjie Li @Pengfei Liu From Shanghai Jiao Tong University The Hong Kong Polytechnic University @SII-GAIR to Daily Papers on Hugging Face.
A potentially another direction for Benchmarking the Frontiers of Autonomous Agents in 2026
AgencyBench presents a comprehensive benchmark for evaluating autonomous agents across real world scenarios, enabling automated evaluation through user simulation and sandbox environments while revealing performance gaps between closed-source and open-source models.
Some of the observations founded are :-
-- Long-horizon tasks remain challenging :
Even frontier models struggle with sustained reasoning over real world tasks that require 1M tokens and 90 tool calls, indicating limits in long context autonomy.
-- Proprietary models outperform open source models:
Closed source models achieve a higher average score (48.4%) than open source counterparts (32.1%), revealing a persistent performance gap on complex agentic tasks.
-- Feedback driven self correction varies widely:
Models like GPT 5.2 and Claude show strong gains from iterative feedback, while others (e.g. DeepSeek V3.2) exhibit minimal or no improvement after feedback.
-- Efficiency trade offs are significant:
High performing models often consume far more tokens and time, some models (e.g. Grok 4.1 Fast) are more token efficient despite lower absolute scores.
-- Agentic scaffolds strongly influence performance:
Models tend to perform best within their native or optimized ecosystems, highlighting that agent performance depends on tight coupling between the model and its scaffold not the model alone.
..... many more...
https://huggingface.co/papers/2601.11044