Rajkumar rawal's picture

Rajkumar rawal PRO

rajkumarrawal

·

https://rajkumarrawal.com.np/

AI & ML interests

AI & Blockchain & Robotics

Recent Activity

reacted to their post with 👍 about 1 hour ago

I submitted a "AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts" Paper by @Keyu Li @Junhao shi Dequan Wang @Yang Xiao @Mohan Jiang @Jie Sun @Yunze Wu @Shijie Xia @Xiaojie Cai @Tianze Xu @Weiye Si @Wenjie Li @Pengfei Liu From Shanghai Jiao Tong University The Hong Kong Polytechnic University @SII-GAIR to Daily Papers on Hugging Face. A potentially another direction for Benchmarking the Frontiers of Autonomous Agents in 2026 AgencyBench presents a comprehensive benchmark for evaluating autonomous agents across real world scenarios, enabling automated evaluation through user simulation and sandbox environments while revealing performance gaps between closed-source and open-source models. Some of the observations founded are :- -- Long-horizon tasks remain challenging : Even frontier models struggle with sustained reasoning over real world tasks that require 1M tokens and 90 tool calls, indicating limits in long context autonomy. -- Proprietary models outperform open source models: Closed source models achieve a higher average score (48.4%) than open source counterparts (32.1%), revealing a persistent performance gap on complex agentic tasks. -- Feedback driven self correction varies widely: Models like GPT 5.2 and Claude show strong gains from iterative feedback, while others (e.g. DeepSeek V3.2) exhibit minimal or no improvement after feedback. -- Efficiency trade offs are significant: High performing models often consume far more tokens and time, some models (e.g. Grok 4.1 Fast) are more token efficient despite lower absolute scores. -- Agentic scaffolds strongly influence performance: Models tend to perform best within their native or optimized ecosystems, highlighting that agent performance depends on tight coupling between the model and its scaffold not the model alone. ..... many more... https://huggingface.co/papers/2601.11044

replied to their post about 1 hour ago

I submitted a "AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts" Paper by @Keyu Li @Junhao shi Dequan Wang @Yang Xiao @Mohan Jiang @Jie Sun @Yunze Wu @Shijie Xia @Xiaojie Cai @Tianze Xu @Weiye Si @Wenjie Li @Pengfei Liu From Shanghai Jiao Tong University The Hong Kong Polytechnic University @SII-GAIR to Daily Papers on Hugging Face. A potentially another direction for Benchmarking the Frontiers of Autonomous Agents in 2026 AgencyBench presents a comprehensive benchmark for evaluating autonomous agents across real world scenarios, enabling automated evaluation through user simulation and sandbox environments while revealing performance gaps between closed-source and open-source models. Some of the observations founded are :- -- Long-horizon tasks remain challenging : Even frontier models struggle with sustained reasoning over real world tasks that require 1M tokens and 90 tool calls, indicating limits in long context autonomy. -- Proprietary models outperform open source models: Closed source models achieve a higher average score (48.4%) than open source counterparts (32.1%), revealing a persistent performance gap on complex agentic tasks. -- Feedback driven self correction varies widely: Models like GPT 5.2 and Claude show strong gains from iterative feedback, while others (e.g. DeepSeek V3.2) exhibit minimal or no improvement after feedback. -- Efficiency trade offs are significant: High performing models often consume far more tokens and time, some models (e.g. Grok 4.1 Fast) are more token efficient despite lower absolute scores. -- Agentic scaffolds strongly influence performance: Models tend to perform best within their native or optimized ecosystems, highlighting that agent performance depends on tight coupling between the model and its scaffold not the model alone. ..... many more... https://huggingface.co/papers/2601.11044

posted an update about 1 hour ago

I submitted a "AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts" Paper by @Keyu Li @Junhao shi Dequan Wang @Yang Xiao @Mohan Jiang @Jie Sun @Yunze Wu @Shijie Xia @Xiaojie Cai @Tianze Xu @Weiye Si @Wenjie Li @Pengfei Liu From Shanghai Jiao Tong University The Hong Kong Polytechnic University @SII-GAIR to Daily Papers on Hugging Face. A potentially another direction for Benchmarking the Frontiers of Autonomous Agents in 2026 AgencyBench presents a comprehensive benchmark for evaluating autonomous agents across real world scenarios, enabling automated evaluation through user simulation and sandbox environments while revealing performance gaps between closed-source and open-source models. Some of the observations founded are :- -- Long-horizon tasks remain challenging : Even frontier models struggle with sustained reasoning over real world tasks that require 1M tokens and 90 tool calls, indicating limits in long context autonomy. -- Proprietary models outperform open source models: Closed source models achieve a higher average score (48.4%) than open source counterparts (32.1%), revealing a persistent performance gap on complex agentic tasks. -- Feedback driven self correction varies widely: Models like GPT 5.2 and Claude show strong gains from iterative feedback, while others (e.g. DeepSeek V3.2) exhibit minimal or no improvement after feedback. -- Efficiency trade offs are significant: High performing models often consume far more tokens and time, some models (e.g. Grok 4.1 Fast) are more token efficient despite lower absolute scores. -- Agentic scaffolds strongly influence performance: Models tend to perform best within their native or optimized ecosystems, highlighting that agent performance depends on tight coupling between the model and its scaffold not the model alone. ..... many more... https://huggingface.co/papers/2601.11044

View all activity

Organizations

rajkumarrawal 's models

None public yet