1 4 4

Harsha Kokel

harshakokel

harshakokel

AI & ML interests

None yet

Recent Activity

commented on an article 7 days ago

Community Evals: Because we're done trusting black-box leaderboards over the community

updated a dataset 8 days ago

ibm-research/acp_bench

upvoted an article 6 months ago

Welcome GPT OSS, the new open-source model family from OpenAI!

View all activity

Organizations

commented on Community Evals: Because we're done trusting black-box leaderboards over the community 7 days ago

This is a very important and timely initiative. It’s easy to get lost in the sea of leaderboards, each with its own format and reporting style. The Inspect AI log format brings much‑needed standardization, and having Hugging Face host evaluation logs is a real game changer. One reason many valuable benchmarks fade away is that original contributors often lack the resources to continuously maintain leaderboards. The Community Evals initiative has tremendous potential to address this gap, and I truly appreciate the effort behind it.

We’re hoping to include our planning benchmark, ACPBench, as part of this ecosystem—it's fully compatible with Inspect AI, the evaluation scripts are available on our GitHub.

References

ACPBench: Reasoning About Action, Change, and Planning, Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi, AAAI 2025
ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning, Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi, ICLR 2026