π₯ GRM2 - The small one that surpasses the big ones. What if a 3-parameter model can beat a 32-parameter model in every benchmark? We prove that it can. GRM2 is a 3b params model based on the llama architecture, trained for long reasoning and high performance in complex tasks - the first 3b params model to outperform qwen3-32b in ALL benchmarks, and outperform o3-mini in almost all benchmarks. π€ Model: OrionLLM/GRM2-3b The first 3b params model to generate over 1000 lines of code and achieve a score of 39.0 in xBench-DeepSearch-2510.
Over the past year, we've seen a shift in LLM Post-Training. Previously, Supervised Fine-Tuning was the most important part: making models imitate curated Question-Answer pairs.
Now we also have Reinforcement Learning with Verifiable Rewards. With techniques like GRPO, models can learn through trial and error in dynamic environments. They can climb to new heights without relying on expensively prepared data.
But what actually are these environments in practiceβ And how do you build them effectivelyβ
Fascinated by these concepts, I spent time exploring this space through experiments, post-training Small Language Models. I've packaged everything I learned into this short course.
What you'll learn
πΉ Agents, Environments, and LLMs: how to map Reinforcement Learning concepts to the LLM domain πΉ How to use Verifiers (open-source library by Prime Intellect) to build RL environments as software artifacts πΉ Common patterns: How to build single-turn, multi-turn, and tool-use environments
πΉ Hands-on: turn a small language model (LFM2-2.6B by LiquidAI) into a Tic Tac Toe master πΈ Build the game Environment πΈ Use it to generate synthetic data for SFT warm-up πΈ Group-based Reinforcement Learning
If you're interested in building "little worlds" where LLMs can learn, this course is for you.