๐ญ Do thinking traces make Language Models learn better? Curious what others think
๐ฆ๐ฐ๐ฒ๐ป๐ฎ๐ฟ๐ถ๐ผ You take an instruction-following LM. You want to train it with a GRPO-style RL algorithm on a task like Tic Tac Toe. Rewards are outcome-based, applied only at the end of each episode: win/loss/draw, format adherence...
During training, the model could just output answers, but a common choice is to make it also output thinking traces.
๐ง๐ต๐ฒ ๐พ๐๐ฒ๐๐๐ถ๐ผ๐ป Does forcing the model to produce thinking traces during training actually improve learningโ
๐ฌ I'd like to hear your thoughts. Share ideas and links to relevant papers and resources.
From what I've understood so far, the answer seems to be ๐๐ฒ๐.
1๏ธโฃ If you force the model to think during training, it becomes a model that thinks at inference time. It naturally allocates more budget (tokens) to a problem, which tends to improve performance.
2๏ธโฃ While the model's "reasoning" already exists in its activation space, using explicit thinking traces as a scratchpad allows training to steer and shape that reasoning.
3๏ธโฃ As the model produces more traces during training, the RL algorithm can progressively give higher rewards to the reasoning patterns that lead to better outcomes.
It's known that Language Models memorize data that can be extracted via prompting.
In this paper, the authors investigate this aspect: - using open models, where prompting can be fully customized by the user, including special tokens. - focusing on open-source models like Olmo, where full training data is available.
๐ค How do they extract data?
During post-training (like SFT), new tokens such as <|user|> are introduced.
The authors hypothesize prompting the model with these tokens can make it output its alignment data (remember Magpie?).
For example, for SFT, their extraction prompt is <|endoftext|><|user|>.
๐ Evaluating memorization
The authors compare each sampled example with the original data using vector search with embedding similarity.
They find that many outputs are semantically very similar to the original data, even if the exact words differ.
Traditional string-matching algorithms underestimate memorization by 10x.
๐ What about RL?
Surprisingly, the same technique works to extract data from Reinforcement Learning (PPO/GRPO) phases.
This is counter-intuitive because the RL objective is not designed to increase sequence likelihoods (unlike SFT).
Practical limitation: in this case, extraction relies on using the initial part of the training prompt, which is not generally public.
๐ Are the extracted data effective for post-training?
Both in SFT and RL, the extracted data can be used to fine-tune models to similar performance to the originals.
The authors suggest that model distillation, where a stronger model is used to drive the training of a weaker one, may be a form of indirect training on the original dataset.
RL environments help LLMs practice, reason, and improve. I explored the Environments Hub and wrote a walkthrough showing how to train and evaluate models using these open environments.
DeepSeek-R1 made clear that Reinforcement Learning can be used to incentivize reasoning in LLMs. In GRPO, the model generates multiple answers and learns to prefer the better ones from rewards.
2๏ธโฃ ๐ช๐ต๐ฎ๐ ๐ฒ๐ป๐๐ถ๐ฟ๐ผ๐ป๐บ๐ฒ๐ป๐๐ ๐ฎ๐ฟ๐ฒ In classic RL, the environment is the world where the Agent lives, interacts, and get rewards to learn.
We can also think of them as software packages, containing data, harness and scoring rules - for the model to learn and be evaluated.
Nowadays, the Agent is not just the LLM. It can use tools, from a weather API to a terminal.
This makes environments for training and evaluation more complex and critical.
Big labs are advancing, but open models and the community still face a fragmented ecosystem. We risk becoming users of systems built with tools we can't access or fully understand.
4๏ธโฃ ๐๐ง๐ฏ๐ข๐ซ๐จ๐ง๐ฆ๐๐ง๐ญ๐ฌ ๐๐ฎ๐ That's why, I was excited when Prime Intellect released the Environments Hub.
It's a place where people share RL environments: tasks you can use to train LLMs with RL (GRPO-style) or evaluate Agents. Plus, the Verifiers library (@willcb) standardizes the creation of RL environments and evaluations. They can help to keep science and experimentation open. ๐ฌ
I explored the Hub and wrote a hands-on walkthrough ๐ - RL + LLMs basics - Environments Hub navigation - Evaluating models/Agents - GRPO Training a tiny model on an alphabetical sort task
๐ฅ In the video, the Agent: - Goes to Hugging Face Spaces - Finds black-forest-labs/FLUX.1-schnell - Expands a short prompt ("my holiday on Lake Como") into a detailed image generation prompt - Waits for the image - Returns the image URL
## What else can it do? Great for information gathering and summarization
๐๏ธ๐๏ธ Compare news websites and create a table of shared stories with links โถ๏ธ Find content creator social profiles from YouTube videos ๐๏ธ Find a product's price range on Amazon ๐ ๐ Gather public transportation travel options
## How is it built? ๐๏ธ Haystack โ Agent execution logic ๐ง Google Gemini 2.5 Flash โ Good and fast LLM with a generous free tier ๐ ๏ธ Playwright MCP server โ Browser automation tools: navigate, click, type, wait...
Even without vision capabilities, this setup can get quite far.
## Next steps - Try a local open model - Move from notebook to real deployment - Incorporate vision
And you? Have you built something similar? What's in your stack?
The latest release of the Haystack OSS LLM framework adds a long-requested feature: image support!
๐ Notebooks below
This isn't just about passing images to an LLM. We built several features to enable practical multimodal use cases.
What's new? ๐ง Support for multiple LLM providers: OpenAI, Amazon Bedrock, Google Gemini, Mistral, NVIDIA, OpenRouter, Ollama and more (support for Hugging Face API coming ๐) ๐๏ธ Prompt template language to handle structured inputs, including images ๐ PDF and image converters ๐ Image embedders using CLIP-like models ๐งพ LLM-based extractor to pull text from images ๐งฉ Components to build multimodal RAG pipelines and Agents
I had the chance of leading this effort with @sjrhuschlee (great collab).
How do you ensure your AI application is safe from harmful or inappropriate user inputs?
This is a core requirement for real-world AI deployments. Luckily, several open Language Models are built specifically for safety moderation.
I've been exploring them and put together a hands-on tutorial using the Haystack framework to build your own AI guardrails.
In the notebook, you'll learn how to use and customize: ๐น Meta Llama Guard (via Hugging Face API) ๐น IBM Granite Guardian (via Ollama), which can also evaluate RAG specific risk dimensions ๐น Google ShieldGemma (via Ollama) ๐น Nvidia NemoGuard models family, including a model for topic control
You'll also see how to integrate content moderation into a ๐ RAG pipeline.
๐งฐ Free up space on the Hub with super_squash_history ๐งน
As you may know, Hugging Face Hub has storage limits on private repos (100 GB for free users, 1 TB for PROs).
This weekend I did some cleanup on my private repos I went 1.58 TB down to 1 GB. ๐
Besides deleting old, unused models, the main tool I used was a lesser-known command: super_squash_history.
When you train a model, you often push multiple checkpoints to the Hub. Each checkpoint = a commit. A 2.6B model in BF16 is ~5 GB. So 10 checkpoints = 50 GB. That adds up fast.
While full commit history can be useful for rollbacks, it's often unnecessary for older experiments where only the final model matters.
In these cases, you can use super_squash_history: it reduces your entire repo history to a single commit.
I am fascinated by models learning from prompts and rewards - no example answers needed like in Supervised Fine-Tuning.
After the DeepSeek boom, everyone is trying GRPO with GSM8K or the Countdown Game...
I wanted a different challenge, like ๐๐ฒ๐ฎ๐ฐ๐ต๐ถ๐ป๐ด ๐ฎ ๐บ๐ผ๐ฑ๐ฒ๐น ๐๐ผ ๐ฐ๐ฟ๐ฒ๐ฎ๐๐ฒ ๐ฎ ๐๐ฐ๐ต๐ฒ๐ฑ๐๐น๐ฒ ๐ณ๐ฟ๐ผ๐บ ๐ฎ ๐น๐ถ๐๐ ๐ผ๐ณ ๐ฒ๐๐ฒ๐ป๐๐ ๐ฎ๐ป๐ฑ ๐ฝ๐ฟ๐ถ๐ผ๐ฟ๐ถ๐๐ถ๐ฒ๐.
Choosing an original problem forced me to: ๐ค Think about the problem setting ๐งฌ Generate data ๐ค Choose the right base model ๐ Design reward functions (and experiencing reward hacking) ๐ Run multiple rounds of training, hoping that my model would learn something.
I am happy to release two new language models for the Italian Language!
๐ช Gemma 2 9B Neogenesis ITA anakin87/gemma-2-9b-neogenesis-ita Building on the impressive work by VAGO Solutions, I applied Direct Preference Optimization with a mix of Italian and English data. Using Spectrum, I trained 20% of model layers.
๐ Evaluated on the Open ITA LLM leaderboard (mii-llm/open_ita_llm_leaderboard), this model achieves strong performance. To beat it on this benchmark, you'd need a 27B model ๐
๐ค Gemma 2 2B Neogenesis ITA anakin87/gemma-2-2b-neogenesis-ita This smaller variant is fine-tuned from the original Gemma 2 2B it by Google. Through a combination of Supervised Fine-Tuning and Direct Preference Optimization, I trained 25% of the layers using Spectrum.
๐ Compared to the original model, it shows improved Italian proficiency, good for its small size.
Hey, it has been a while... I was busy participating in ๐ ๐๐๐ฆ๐ฆ๐ ๐๐จ๐ฆ๐ฉ๐๐ญ๐ข๐ญ๐ข๐จ๐ง!
Here's the idea: Gemma open models have a large vocabulary size (256K), so improving them for a specific language or cultural context should be pretty affordable - no need for continued pre-training.
In this notebook, I show how I improve the performance of Gemma 2 2B on Italian via Post-Training. I believe this method is adaptable to other languages and model sizes.
๐๐ฆ๐บ ๐๐ต๐ฆ๐ฑ๐ด ๐ Choose reference metrics ๐งโ๐ฌ Data curation for Instruction Fine Tuning: identify existing datasets + generate synthetic data ๐๏ธโโ๏ธ Efficient Instruction Fine Tuning with Spectrum ๐งโ๐ฌ Data curation for Preference Tuning: identify existing datasets + generate synthetic data ๐๐ Efficient Direct Preference Optimization with Spectrum ๐ Evaluation
I'm also planning a ๐ Gemma Giveaway (on LinkedIn - https://www.linkedin.com/in/stefano-fiorucci) in the next few days - sharing techniques, datasets, and models I used for my project... so stay tuned! ๐ป
Some time ago OpenAI published Swarm: an educational framework for building multi-agent systems.
Their approach focuses on two main concepts: ใป ๐๐จ๐ฎ๐ญ๐ข๐ง๐๐ฌ: Each agent follows specific ๐ instructions and uses ๐ ๏ธ tools to execute them. ใป ๐๐๐ง๐๐จ๐๐๐ฌ ๐ค: Agents can transfer control to one another using tool/function calling.
When I first read these ideas, I thought: ๐ด๐ช๐ฎ๐ฑ๐ญ๐ฆ ๐ฃ๐ถ๐ต ๐ฑ๐ฐ๐ธ๐ฆ๐ณ๐ง๐ถ๐ญ! And they pair well with the recent unified tool support in Haystack.
๐งโ๐ป So, I decided to re-implement these concepts using Haystack, and in just a few lines of code, I had a working prototype.
๐ Bonus feature: this implementation isn't tied to a single model provider - different agents can be powered by different models!
I replicated the ACME customer service example from the original article, with 3 Agents: ๐ Triage Agent - Llama 3.2 running on Ollama ๐ Sales Agent - Anthropic Claude 3.5 Sonnet ๐ Issues and Repairs Agent - OpenAI GPT-4o mini
Want to see the full implementation and give it a try? Check out the blog post and notebook! โจ
It's a recent technique for creating synthetic instruction datasets.
Magpie is based on a simple but ingenious idea ๐ if you prompt an instruction-tuned model with a pre-query template, you can make it generate a plausible user query/instruction
Here's an example: model: Llama-3-8B-Instruct pre-query template: "<|begin_of_text|><|start_header_id|>user<|end_header_id|>" generated user instruction: "What are some of the responsibilities of a commercial pilot?"
You can then feed this instruction back into the same model to get the assistant response.
By repeating this process, it's possible to generate large synthetic datasets with relatively little effort.
๐ช The authors demonstrate that using these datasets for Supervised Fine Tuning (SFT) can yield strong performance, even competitive with the original instruct model.
Most Language Models are primarily trained on English texts, so they tend to produce data in English.
How can we overcome this?
Earlier approaches were complex or costly.
Then @mrm8488 found a simple solution: add the target language to the pre-query template. For Spanish, the template becomes "<|begin_of_text|><|start_header_id|>user<|end_header_id|>spanish:".
This method works for Spanish and German!
โ Unfortunately, it does not work well for other languages (๐ฎ๐น, ๐ณ๐ฑ, ...)