---
license: mit
datasets:
- openai/gsm8k
language:
- en
base_model:
- HuggingFaceTB/SmolLM2-135M-Instruct
---

![GRPO Training Overview](GRPO.png)

# SmolLM2-135M Fine-Tuned with GRPO on GSM8K (First 1500 Samples)

## 📌 Model Summary
This is a **SmolLM2-135M** model fine-tuned using the **Guided Reward Policy Optimization (GRPO)** technique on a subset of the **GSM8K** dataset (only the first 1500 samples due to time and memory constraints). The training was conducted on a **DPU P-100 accelerator with 21GB VRAM**.

## 📊 Training Details

### **🛠 Training Configuration**
- **Base Model:** [`HuggingFaceTB/SmolLM2-135M-Instruct`](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct)
- **Fine-Tuning Technique:** GRPO (Guided Reward Policy Optimization)
- **Dataset:** GSM8K (first 1500 samples)
- **GPU Used:** NVIDIA Tesla **P100** (21GB VRAM)
- **Precision:** `float16`
- **Optimizer:** `adamw_torch_fused`
- **Batch Size:** `8`
- **Gradient Accumulation Steps:** `2`
- **Max Prompt Length:** `128`
- **Max Completion Length:** `100`
- **Epochs:** `1`
- **Learning Rate:** `5e-6`
- **LR Scheduler:** `cosine`
- **Weight Decay:** `0.2`
- **Logging Steps:** `1`
- **FP16 Enabled:** ✅

### **🏆 Reward Functions Used**
The model was optimized using the following reward functions:
1. **`xmlcount_reward_func`**
2. **`soft_format_reward_func`**
3. **`strict_format_reward_func`**
4. **`int_reward_func`**
5. **`correctness_reward_func`**

## 📝 Dataset Details
The model was trained on a subset of the **GSM8K** dataset. The dataset was processed as follows:
- The **first 1500 samples** were selected to reduce training time.
- Each training sample consisted of a **question (prompt)** and a **ground truth answer** extracted using:
  ```python
  def extract_hash_answer(text: str) -> str | None:
      if "####" not in text:
          return None
      return text.split("####")[1].strip()
  ```
- The dataset was loaded and formatted using:
  ```python
  def get_gsm8k_questions(split="train", num_samples=1500) -> Dataset:
      data = load_dataset('openai/gsm8k', 'main')[split]
      data = data.shuffle(seed=42).select(range(num_samples))  # Selecting 1500 samples
      data = data.map(lambda x: {
          'prompt': [
              {'role': 'system', 'content': SYSTEM_PROMPT},
              {'role': 'user', 'content': x['question']}
          ],
          'answer': extract_hash_answer(x['answer'])
      })
      return data
  ```

## ⚡ Performance & Limitations
- The model was **fine-tuned on limited data** (1500 samples instead of the full dataset).
- Due to **hardware constraints (P100, 21GB VRAM)**, some **training optimizations** were made to improve efficiency.
- The model is expected to perform well on **mathematical reasoning tasks** but may have **limited generalization** due to the small training set.

## 🔧 How to Use
You can use this model with **Hugging Face Transformers** as follows:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "your-username/SmolLM2-135M-GRPO"

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Generate output
prompt = "If a train travels at 60 mph for 2.5 hours, how far does it travel?"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_length=100)

print(tokenizer.decode(output[0], skip_special_tokens=True))
```

## 🚀 Acknowledgements
- **Hugging Face Team** for **SmolLM2-135M**
- **OpenAI GSM8K dataset**
- **GRPO fine-tuning technique** for reward-based optimization

## 📌 Future Work
- **Increase dataset size** for better generalization.
- **Optimize training on larger GPUs** (e.g., A100, H100).
- **Experiment with different reward functions** to improve accuracy.