--- license: mit datasets: - openai/gsm8k language: - en base_model: - HuggingFaceTB/SmolLM2-135M-Instruct --- ![GRPO Training Overview](GRPO.png) # SmolLM2-135M Fine-Tuned with GRPO on GSM8K (First 1500 Samples) ## 📌 Model Summary This is a **SmolLM2-135M** model fine-tuned using the **Guided Reward Policy Optimization (GRPO)** technique on a subset of the **GSM8K** dataset (only the first 1500 samples due to time and memory constraints). The training was conducted on a **DPU P-100 accelerator with 21GB VRAM**. ## 📊 Training Details ### **🛠 Training Configuration** - **Base Model:** [`HuggingFaceTB/SmolLM2-135M-Instruct`](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct) - **Fine-Tuning Technique:** GRPO (Guided Reward Policy Optimization) - **Dataset:** GSM8K (first 1500 samples) - **GPU Used:** NVIDIA Tesla **P100** (21GB VRAM) - **Precision:** `float16` - **Optimizer:** `adamw_torch_fused` - **Batch Size:** `8` - **Gradient Accumulation Steps:** `2` - **Max Prompt Length:** `128` - **Max Completion Length:** `100` - **Epochs:** `1` - **Learning Rate:** `5e-6` - **LR Scheduler:** `cosine` - **Weight Decay:** `0.2` - **Logging Steps:** `1` - **FP16 Enabled:** ✅ ### **🏆 Reward Functions Used** The model was optimized using the following reward functions: 1. **`xmlcount_reward_func`** 2. **`soft_format_reward_func`** 3. **`strict_format_reward_func`** 4. **`int_reward_func`** 5. **`correctness_reward_func`** ## 📝 Dataset Details The model was trained on a subset of the **GSM8K** dataset. The dataset was processed as follows: - The **first 1500 samples** were selected to reduce training time. - Each training sample consisted of a **question (prompt)** and a **ground truth answer** extracted using: ```python def extract_hash_answer(text: str) -> str | None: if "####" not in text: return None return text.split("####")[1].strip() ``` - The dataset was loaded and formatted using: ```python def get_gsm8k_questions(split="train", num_samples=1500) -> Dataset: data = load_dataset('openai/gsm8k', 'main')[split] data = data.shuffle(seed=42).select(range(num_samples)) # Selecting 1500 samples data = data.map(lambda x: { 'prompt': [ {'role': 'system', 'content': SYSTEM_PROMPT}, {'role': 'user', 'content': x['question']} ], 'answer': extract_hash_answer(x['answer']) }) return data ``` ## ⚡ Performance & Limitations - The model was **fine-tuned on limited data** (1500 samples instead of the full dataset). - Due to **hardware constraints (P100, 21GB VRAM)**, some **training optimizations** were made to improve efficiency. - The model is expected to perform well on **mathematical reasoning tasks** but may have **limited generalization** due to the small training set. ## 🔧 How to Use You can use this model with **Hugging Face Transformers** as follows: ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "your-username/SmolLM2-135M-GRPO" # Load model and tokenizer model = AutoModelForCausalLM.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) # Generate output prompt = "If a train travels at 60 mph for 2.5 hours, how far does it travel?" inputs = tokenizer(prompt, return_tensors="pt") output = model.generate(**inputs, max_length=100) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ## 🚀 Acknowledgements - **Hugging Face Team** for **SmolLM2-135M** - **OpenAI GSM8K dataset** - **GRPO fine-tuning technique** for reward-based optimization ## 📌 Future Work - **Increase dataset size** for better generalization. - **Optimize training on larger GPUs** (e.g., A100, H100). - **Experiment with different reward functions** to improve accuracy.