openai/gsm8k
Benchmark β’ Updated β’ 17.6k β’ 969k β’ 1.34k
This is a SmolLM2-135M model fine-tuned using the Guided Reward Policy Optimization (GRPO) technique on a subset of the GSM8K dataset (only the first 1500 samples due to time and memory constraints). The training was conducted on a DPU P-100 accelerator with 21GB VRAM.
HuggingFaceTB/SmolLM2-135M-Instructfloat16adamw_torch_fused8212810015e-6cosine0.21The model was optimized using the following reward functions:
xmlcount_reward_funcsoft_format_reward_funcstrict_format_reward_funcint_reward_funccorrectness_reward_funcThe model was trained on a subset of the GSM8K dataset. The dataset was processed as follows:
def extract_hash_answer(text: str) -> str | None:
if "####" not in text:
return None
return text.split("####")[1].strip()
def get_gsm8k_questions(split="train", num_samples=1500) -> Dataset:
data = load_dataset('openai/gsm8k', 'main')[split]
data = data.shuffle(seed=42).select(range(num_samples)) # Selecting 1500 samples
data = data.map(lambda x: {
'prompt': [
{'role': 'system', 'content': SYSTEM_PROMPT},
{'role': 'user', 'content': x['question']}
],
'answer': extract_hash_answer(x['answer'])
})
return data
You can use this model with Hugging Face Transformers as follows:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "your-username/SmolLM2-135M-GRPO"
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Generate output
prompt = "If a train travels at 60 mph for 2.5 hours, how far does it travel?"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_length=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Base model
HuggingFaceTB/SmolLM2-135M