Instructions to use kridaydave/Qwen-1.5B-LFGRPO-OPTIM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use kridaydave/Qwen-1.5B-LFGRPO-OPTIM with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit") model = PeftModel.from_pretrained(base_model, "kridaydave/Qwen-1.5B-LFGRPO-OPTIM") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Unsloth Studio
How to use kridaydave/Qwen-1.5B-LFGRPO-OPTIM with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for kridaydave/Qwen-1.5B-LFGRPO-OPTIM to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for kridaydave/Qwen-1.5B-LFGRPO-OPTIM to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for kridaydave/Qwen-1.5B-LFGRPO-OPTIM to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="kridaydave/Qwen-1.5B-LFGRPO-OPTIM", max_seq_length=2048, )
Qwen-1.5B-LFGRPO-OPTIM
This repository hosts the LoRA adapter weights for Qwen-1.5B-LFGRPO-OPTIM, a low-compute, alignment-optimized reasoning model. The model is trained using Layer-Frozen Group Relative Policy Optimization (LF-GRPO), a novel alignment paradigm designed to mitigate the "alignment tax" in small language models.
- Developed by: Kriday Dave (Alethia Research Group)
- Model type: Causal Language Model with PEFT/LoRA Adapter
- Base Model:
Qwen/Qwen2.5-1.5B-Instruct(Quantized in 4-bit) - Language(s): English
- License: Apache-2.0
- Repository: Alethia-Research GitHub
- Paper: Making Small Models Reason on a Colab Budget: Layer-Frozen Group Relative Policy Optimization
Model Description
Traditional reinforcement learning alignment (like standard GRPO) backpropagates formatting and correctness gradients across all layers of a language model. In smaller models (1.5B to 3B parameters), this triggers Central Engine Disruption—the destructive corruption of core mathematical and logical representations in early and middle layers ($L0$--$L23$).
LF-GRPO solves this by strictly freezing the model's central logic core ($L0$--$L23$) and confining parameter updates to the late-layer behavioral periphery ($L24$--$L27$). This allows the model to learn complex reasoning layout boundaries (such as step-by-step <think> tag monologues) without corrupting its underlying arithmetic capability.
Functional Behaviors:
- Structured Thinking: The model breaks down word problems step-by-step using logical numbering arrays.
- Conciseness Penalization: Through step-decay relative rewards, the model maintains a short, high-density reasoning path, preventing verbosity drift.
- Intact Core Arithmetic: Avoids the standard post-alignment reasoning decay, preserving raw calculation precision.
How to Get Started with the Model
You can load this adapter on top of the base Qwen-1.5B model using peft and transformers.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-1.5B-Instruct",
torch_dtype=torch.float16,
device_map="auto"
)
# Load the LF-GRPO adapter
model = PeftModel.from_pretrained(base_model, "kridaydave/Qwen-1.5B-LFGRPO-OPTIM")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
model.eval()
# Prompt format (Zero-Shot CoT with system guidance)
SYSTEM_PROMPT = (
"A conversation between User and Assistant. The Assistant must think step-by-step "
"inside <think>...</think> tags to solve the mathematical problem, and then provide "
"the final numeric answer outside the tags."
)
prompt = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\nJanet has 16 eggs. She eats 3 for breakfast and bakes muffins with 4. She sells the rest for $2 each. How much does she make?<|im_end|>\n<|im_start|>assistant\n<think>\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=300, do_sample=False)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:]))
Training Details
Training Data
The model was trained on a 1,000-sample subset of the OpenAI GSM8K dataset, optimized specifically for step-by-step math logic.
Training Procedure
- Regime: Two-stage optimization. Stage 1 (steps 0-100) focuses on format-priming and monologue tag alignment. Stage 2 (steps 101-300) optimizes for final math correctness and conciseness.
- Group Relative Search: Group size ($N=4$) is used to compute advantages relative to the group mean and standard deviation, bypassing the memory-heavy critic model.
- Autograd Periphery Insulation: Hard gradient masking applied at layer 24. 100% of parameters in layers 0-23 were kept frozen.
Training Hyperparameters
- LoRA Target Modules:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj - LoRA Rank / Alpha: 32 / 32
- Targeted Layers:
[24, 25, 26, 27] - Trainable Parameters: 5,275,648 (0.34% of base model)
- Optimizer:
paged_adamw_8bit(with CUDA page offloading) - Learning Rate: 1.5e-5
- Batch Configuration: Batch=1, Accumulation=4 (effective batch size = 4)
- Sequence Limits: Prompt=512, Completion=384
Evaluation Results
Evaluated on the OpenAI GSM8K test split (held-out prompts) under a zero-shot ChatML reasoning format:
- Qwen2.5-1.5B-Instruct (Base Baseline): ~42.0% - 50.0%
- Standard GRPO (Full-Layer LoRA): ~42.0% (degraded due to alignment tax / engine disruption)
- LF-GRPO (This Work - Step 100): ~50.0%
- LF-GRPO (This Work - Step 200/300): ~58.0% - 65.0% OOD accuracy (highly structured, concise CoT)
Environmental Impact
- Hardware Type: 1 x Tesla T4 GPU (16GB VRAM)
- Hours used: ~2.0 hours
- Cloud Provider: Google Colab
- Compute Region:
us-central1
Technical Specifications
Model Architecture
The underlying architecture is based on Qwen2.5 (RoPE embeddings, SwiGLU gating, and RMSNorm layers) using a 28-layer parameter layout.
Software
- TRL (Transformer Reinforcement Learning)
- Unsloth (Fast language model training & Triton kernels)
- vLLM (Fast CUDA graph decoders for advantage rollouts)
- Downloads last month
- 45