YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

ConvGPT Improvement Pipeline

Complete research-backed training pipeline to improve mkurman/ConvGPT-0.2B-SYNTH-250B-EC for synthetic reasoning, agentic AI, multiturn conversations, and reduced repetition.

Quick Start

# 1. Install dependencies
pip install transformers trl torch datasets accelerate trackio

# 2. Phase 1: SFT on reasoning + multiturn (~2-3h on A10G)
python train_phase1_sft.py

# 3. Phase 2: DPO preference alignment (~1-2h on A10G)
python train_phase2_dpo.py

# 4. Evaluate
python evaluate_model.py

# 5. Chat with improved model
python inference_with_improved_settings.py

Problem & Solution

Issue Solution Evidence
Weak step-by-step reasoning SFT on 40k math CoT samples (OpenMathInstruct-1) DeepSeek-R1, Fine-tune-CoT papers
Poor multiturn coherence SFT on 12k Capybara conversations TRL SFT quickstart
Repetitive outputs DPO on UltraFeedback + repetition_penalty=1.15 DPO paper, DITTO paper
Low response quality Preference alignment on chosen vs rejected DPO (arXiv:2305.18290)

Phase 1: SFT (Reasoning + Multiturn)

Datasets (mixed 52k samples):

  • nvidia/OpenMathInstruct-1 — 40k filtered correct solutions
  • trl-lib/Capybara — 12k diverse multiturn conversations

Key technique: assistant_only_loss=True — only trains on assistant responses, not user prompts.

Config:

  • learning_rate=2e-5, num_train_epochs=3
  • per_device_train_batch_size=4, gradient_accumulation_steps=4
  • max_length=2048, bf16=True

Output: mkurman/ConvGPT-0.2B-SYNTH-250B-EC-Phase1-SFT

Phase 2: DPO (Quality + Repetition)

Dataset: HuggingFaceH4/ultrafeedback_binarized (train_prefs)

  • ~50k preference pairs (prompt + chosen + rejected)
  • Filtered to only rows where score_chosen > score_rejected

Config:

  • learning_rate=1e-6, beta=0.1
  • per_device_train_batch_size=2, gradient_accumulation_steps=8
  • num_train_epochs=1

Output: mkurman/ConvGPT-0.2B-SYNTH-250B-EC-Phase2-DPO

Inference Settings

Setting Base Model Improved
repetition_penalty 1.0 1.15
temperature default 0.7
top_p default 0.9
do_sample default True
top_k default 50

Expected Improvements

Metric Expected Change
Repetition rate -30-50%
Math reasoning +10-20%
Multiturn coherence Significant
Response quality Improved via preference alignment

Research References

  • DeepSeek-R1 (arXiv:2501.12948): Cold-start SFT + GRPO for reasoning
  • Fine-tune-CoT (arXiv:2212.10071): Diverse reasoning paths boost small models
  • DPO (arXiv:2305.18290): Preference optimization without reward model
  • DITTO (arXiv:2206.02369): Repetition reduction training loss
  • RAGEN (arXiv:2504.20073): Multi-turn RL for agents
  • Toolformer (arXiv:2302.04761): Self-supervised tool use via SFT
  • ReAct (arXiv:2210.03629): Reasoning + acting interleaving

Hardware Requirements

  • Minimum: T4 GPU (16GB VRAM)
  • Recommended: A10G (24GB) or A100 (80GB)
  • Training time: Phase 1 (2-3h) + Phase 2 (1-2h) on A10G

Files

File Purpose
train_phase1_sft.py SFT on reasoning + multiturn
train_phase2_dpo.py DPO preference alignment
evaluate_model.py Evaluate repetition + quality
inference_with_improved_settings.py Chat with optimal settings

Future Improvements

  1. GRPO: Train on math with rule-based rewards (exact match + format)
  2. Tool calling: Add tool-use datasets to Capybara mix
  3. Longer context: Increase max_length to 4096 or 8192
  4. DITTO loss: Explicitly train against repetition loops
  5. MPO: Combine sigmoid + BCO + SFT losses for richer preference learning

License

Apache-2.0 (same as base model)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for mkurman/convgpt-improvement-pipeline