--- language: - en license: apache-2.0 base_model: Qwen/Qwen3.5-0.8B-Base datasets: - PursuitOfDataScience/0.5M-thinking tags: - qwen3.5 - chain-of-thought - reasoning - math - sft pipeline_tag: text-generation --- # Qwen3.5-0.8B-thinking A **Chain-of-Thought fine-tuned** version of [Qwen/Qwen3.5-0.8B-Base](https://huggingface.co/Qwen/Qwen3.5-0.8B-Base), trained to reason step-by-step using `` tags before producing a final answer. --- ## Model Details | Attribute | Value | |---|---| | **Base model** | [Qwen/Qwen3.5-0.8B-Base](https://huggingface.co/Qwen/Qwen3.5-0.8B-Base) | | **Architecture** | Qwen3_5ForCausalLM (hybrid linear / full attention) | | **Parameters** | ~0.8B | | **Context window** | 4096 tokens | | **Hidden size** | 1024 | | **Layers** | 24 | | **Attention heads** | 8 (2 KV heads) | | **Vocabulary** | 248,320 tokens | | **Precision** | bfloat16 | --- ## Training Details ### Data Fine-tuned on [PursuitOfDataScience/0.5M-thinking](https://huggingface.co/datasets/PursuitOfDataScience/0.5M-thinking), a dataset of ~500K examples with structured chain-of-thought reasoning wrapped in `` / `` tags followed by a clean final answer. After filtering examples that exceed the 4096-token context window, **244,997 examples** were used for training. ### Procedure The model was trained with **supervised fine-tuning (SFT)** using HuggingFace `Trainer`: | Hyperparameter | Value | |---|---| | Epochs | 1 | | Per-device batch size | 4 | | Gradient accumulation steps | 8 | | **Effective batch size** | **32** | | Learning rate | 2e-5 | | LR schedule | Linear with warmup | | Warmup steps | 100 | | Max sequence length | 4096 | | Total optimizer steps | 7,657 | | Hardware | 1× H100 GPU | | Precision | bfloat16 | | Attention | SDPA (scaled dot-product attention) | **Prompt format used during training:** ``` user: assistant: ``` The `` tag is hardcoded into the prompt prefix so the model always learns to emit structured reasoning before the answer. **Label masking:** Only the assistant response (starting after ``) is included in the cross-entropy loss — the prompt tokens are masked with `-100`. --- ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "PursuitOfDataScience/Qwen3.5-0.8B-thinking" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) question = "If Alice has 3 apples and buys 5 more, how many apples does she have?" prompt = ( f"user: Solve this math problem step by step. " f"Show your reasoning, then give the final answer after ####.\n\n" f"Question: {question}\n" f"assistant: \n" ) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=1024, temperature=0.6, top_p=0.9, do_sample=True, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` --- ## GSM8K Benchmark Results (Pass@1) Evaluated on the GSM8K test set (1,319 examples) using greedy-like sampling (temperature=0.6, top_p=0.9, max_new_tokens=4096). The x-axis below represents **CoT-SFT optimizer steps** (0 = base model before any fine-tuning; 7,657 = end of one full epoch). Only the final model is publicly released — intermediate checkpoints are not available. | Training Steps | GSM8K Accuracy | |---:|---:| | 0 (base, with ``) | 58.23% | | 0 (base, no ``) | 51.40% | | 500 | 57.32% | | 1,000 | 59.97% | | 1,500 | 63.53% | | 2,000 | 60.20% | | 2,500 | 59.21% | | 3,000 | 60.73% | | 3,500 | 60.58% | | 4,000 | 60.35% | | 4,500 | 61.11% | | 5,000 | 58.61% | | 5,500 | 62.62% | | 6,000 | 62.17% | | 6,500 | 61.11% | | 7,000 | 63.68% | | 7,500 | 61.03% | | 7,657 | 61.64% | | **final model** | **62.40%** | The fine-tuned final model achieves **62.40%** vs the base model's **58.23%** (+4.17 pp) when both use chain-of-thought (``) prompting, and a **+10.99 pp** gain over the base model without any reasoning prompt. --- ## Acknowledgements - Base model: [Qwen/Qwen3.5-0.8B-Base](https://huggingface.co/Qwen/Qwen3.5-0.8B-Base) by the Qwen Team (Alibaba Cloud) - Training data: [PursuitOfDataScience/0.5M-thinking](https://huggingface.co/datasets/PursuitOfDataScience/0.5M-thinking) --- ## License Apache 2.0 — same as the base model.