---
language:
- en
license: apache-2.0
base_model: Qwen/Qwen3.5-0.8B-Base
datasets:
- PursuitOfDataScience/0.5M-thinking
tags:
- qwen3.5
- chain-of-thought
- reasoning
- math
- sft
pipeline_tag: text-generation
---

# Qwen3.5-0.8B-thinking

A **Chain-of-Thought fine-tuned** version of [Qwen/Qwen3.5-0.8B-Base](https://huggingface.co/Qwen/Qwen3.5-0.8B-Base),
trained to reason step-by-step using `<think>` tags before producing a final answer.

---

## Model Details

| Attribute | Value |
|---|---|
| **Base model** | [Qwen/Qwen3.5-0.8B-Base](https://huggingface.co/Qwen/Qwen3.5-0.8B-Base) |
| **Architecture** | Qwen3_5ForCausalLM (hybrid linear / full attention) |
| **Parameters** | ~0.8B |
| **Context window** | 4096 tokens |
| **Hidden size** | 1024 |
| **Layers** | 24 |
| **Attention heads** | 8 (2 KV heads) |
| **Vocabulary** | 248,320 tokens |
| **Precision** | bfloat16 |

---

## Training Details

### Data
Fine-tuned on [PursuitOfDataScience/0.5M-thinking](https://huggingface.co/datasets/PursuitOfDataScience/0.5M-thinking),
a dataset of ~500K examples with structured chain-of-thought reasoning wrapped in `<think>` / `</think>` tags
followed by a clean final answer.

After filtering examples that exceed the 4096-token context window, **244,997 examples** were used for training.

### Procedure

The model was trained with **supervised fine-tuning (SFT)** using HuggingFace `Trainer`:

| Hyperparameter | Value |
|---|---|
| Epochs | 1 |
| Per-device batch size | 4 |
| Gradient accumulation steps | 8 |
| **Effective batch size** | **32** |
| Learning rate | 2e-5 |
| LR schedule | Linear with warmup |
| Warmup steps | 100 |
| Max sequence length | 4096 |
| Total optimizer steps | 7,657 |
| Hardware | 1× H100 GPU |
| Precision | bfloat16 |
| Attention | SDPA (scaled dot-product attention) |

**Prompt format used during training:**

```
user: <question>
assistant: <think>
<step-by-step reasoning>
</think>
<final answer>
```

The `<think>` tag is hardcoded into the prompt prefix so the model always learns
to emit structured reasoning before the answer.

**Label masking:** Only the assistant response (starting after `<think>`) is
included in the cross-entropy loss — the prompt tokens are masked with `-100`.

---

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "PursuitOfDataScience/Qwen3.5-0.8B-thinking"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

question = "If Alice has 3 apples and buys 5 more, how many apples does she have?"

prompt = (
    f"user: Solve this math problem step by step. "
    f"Show your reasoning, then give the final answer after ####.\n\n"
    f"Question: {question}\n"
    f"assistant: <think>\n"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    temperature=0.6,
    top_p=0.9,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## GSM8K Benchmark Results (Pass@1)

Evaluated on the GSM8K test set (1,319 examples) using greedy-like sampling
(temperature=0.6, top_p=0.9, max_new_tokens=4096).

The x-axis below represents **CoT-SFT optimizer steps** (0 = base model before
any fine-tuning; 7,657 = end of one full epoch).  Only the final model is
publicly released — intermediate checkpoints are not available.

| Training Steps | GSM8K Accuracy |
|---:|---:|
| 0 (base, with `<think>`) | 58.23% |
| 0 (base, no `<think>`)   | 51.40% |
| 500   | 57.32% |
| 1,000 | 59.97% |
| 1,500 | 63.53% |
| 2,000 | 60.20% |
| 2,500 | 59.21% |
| 3,000 | 60.73% |
| 3,500 | 60.58% |
| 4,000 | 60.35% |
| 4,500 | 61.11% |
| 5,000 | 58.61% |
| 5,500 | 62.62% |
| 6,000 | 62.17% |
| 6,500 | 61.11% |
| 7,000 | 63.68% |
| 7,500 | 61.03% |
| 7,657 | 61.64% |
| **final model** | **62.40%** |

The fine-tuned final model achieves **62.40%** vs the base model's **58.23%**
(+4.17 pp) when both use chain-of-thought (`<think>`) prompting, and a
**+10.99 pp** gain over the base model without any reasoning prompt.

---

## Acknowledgements

- Base model: [Qwen/Qwen3.5-0.8B-Base](https://huggingface.co/Qwen/Qwen3.5-0.8B-Base) by the Qwen Team (Alibaba Cloud)
- Training data: [PursuitOfDataScience/0.5M-thinking](https://huggingface.co/datasets/PursuitOfDataScience/0.5M-thinking)

---

## License

Apache 2.0 — same as the base model.