test
Browse files- README_TEST.md +77 -0
README_TEST.md
ADDED
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# NovaMind-256M: Training a 3.3B Token Conversational LLM on Modal
|
| 2 |
+
|
| 3 |
+
I've built and trained **NovaMind-256M**, a decoder-only conversational language model with ~252M parameters. This project covers the entire pipeline: from custom architecture design and data preparation to large-scale training on H100 GPUs using [Modal](https://modal.com).
|
| 4 |
+
|
| 5 |
+
## Why I built this
|
| 6 |
+
|
| 7 |
+
I wanted to understand how LLMs actually work under the hood, not just call an API. So I read a bunch of papers, picked the best ideas from the top models, and combined them into something I could actually train myself without spending thousands of dollars on cloud compute.
|
| 8 |
+
|
| 9 |
+
The result is ~256 million parameters. Small by industry standards, but big enough to get real results.
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+

|
| 14 |
+
|
| 15 |
+
## 🏗️ The Architecture
|
| 16 |
+
|
| 17 |
+
I designed NovaMind-256M with modern efficiency in mind, combining established LLM techniques with a few unique twists:
|
| 18 |
+
|
| 19 |
+
- **Grouped Query Attention (GQA):** 16 Query heads sharing 4 KV heads (4:1 ratio). This dramatically cuts down my KV cache memory footprint during long chats.
|
| 20 |
+
- **SwiGLU FFN:** I used SwiGLU instead of standard GELU for better parameter efficiency.
|
| 21 |
+
- **Pre-RMSNorm:** Ensures training stays stable across all 24 layers.
|
| 22 |
+
- **HiRoPE (Hierarchical Rotary Position Embedding):** This is one of the unique parts of my model. I adapted HiRoPE from code-specific research to work for conversations. I split the head dimensions into:
|
| 23 |
+
- **Local stream (base=10k):** Handles fine-grained context within a single turn.
|
| 24 |
+
- **Global stream (base=500k):** Tracks coarse context across the entire dialogue history.
|
| 25 |
+
- **Tag-Aware Loss Curriculum:** Another unique experiment. I don't treat every token equally. I use a dynamic weighting scheme for `<think>`, `<assistant>`, and `<human>` tags that shifts as the model progresses through different training phases.
|
| 26 |
+
- **Weight Tying:** I tied the embedding and LM head weights, saving about 33M parameters.
|
| 27 |
+
|
| 28 |
+
## 🚀 Training Journey
|
| 29 |
+
|
| 30 |
+
I trained the model on a total of **3.368 Billion tokens** using Modal's H100 GPU infrastructure. I split the training into two distinct phases to optimize for knowledge and personality.
|
| 31 |
+
|
| 32 |
+
### Phase 1: Knowledge Foundation (Pretraining)
|
| 33 |
+
|
| 34 |
+
- **The Data:** I used a heavy mix of **Wikipedia EN** for factual depth and **TinyStories** to help a smaller model like this develop better reasoning and narrative coherence.
|
| 35 |
+
- **Cost:** Total of **$22.80** (this includes all the CPU-based data preparation and the H100 pretraining run).
|
| 36 |
+
|
| 37 |
+

|
| 38 |
+
|
| 39 |
+
### Phase 2: Conversational Polish (SFT)
|
| 40 |
+
|
| 41 |
+
- **The Data:** A curated instruction set including **Alpaca, OASST1, Dolly, DailyDialog**, and my own custom **Identity Seeds** to anchor the NovaMind persona.
|
| 42 |
+
- **Cost:** Only **$1.21** on a Modal H100.
|
| 43 |
+
|
| 44 |
+

|
| 45 |
+
|
| 46 |
+
> [!TIP]
|
| 47 |
+
> Modal gives $30 of free credit every month. Since the entire training only cost me about $24.01, I basically trained a 3.3B token model for free.
|
| 48 |
+
|
| 49 |
+
## 💬 Chatting with NovaMind
|
| 50 |
+
|
| 51 |
+
I wrote a rich terminal interface to interact with the final SFT checkpoint. It supports multi-turn history (keeping the last 6 turns), auto-detects if you're on a Mac (MPS) or GPU (CUDA), and handles stop tokens so the model doesn't hallucinate runaway text.
|
| 52 |
+
|
| 53 |
+
```bash
|
| 54 |
+
# How I start the chat
|
| 55 |
+
python chat.py
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+

|
| 59 |
+
|
| 60 |
+
## 📁 What's Inside
|
| 61 |
+
|
| 62 |
+
- `model.py`: My core architecture (`NovaMind256M` & `NovaMindConfig`).
|
| 63 |
+
- `train.py`: The heart of the training loop.
|
| 64 |
+
- `modal_novamind.py`: My Modal deployment config for remote GPU execution.
|
| 65 |
+
- `prepare_data_*.py`: How I tokenized the 3.3B tokens on CPU.
|
| 66 |
+
- `chat.py`: The interactive REPL I use for testing.
|
| 67 |
+
- `plots/`: Where I keep my training visualizations.
|
| 68 |
+
|
| 69 |
+
## 🛠️ How to use it
|
| 70 |
+
|
| 71 |
+
`make sure you have novamind_sft_final.pt` in `checkpoints/` folder`
|
| 72 |
+
|
| 73 |
+
### Run the REPL
|
| 74 |
+
|
| 75 |
+
```bash
|
| 76 |
+
python chat.py --temp 0.7 --max_tokens 400
|
| 77 |
+
```
|