MrEngineer commited on
Commit
bc5ed70
·
verified ·
1 Parent(s): 5640ef1
Files changed (1) hide show
  1. README_TEST.md +77 -0
README_TEST.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # NovaMind-256M: Training a 3.3B Token Conversational LLM on Modal
2
+
3
+ I've built and trained **NovaMind-256M**, a decoder-only conversational language model with ~252M parameters. This project covers the entire pipeline: from custom architecture design and data preparation to large-scale training on H100 GPUs using [Modal](https://modal.com).
4
+
5
+ ## Why I built this
6
+
7
+ I wanted to understand how LLMs actually work under the hood, not just call an API. So I read a bunch of papers, picked the best ideas from the top models, and combined them into something I could actually train myself without spending thousands of dollars on cloud compute.
8
+
9
+ The result is ~256 million parameters. Small by industry standards, but big enough to get real results.
10
+
11
+ ---
12
+
13
+ ![My Model Architecture](image/Architecture.svg)
14
+
15
+ ## 🏗️ The Architecture
16
+
17
+ I designed NovaMind-256M with modern efficiency in mind, combining established LLM techniques with a few unique twists:
18
+
19
+ - **Grouped Query Attention (GQA):** 16 Query heads sharing 4 KV heads (4:1 ratio). This dramatically cuts down my KV cache memory footprint during long chats.
20
+ - **SwiGLU FFN:** I used SwiGLU instead of standard GELU for better parameter efficiency.
21
+ - **Pre-RMSNorm:** Ensures training stays stable across all 24 layers.
22
+ - **HiRoPE (Hierarchical Rotary Position Embedding):** This is one of the unique parts of my model. I adapted HiRoPE from code-specific research to work for conversations. I split the head dimensions into:
23
+ - **Local stream (base=10k):** Handles fine-grained context within a single turn.
24
+ - **Global stream (base=500k):** Tracks coarse context across the entire dialogue history.
25
+ - **Tag-Aware Loss Curriculum:** Another unique experiment. I don't treat every token equally. I use a dynamic weighting scheme for `<think>`, `<assistant>`, and `<human>` tags that shifts as the model progresses through different training phases.
26
+ - **Weight Tying:** I tied the embedding and LM head weights, saving about 33M parameters.
27
+
28
+ ## 🚀 Training Journey
29
+
30
+ I trained the model on a total of **3.368 Billion tokens** using Modal's H100 GPU infrastructure. I split the training into two distinct phases to optimize for knowledge and personality.
31
+
32
+ ### Phase 1: Knowledge Foundation (Pretraining)
33
+
34
+ - **The Data:** I used a heavy mix of **Wikipedia EN** for factual depth and **TinyStories** to help a smaller model like this develop better reasoning and narrative coherence.
35
+ - **Cost:** Total of **$22.80** (this includes all the CPU-based data preparation and the H100 pretraining run).
36
+
37
+ ![Phase 1 Training Loss](plots/phase1b_loss.png)
38
+
39
+ ### Phase 2: Conversational Polish (SFT)
40
+
41
+ - **The Data:** A curated instruction set including **Alpaca, OASST1, Dolly, DailyDialog**, and my own custom **Identity Seeds** to anchor the NovaMind persona.
42
+ - **Cost:** Only **$1.21** on a Modal H100.
43
+
44
+ ![Phase 2 SFT Loss](plots/phase2_sft_loss.png)
45
+
46
+ > [!TIP]
47
+ > Modal gives $30 of free credit every month. Since the entire training only cost me about $24.01, I basically trained a 3.3B token model for free.
48
+
49
+ ## 💬 Chatting with NovaMind
50
+
51
+ I wrote a rich terminal interface to interact with the final SFT checkpoint. It supports multi-turn history (keeping the last 6 turns), auto-detects if you're on a Mac (MPS) or GPU (CUDA), and handles stop tokens so the model doesn't hallucinate runaway text.
52
+
53
+ ```bash
54
+ # How I start the chat
55
+ python chat.py
56
+ ```
57
+
58
+ ![My Chat Interface](image/chat.png)
59
+
60
+ ## 📁 What's Inside
61
+
62
+ - `model.py`: My core architecture (`NovaMind256M` & `NovaMindConfig`).
63
+ - `train.py`: The heart of the training loop.
64
+ - `modal_novamind.py`: My Modal deployment config for remote GPU execution.
65
+ - `prepare_data_*.py`: How I tokenized the 3.3B tokens on CPU.
66
+ - `chat.py`: The interactive REPL I use for testing.
67
+ - `plots/`: Where I keep my training visualizations.
68
+
69
+ ## 🛠️ How to use it
70
+
71
+ `make sure you have novamind_sft_final.pt` in `checkpoints/` folder`
72
+
73
+ ### Run the REPL
74
+
75
+ ```bash
76
+ python chat.py --temp 0.7 --max_tokens 400
77
+ ```