mookiezi/Discord-Dialogues
Viewer β’ Updated β’ 7.3M β’ 575 β’ 17
|
Micro Language Model Attention-Free β’ MLP-Only β’ Byte-Level β’ Conversational |
MicroMixer-2-100K-discord-dialogues is a ~125K parameter MLP-Mixer language model trained on Discord conversation data. This is the smallest model variant in the MicroMixer-2 family, designed for rapid experimentation and testing with 3 mixer layers and compact dimensions.
graph TD
A[Byte Input] --> B[Token Embedding]
B --> C[RoPE Position Encoding]
C --> D[MicroMixerLayer Γ3]
D --> E[LayerNorm]
E --> F[LM Head]
F --> G[Byte Output]
style A fill:#007BFF,color:#fff
style G fill:#00D620,color:#fff
style D fill:#AE00FF,color:#fff
| Parameter | Value |
|---|---|
| Total Parameters | 124,764 |
| Hidden Dimension | 84 |
| Hyper Hidden Dimension | 48 |
| Channel MLP Dimension | 128 |
| Number of Layers | 3 |
| Max Sequence Length | 64 |
| Vocabulary Size | 256 (Byte-level) |
| DropPath Rate | 0.0 |
| Label Smoothing | 0.0 |
βββββββββββββββββββββββββββββββββββββββββββββββ
β MicroMixerLayer β
β βββββββββββββββββββββββββββββββββββββββ β
β β LayerNorm β HyperMixing β Residual β β β Token Mixing
β βββββββββββββββββββββββββββββββββββββββ€ β
β β LayerNorm β MlpBlock β Residual β β β Channel Mixing
β βββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββ
Linear β GELU β Linear[Prompt] User: Hello
[Output] Assistant:ing when make what of your thing acke in conost im
[Prompt] User: How are you?
[Output] Assistant:
Aser: Your you got the loo
Assistant: Me I molin
[Prompt] User: What is your name?
[Output] Assistant:i mout good ind the do peale ponthon me, por isout
| Metric | Value |
|---|---|
| Train Loss | 1.6519 |
| Train PPL | 5.22 |
| Val Loss | 1.5910 |
| Val PPL | 4.91 |
| Epoch | 3 |
| Global Steps | 140 |
Dataset: Discord-Dialogues
import torch
from huggingface_hub import hf_hub_download
from src.model import MicroMixer, MicroMixerConfig
from src.tokenizer import ByteTokenizer
# Clone the repository first:
# git clone https://github.com/llaa33219/MicroMixer-2.git
# cd MicroMixer-2
config = MicroMixerConfig(
max_seq_len=64,
hidden_dim=84,
hyper_hidden_dim=48,
channel_mlp_dim=128,
num_layers=3,
)
model = MicroMixer(config)
weights_path = hf_hub_download("llaa33219/MicroMixer-2-100K-discord-dialogues", "model.pt")
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()
tokenizer = ByteTokenizer()
input_ids = torch.tensor([tokenizer.encode("User: Hello
Assistant:")])
with torch.no_grad():
output = model.generate(input_ids, max_new_tokens=64, temperature=0.7, top_k=40)
print(tokenizer.decode(output[0].tolist()))
| Limitation | Description |
|---|---|
| Very Small Model | Only ~125K parameters |
| Very Limited Context | Max 64 tokens sequence length |
| Minimal Capacity | Insufficient for coherent language generation |
| Research Use Only | Primarily for architecture testing |
Part of the MicroMixer-2 research project