MicroMixer-2-1M-discord-dialogues

Micro Language Model
Attention-Free • MLP-Only • Byte-Level • Conversational

📋 Overview

MicroMixer-2-1M-discord-dialogues is a ~1M parameter MLP-Mixer language model trained on Discord conversation data. V4 introduces DropPath regularization, label smoothing, and padding-aware loss for better training stability. It generates conversational text in a User/Assistant format without using any attention mechanisms.

🏗️ Architecture

graph TD
    A[Byte Input] --> B[Token Embedding]
    B --> C[RoPE Position Encoding]
    C --> D[MicroMixerLayer ×5]
    D --> E[LayerNorm]
    E --> F[LM Head]
    F --> G[Byte Output]
    
    style A fill:#007BFF,color:#fff
    style G fill:#00D620,color:#fff
    style D fill:#AE00FF,color:#fff

Model Configuration

Parameter	Value
Total Parameters	`1,016,204`
Hidden Dimension	`168`
Hyper Hidden Dimension	`84`
Channel MLP Dimension	`448`
Number of Layers	`5`
Max Sequence Length	`4096`
Vocabulary Size	`256` (Byte-level)
DropPath Rate	`0.1`
Label Smoothing	`0.1`

Core Components

┌─────────────────────────────────────────────┐
│           MicroMixerLayer                    │
│  ┌─────────────────────────────────────┐    │
│  │  LayerNorm → HyperMixing → Residual │    │ ← Token Mixing
│  ├─────────────────────────────────────┤    │
│  │  LayerNorm → MlpBlock → Residual    │    │ ← Channel Mixing
│  └─────────────────────────────────────┘    │
└─────────────────────────────────────────────┘

1️⃣ RoPE (Rotary Position Embedding)

Encodes positions via rotation transformations
Enables length extrapolation beyond training sequences

2️⃣ HyperMixing (Token Mixing)

Compresses past context via cumulative average pooling
Hypernetwork generates adaptive weights
O(S) complexity token mixing without attention

3️⃣ MlpBlock (Channel Mixing)

Non-linear transformation of feature dimensions
Structure: Linear → GELU → Linear

4️⃣ V4 Innovations

DropPath: Stochastic depth regularization (random residual skipping)
Label Smoothing: Prevents overconfident predictions
Padding-Aware Loss: Ignores padding tokens in cross-entropy
Increased Depth: 5 layers (vs 3 in V3)

🎯 Generation Examples

User: Hello, how are you?
Assistant: Its not a fun of but I can't really find me be on a heread long collecting...

User: What is your favorite color?
Assistant: Not even worth to get voice the time to up them the stuff inel that was i fever...

User: Tell me a joke.
Assistant: I can always sure when your start to and in the mainse of I had say again...

📊 Training Results

Epoch	Train Loss	Train PPL	Val Loss	Val PPL
1	2.89	18.04	2.68	14.65
2	2.57	13.05	2.63	13.88
3	2.54	12.68	2.62	13.73

📊 Training Data

Dataset: Discord-Dialogues

7.3M Discord conversation samples
Converted from ChatML to User/Assistant format
Multi-turn conversational data
~239M total tokens

🔧 Usage

import torch
from huggingface_hub import hf_hub_download
from src.model import MicroMixer, MicroMixerConfig
from src.tokenizer import ByteTokenizer

# Clone the repository first:
# git clone https://github.com/llaa33219/MicroMixer-2.git
# cd MicroMixer-2

config = MicroMixerConfig(
    max_seq_len=4096,
    hidden_dim=168,
    hyper_hidden_dim=84,
    channel_mlp_dim=448,
    num_layers=5,
)

model = MicroMixer(config)
weights_path = hf_hub_download("llaa33219/MicroMixer-2-1M-discord-dialogues", "model.pt")
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()

tokenizer = ByteTokenizer()
input_ids = torch.tensor([tokenizer.encode("User: Hello
Assistant:")])

with torch.no_grad():
    output = model.generate(input_ids, max_new_tokens=128, temperature=0.7, top_k=40)

print(tokenizer.decode(output[0].tolist()))

⚠️ Limitations

Limitation	Description
Small Model Size	Only ~1M parameters
Grammar Issues	Generated text has grammatical errors
Repetitive Patterns	Tends to repeat learned phrases
Limited Knowledge	Trained only on Discord conversations

_{Part of the MicroMixer-2 research project}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

llaa33219
/

MicroMixer-2-1M-discord-dialogues