MicroMixer-2-100K-discord-dialogues

Micro Language Model
Attention-Free • MLP-Only • Byte-Level • Conversational

📋 Overview

MicroMixer-2-100K-discord-dialogues is a ~125K parameter MLP-Mixer language model trained on Discord conversation data. This is the smallest model variant in the MicroMixer-2 family, designed for rapid experimentation and testing with 3 mixer layers and compact dimensions.

🏗️ Architecture

graph TD
    A[Byte Input] --> B[Token Embedding]
    B --> C[RoPE Position Encoding]
    C --> D[MicroMixerLayer ×3]
    D --> E[LayerNorm]
    E --> F[LM Head]
    F --> G[Byte Output]
    
    style A fill:#007BFF,color:#fff
    style G fill:#00D620,color:#fff
    style D fill:#AE00FF,color:#fff

Model Configuration

Parameter	Value
Total Parameters	`124,764`
Hidden Dimension	`84`
Hyper Hidden Dimension	`48`
Channel MLP Dimension	`128`
Number of Layers	`3`
Max Sequence Length	`64`
Vocabulary Size	`256` (Byte-level)
DropPath Rate	`0.0`
Label Smoothing	`0.0`

Core Components

┌─────────────────────────────────────────────┐
│           MicroMixerLayer                    │
│  ┌─────────────────────────────────────┐    │
│  │  LayerNorm → HyperMixing → Residual │    │ ← Token Mixing
│  ├─────────────────────────────────────┤    │
│  │  LayerNorm → MlpBlock → Residual    │    │ ← Channel Mixing
│  └─────────────────────────────────────┘    │
└─────────────────────────────────────────────┘

1️⃣ RoPE (Rotary Position Embedding)

Encodes positions via rotation transformations
Enables length extrapolation beyond training sequences

2️⃣ HyperMixing (Token Mixing)

Compresses past context via cumulative average pooling
Hypernetwork generates adaptive weights
O(S) complexity token mixing without attention

3️⃣ MlpBlock (Channel Mixing)

Non-linear transformation of feature dimensions
Structure: Linear → GELU → Linear

🎯 Generation Examples

[Prompt] User: Hello
[Output] Assistant:ing when make what of your thing acke in conost im

[Prompt] User: How are you?
[Output] Assistant:
Aser: Your you got the loo
Assistant: Me I molin

[Prompt] User: What is your name?
[Output] Assistant:i mout good ind the do peale ponthon me, por isout

📊 Training Results

Metric	Value
Train Loss	1.6519
Train PPL	5.22
Val Loss	1.5910
Val PPL	4.91
Epoch	3
Global Steps	140

📊 Training Data

Dataset: Discord-Dialogues

Discord conversation samples
Converted from ChatML to User/Assistant format
Multi-turn conversational data
Sequence length: 64 tokens

🔧 Usage

import torch
from huggingface_hub import hf_hub_download
from src.model import MicroMixer, MicroMixerConfig
from src.tokenizer import ByteTokenizer

# Clone the repository first:
# git clone https://github.com/llaa33219/MicroMixer-2.git
# cd MicroMixer-2

config = MicroMixerConfig(
    max_seq_len=64,
    hidden_dim=84,
    hyper_hidden_dim=48,
    channel_mlp_dim=128,
    num_layers=3,
)

model = MicroMixer(config)
weights_path = hf_hub_download("llaa33219/MicroMixer-2-100K-discord-dialogues", "model.pt")
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()

tokenizer = ByteTokenizer()
input_ids = torch.tensor([tokenizer.encode("User: Hello
Assistant:")])

with torch.no_grad():
    output = model.generate(input_ids, max_new_tokens=64, temperature=0.7, top_k=40)

print(tokenizer.decode(output[0].tolist()))

⚠️ Limitations

Limitation	Description
Very Small Model	Only ~125K parameters
Very Limited Context	Max 64 tokens sequence length
Minimal Capacity	Insufficient for coherent language generation
Research Use Only	Primarily for architecture testing

_{Part of the MicroMixer-2 research project}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

llaa33219
/

MicroMixer-2-100K-discord-dialogues