MicroMixer-2 Logo

MicroMixer-2-1M-discord-dialogues

Parameters Architecture Dataset

Micro Language Model
Attention-Free β€’ MLP-Only β€’ Byte-Level β€’ Conversational

GitHub


πŸ“‹ Overview

MicroMixer-2-1M-discord-dialogues is a ~1M parameter MLP-Mixer language model trained on Discord conversation data. V4 introduces DropPath regularization, label smoothing, and padding-aware loss for better training stability. It generates conversational text in a User/Assistant format without using any attention mechanisms.


πŸ—οΈ Architecture

graph TD
    A[Byte Input] --> B[Token Embedding]
    B --> C[RoPE Position Encoding]
    C --> D[MicroMixerLayer Γ—5]
    D --> E[LayerNorm]
    E --> F[LM Head]
    F --> G[Byte Output]
    
    style A fill:#007BFF,color:#fff
    style G fill:#00D620,color:#fff
    style D fill:#AE00FF,color:#fff

Model Configuration

Parameter Value
Total Parameters1,016,204
Hidden Dimension168
Hyper Hidden Dimension84
Channel MLP Dimension448
Number of Layers5
Max Sequence Length4096
Vocabulary Size256 (Byte-level)
DropPath Rate0.1
Label Smoothing0.1

Core Components

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           MicroMixerLayer                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  LayerNorm β†’ HyperMixing β†’ Residual β”‚    β”‚ ← Token Mixing
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚
β”‚  β”‚  LayerNorm β†’ MlpBlock β†’ Residual    β”‚    β”‚ ← Channel Mixing
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1️⃣ RoPE (Rotary Position Embedding)

  • Encodes positions via rotation transformations
  • Enables length extrapolation beyond training sequences

2️⃣ HyperMixing (Token Mixing)

  • Compresses past context via cumulative average pooling
  • Hypernetwork generates adaptive weights
  • O(S) complexity token mixing without attention

3️⃣ MlpBlock (Channel Mixing)

  • Non-linear transformation of feature dimensions
  • Structure: Linear β†’ GELU β†’ Linear

4️⃣ V4 Innovations

  • DropPath: Stochastic depth regularization (random residual skipping)
  • Label Smoothing: Prevents overconfident predictions
  • Padding-Aware Loss: Ignores padding tokens in cross-entropy
  • Increased Depth: 5 layers (vs 3 in V3)

🎯 Generation Examples

User: Hello, how are you?
Assistant: Its not a fun of but I can't really find me be on a heread long collecting...

User: What is your favorite color?
Assistant: Not even worth to get voice the time to up them the stuff inel that was i fever...

User: Tell me a joke.
Assistant: I can always sure when your start to and in the mainse of I had say again...

πŸ“Š Training Results

Epoch Train Loss Train PPL Val Loss Val PPL
1 2.89 18.04 2.68 14.65
2 2.57 13.05 2.63 13.88
3 2.54 12.68 2.62 13.73

πŸ“Š Training Data

Dataset: Discord-Dialogues

  • 7.3M Discord conversation samples
  • Converted from ChatML to User/Assistant format
  • Multi-turn conversational data
  • ~239M total tokens

πŸ”§ Usage

import torch
from huggingface_hub import hf_hub_download
from src.model import MicroMixer, MicroMixerConfig
from src.tokenizer import ByteTokenizer

# Clone the repository first:
# git clone https://github.com/llaa33219/MicroMixer-2.git
# cd MicroMixer-2

config = MicroMixerConfig(
    max_seq_len=4096,
    hidden_dim=168,
    hyper_hidden_dim=84,
    channel_mlp_dim=448,
    num_layers=5,
)

model = MicroMixer(config)
weights_path = hf_hub_download("llaa33219/MicroMixer-2-1M-discord-dialogues", "model.pt")
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()

tokenizer = ByteTokenizer()
input_ids = torch.tensor([tokenizer.encode("User: Hello
Assistant:")])

with torch.no_grad():
    output = model.generate(input_ids, max_new_tokens=128, temperature=0.7, top_k=40)

print(tokenizer.decode(output[0].tolist()))

⚠️ Limitations

Limitation Description
Small Model Size Only ~1M parameters
Grammar Issues Generated text has grammatical errors
Repetitive Patterns Tends to repeat learned phrases
Limited Knowledge Trained only on Discord conversations

GitHub

Part of the MicroMixer-2 research project

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train llaa33219/MicroMixer-2-1M-discord-dialogues

Collection including llaa33219/MicroMixer-2-1M-discord-dialogues