Instructions to use MiniMaxAI/MiniMax-M2.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MiniMaxAI/MiniMax-M2.1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="MiniMaxAI/MiniMax-M2.1", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("MiniMaxAI/MiniMax-M2.1", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("MiniMaxAI/MiniMax-M2.1", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use MiniMaxAI/MiniMax-M2.1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "MiniMaxAI/MiniMax-M2.1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MiniMaxAI/MiniMax-M2.1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/MiniMaxAI/MiniMax-M2.1

SGLang

How to use MiniMaxAI/MiniMax-M2.1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "MiniMaxAI/MiniMax-M2.1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MiniMaxAI/MiniMax-M2.1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "MiniMaxAI/MiniMax-M2.1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MiniMaxAI/MiniMax-M2.1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use MiniMaxAI/MiniMax-M2.1 with Docker Model Runner:
```
docker model run hf.co/MiniMaxAI/MiniMax-M2.1
```

MiniMax-M2.1 / docs /mlx_deploy_guide.md

rogeryoungh

add mlx (#8)

927ea2b verified 4 months ago

preview code

raw

history blame

1.85 kB

MLX deployment guide

Run, serve, and fine-tune MiniMax-M2.1 locally on your Mac using the MLX framework. This guide gets you up and running quickly.

Requirements

Apple Silicon Mac (M3 Ultra or later)

At least 256GB of unified memory (RAM)

Installation

Install the mlx-lm package via pip:

pip install -U mlx-lm

CLI

Generate text directly from the terminal:

mlx_lm.generate \
  --model mlx-community/MiniMax-M2.1-4bit \
  --prompt "How tall is Mount Everest?"

Add --max-tokens 256 to control response length, or --temp 0.7 for creativity.

Python Script Example

Use mlx-lm in your own Python scripts:

from mlx_lm import load, generate

# Load the quantized model
model, tokenizer = load("mlx-community/MiniMax-M2.1-4bit")

prompt = "Hello, how are you?"

# Apply chat template if available (recommended for chat models)
if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

# Generate response
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=256,
    temp=0.7,
    verbose=True
)

print(response)

Tips

Model variants: Check this MLX community collection on Hugging Face for MiniMax-M2.1-4bit, 6bit, 8bit, or bfloat16 versions.
Fine-tuning: Use mlx-lm.lora for efficient parameter-efficient fine-tuning (PEFT).

Resources

GitHub: https://github.com/ml-explore/mlx-lm
Models: https://huggingface.co/mlx-community