Instructions to use IEITYuan/Yuan2-M32-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use IEITYuan/Yuan2-M32-gguf with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="IEITYuan/Yuan2-M32-gguf",
	filename="Yuan2-moe_0526-000.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use IEITYuan/Yuan2-M32-gguf with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf IEITYuan/Yuan2-M32-gguf
# Run inference directly in the terminal:
llama-cli -hf IEITYuan/Yuan2-M32-gguf

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf IEITYuan/Yuan2-M32-gguf
# Run inference directly in the terminal:
llama-cli -hf IEITYuan/Yuan2-M32-gguf

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf IEITYuan/Yuan2-M32-gguf
# Run inference directly in the terminal:
./llama-cli -hf IEITYuan/Yuan2-M32-gguf

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf IEITYuan/Yuan2-M32-gguf
# Run inference directly in the terminal:
./build/bin/llama-cli -hf IEITYuan/Yuan2-M32-gguf

Use Docker

docker model run hf.co/IEITYuan/Yuan2-M32-gguf

LM Studio
Jan
Ollama
How to use IEITYuan/Yuan2-M32-gguf with Ollama:
```
ollama run hf.co/IEITYuan/Yuan2-M32-gguf
```

Unsloth Studio new

How to use IEITYuan/Yuan2-M32-gguf with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for IEITYuan/Yuan2-M32-gguf to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for IEITYuan/Yuan2-M32-gguf to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for IEITYuan/Yuan2-M32-gguf to start chatting

Docker Model Runner
How to use IEITYuan/Yuan2-M32-gguf with Docker Model Runner:
```
docker model run hf.co/IEITYuan/Yuan2-M32-gguf
```

Lemonade

How to use IEITYuan/Yuan2-M32-gguf with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull IEITYuan/Yuan2-M32-gguf

Run and chat with the model

lemonade run user.Yuan2-M32-gguf-{{QUANT_TAG}}

List all available models

lemonade list

Yuan2.0-M32: Mixture of Experts with Attention Router

🌎 GitHub • 🤗 Hugging Face • 💬 WeChat• 📎 Yuan2.0-M32 Paper

🚀🚀 llama.cpp for Yuan2.0 LLMs

1. Introduction

Yuan2.0-M32 is a Mixture-of-Experts (MoE) language model with 32 experts, of which 2 are active. A new router network, Attention Router, is proposed and has been adopted for more efficient expert selection, boosting accuracy by 3.8% over models using a classical router network. Yuan 2.0-M32 is trained from scratch with 2000B tokens, and its training computation is only 9.25% of that required by a dense model of the same parameter scale. Demonstrating competitive capabilities in coding, math, and various specialized fields, Yuan2.0-M32 operates with only 3.7B active parameters out of a total 40B, and a forward computation of 7.4 GFLOPS per token, which is just 1/19th of Llama3-70B's requirement. Yuan 2.0-M32 has surpassed Llama3-70B on the MATH and ARC-Challenge benchmarks, achieving accuracies of 55.9% and 95.8%, respectively. The basic information of the Yuan2.0-M32 model is as follows:

Total Parameters ： 40B
Experts： 32
Active Experts： 2
Active Parameters： 3.7B
Training Tokens： 2000B tokens
Sequence Length： 16K

The technical report for the Yuan2.0-M32 model has been released, and you can find more detailed technical information and evaluation results by referring to the paper.

2. Model Downloads

Model	Sequence Length	Type	Download
Yuan2.0-M32	16K	Megatron	HuggingFace
Yuan2.0-M32-HF	16K	HuggingFace	HuggingFace
Yuan2.0-M32-GGUF	16K	GGUF	HuggingFace
Yuan2.0-M32-GGUF-INT4	16K	GGUF	HuggingFace

3. Evaluation

3.1 Benchmarks 🏆

We conducted a thorough evaluation of the Yuan2.0-M32 model across a range of benchmarks, including HumanEval, GSM8K, MMLU, Math, and ARC-Challenge. These benchmarks are designed to test the model's proficiency in key areas such as natural language understanding, knowledge acquisition, mathematical computation and reasoning, and code generation. The Yuan2.0-M32 has shown a consistent and significant advantage over other models like Llama3-8B and Mistral-8×7B, excelling in all evaluated tasks. Remarkably, its overall performance is on par with the more substantial Llama3-70B model.The detailed evaluation results are outlined in the subsequent table.

Model	HumanEval	GSM8K	MMLU	Math	ARC-C*
Llama3-70B	81.7%	93%	80.3	50.4%	93.3%
Llama3-8B	62.2%	79.6%	68.4%	30%	78.6%
Phi-3-medium	62.2%	91.0%	78.0%	-	91.6%
Phi-3-small	61%	89.6%	75.7%	-	90.7%
Phi-3-mini	58.5%	82.5%	68.8%	-	84.9%
Mistral-8*22B	45.1%	78.6%	77.8%	41,8%	91.3%
Mistral-8*7B	40.2%	58.4%	70.86%	28.4%	85.9%
Yuan2.0-M32	74.4%	92.7%	72.2%	55.9%	95.8%

* ARC-C: AI2 Reasoning Challenge (ARC) benchmark contains more complex parts that need further reasoning.

3.2 Computational Utilization for Model

Model	Params (B)	Active Params (B)	GFLOPs/token (Inference)	GFLOPS/token (Fine-tune)	Mean Accuracy	Average Accuracy/GFLOPSs per token (Inference)
Llama3-70B	70	70	140	420	79.25	0.57
Llama3-8B	8	8	16	48	64.15	4.00
Mistral-8*22B	141	39	78	234	72.38	0.93
Mistral-8*7B	47	12.9	25.8	77.3	60.83	2.36
Yuan2.0-M32	40	3.7	7.4	22.2	79.15	10.69

4. Quick Start

4.1 Environment Config

We strongly recommend using the latest release of docker images of Yuan2.0-M32.You can launch an instance of the Yuan 2.0 container with the following Docker commands:

docker pull yuanmodel/yuan2.0:m32
docker run --gpus all --privileged --ulimit stack=68719476736 --shm-size=1000G -itd -v /path/to/yuan_2.0:/workspace/yuan_2.0 -v /path/to/dataset:/workspace/dataset -v /path/to/checkpoints:/workspace/checkpoints --name your_name yuanmodel/yuan2.0:m32
docker exec -it your_name bash

4.2 Data Preprocess

We have provided the data preprocess script. See documentation here.

4.3 Model Pretrain

We've provided several scripts for pretraining in the example. The details can be seen from documentation here.

4.4 Inference Service

For a detailed deployment plan, please refer to vllm.

For more information, please refer to GitHub repository.

5. Statement of Agreement

The use of the source code in this repository requires compliance with the open source license agreement Apache 2.0. The Yuan2.0 model supports commercial use and does not require authorization. Please understand and comply with the 《Yuan2.0 Model License Agreement》. Do not use the open source model and code, as well as derivatives generated from open source projects, for any purposes that may cause harm to the country and society, or for any services that have not undergone security assessment and filing. Although we have taken measures to ensure the compliance and accuracy of the data during training, the model has a huge number of parameters and is affected by probability and randomness factors. We cannot guarantee the accuracy of the output content, and the model is easily misled by input instructions. This project does not assume any data security, public opinion risks, or any model misleading, abusing, spreading caused by open-source models and code Risks and responsibilities arising from improper utilization You will be solely responsible for the risks and consequences arising from the use, copying, distribution, and modification of the model in this open source project

6. Contact Us

If you have any questions, please raise an issue or contact us at air_service@ieisystem.com

Downloads last month: 119

GGUF

Model size

40B params

Architecture

yuan2_moe

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for IEITYuan/Yuan2-M32-gguf

Yuan 2.0-M32: Mixture of Experts with Attention Router

Paper • 2405.17976 • Published May 28, 2024 • 21