Instructions to use QuantFactory/magnum-v4-12b-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use QuantFactory/magnum-v4-12b-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="QuantFactory/magnum-v4-12b-GGUF") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("QuantFactory/magnum-v4-12b-GGUF", dtype="auto") - llama-cpp-python
How to use QuantFactory/magnum-v4-12b-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="QuantFactory/magnum-v4-12b-GGUF", filename="magnum-v4-12b.Q2_K.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use QuantFactory/magnum-v4-12b-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf QuantFactory/magnum-v4-12b-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf QuantFactory/magnum-v4-12b-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf QuantFactory/magnum-v4-12b-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf QuantFactory/magnum-v4-12b-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf QuantFactory/magnum-v4-12b-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf QuantFactory/magnum-v4-12b-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf QuantFactory/magnum-v4-12b-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf QuantFactory/magnum-v4-12b-GGUF:Q4_K_M
Use Docker
docker model run hf.co/QuantFactory/magnum-v4-12b-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use QuantFactory/magnum-v4-12b-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "QuantFactory/magnum-v4-12b-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantFactory/magnum-v4-12b-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/QuantFactory/magnum-v4-12b-GGUF:Q4_K_M
- SGLang
How to use QuantFactory/magnum-v4-12b-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "QuantFactory/magnum-v4-12b-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantFactory/magnum-v4-12b-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "QuantFactory/magnum-v4-12b-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantFactory/magnum-v4-12b-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use QuantFactory/magnum-v4-12b-GGUF with Ollama:
ollama run hf.co/QuantFactory/magnum-v4-12b-GGUF:Q4_K_M
- Unsloth Studio new
How to use QuantFactory/magnum-v4-12b-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for QuantFactory/magnum-v4-12b-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for QuantFactory/magnum-v4-12b-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for QuantFactory/magnum-v4-12b-GGUF to start chatting
- Pi new
How to use QuantFactory/magnum-v4-12b-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf QuantFactory/magnum-v4-12b-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "QuantFactory/magnum-v4-12b-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use QuantFactory/magnum-v4-12b-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf QuantFactory/magnum-v4-12b-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default QuantFactory/magnum-v4-12b-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use QuantFactory/magnum-v4-12b-GGUF with Docker Model Runner:
docker model run hf.co/QuantFactory/magnum-v4-12b-GGUF:Q4_K_M
- Lemonade
How to use QuantFactory/magnum-v4-12b-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull QuantFactory/magnum-v4-12b-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.magnum-v4-12b-GGUF-Q4_K_M
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)QuantFactory/magnum-v4-12b-GGUF
This is quantized version of anthracite-org/magnum-v4-12b created using llama.cpp
Original Model Card
This is a series of models designed to replicate the prose quality of the Claude 3 models, specifically Sonnet and Opus.
This model is fine-tuned on top of mistralai/Mistral-Nemo-Instruct-2407.
Prompting
A typical input would look like this:
<s>[INST] SYSTEM MESSAGE
USER MESSAGE[/INST] ASSISTANT MESSAGE</s>[INST] USER MESSAGE[/INST]
SillyTavern templates
Below are Instruct and Context templates for use within SillyTavern.
context template
default SillyTavern template works fine
instruct template
default SillyTavern template works fine
Axolotl config
See axolotl config
base_model: mistralai/Mistral-Nemo-Instruct-2407
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
hub_model_id: anthracite-org/magnum-v4-12b-r2
hub_strategy: "all_checkpoints"
push_dataset_to_hub:
hf_use_auth_token: true
plugins:
- axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true
load_in_8bit: false
load_in_4bit: false
strict: false
datasets:
- path: anthracite-org/c2_logs_32k_llama3_qwen2_v1.2_no_system
type: custommistralv3tekken
- path: anthracite-org/kalo-opus-instruct-22k-no-refusal-no-system
type: custommistralv3tekken
- path: anthracite-org/kalo-opus-instruct-3k-filtered-no-system
type: custommistralv3tekken
- path: anthracite-org/nopm_claude_writing_fixed
type: custommistralv3tekken
- path: anthracite-org/kalo_opus_misc_240827_no_system
type: custommistralv3tekken
- path: anthracite-org/kalo_misc_part2_no_system
type: custommistralv3tekken
#chat_template: chatml
shuffle_merged_datasets: true
#default_system_message: "You are an assistant that responds to the user."
dataset_prepared_path: /workspace/data/magnum-12b-data
val_set_size: 0.0
output_dir: /workspace/data/12b-fft-out
sequence_len: 32768
sample_packing: true
pad_to_sequence_len: true
adapter:
lora_model_dir:
lora_r:
lora_alpha:
lora_dropout:
lora_target_linear:
lora_fan_in_fan_out:
wandb_project: 12b-magnum-fft
wandb_entity:
wandb_watch:
wandb_name: v4-r2-attempt-01
wandb_log_model:
gradient_accumulation_steps: 2
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.00001
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
warmup_steps: 40
evals_per_epoch:
eval_table_size:
eval_max_new_tokens:
saves_per_epoch: 2
debug:
deepspeed: deepspeed_configs/zero2.json
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
pad_token: <pad>
Credits
We'd like to thank Recursal / Featherless for sponsoring the compute for this train, Featherless has been hosting our Magnum models since the first 72 B and has given thousands of people access to our models and helped us grow.
We would also like to thank all members of Anthracite who made this finetune possible.
Datasets
- anthracite-org/c2_logs_32k_llama3_qwen2_v1.2_no_system
- anthracite-org/kalo-opus-instruct-22k-no-refusal-no-system
- anthracite-org/kalo-opus-instruct-3k-filtered-no-system
- anthracite-org/nopm_claude_writing_fixed
- anthracite-org/kalo_opus_misc_240827_no_system
- anthracite-org/kalo_misc_part2_no_system
Training
The training was done for 2 epochs. We used 8xH100s GPUs graciously provided by Recursal AI / Featherless AI for the full-parameter fine-tuning of the model.
Safety
...
- Downloads last month
- 107
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit

# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="QuantFactory/magnum-v4-12b-GGUF", filename="", )