Instructions to use katanemo/Arch-Router-1.5B.gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use katanemo/Arch-Router-1.5B.gguf with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="katanemo/Arch-Router-1.5B.gguf") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("katanemo/Arch-Router-1.5B.gguf", dtype="auto") - llama-cpp-python
How to use katanemo/Arch-Router-1.5B.gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="katanemo/Arch-Router-1.5B.gguf", filename="Arch-Router-1.5B-Q2_K.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use katanemo/Arch-Router-1.5B.gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf katanemo/Arch-Router-1.5B.gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf katanemo/Arch-Router-1.5B.gguf:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf katanemo/Arch-Router-1.5B.gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf katanemo/Arch-Router-1.5B.gguf:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf katanemo/Arch-Router-1.5B.gguf:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf katanemo/Arch-Router-1.5B.gguf:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf katanemo/Arch-Router-1.5B.gguf:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf katanemo/Arch-Router-1.5B.gguf:Q4_K_M
Use Docker
docker model run hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use katanemo/Arch-Router-1.5B.gguf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "katanemo/Arch-Router-1.5B.gguf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "katanemo/Arch-Router-1.5B.gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
- SGLang
How to use katanemo/Arch-Router-1.5B.gguf with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "katanemo/Arch-Router-1.5B.gguf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "katanemo/Arch-Router-1.5B.gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "katanemo/Arch-Router-1.5B.gguf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "katanemo/Arch-Router-1.5B.gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use katanemo/Arch-Router-1.5B.gguf with Ollama:
ollama run hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
- Unsloth Studio new
How to use katanemo/Arch-Router-1.5B.gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for katanemo/Arch-Router-1.5B.gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for katanemo/Arch-Router-1.5B.gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for katanemo/Arch-Router-1.5B.gguf to start chatting
- Pi new
How to use katanemo/Arch-Router-1.5B.gguf with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf katanemo/Arch-Router-1.5B.gguf:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "katanemo/Arch-Router-1.5B.gguf:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use katanemo/Arch-Router-1.5B.gguf with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf katanemo/Arch-Router-1.5B.gguf:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default katanemo/Arch-Router-1.5B.gguf:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use katanemo/Arch-Router-1.5B.gguf with Docker Model Runner:
docker model run hf.co/katanemo/Arch-Router-1.5B.gguf:Q4_K_M
- Lemonade
How to use katanemo/Arch-Router-1.5B.gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull katanemo/Arch-Router-1.5B.gguf:Q4_K_M
Run and chat with the model
lemonade run user.Arch-Router-1.5B.gguf-Q4_K_M
List all available models
lemonade list
| base_model: | |
| - Qwen/Qwen2.5-1.5B-Instruct | |
| language: | |
| - en | |
| library_name: transformers | |
| license: other | |
| license_name: katanemo-research | |
| license_link: https://huggingface.co/katanemo/Arch-Router-1.5B.gguf/blob/main/LICENSE | |
| pipeline_tag: text-generation | |
| # katanemo/Arch-Router-1.5B | |
| ## Overview | |
| With the rapid proliferation of large language models (LLMs) -- each optimized for different strengths, style, or latency/cost profile -- routing has become an essential technique to operationalize the use of different models. However, existing LLM routing approaches are limited in two key ways: they evaluate performance using benchmarks that often fail to capture human preferences driven by subjective evaluation criteria, and they typically select from a limited pool of models. | |
| We introduce a preference-aligned routing framework that guides model selection by matching queries to user-defined domains (e.g., travel) or action types (e.g., image editing) -- offering a practical mechanism to encode preferences in routing decisions. Specifically, we introduce Arch-Router, a compact 1.5B model that learns to map queries to domain-action preferences for model routing decisions. Experiments on conversational datasets demonstrate that our approach achieves state-of-the-art (SOTA) results in matching queries with human preferences, outperforming top proprietary models. | |
| This model is described in the paper: https://arxiv.org/abs/2506.16655, and powers [Arch](https://github.com/katanemo/arch) the open-source AI-native proxy for agents to enable preference-based routing in a seamless way. The project page can be found at https://archgw.com/. | |
| ### How It Works | |
| To support effective routing, Arch-Router introduces two key concepts: | |
| - **Domain** – the high-level thematic category or subject matter of a request (e.g., legal, healthcare, programming). | |
| - **Action** – the specific type of operation the user wants performed (e.g., summarization, code generation, booking appointment, translation). | |
| Both domain and action configs are associated with preferred models or model variants. At inference time, Arch-Router analyzes the incoming prompt to infer its domain and action using semantic similarity, task indicators, and contextual cues. It then applies the user-defined routing preferences to select the model best suited to handle the request. | |
| ### Key Features | |
| - **Structured Preference Routing**: Aligns prompt request with model strengths using explicit domain–action mappings. | |
| - **Transparent and Controllable**: Makes routing decisions transparent and configurable, empowering users to customize system behavior. | |
| - **Flexible and Adaptive**: Supports evolving user needs, model updates, and new domains/actions without retraining the router. | |
| - **Production-Ready Performance**: Optimized for low-latency, high-throughput applications in multi-model environments. | |
| # Requirements | |
| The code of Arch-Router-1.5B has been in the Hugging Face `transformers` library and we advise you to install latest version: | |
| ```bash | |
| pip install transformers>=4.37.0 | |
| ``` | |
| # How to use | |
| We use the following example to illustrate how to use our model to perform routing tasks. Please note that, our model works best with our provided prompt format. | |
| ### Quickstart | |
| ````python | |
| import json | |
| from typing import Any, Dict, List | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model_name = "katanemo/Arch-Router-1.5B" | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_name, device_map="auto", torch_dtype="auto", trust_remote_code=True | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| # Please use our provided prompt for best performance | |
| TASK_INSTRUCTION = """ | |
| You are a helpful assistant designed to find the best suited route. | |
| You are provided with route description within <routes></routes> XML tags: | |
| <routes> | |
| {routes} | |
| </routes> | |
| <conversation> | |
| {conversation} | |
| </conversation> | |
| """ | |
| FORMAT_PROMPT = """ | |
| Your task is to decide which route is best suit with user intent on the conversation in <conversation></conversation> XML tags. Follow the instruction: | |
| 1. If the latest intent from user is irrelevant or user intent is full filled, response with other route {"route": "other"}. | |
| 2. You must analyze the route descriptions and find the best match route for user latest intent. | |
| 3. You only response the name of the route that best matches the user's request, use the exact name in the <routes></routes>. | |
| Based on your analysis, provide your response in the following JSON formats if you decide to match any route: | |
| {"route": "route_name"} | |
| """ | |
| # Define route config | |
| route_config = [ | |
| { | |
| "name": "code_generation", | |
| "description": "Generating new code snippets, functions, or boilerplate based on user prompts or requirements", | |
| }, | |
| { | |
| "name": "bug_fixing", | |
| "description": "Identifying and fixing errors or bugs in the provided code across different programming languages", | |
| }, | |
| { | |
| "name": "performance_optimization", | |
| "description": "Suggesting improvements to make code more efficient, readable, or scalable", | |
| }, | |
| { | |
| "name": "api_help", | |
| "description": "Assisting with understanding or integrating external APIs and libraries", | |
| }, | |
| { | |
| "name": "programming", | |
| "description": "Answering general programming questions, theory, or best practices", | |
| }, | |
| ] | |
| # Helper function to create the system prompt for our model | |
| def format_prompt( | |
| route_config: List[Dict[str, Any]], conversation: List[Dict[str, Any]] | |
| ): | |
| return ( | |
| TASK_INSTRUCTION.format( | |
| routes=json.dumps(route_config), conversation=json.dumps(conversation) | |
| ) | |
| + FORMAT_PROMPT | |
| ) | |
| # Define conversations | |
| conversation = [ | |
| { | |
| "role": "user", | |
| "content": "fix this module 'torch.utils._pytree' has no attribute 'register_pytree_node'. did you mean: '_register_pytree_node'?", | |
| } | |
| ] | |
| route_prompt = format_prompt(route_config, conversation) | |
| messages = [ | |
| {"role": "user", "content": route_prompt}, | |
| ] | |
| input_ids = tokenizer.apply_chat_template( | |
| messages, add_generation_prompt=True, return_tensors="pt" | |
| ).to(model.device) | |
| # 2. Generate | |
| generated_ids = model.generate( | |
| input_ids=input_ids, # or just positional: model.generate(input_ids, …) | |
| max_new_tokens=32768, | |
| ) | |
| # 3. Strip the prompt from each sequence | |
| prompt_lengths = input_ids.shape[1] # same length for every row here | |
| generated_only = [ | |
| output_ids[prompt_lengths:] # slice off the prompt tokens | |
| for output_ids in generated_ids | |
| ] | |
| # 4. Decode if you want text | |
| response = tokenizer.batch_decode(generated_only, skip_special_tokens=True)[0] | |
| print(response) | |
| ```` | |
| Then you should be able to see the following output string in JSON format: | |
| ````python | |
| {"route": "bug_fixing"} | |
| ```` | |
| To better understand how to create the route descriptions, please take a look at our [Katanemo API](https://docs.archgw.com/guides/llm_router.html). | |
| # License | |
| Katanemo Arch-Router model is distributed under the [Katanemo license](https://huggingface.co/katanemo/Arch-Router-1.5B.gguf/blob/main/LICENSE). |