Llama 3.1 8B Instruct — GGUF + MXQ for llama-cli-mblt

This repository provides Llama 3.1 8B Instruct compiled and optimized for Mobilint NPU hardware, packaged for use with llama-cli-mblt.

Branches

Branch Contents Description
main Body model only Standard autoregressive decoding
eagle3 Body + FC + Draft models EAGLE3 speculative decoding (~2-4x faster)

Files

main branch

File Size Description
llama-3.1-8b-instruct-vocab.gguf 7.5 MB Tokenizer (vocab-only GGUF)
target_emb.bin 2.0 GB Body embedding weights (float32)
single_Body_Llama-3.1-8B-Instruct.mxq 3.7 GB Body model for NPU (W4V8 quantized)
config.json — Model configuration

eagle3 branch (adds)

File Size Description
single_Fc_Llama-3.1-8B-Instruct.mxq 49 MB FC dimension converter model
Draft_Llama-3.1-8B-Instruct.mxq 181 MB EAGLE3 draft model
draft_emb.bin 2.0 GB Draft embedding weights
d2t.bin 250 KB Draft-to-target vocabulary mapping

Quick Start

Install

# Build llama-cli-mblt
cd llama.cpp
cmake -B build -DLLAMA_MOBILINT=ON -DLLAMA_MOBILINT_RUNTIME_DIR=/path/to/qbruntime -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-cli-mblt -j$(nproc)

Simple decoding (main branch)

# Download model files
huggingface-cli download mobilint/Llama-3.1-8B-Instruct-GGUF --local-dir models/llama-8b

# Run
./build/bin/llama-cli-mblt \
    --gguf  models/llama-8b/llama-3.1-8b-instruct-vocab.gguf \
    --embd  models/llama-8b/target_emb.bin \
    --mxq   models/llama-8b/single_Body_Llama-3.1-8B-Instruct.mxq \
    --core-mode global4 --chat \
    -p "What is the meaning of life?" -n 256

EAGLE3 speculative decoding (eagle3 branch)

# Download with eagle3 branch
huggingface-cli download mobilint/Llama-3.1-8B-Instruct-GGUF --revision eagle3 --local-dir models/llama-8b-eagle3

# Run with ~2-4x speedup
./build/bin/llama-cli-mblt \
    --gguf  models/llama-8b-eagle3/llama-3.1-8b-instruct-vocab.gguf \
    --embd  models/llama-8b-eagle3/target_emb.bin \
    --mxq   models/llama-8b-eagle3/single_Body_Llama-3.1-8B-Instruct.mxq \
    --mxq-fc    models/llama-8b-eagle3/single_Fc_Llama-3.1-8B-Instruct.mxq \
    --mxq-draft models/llama-8b-eagle3/Draft_Llama-3.1-8B-Instruct.mxq \
    --embd-draft models/llama-8b-eagle3/draft_emb.bin \
    --d2t   models/llama-8b-eagle3/d2t.bin \
    --core-mode global4 --n-draft 2 --tree-depth 6 --total-tokens 23 \
    --chat --temp 0.0 -p "Explain quantum computing" -n 200

# Interactive chat
./build/bin/llama-cli-mblt \
    --gguf  models/llama-8b-eagle3/llama-3.1-8b-instruct-vocab.gguf \
    --embd  models/llama-8b-eagle3/target_emb.bin \
    --mxq   models/llama-8b-eagle3/single_Body_Llama-3.1-8B-Instruct.mxq \
    --mxq-fc    models/llama-8b-eagle3/single_Fc_Llama-3.1-8B-Instruct.mxq \
    --mxq-draft models/llama-8b-eagle3/Draft_Llama-3.1-8B-Instruct.mxq \
    --embd-draft models/llama-8b-eagle3/draft_emb.bin \
    --d2t   models/llama-8b-eagle3/d2t.bin \
    --core-mode global4 --n-draft 2 --tree-depth 6 --total-tokens 23 \
    -i -n 256

Performance

Tested on Mobilint Aries NPU with global4 core mode:

Mode Prefill Decode Tokens/Step
Simple ~330 t/s ~10 t/s 1.0
EAGLE3 ~330 t/s ~23 t/s ~4.2

About

This model is compiled and optimized for Mobilint NPU hardware. It is intended to be used with llama-cli-mblt from llama.cpp's mobilint example.

Downloads last month
78
GGUF
Model size
0 params
Architecture
llama
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mobilint/Llama-3.1-8B-Instruct-GGUF

Quantized
(631)
this model

Collection including mobilint/Llama-3.1-8B-Instruct-GGUF