Instructions to use meta-llama/Llama-3.1-8B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use meta-llama/Llama-3.1-8B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-8B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use meta-llama/Llama-3.1-8B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "meta-llama/Llama-3.1-8B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/meta-llama/Llama-3.1-8B-Instruct
- SGLang
How to use meta-llama/Llama-3.1-8B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "meta-llama/Llama-3.1-8B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "meta-llama/Llama-3.1-8B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use meta-llama/Llama-3.1-8B-Instruct with Docker Model Runner:
docker model run hf.co/meta-llama/Llama-3.1-8B-Instruct
fix: set `clean_up_tokenization_spaces` to `false`
tokenizer_config.json has "clean_up_tokenization_spaces": true, which causes tokenizer.decode() to silently corrupt text. This affects every Llama 3.x model on the Hub and every fine-tune or downstream model that inherits their tokenizer config. Both Llama 2 and Llama 4 ship with false.
The fix is a one-line change: "clean_up_tokenization_spaces": true → "clean_up_tokenization_spaces": false.
What clean_up_tokenization_spaces does
When true, tokenizer.decode() strips spaces before punctuation marks during decoding. Specifically, it applies these string replacements to the decoded text:
text.replace(" .", ".").replace(" ?", "?").replace(" !", "!")
.replace(" ,", ",").replace(" ' ", "'")
.replace(" n't", "n't").replace(" 'm", "'m")
.replace(" 's", "'s").replace(" 've", "'ve").replace(" 're", "'re")
This was designed for BERT-era WordPiece tokenizers (2019) where decoding produced artifacts like "Hello , world .". Llama 3's BPE tokenizer encodes spaces as part of tokens and does not produce these artifacts. The cleanup is actively destructive — it strips legitimate spaces from the decoded text.
Minimal reproduction
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
text = "x != y and a.b == c"
ids = tokenizer.encode(text, add_special_tokens=False)
decoded = tokenizer.decode(ids)
print(repr(decoded))
decoded_fixed = tokenizer.decode(ids, clean_up_tokenization_spaces=False)
print(repr(decoded_fixed))
Output:
'x!= y and a.b == c' ← space before != silently dropped
'x != y and a.b == c' ← correct
Impact
- The bug is specific to HuggingFace's tokenizer implementation. Every other tokenizer — including Meta's own tiktoken, vLLM, SGLang, OLMo, etc. — does not have this behavior.
- Every model that uses a Llama 3 tokenizer from HuggingFace has been and is currently decoding wrong — not just the official
meta-llamarepos, but all fine-tunes and derivatives that inherited the tokenizer config.
How Llama 3 got clean_up_tokenization_spaces=True
This was never an intentional choice by Meta:
- Llama 2 explicitly set it to
FalseinLlamaTokenizer.__init__ - Llama 3 switched to
PreTrainedTokenizerFastvia a newLlama3Converter(PR #30334). The converter didn't passclean_up_tokenization_spaces, so it inherited the HuggingFace transformers library default ofTrue - The uploaded
tokenizer_config.jsonfiles on the Hub baked inTrue - PR #33778 (Llama 3.2 support, Oct 2024) then hardcoded
Truein the conversion script for backward compatibility — without discussion of whether the value was correct - The library default was changed to
Falsein Sep 2024 (PR #31938), but the Llama 3 configs already hadTruefrozen
This has been flagged multiple times:
- Discussion 44 on Meta-Llama-3-70B-Instruct (May 2024)
- transformers issue #35175
- transformers issue #31187
- transformers issue #32575
@ArthurZucker acknowledged in #35175: "It should be set to False!"
Both Llama 2 and Llama 4 ship with false, confirming this is recognized as a bug.
Affected models
All 21 Llama 3.x text model repos on the Hub have "clean_up_tokenization_spaces": true:
- Llama 3.0: Meta-Llama-3-8B, -8B-Instruct, -70B, -70B-Instruct
- Llama 3.1: Llama-3.1-8B, -8B-Instruct, -70B, -70B-Instruct, -405B, -405B-FP8, -405B-Instruct, -405B-Instruct-FP8
- Llama 3.2: Llama-3.2-1B, -1B-Instruct, -3B, -3B-Instruct
- Llama 3.3: Llama-3.3-70B-Instruct
Companion PRs have been opened on each of these repos. Downstream models (fine-tunes and derivatives) that inherited the tokenizer config are not covered by these PRs and will need to be fixed independently.
(removed due to broken links - see below)
(removed due to broken links - see below)
Companion PRs
The same one-line fix has been opened on all 24 meta-llama repos that have clean_up_tokenization_spaces=true in their tokenizer_config.json. Tested across every version of transformers from 4.40.0 (first Llama 3 support, April 2024) through 5.3.0 (latest, March 2026) — all produce incorrect decoded text.
Llama 3.0:
Llama 3.1:
- Llama-3.1-8B
- Llama-3.1-8B-Instruct — this PR
- Llama-3.1-70B
- Llama-3.1-70B-Instruct
- Llama-3.1-405B
- Llama-3.1-405B-FP8
- Llama-3.1-405B-Instruct
- Llama-3.1-405B-Instruct-FP8
Llama 3.2:
Llama 3.3:
Llama Guard:
Prompt Guard:
The remaining 46 meta-llama repos either have false already (Llama 4, Llama-Guard-4) or don't have their own tokenizer_config.json (CodeLlama, Llama 2, quantized/vision/Original-format variants). Downstream models (fine-tunes and derivatives) that inherited the tokenizer config are not covered by these PRs and will need to be fixed independently.
High-download descendant PRs
Surveyed the top Llama 3 derivative models by download count on the Hub. Opened fix PRs on the 13 highest-download non-meta-llama models that ship their own tokenizer_config.json with clean_up_tokenization_spaces=true. Together with the 24 official meta-llama PRs above, these cover ~90% of total downloads across all affected models found.
RedHatAI (quantizations):
- Llama-3.2-1B-Instruct-FP8-dynamic — 1.56M downloads
- Llama-3.2-1B-Instruct-FP8 — 836K
- Meta-Llama-3.1-8B-Instruct-FP8 — 531K
- Meta-Llama-3.1-8B-FP8 — 226K
AWQ quantizations:
unsloth (mirrors/quantizations):
- Meta-Llama-3.1-8B-Instruct — 381K
- Llama-3.1-8B-Instruct — 229K
Other:
- fixie-ai/ultravox-v0_5-llama-3_2-1b — 767K
- IlyaGusev/saiga_llama3_8b — 397K
- NousResearch/Hermes-3-Llama-3.1-8B — 382K
- nvidia/Llama-3.1-Nemotron-Nano-8B-v1 — 294K
- llamafactory/tiny-random-Llama-3 — 900K (test model)
Total PRs filed: 37 (24 official meta-llama + 13 high-download descendants). There are ~170 more affected models on the Hub with lower download counts not covered here.