Instructions to use meta-llama/Llama-3.1-8B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use meta-llama/Llama-3.1-8B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-8B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use meta-llama/Llama-3.1-8B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "meta-llama/Llama-3.1-8B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meta-llama/Llama-3.1-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/meta-llama/Llama-3.1-8B-Instruct

SGLang

How to use meta-llama/Llama-3.1-8B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "meta-llama/Llama-3.1-8B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meta-llama/Llama-3.1-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "meta-llama/Llama-3.1-8B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meta-llama/Llama-3.1-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use meta-llama/Llama-3.1-8B-Instruct with Docker Model Runner:
```
docker model run hf.co/meta-llama/Llama-3.1-8B-Instruct
```

fix: set `clean_up_tokenization_spaces` to `false`

#356

by maxsloef - opened Mar 21

base: refs/heads/main

←

from: refs/pr/356

Discussion Files changed

-1

maxsloef

Mar 21

•

edited Mar 21

tokenizer_config.json has "clean_up_tokenization_spaces": true, which causes tokenizer.decode() to silently corrupt text. This affects every Llama 3.x model on the Hub and every fine-tune or downstream model that inherits their tokenizer config. Both Llama 2 and Llama 4 ship with false.

The fix is a one-line change: "clean_up_tokenization_spaces": true → "clean_up_tokenization_spaces": false.

What `clean_up_tokenization_spaces` does

When true, tokenizer.decode() strips spaces before punctuation marks during decoding. Specifically, it applies these string replacements to the decoded text:

text.replace(" .", ".").replace(" ?", "?").replace(" !", "!")
    .replace(" ,", ",").replace(" ' ", "'")
    .replace(" n't", "n't").replace(" 'm", "'m")
    .replace(" 's", "'s").replace(" 've", "'ve").replace(" 're", "'re")

This was designed for BERT-era WordPiece tokenizers (2019) where decoding produced artifacts like "Hello , world .". Llama 3's BPE tokenizer encodes spaces as part of tokens and does not produce these artifacts. The cleanup is actively destructive — it strips legitimate spaces from the decoded text.

Minimal reproduction

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

text = "x != y and a.b == c"
ids = tokenizer.encode(text, add_special_tokens=False)

decoded = tokenizer.decode(ids)
print(repr(decoded))

decoded_fixed = tokenizer.decode(ids, clean_up_tokenization_spaces=False)
print(repr(decoded_fixed))

Output:

'x!= y and a.b == c'    ← space before != silently dropped
'x != y and a.b == c'   ← correct

Impact

The bug is specific to HuggingFace's tokenizer implementation. Every other tokenizer — including Meta's own tiktoken, vLLM, SGLang, OLMo, etc. — does not have this behavior.
Every model that uses a Llama 3 tokenizer from HuggingFace has been and is currently decoding wrong — not just the official meta-llama repos, but all fine-tunes and derivatives that inherited the tokenizer config.

How Llama 3 got `clean_up_tokenization_spaces=True`

This was never an intentional choice by Meta:

Llama 2 explicitly set it to False in LlamaTokenizer.__init__
Llama 3 switched to PreTrainedTokenizerFast via a new Llama3Converter (PR #30334). The converter didn't pass clean_up_tokenization_spaces, so it inherited the HuggingFace transformers library default of True
The uploaded tokenizer_config.json files on the Hub baked in True
PR #33778 (Llama 3.2 support, Oct 2024) then hardcoded True in the conversion script for backward compatibility — without discussion of whether the value was correct
The library default was changed to False in Sep 2024 (PR #31938), but the Llama 3 configs already had True frozen

This has been flagged multiple times:

@ArthurZucker acknowledged in #35175: "It should be set to False!"

Both Llama 2 and Llama 4 ship with false, confirming this is recognized as a bug.

Affected models

All 21 Llama 3.x text model repos on the Hub have "clean_up_tokenization_spaces": true:

Llama 3.0: Meta-Llama-3-8B, -8B-Instruct, -70B, -70B-Instruct
Llama 3.1: Llama-3.1-8B, -8B-Instruct, -70B, -70B-Instruct, -405B, -405B-FP8, -405B-Instruct, -405B-Instruct-FP8
Llama 3.2: Llama-3.2-1B, -1B-Instruct, -3B, -3B-Instruct
Llama 3.3: Llama-3.3-70B-Instruct

Companion PRs have been opened on each of these repos. Downstream models (fine-tunes and derivatives) that inherited the tokenizer config are not covered by these PRs and will need to be fixed independently.

fix: set `clean_up_tokenization_spaces` to `false`371866bb

maxsloef

Mar 21

•

edited Mar 21

(removed due to broken links - see below)

maxsloef

Mar 21

•

edited Mar 21

(removed due to broken links - see below)

maxsloef

Mar 21

Companion PRs

The same one-line fix has been opened on all 24 meta-llama repos that have clean_up_tokenization_spaces=true in their tokenizer_config.json. Tested across every version of transformers from 4.40.0 (first Llama 3 support, April 2024) through 5.3.0 (latest, March 2026) — all produce incorrect decoded text.

Llama 3.0:

Llama 3.1:

Llama 3.2:

Llama 3.3:

Llama-3.3-70B-Instruct

Llama Guard:

Prompt Guard:

The remaining 46 meta-llama repos either have false already (Llama 4, Llama-Guard-4) or don't have their own tokenizer_config.json (CodeLlama, Llama 2, quantized/vision/Original-format variants). Downstream models (fine-tunes and derivatives) that inherited the tokenizer config are not covered by these PRs and will need to be fixed independently.

maxsloef

Mar 21

High-download descendant PRs

Surveyed the top Llama 3 derivative models by download count on the Hub. Opened fix PRs on the 13 highest-download non-meta-llama models that ship their own tokenizer_config.json with clean_up_tokenization_spaces=true. Together with the 24 official meta-llama PRs above, these cover ~90% of total downloads across all affected models found.

RedHatAI (quantizations):

Llama-3.2-1B-Instruct-FP8-dynamic — 1.56M downloads
Llama-3.2-1B-Instruct-FP8 — 836K
Meta-Llama-3.1-8B-Instruct-FP8 — 531K
Meta-Llama-3.1-8B-FP8 — 226K

AWQ quantizations:

casperhansen/llama-3.3-70b-instruct-awq — 792K
kosbu/Llama-3.3-70B-Instruct-AWQ — 489K

unsloth (mirrors/quantizations):

Meta-Llama-3.1-8B-Instruct — 381K
Llama-3.1-8B-Instruct — 229K

Other:

fixie-ai/ultravox-v0_5-llama-3_2-1b — 767K
IlyaGusev/saiga_llama3_8b — 397K
NousResearch/Hermes-3-Llama-3.1-8B — 382K
nvidia/Llama-3.1-Nemotron-Nano-8B-v1 — 294K
llamafactory/tiny-random-Llama-3 — 900K (test model)

Total PRs filed: 37 (24 official meta-llama + 13 high-download descendants). There are ~170 more affected models on the Hub with lower download counts not covered here.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment

fix: set `clean_up_tokenization_spaces` to `false`

What clean_up_tokenization_spaces does

Minimal reproduction

Impact

How Llama 3 got clean_up_tokenization_spaces=True

Affected models

Companion PRs

High-download descendant PRs

What `clean_up_tokenization_spaces` does

How Llama 3 got `clean_up_tokenization_spaces=True`