Instructions to use HuggingFaceTB/SmolLM3-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HuggingFaceTB/SmolLM3-3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceTB/SmolLM3-3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM3-3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use HuggingFaceTB/SmolLM3-3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HuggingFaceTB/SmolLM3-3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolLM3-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/HuggingFaceTB/SmolLM3-3B

SGLang

How to use HuggingFaceTB/SmolLM3-3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HuggingFaceTB/SmolLM3-3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolLM3-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HuggingFaceTB/SmolLM3-3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolLM3-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use HuggingFaceTB/SmolLM3-3B with Docker Model Runner:
```
docker model run hf.co/HuggingFaceTB/SmolLM3-3B
```

Time to go Big/Sota!!!

#13

by Daemontatox - opened Jul 9, 2025

Discussion

Daemontatox

Jul 9, 2025

time for huggingface to go big and compete withother sota models , the quality and performance from the smol family have been outstanding , time to try and compete with the big models

david-thrower

Jul 9, 2025

I partially agree. I just hope someone stands to pay the costs for that. I only say partially agree, because I think a novel architecture paradigm shift altogether is around the corner (e.g. not a multi-head attention transformer, something that scales in linear or substantially sub-quadratic timing, ... ), and it may be better to put the big chips on the table when we get there.

ZiggyS

Jul 9, 2025

i think part of the point of being small is so those who dont have a lot of resources can still learn from them. Go big, it cuts many out.

david-thrower

Jul 9, 2025

@ZiggyS I can definitely relate to resource constraints being a bottle neck.

Nonetheless, it would be great though overly-optimistic to have a model source for full scale SOTA models that truly are open source and platform independent ... I know the EU is working on a project to that effect, but it would be good to have 1 commercial / non - government affiliated source also that is fully transparent and free of international conflicts of interest.

As I mention though, it would be a big risk to develop one now, because everything is moving so fast, that whatever you do could be fundamentally obsolete in a week. Especially as it is inevitable that someone some where will publish more effective solutions (in terms of actual published models and code, not just theoretical papers) to problems like these any day now:

More robust capability for a model to continuously re-train / fine tune itself at inference time to replicate its user's behavior with granularity and learns from past user corrections and clarifications, especially if one that can update itself on a user specific basis with controlled multi-tenancy
A generative model that scales in linear complexity timing or in a substantially sub-quadratic complexity. A proof of concept in text classification that scales in linear / O(n) to O(n log(n)) timing with increased sequence length already exists.
A generative model architecture having a truly unlimited context window without performance degradation as the context grows
...

Kwissbeats

Jul 14, 2025

I honestly it makes no sense to me at all, everything I like about the smoll models has to do with it's size/consistency.
ofc it will be interesting what the team can do, they seem to be very talented.
but it does not necessary correlate in my opinion, and the big model competition is heated enough.

Daemontatox changed discussion status to closed Jul 14, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment