Instructions to use HuggingFaceTB/SmolLM3-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HuggingFaceTB/SmolLM3-3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="HuggingFaceTB/SmolLM3-3B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B") model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM3-3B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use HuggingFaceTB/SmolLM3-3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "HuggingFaceTB/SmolLM3-3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolLM3-3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/HuggingFaceTB/SmolLM3-3B
- SGLang
How to use HuggingFaceTB/SmolLM3-3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "HuggingFaceTB/SmolLM3-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolLM3-3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "HuggingFaceTB/SmolLM3-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolLM3-3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use HuggingFaceTB/SmolLM3-3B with Docker Model Runner:
docker model run hf.co/HuggingFaceTB/SmolLM3-3B
Time to go Big/Sota!!!
time for huggingface to go big and compete withother sota models , the quality and performance from the smol family have been outstanding , time to try and compete with the big models
I partially agree. I just hope someone stands to pay the costs for that. I only say partially agree, because I think a novel architecture paradigm shift altogether is around the corner (e.g. not a multi-head attention transformer, something that scales in linear or substantially sub-quadratic timing, ... ), and it may be better to put the big chips on the table when we get there.
i think part of the point of being small is so those who dont have a lot of resources can still learn from them. Go big, it cuts many out.
@ZiggyS I can definitely relate to resource constraints being a bottle neck.
Nonetheless, it would be great though overly-optimistic to have a model source for full scale SOTA models that truly are open source and platform independent ... I know the EU is working on a project to that effect, but it would be good to have 1 commercial / non - government affiliated source also that is fully transparent and free of international conflicts of interest.
As I mention though, it would be a big risk to develop one now, because everything is moving so fast, that whatever you do could be fundamentally obsolete in a week. Especially as it is inevitable that someone some where will publish more effective solutions (in terms of actual published models and code, not just theoretical papers) to problems like these any day now:
- More robust capability for a model to continuously re-train / fine tune itself at inference time to replicate its user's behavior with granularity and learns from past user corrections and clarifications, especially if one that can update itself on a user specific basis with controlled multi-tenancy
- A generative model that scales in linear complexity timing or in a substantially sub-quadratic complexity. A proof of concept in text classification that scales in linear / O(n) to O(n log(n)) timing with increased sequence length already exists.
- A generative model architecture having a truly unlimited context window without performance degradation as the context grows
- ...
I honestly it makes no sense to me at all, everything I like about the smoll models has to do with it's size/consistency.
ofc it will be interesting what the team can do, they seem to be very talented.
but it does not necessary correlate in my opinion, and the big model competition is heated enough.