Instructions to use Tralalabs/CHEETAH-350M-Merged-FP16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Tralalabs/CHEETAH-350M-Merged-FP16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Tralalabs/CHEETAH-350M-Merged-FP16") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Tralalabs/CHEETAH-350M-Merged-FP16") model = AutoModelForCausalLM.from_pretrained("Tralalabs/CHEETAH-350M-Merged-FP16") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Tralalabs/CHEETAH-350M-Merged-FP16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Tralalabs/CHEETAH-350M-Merged-FP16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Tralalabs/CHEETAH-350M-Merged-FP16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Tralalabs/CHEETAH-350M-Merged-FP16
- SGLang
How to use Tralalabs/CHEETAH-350M-Merged-FP16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Tralalabs/CHEETAH-350M-Merged-FP16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Tralalabs/CHEETAH-350M-Merged-FP16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Tralalabs/CHEETAH-350M-Merged-FP16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Tralalabs/CHEETAH-350M-Merged-FP16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Tralalabs/CHEETAH-350M-Merged-FP16 with Docker Model Runner:
docker model run hf.co/Tralalabs/CHEETAH-350M-Merged-FP16
CHEETAH-350M-Merged-FP16
CHEETAH-350M-Merged-FP16 is a merged instruction-tuned model based on LiquidAI/LFM2-350M.
It was fine-tuned as a LoRA adapter on HuggingFaceTB/smol-smoltalk, then merged into the base model to create a standalone Transformers model.
🐆 Fast, small, cheap, and instruction-following focused.
Model Details
| Field | Value |
|---|---|
| Model family | CHEETAH |
| Model name | CHEETAH-350M-Merged-FP16 |
| Base model | LiquidAI/LFM2-350M |
| Training dataset | HuggingFaceTB/smol-smoltalk |
| Fine-tuning type | LoRA SFT |
| Final format | Merged FP16 Transformers model |
| Training platform | Modal |
| GPU | NVIDIA L4 |
| Selected checkpoint | Step 750 |
| License | lfm1.0 |
Training Summary
The model was trained as a LoRA adapter and stopped at step 750 after the checkpoint was saved.
| Metric | Value |
|---|---|
| Selected step | 750 |
| Last evaluated step | 700 |
| Eval loss | 1.3082 |
| Eval perplexity | 3.70 |
| Tokens seen at checkpoint | 9,711,906 |
| Training time | 32.8 minutes |
| Speed near end | ~5,000 tok/s |
| GPU | NVIDIA L4 |
Final Training Log
[2026-05-30 19:03:52] step=700/1000 loss=17.2614 lr=4.36e-05 tokens_seen=9,070,297 tok/s=5116.6 elapsed_min=30.6
[2026-05-30 19:03:53] eval_loss=1.3082 eval_ppl=3.70
[2026-05-30 19:04:18] step=710/1000 loss=16.9876 lr=4.10e-05 tokens_seen=9,195,110 tok/s=4728.6 elapsed_min=31.1
[2026-05-30 19:04:44] step=720/1000 loss=16.4713 lr=3.84e-05 tokens_seen=9,324,489 tok/s=5017.9 elapsed_min=31.5
[2026-05-30 19:05:10] step=730/1000 loss=16.7246 lr=3.59e-05 tokens_seen=9,457,294 tok/s=5178.9 elapsed_min=31.9
[2026-05-30 19:05:36] step=740/1000 loss=16.3293 lr=3.34e-05 tokens_seen=9,580,098 tok/s=4697.8 elapsed_min=32.4
[2026-05-30 19:06:02] step=750/1000 loss=16.4356 lr=3.10e-05 tokens_seen=9,711,906 tok/s=5018.9 elapsed_min=32.8
[2026-05-30 19:06:06] Saved checkpoint: /outputs/CHEETAH-350M-LoRA-L4/checkpoints/step-750
Note: the displayed training loss was affected by gradient accumulation logging. Evaluation loss and perplexity are the preferred metrics for judging the selected checkpoint.
Intended Use
This model is intended for:
- Lightweight instruction following
- Small assistant experiments
- Fast local or cloud inference
- Educational fine-tuning experiments
- CHEETAH model family development
Not Intended For
This model is not intended for:
- High-stakes medical, legal, or financial advice
- Safety-critical automation
- Private-data processing without review
- Production deployment without evaluation
Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "Tralalabs/CHEETAH-350M-Merged-FP16"
tokenizer = AutoTokenizer.from_pretrained(
model_id,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True,
)
messages = [
{
"role": "system",
"content": "You are CHEETAH, a fast, clear, helpful assistant.",
},
{
"role": "user",
"content": "Explain why cheetahs are fast in 3 short bullets.",
},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=160,
do_sample=True,
temperature=0.35,
top_p=0.9,
repetition_penalty=1.05,
pad_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Training Data
The model was fine-tuned on:
HuggingFaceTB/smol-smoltalk
This dataset is a subset of SmolTalk adapted for models smaller than 1B parameters.
Training Configuration
| Setting | Value |
|---|---|
| Base model | LiquidAI/LFM2-350M |
| Dataset | HuggingFaceTB/smol-smoltalk |
| Rows | 16,000 |
| Max sequence length | 2048 |
| LoRA rank | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| Learning rate | 2e-4 |
| Gradient accumulation | 16 |
| Selected checkpoint | Step 750 |
| Final tokens seen | 9,711,906 |
Limitations
CHEETAH-350M-Merged-FP16 is a small 350M-class model. It may:
- Hallucinate facts
- Struggle with long reasoning chains
- Give weak answers on niche knowledge
- Misread complex instructions
- Need careful prompting for best results
For factual or current information, verify outputs with trusted sources.
License
This model is released under lfm1.0, matching the license of the base model LiquidAI/LFM2-350M.
The training dataset HuggingFaceTB/smol-smoltalk is licensed under Apache-2.0.
Citation
Base model:
LiquidAI/LFM2-350M
Dataset:
HuggingFaceTB/smol-smoltalk
Model Family
This model belongs to the CHEETAH family:
CHEETAH-[SIZE]-LoRA
CHEETAH-[SIZE]-Merged
This release:
CHEETAH-350M-Merged-FP16
Example Output
Prompt:
system
You are CHEETAH, a fast, clear, helpful assistant.
user
Explain why cheetahs are fast in 3 short bullets.
Model output:
assistant
1. Cheetahs have a unique body structure that allows them to run at incredible speeds.
2. Their long legs and lightweight build enable them to accelerate quickly.
3. Cheetahs have a specialized tail that acts as a counterbalance during high-speed runs.
- Downloads last month
- 125