Instructions to use openai/gpt-oss-120b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use openai/gpt-oss-120b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="openai/gpt-oss-120b") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-120b") model = AutoModelForCausalLM.from_pretrained("openai/gpt-oss-120b") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use openai/gpt-oss-120b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "openai/gpt-oss-120b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openai/gpt-oss-120b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/openai/gpt-oss-120b
- SGLang
How to use openai/gpt-oss-120b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "openai/gpt-oss-120b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openai/gpt-oss-120b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "openai/gpt-oss-120b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openai/gpt-oss-120b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use openai/gpt-oss-120b with Docker Model Runner:
docker model run hf.co/openai/gpt-oss-120b
assistantfinal, analysis keyword is contained in the huggingface gpt-oss-120 output. Is this intended?
from transformers import pipeline
import torch
import os
model_id = "openai/gpt-oss-120b"
Use shared cache directory as per README.md guidelines
cache_dir = os.path.join(os.path.dirname(os.path.abspath(file)), "llm_weights")
os.makedirs(cache_dir, exist_ok=True)
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype="auto",
device_map={"": 0},
model_kwargs={"cache_dir": cache_dir},
)
messages = [
{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]
try:
outputs = pipe(
messages,
max_new_tokens=256,
temperature=0.7,
do_sample=True,
top_p=0.9,
pad_token_id=pipe.tokenizer.eos_token_id,
)
print(outputs[0]["generated_text"])
except Exception as e:
print(f"Generation failed: {e}")
print("Try reducing max_new_tokens or using CPU device")
==============
[{'role': 'user', 'content': 'Explain quantum mechanics clearly and concisely.'}, {'role': 'assistant', 'content': "analysisWe need to explain quantum mechanics clearly and concisely. Provide a brief overview: core principles, wavefunction, superposition, uncertainty, measurement, entanglement, quantization, Schrödinger equation, etc. Use analogies, avoid heavy math but maybe simple equations. Keep concise but thorough. Probably bullet points. Let's produce final answer.assistantfinalQuantum Mechanics in a nutshell\n\n| Core idea | What it means | Everyday analogy |\n|----------|----------------|-------------------|\n| 1. Quantized energy | Particles (electrons, photons, etc.) can only have certain discrete energy values, not a continuous range. | Like a staircase: you can stand on a step (allowed energy) but not in between. |\n| 2. Wave‑particle duality | Every quantum object has both particle‑like and wave‑like aspects. Its state is described by a wavefunction\u202fψ(x,t). | A water wave that also carries a tiny boat—its height tells you the probability of finding the boat. |\n| 3. Superposition | A system can exist in a combination of several possible states simultaneously until measured. | A spinning coin that is both “heads” and “"}]
Hello. I'm encountering the same issue. have you resolved it yet? Thanks.
Made RE splitting logic with claude meanwhile. I wish the answers were provided separtely.
import re
from typing import Dict
def split_response(response: str) -> Dict[str, str]:
"""Split the response into analysis and final content"""
# Find the analysis part (everything before "assistantfinal")
analysis_match = re.search(r'^(.*?)assistantfinal', response, flags=re.DOTALL)
if analysis_match:
analysis_content = analysis_match.group(1).strip()
# Remove "analysis" prefix if present
analysis_content = re.sub(r'^analysis\s*', '', analysis_content)
# Find the final content (everything after "assistantfinal")
final_content = re.sub(r'^.*?assistantfinal\s*', '', response, flags=re.DOTALL).strip()
else:
# If no "assistantfinal" found, check for just "analysis" prefix
if response.startswith('analysis'):
analysis_content = response.strip()
final_content = ""
else:
# No clear separation, treat entire response as final content
analysis_content = ""
final_content = response.strip()
return {
"analysis_content": analysis_content,
"final_content": final_content
}
def test_splitting():
# Test case 1: Response from the actual results
test_response1 = """analysisWe need to explain quantum entanglement. Provide a clear explanation, possibly with analogies, mention key concepts: superposition, nonlocal correlations, Bell's theorem, measurements, etc. Should be accessible.assistantfinalQuantum entanglement is a striking feature of quantum mechanics in which two (or more) particles become linked so that the state of one instantly influences the state of the other, no matter how far apart they are."""
result1 = split_response(test_response1)
print("Test 1:")
print(f"Analysis: {result1['analysis_content'][:100]}...")
print(f"Final: {result1['final_content'][:100]}...")
print()
# Test case 2: Math problem response
test_response2 = """analysisWe need to solve system: x + y = 10, x - y = 4. Solve: add: 2x = 14 => x = 7. Then y = 3. Then x*y = 21. Provide answer.assistantfinalFrom the two equations
\[
\begin{cases}
x + y = 10 \\
x - y = 4
\end{cases}
\]
add them to eliminate :
\[
(x+y) + (x-y) = 10 + 4 \;\Longrightarrow\; 2x = 14 \;\Longrightarrow\; x = 7.
\]"""
result2 = split_response(test_response2)
print("Test 2:")
print(f"Analysis: {result2['analysis_content']}")
print(f"Final: {result2['final_content'][:100]}...")
print()
# Test case 3: No analysis prefix
test_response3 = """**Black holes** are fascinating cosmic objects where gravity is so strong that nothing can escape."""
result3 = split_response(test_response3)
print("Test 3:")
print(f"Analysis: '{result3['analysis_content']}'")
print(f"Final: {result3['final_content'][:100]}...")
if name == "main":
test_splitting()
'''
Thank you so much for bringing this method!
what is assistantfinal?