Instructions to use openai/gpt-oss-120b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use openai/gpt-oss-120b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="openai/gpt-oss-120b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-120b")
model = AutoModelForCausalLM.from_pretrained("openai/gpt-oss-120b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use openai/gpt-oss-120b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "openai/gpt-oss-120b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openai/gpt-oss-120b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/openai/gpt-oss-120b

SGLang

How to use openai/gpt-oss-120b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "openai/gpt-oss-120b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openai/gpt-oss-120b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "openai/gpt-oss-120b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openai/gpt-oss-120b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use openai/gpt-oss-120b with Docker Model Runner:
```
docker model run hf.co/openai/gpt-oss-120b
```

assistantfinal, analysis keyword is contained in the huggingface gpt-oss-120 output. Is this intended?

#130

by ml345 - opened Aug 26, 2025

Discussion

ml345

Aug 26, 2025

from transformers import pipeline
import torch
import os

model_id = "openai/gpt-oss-120b"

Use shared cache directory as per README.md guidelines

cache_dir = os.path.join(os.path.dirname(os.path.abspath(file)), "llm_weights")
os.makedirs(cache_dir, exist_ok=True)

pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype="auto",
device_map={"": 0},
model_kwargs={"cache_dir": cache_dir},
)

messages = [
{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

try:
outputs = pipe(
messages,
max_new_tokens=256,
temperature=0.7,
do_sample=True,
top_p=0.9,
pad_token_id=pipe.tokenizer.eos_token_id,
)
print(outputs[0]["generated_text"])
except Exception as e:
print(f"Generation failed: {e}")
print("Try reducing max_new_tokens or using CPU device")

==============
[{'role': 'user', 'content': 'Explain quantum mechanics clearly and concisely.'}, {'role': 'assistant', 'content': "analysisWe need to explain quantum mechanics clearly and concisely. Provide a brief overview: core principles, wavefunction, superposition, uncertainty, measurement, entanglement, quantization, Schrödinger equation, etc. Use analogies, avoid heavy math but maybe simple equations. Keep concise but thorough. Probably bullet points. Let's produce final answer.assistantfinalQuantum Mechanics in a nutshell\n\n| Core idea | What it means | Everyday analogy |\n|----------|----------------|-------------------|\n| 1. Quantized energy | Particles (electrons, photons, etc.) can only have certain discrete energy values, not a continuous range. | Like a staircase: you can stand on a step (allowed energy) but not in between. |\n| 2. Wave‑particle duality | Every quantum object has both particle‑like and wave‑like aspects. Its state is described by a wavefunction\u202fψ(x,t). | A water wave that also carries a tiny boat—its height tells you the probability of finding the boat. |\n| 3. Superposition | A system can exist in a combination of several possible states simultaneously until measured. | A spinning coin that is both “heads” and “"}]

ml345 changed discussion title from assistantfinal, analysis keyword is contained in the huggingface gpt-oss-120 output. Is this okay? to assistantfinal, analysis keyword is contained in the huggingface gpt-oss-120 output. Is this intended? Aug 26, 2025

Carrieischoc

Sep 14, 2025

Hello. I'm encountering the same issue. have you resolved it yet? Thanks.

ml345

Sep 15, 2025

•

edited Sep 15, 2025

Made RE splitting logic with claude meanwhile. I wish the answers were provided separtely.

import re
from typing import Dict

def split_response(response: str) -> Dict[str, str]:
"""Split the response into analysis and final content"""
# Find the analysis part (everything before "assistantfinal")
analysis_match = re.search(r'^(.*?)assistantfinal', response, flags=re.DOTALL)

if analysis_match:
    analysis_content = analysis_match.group(1).strip()
    # Remove "analysis" prefix if present
    analysis_content = re.sub(r'^analysis\s*', '', analysis_content)
    
    # Find the final content (everything after "assistantfinal")
    final_content = re.sub(r'^.*?assistantfinal\s*', '', response, flags=re.DOTALL).strip()
else:
    # If no "assistantfinal" found, check for just "analysis" prefix
    if response.startswith('analysis'):
        analysis_content = response.strip()
        final_content = ""
    else:
        # No clear separation, treat entire response as final content
        analysis_content = ""
        final_content = response.strip()

return {
    "analysis_content": analysis_content,
    "final_content": final_content
}

def test_splitting():
# Test case 1: Response from the actual results
test_response1 = """analysisWe need to explain quantum entanglement. Provide a clear explanation, possibly with analogies, mention key concepts: superposition, nonlocal correlations, Bell's theorem, measurements, etc. Should be accessible.assistantfinalQuantum entanglement is a striking feature of quantum mechanics in which two (or more) particles become linked so that the state of one instantly influences the state of the other, no matter how far apart they are."""

result1 = split_response(test_response1)
print("Test 1:")
print(f"Analysis: {result1['analysis_content'][:100]}...")
print(f"Final: {result1['final_content'][:100]}...")
print()

# Test case 2: Math problem response
test_response2 = """analysisWe need to solve system: x + y = 10, x - y = 4. Solve: add: 2x = 14 => x = 7. Then y = 3. Then x*y = 21. Provide answer.assistantfinalFrom the two equations

\[
\begin{cases}
x + y = 10 \\
x - y = 4
\end{cases}
\]

add them to eliminate $y$ :

\[
(x+y) + (x-y) = 10 + 4 \;\Longrightarrow\; 2x = 14 \;\Longrightarrow\; x = 7.
\]"""

result2 = split_response(test_response2)
print("Test 2:")
print(f"Analysis: {result2['analysis_content']}")
print(f"Final: {result2['final_content'][:100]}...")
print()

# Test case 3: No analysis prefix
test_response3 = """**Black holes** are fascinating cosmic objects where gravity is so strong that nothing can escape."""

result3 = split_response(test_response3)
print("Test 3:")
print(f"Analysis: '{result3['analysis_content']}'")
print(f"Final: {result3['final_content'][:100]}...")

if name == "main":
test_splitting()
'''

Carrieischoc

Sep 15, 2025

Thank you so much for bringing this method!

Gerald001

Feb 25

what is assistantfinal?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment