Instructions to use google/gemma-2-2b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-2-2b-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="google/gemma-2-2b-it")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b-it")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use google/gemma-2-2b-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-2-2b-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-2-2b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/google/gemma-2-2b-it

SGLang

How to use google/gemma-2-2b-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-2-2b-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-2-2b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-2-2b-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-2-2b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use google/gemma-2-2b-it with Docker Model Runner:
```
docker model run hf.co/google/gemma-2-2b-it
```

how to solve this error

#64

by vinayakarsh - opened Mar 1, 2025

Discussion

vinayakarsh

Mar 1, 2025

Unsupported: call_method UserDefinedObjectVariable(Params4bit) t [] {}

from user code:
File "/usr/local/lib/python3.11/dist-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.11/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 887, in forward
outputs = self.model(
File "/usr/local/lib/python3.11/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 667, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.11/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 321, in forward
hidden_states, self_attn_weights = self.self_attn(
File "/usr/local/lib/python3.11/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 216, in forward
query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
File "/usr/local/lib/python3.11/dist-packages/bitsandbytes/nn/modules.py", line 484, in forward
return bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state).to(inp_dtype)

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

i'm using bitsandbytes quantisation:
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)

GopiUppari

Google org Mar 3, 2025

Hi @vinayakarsh ,

Getting an error because TorchDynamo is trying to optimize the computation graph, but BitsAndBytes 4-bit quantized layers are not fully supported by TorchDynamo. To solve this error, to resolve this error, please disable TorchDynamo using the following code.

  import torch._dynamo
  torch._dynamo.config.suppress_errors = True
  torch._dynamo.disable()

The code was successfully executed in Google Colab with a T4 GPU runtime. You can check the details in the provided gist file, where I have also listed the library versions used.

Thank you.

vinayakarsh

Mar 7, 2025

thanks for the help... I tried disabling TorchDynamo using the code above and it returns the below error:
"""
Unsupported: call_method UserDefinedObjectVariable(Params4bit) t [] {}

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
"""

I had set these using the below code:
"""
import os
os.environ["TORCH_LOGS"] = "+dynamo"
os.environ["TORCHDYNAMO_VERBOSE"] = "1"
"""

still return the same error...

zddydy

Mar 8, 2025

•

edited Mar 8, 2025

I face the same problem......So how to solve it.

zddydy

Mar 8, 2025

apt install gcc
write "export CC=/usr/bin/gcc " to .bashrc

GopiUppari

Google org Mar 10, 2025

Could you please share more details about hardware environment and also, sharing the code you are using would be helpful. This information will enable us to better understand the issue and assist you effectively.

Thank you.

vinayakarsh

Mar 10, 2025

•

edited Mar 10, 2025

using T4 gpu from google colab...

GopiUppari

Google org Mar 11, 2025

Hi @vinayakarsh ,

I successfully executed code in as your mentioned environment (T4 GPU) in google colab. For more details, please refer to this gist file.

Thank you.

Sowmiya01

Mar 14, 2025

I am also facing the same issue. With 4-bit quantization, during inference in a GPU based machine, I am getting error

2025-03-14 22:59:12,521 - ERROR - Error generating response: call_method UserDefinedObjectVariable(Params4bit) t [] {}

from user code:
File "/home/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
File "/home/lib/python3.10/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 887, in forward
outputs = self.model(
File "/home/lib/python3.10/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 667, in forward
layer_outputs = decoder_layer(
File "/home/lib/python3.10/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 321, in forward
hidden_states, self_attn_weights = self.self_attn(
File "/home/lib/python3.10/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 216, in forward
query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
File "/home/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 484, in forward
return bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state).to(inp_dtype)

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

HarryMayne

Mar 20, 2025

FYI this seems to be a bug with transformers 4.49.0. Downgrading to 4.48.0 works for me

nmcco

Mar 21, 2025

Confirming that downgrading to 4.48.0 fixes this.

vinayakarsh

Mar 23, 2025

thank you... it was really helpful...

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment