Instructions to use google/gemma-2-2b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-2-2b-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="google/gemma-2-2b-it") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it") model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b-it") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use google/gemma-2-2b-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-2-2b-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-2-2b-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/google/gemma-2-2b-it
- SGLang
How to use google/gemma-2-2b-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-2-2b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-2-2b-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-2-2b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-2-2b-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use google/gemma-2-2b-it with Docker Model Runner:
docker model run hf.co/google/gemma-2-2b-it
how to solve this error
Unsupported: call_method UserDefinedObjectVariable(Params4bit) t [] {}
from user code:
File "/usr/local/lib/python3.11/dist-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.11/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 887, in forward
outputs = self.model(
File "/usr/local/lib/python3.11/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 667, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.11/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 321, in forward
hidden_states, self_attn_weights = self.self_attn(
File "/usr/local/lib/python3.11/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 216, in forward
query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
File "/usr/local/lib/python3.11/dist-packages/bitsandbytes/nn/modules.py", line 484, in forward
return bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state).to(inp_dtype)
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
i'm using bitsandbytes quantisation:
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
Hi @vinayakarsh ,
Getting an error because TorchDynamo is trying to optimize the computation graph, but BitsAndBytes 4-bit quantized layers are not fully supported by TorchDynamo. To solve this error, to resolve this error, please disable TorchDynamo using the following code.
import torch._dynamo
torch._dynamo.config.suppress_errors = True
torch._dynamo.disable()
The code was successfully executed in Google Colab with a T4 GPU runtime. You can check the details in the provided gist file, where I have also listed the library versions used.
Thank you.
thanks for the help... I tried disabling TorchDynamo using the code above and it returns the below error:
"""
Unsupported: call_method UserDefinedObjectVariable(Params4bit) t [] {}
from user code:
File "/usr/local/lib/python3.11/dist-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.11/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 887, in forward
outputs = self.model(
File "/usr/local/lib/python3.11/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 667, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.11/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 321, in forward
hidden_states, self_attn_weights = self.self_attn(
File "/usr/local/lib/python3.11/dist-packages/transformers/models/gemma2/modeling_gemma2.py", line 216, in forward
query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
File "/usr/local/lib/python3.11/dist-packages/bitsandbytes/nn/modules.py", line 484, in forward
return bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state).to(inp_dtype)
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
"""
I had set these using the below code:
"""
import os
os.environ["TORCH_LOGS"] = "+dynamo"
os.environ["TORCHDYNAMO_VERBOSE"] = "1"
"""
still return the same error...
I face the same problem......So how to solve it.
apt install gcc
write "export CC=/usr/bin/gcc " to .bashrc
Could you please share more details about hardware environment and also, sharing the code you are using would be helpful. This information will enable us to better understand the issue and assist you effectively.
Thank you.
Hi @vinayakarsh ,
I successfully executed code in as your mentioned environment (T4 GPU) in google colab. For more details, please refer to this gist file.
Thank you.
I am also facing the same issue. With 4-bit quantization, during inference in a GPU based machine, I am getting error
2025-03-14 22:59:12,521 - ERROR - Error generating response: call_method UserDefinedObjectVariable(Params4bit) t [] {}
from user code:
File "/home/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
File "/home/lib/python3.10/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 887, in forward
outputs = self.model(
File "/home/lib/python3.10/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 667, in forward
layer_outputs = decoder_layer(
File "/home/lib/python3.10/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 321, in forward
hidden_states, self_attn_weights = self.self_attn(
File "/home/lib/python3.10/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 216, in forward
query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
File "/home/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 484, in forward
return bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state).to(inp_dtype)
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
FYI this seems to be a bug with transformers 4.49.0. Downgrading to 4.48.0 works for me
Confirming that downgrading to 4.48.0 fixes this.
thank you... it was really helpful...

