Instructions to use ricepaper/vi-gemma2-2b-ChatQA-RAG-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ricepaper/vi-gemma2-2b-ChatQA-RAG-v1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ricepaper/vi-gemma2-2b-ChatQA-RAG-v1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ricepaper/vi-gemma2-2b-ChatQA-RAG-v1")
model = AutoModelForCausalLM.from_pretrained("ricepaper/vi-gemma2-2b-ChatQA-RAG-v1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ricepaper/vi-gemma2-2b-ChatQA-RAG-v1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ricepaper/vi-gemma2-2b-ChatQA-RAG-v1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ricepaper/vi-gemma2-2b-ChatQA-RAG-v1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ricepaper/vi-gemma2-2b-ChatQA-RAG-v1

SGLang

How to use ricepaper/vi-gemma2-2b-ChatQA-RAG-v1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ricepaper/vi-gemma2-2b-ChatQA-RAG-v1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ricepaper/vi-gemma2-2b-ChatQA-RAG-v1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ricepaper/vi-gemma2-2b-ChatQA-RAG-v1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ricepaper/vi-gemma2-2b-ChatQA-RAG-v1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio new

How to use ricepaper/vi-gemma2-2b-ChatQA-RAG-v1 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ricepaper/vi-gemma2-2b-ChatQA-RAG-v1 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ricepaper/vi-gemma2-2b-ChatQA-RAG-v1 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ricepaper/vi-gemma2-2b-ChatQA-RAG-v1 to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="ricepaper/vi-gemma2-2b-ChatQA-RAG-v1",
    max_seq_length=2048,
)

Docker Model Runner
How to use ricepaper/vi-gemma2-2b-ChatQA-RAG-v1 with Docker Model Runner:
```
docker model run hf.co/ricepaper/vi-gemma2-2b-ChatQA-RAG-v1
```

Model Card: vi-gemma2-2b-ChatQA-RAG-v1

(English below)

Tiếng Việt (Vietnamese)

Mô tả mô hình:

vi-gemma2-2b-ChatQA-RAG là một mô hình ngôn ngữ lớn được tinh chỉnh từ mô hình cơ sở google/gemma-2-2b-it sử dụng kỹ thuật LoRA. Mô hình được huấn luyện trên tập dữ liệu tiếng Việt với mục tiêu cải thiện khả năng xử lý ngôn ngữ tiếng Việt và nâng cao hiệu suất cho các tác vụ truy xuất thông tin mở (Retrieval Augmented Generation - RAG).

Mô hình được tinh chỉnh tập trung vào bài toán RAG theo phương pháp của NVIDIA Chat-QA link

Cách sử dụng:

Dưới đây chúng tôi chia sẻ một số đoạn mã về cách bắt đầu nhanh chóng để sử dụng mô hình. Trước tiên, hãy đảm bảo đã cài đặt pip install -U transformers, sau đó sao chép đoạn mã từ phần có liên quan đến usecase của bạn.

Chúng tôi khuyến nghị sử dụng torch.bfloat16 làm mặc định.

# pip install transformers torch accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Khởi tạo tokenizer và model từ checkpoint đã lưu
tokenizer = AutoTokenizer.from_pretrained("hiieu/vi-gemma2-2b-ChatQA-RAG-v1")
model = AutoModelForCausalLM.from_pretrained(
    "hiieu/vi-gemma2-2b-ChatQA-RAG-v1",
    device_map="auto",
    torch_dtype=torch.bfloat16
)

# Sử dụng GPU nếu có
if torch.cuda.is_available():
    model.to("cuda")

messages = [
    {"role": "user", "content": "Hãy cho tôi biết một số tính chất của STRs được dùng để làm gì?"}
]
document = """Context: Short Tandem Repeats (STRs) là các trình tự DNA lặp lại ngắn (2- 6 nucleotides) xuất hiện phổ biến trong hệ gen của con người. Các trình tự này có tính đa hình rất cao trong tự nhiên, điều này khiến các STRs trở thành những markers di truyền rất quan trọng trong nghiên cứu bản đồ gen người và chuẩn đoán bệnh lý di truyền cũng như xác định danh tính trong lĩnh vực pháp y.
Các STRs trở nên phổ biến tại các phòng xét nghiệm pháp y bởi vì việc nhân bản và phân tích STRs chỉ cần lượng DNA rất thấp ngay cả khi ở dạng bị phân hủy việc đinh danh vẫn có thể được thực hiện thành công. Hơn nữa việc phát hiện và đánh giá sự nhiễm DNA mẫu trong các mẫu vật có thể được giải quyết nhanh với kết quả phân tích STRs. Ở Hoa Kỳ hiện nay, từ bộ 13 markers nay đã tăng lên 20 markers chính đang được sử dụng để tạo ra một cơ sở dữ liệu DNA trên toàn đất nước được gọi là The FBI Combined DNA Index System (Expaned CODIS).
CODIS và các cơ sử dữ liệu DNA tương tự đang được sử dụng thực sự thành công trong việc liên kết các hồ sơ DNA từ các tội phạm và các bằng chứng hiện trường vụ án. Kết quả định danh STRs cũng được sử dụng để hỗ trợ hàng trăm nghìn trường hợp xét nghiệm huyết thống cha con mỗi năm'
"""

def get_formatted_input(messages, context):
    system = "System: Đây là một cuộc trò chuyện giữa người dùng và trợ lý trí tuệ nhân tạo. Trợ lý cung cấp câu trả lời hữu ích, chi tiết và lịch sự cho các câu hỏi của người dùng dựa trên ngữ cảnh được cung cấp. Trợ lý cũng nên chỉ ra khi câu trả lời không thể tìm thấy trong ngữ cảnh."
    conversation = '\n\n'.join(["User: " + item["content"] if item["role"] == "user" else "Assistant: " + item["content"] for item in messages]) 
    formatted_input = system + "\n\n" + context + "\n\n" + conversation + "\n\n### Assistant:"
    
    return formatted_input

# Chuẩn bị dữ liệu đầu vào
formatted_input = get_formatted_input(messages, document)

# Mã hóa input text thành input ids
input_ids = tokenizer(formatted_input, return_tensors="pt").to(model.device)


# Tạo văn bản bằng model
outputs = model.generate(
    **input_ids,
    max_new_tokens=512,
    do_sample=True,   # Kích hoạt chế độ tạo văn bản dựa trên lấy mẫu. Trong chế độ này, model sẽ chọn ngẫu nhiên token tiếp theo dựa trên xác suất được tính từ phân phối xác suất của các token.
    temperature=0.1,  # Giảm temperature để kiểm soát tính ngẫu nhiên
)
# Giải mã và in kết quả
print(tokenizer.decode(outputs[0]).rsplit("### Assistant:")[-1])
>>> STRs là các trình tự DNA lặp lại ngắn (2-6 nucleotides) xuất hiện phổ biến trong hệ gen của con người. Chúng có tính đa hình cao và được sử dụng trong nghiên cứu bản đồ gen người và chuẩn đoán bệnh lý di truyền.<eos>