Instructions to use bigscience/bloomz-7b1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bigscience/bloomz-7b1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="bigscience/bloomz-7b1")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("bigscience/bloomz-7b1") model = AutoModelForCausalLM.from_pretrained("bigscience/bloomz-7b1") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use bigscience/bloomz-7b1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bigscience/bloomz-7b1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloomz-7b1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/bigscience/bloomz-7b1
- SGLang
How to use bigscience/bloomz-7b1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "bigscience/bloomz-7b1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloomz-7b1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "bigscience/bloomz-7b1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloomz-7b1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use bigscience/bloomz-7b1 with Docker Model Runner:
docker model run hf.co/bigscience/bloomz-7b1
loading issue in HF Spaces
I am not able to run bloom-7b1 through the A10 large, which works with models like falcon-7b. I do not understand why, as this model does not seem much larger than what the A10 large can handle (about 15gb VRAM). The model initializes, but seems to take very long for making inferences.
Any ideas on what I may be doing wrong here?
import gradio as gr
import os
import torch
#-- sanity check on hardware
print(f"Is CUDA available: {torch.cuda.is_available()}")
#-- True
print(f"CUDA device: {torch.cuda.get_device_name(torch.cuda.current_device())}")
#-- Nvidia something something
from langchain import PromptTemplate, HuggingFaceHub, LLMChain
#-- possible models
flan = "google/flan-t5-xxl"
falcon_7b = "tiiuae/falcon-7b"
falcon_7b_instruct = "tiiuae/falcon-7b-instruct"
bloom_7b = "bigscience/bloom-7b1"
bloom_7b_instruct = "bigscience/bloomz-7b1-mt"
bloom_650m = "bigscience/bloom-560m"
#-- set args for retrieved model
args = {"temperature":0.0001, "max_length":250}
#-- specify model
llm=HuggingFaceHub(repo_id=bloom_7b, model_kwargs=args)
#-- sanity check
print('LLM loaded!')
#-- variable for input + eventual prompts
template='{question}'
prompt = PromptTemplate(template=template, input_variables=["question"])
#-- init langchain
chain = LLMChain(llm=llm, prompt=prompt)
#-- sanity check
print(chain.run('What is the Sally-Anne test?'))
#-- Run the chain only specifying the input variable.
def answer(question):
return chain.run(question)
#-- init app
demo = gr.Interface(fn=answer, inputs='text',outputs='text',examples=[['Hey how are you']])
demo.launch()