Instructions to use OpenAssistant/falcon-40b-sft-mix-1226 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenAssistant/falcon-40b-sft-mix-1226 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="OpenAssistant/falcon-40b-sft-mix-1226", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("OpenAssistant/falcon-40b-sft-mix-1226", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use OpenAssistant/falcon-40b-sft-mix-1226 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "OpenAssistant/falcon-40b-sft-mix-1226" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenAssistant/falcon-40b-sft-mix-1226", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/OpenAssistant/falcon-40b-sft-mix-1226
- SGLang
How to use OpenAssistant/falcon-40b-sft-mix-1226 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "OpenAssistant/falcon-40b-sft-mix-1226" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenAssistant/falcon-40b-sft-mix-1226", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "OpenAssistant/falcon-40b-sft-mix-1226" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenAssistant/falcon-40b-sft-mix-1226", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use OpenAssistant/falcon-40b-sft-mix-1226 with Docker Model Runner:
docker model run hf.co/OpenAssistant/falcon-40b-sft-mix-1226
ValueError: the following model_kwargs are not used by model
I am running the code with the transformers repo that was recommended in the llama model repos:
git clone https://github.com/huggingface/transformers.git
cd transformers
git checkout d04ec99bec8a0b432fc03ed60cea9a1a20ebaf3c
pip install .
However, I get an error when trying to run:
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
tokenizer = AutoTokenizer.from_pretrained("falcon-40b-sft-mix-1226", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("falcon-40b-sft-mix-1226", device_map="sequential", offload_folder="offload", load_in_8bit=True, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_prompt=True)
message = "<|prompter|>This is a demo of a text streamer. What's a cool fact about ducks?<|endoftext|><|assistant|>"
inputs = tokenizer(message, return_tensors="pt").to(model.device)
tokens = model.generate(**inputs, max_new_tokens=25, do_sample=True, temperature=0.9, streamer=streamer)
returns this error:
dev_1/lib/python3.10/site-packages/transformers/generation/utils.py:1250: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
warnings.warn(
โญโโโโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ <stdin>:1 in <module> โ
โ โ
โ dev_1/lib/python3.10/site-p โ
โ ackages/torch/utils/_contextlib.py:115 in decorate_context โ
โ โ
โ 112 โ @functools.wraps(func) โ
โ 113 โ def decorate_context(*args, **kwargs): โ
โ 114 โ โ with ctx_factory(): โ
โ โฑ 115 โ โ โ return func(*args, **kwargs) โ
โ 116 โ โ
โ 117 โ return decorate_context โ
โ 118 โ
โ โ
โdev_1/lib/python3.10/site-p โ
โ ackages/transformers/generation/utils.py:1262 in generate โ
โ โ
โ 1259 โ โ generation_config = copy.deepcopy(generation_config) โ
โ 1260 โ โ model_kwargs = generation_config.update(**kwargs) # All unused kw โ
โ 1261 โ โ generation_config.validate() โ
โ โฑ 1262 โ โ self._validate_model_kwargs(model_kwargs.copy()) โ
โ 1263 โ โ โ
โ 1264 โ โ # 2. Set generation parameters if not already defined โ
โ 1265 โ โ logits_processor = logits_processor if logits_processor is not Non โ
โ โ
โ dev_1/lib/python3.10/site-p โ
โ ackages/transformers/generation/utils.py:1135 in _validate_model_kwargs โ
โ โ
โ 1132 โ โ โ โ unused_model_args.append(key) โ
โ 1133 โ โ โ
โ 1134 โ โ if unused_model_args: โ
โ โฑ 1135 โ โ โ raise ValueError( โ
โ 1136 โ โ โ โ f"The following `model_kwargs` are not used by the model: โ
โ 1137 โ โ โ โ " generate arguments will also show up in this list)" โ
โ 1138 โ โ โ ) โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
ValueError: The following `model_kwargs` are not used by the model:
['token_type_ids'] (note: typos in the generate arguments will also show up in this
list)
Lol, ok the quick fix:
open transformers/generation/utils.py and comment out the if statement on line 1134-1138, like so:
# if unused_model_args:
# raise ValueError(
# f"The following `model_kwargs` are not used by the model: {unused_model_args} (note: typos in the"
# " generate arguments will also show up in this list)"
# )
If it looks stupid, but it works...
New output (as expected for 25 max_new_tokens):
dev_1/lib/python3.10/site-packages/transformers/generation/utils.py:1250: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
warnings.warn(
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Ducks have a waterproof coating on their feathers, which allows them to swim and preen themselves in water.
Their web
update: huggingface indicated they will not fix this in the transformers repo, so this is now a mandatory step.
simpler solution, replace **inputs with input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask']
the extra arg is returned by the tokenizer.
simpler solution, replace
**inputswithinput_ids=inputs['input_ids'], attention_mask=inputs['attention_mask']
the extra arg is returned by the tokenizer.
yooo, that's way easier, thanks :D