Instructions to use OpenAssistant/falcon-40b-sft-mix-1226 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OpenAssistant/falcon-40b-sft-mix-1226 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="OpenAssistant/falcon-40b-sft-mix-1226", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("OpenAssistant/falcon-40b-sft-mix-1226", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use OpenAssistant/falcon-40b-sft-mix-1226 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "OpenAssistant/falcon-40b-sft-mix-1226"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenAssistant/falcon-40b-sft-mix-1226",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/OpenAssistant/falcon-40b-sft-mix-1226

SGLang

How to use OpenAssistant/falcon-40b-sft-mix-1226 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "OpenAssistant/falcon-40b-sft-mix-1226" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenAssistant/falcon-40b-sft-mix-1226",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "OpenAssistant/falcon-40b-sft-mix-1226" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenAssistant/falcon-40b-sft-mix-1226",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use OpenAssistant/falcon-40b-sft-mix-1226 with Docker Model Runner:
```
docker model run hf.co/OpenAssistant/falcon-40b-sft-mix-1226
```

ValueError: the following model_kwargs are not used by model

by daryl149 - opened Jun 4, 2023

Discussion

daryl149

Jun 4, 2023

I am running the code with the transformers repo that was recommended in the llama model repos:

git clone https://github.com/huggingface/transformers.git
cd transformers
git checkout d04ec99bec8a0b432fc03ed60cea9a1a20ebaf3c
pip install .

However, I get an error when trying to run:

from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
tokenizer = AutoTokenizer.from_pretrained("falcon-40b-sft-mix-1226", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("falcon-40b-sft-mix-1226", device_map="sequential", offload_folder="offload", load_in_8bit=True, trust_remote_code=True) 
streamer = TextStreamer(tokenizer, skip_prompt=True)
message = "<|prompter|>This is a demo of a text streamer. What's a cool fact about ducks?<|endoftext|><|assistant|>"
inputs = tokenizer(message, return_tensors="pt").to(model.device)

tokens = model.generate(**inputs,  max_new_tokens=25, do_sample=True, temperature=0.9, streamer=streamer)

returns this error:

dev_1/lib/python3.10/site-packages/transformers/generation/utils.py:1250: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
╭──────────────────────── Traceback (most recent call last) ────────────────────────╮
│ <stdin>:1 in <module>                                                             │
│                                                                                   │
│ dev_1/lib/python3.10/site-p │
│ ackages/torch/utils/_contextlib.py:115 in decorate_context                        │
│                                                                                   │
│   112 │   @functools.wraps(func)                                                  │
│   113 │   def decorate_context(*args, **kwargs):                                  │
│   114 │   │   with ctx_factory():                                                 │
│ ❱ 115 │   │   │   return func(*args, **kwargs)                                    │
│   116 │                                                                           │
│   117 │   return decorate_context                                                 │
│   118                                                                             │
│                                                                                   │
│dev_1/lib/python3.10/site-p │
│ ackages/transformers/generation/utils.py:1262 in generate                         │
│                                                                                   │
│   1259 │   │   generation_config = copy.deepcopy(generation_config)               │
│   1260 │   │   model_kwargs = generation_config.update(**kwargs)  # All unused kw │
│   1261 │   │   generation_config.validate()                                       │
│ ❱ 1262 │   │   self._validate_model_kwargs(model_kwargs.copy())                   │
│   1263 │   │                                                                      │
│   1264 │   │   # 2. Set generation parameters if not already defined              │
│   1265 │   │   logits_processor = logits_processor if logits_processor is not Non │
│                                                                                   │
│ dev_1/lib/python3.10/site-p │
│ ackages/transformers/generation/utils.py:1135 in _validate_model_kwargs           │
│                                                                                   │
│   1132 │   │   │   │   unused_model_args.append(key)                              │
│   1133 │   │                                                                      │
│   1134 │   │   if unused_model_args:                                              │
│ ❱ 1135 │   │   │   raise ValueError(                                              │
│   1136 │   │   │   │   f"The following `model_kwargs` are not used by the model:  │
│   1137 │   │   │   │   " generate arguments will also show up in this list)"      │
│   1138 │   │   │   )                                                              │
╰───────────────────────────────────────────────────────────────────────────────────╯
ValueError: The following `model_kwargs` are not used by the model: 
['token_type_ids'] (note: typos in the generate arguments will also show up in this 
list)

daryl149

Jun 4, 2023

•

edited Jun 5, 2023

Lol, ok the quick fix:
open transformers/generation/utils.py and comment out the if statement on line 1134-1138, like so:

        # if unused_model_args:
        #     raise ValueError(
        #         f"The following `model_kwargs` are not used by the model: {unused_model_args} (note: typos in the"
        #         " generate arguments will also show up in this list)"
        #     )

If it looks stupid, but it works...

New output (as expected for 25 max_new_tokens):

dev_1/lib/python3.10/site-packages/transformers/generation/utils.py:1250: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Ducks have a waterproof coating on their feathers, which allows them to swim and preen themselves in water.
Their web

update: huggingface indicated they will not fix this in the transformers repo, so this is now a mandatory step.

steremma

Jun 27, 2023

•

edited Jun 27, 2023

simpler solution, replace **inputs with input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask']
the extra arg is returned by the tokenizer.

daryl149

Jul 5, 2023

•

edited Jul 5, 2023

simpler solution, replace **inputs with input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask']
the extra arg is returned by the tokenizer.

yooo, that's way easier, thanks :D

daryl149 changed discussion status to closed Jul 5, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment