Instructions to use LnL-AI/dbrx-base-converted-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use LnL-AI/dbrx-base-converted-v2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="LnL-AI/dbrx-base-converted-v2", trust_remote_code=True)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("LnL-AI/dbrx-base-converted-v2", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("LnL-AI/dbrx-base-converted-v2", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use LnL-AI/dbrx-base-converted-v2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "LnL-AI/dbrx-base-converted-v2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LnL-AI/dbrx-base-converted-v2",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/LnL-AI/dbrx-base-converted-v2

SGLang

How to use LnL-AI/dbrx-base-converted-v2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "LnL-AI/dbrx-base-converted-v2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LnL-AI/dbrx-base-converted-v2",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "LnL-AI/dbrx-base-converted-v2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LnL-AI/dbrx-base-converted-v2",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use LnL-AI/dbrx-base-converted-v2 with Docker Model Runner:
```
docker model run hf.co/LnL-AI/dbrx-base-converted-v2
```

Ready for Testing...

by Qubitium - opened Mar 30, 2024

Discussion

Qubitium

LnL AI org Mar 30, 2024

•

edited Mar 30, 2024

@fahadh4ilyas @winglian

Both converted and converted-v2 version can train with bfloat16 with ~760GB vram. Original base cannot train as vram explodes.

fahadh4ilyas

Mar 30, 2024

what happened if flash attention enable?

Qubitium

LnL AI org Mar 30, 2024

@fahadh4ilyas

Flash Attention 2 barfs when applying padding

Qubitium

LnL AI org Mar 30, 2024

@fahadh4ilyas the padding issue may be caused by my custom training dataset code. I am going to double check.

Qubitium

LnL AI org Mar 30, 2024

@fahadh4ilyas confirmed. there is nothing wrong with fa2. It was my custom train code that was breaking it. Remove note about fa2 compat.

fahadh4ilyas

Mar 30, 2024

@Qubitium what is your resource specification to use for training? Does 8×A100 enough?

Qubitium

LnL AI org Mar 30, 2024

•

edited Mar 30, 2024

@fahadh4ilyas I am currently testing train on it with

fa2 enabled
trl/sft_trainer
batch 1
max seq len 2048
adam 8bit optimizer
bfloat16

I am using 767.44GB of vram right now. @winglian bf16 test shows he is using only 8x80 640GB so not sure what magic he is doing to use so much lower vram. Though I am using trl and he is using axotol.

I have a slight hunch that databricks purposely made the model just big enough to be out of normal 8xA100/H100 range.

Galore may be an option to get it back to normal 8xA100 range but our early tests show galore a lot slower so there may be a trade-off, or we are not testing galore correctly. I haven't personally validated the galore tests.

fahadh4ilyas

Mar 30, 2024

@Qubitium what kind of parallelization engine that you use for using multiple gpu? Deepspeed? or else? And did you do full finetune or lora?

Qubitium

LnL AI org Mar 30, 2024

@fahadh4ilyas Zero parallelization at the moment. Just dumb accelerate/trl integration where model layers spread across multiple gpus but only 1 gpu is particpating in training at any given moment so extremely inefficient. This is our first attempt to train on something that requires more than 1 or 2 gpu for full-finetuning so have not tested out deepspeed yet (it should help).

We only do full-finetuning and not lora/qlora at the moment.

Qubitium

LnL AI org Apr 3, 2024

•

edited Apr 3, 2024

@fahadh4ilyas With optimizer set to paged_adam_8bit memory usage went down to ~670GB of our setup. However, we still reverted back to adam_8bit as the paged_adam_8bit memory pattern was triggering a CUDA/Nvidia issue where UVM process (started by nvidia) which controls gpu/cpu memory sharing is started. This caused training speed to slow down 3x. UVM appears to be unified memory sharing for gpu/cpu that is designed reduce OOM. Not sure how to disable this on linux and paged_adam_8bit triggers this 100% in our setup.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment