Instructions to use togethercomputer/LLaMA-2-7B-32K with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use togethercomputer/LLaMA-2-7B-32K with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="togethercomputer/LLaMA-2-7B-32K")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use togethercomputer/LLaMA-2-7B-32K with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "togethercomputer/LLaMA-2-7B-32K"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/LLaMA-2-7B-32K",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/togethercomputer/LLaMA-2-7B-32K

SGLang

How to use togethercomputer/LLaMA-2-7B-32K with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "togethercomputer/LLaMA-2-7B-32K" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/LLaMA-2-7B-32K",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "togethercomputer/LLaMA-2-7B-32K" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/LLaMA-2-7B-32K",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use togethercomputer/LLaMA-2-7B-32K with Docker Model Runner:
```
docker model run hf.co/togethercomputer/LLaMA-2-7B-32K
```

Model on your API Playground

by 1littlecoder - opened Jul 29, 2023

Discussion

1littlecoder

Jul 29, 2023

Hey Team, I tried to play with this model on your API playground, But I found it hard to get working - especially with the context length - while inputting and also outputing.

Any guide on that?

zhangce

Together org Jul 29, 2023

@1littlecoder hmm, if you have any feedback please let us know and we will keep improving! What are the challenges you are facing? Thanks!

Sc0urge

Aug 25, 2023

I have the same issues, when having 12000 tokens as input, in the playground the answer would load for a few seconds and then just stop without an error. When using the API, I would get a timeout error after a while.

mauriceweber

Aug 31, 2023

Hi @Sc0urge , thanks for your feedback! does the timeout issue persist? and what is the number of tokens that works without any timeout errors for you?

Sc0urge

Sep 2, 2023

Hi @Sc0urge , thanks for your feedback! does the timeout issue persist? and what is the number of tokens that works without any timeout errors for you?

Even when just giving "Hello" as prompt it crashes, however this time with "An unknown error has occurred with inference" (in playground). Normal LLAMA works though

mauriceweber

Sep 4, 2023

I couldn't observe this problem, can you let me know more details and the generation parameters you are using? Thanks!

Sc0urge

Sep 4, 2023

I couldn't observe this problem, can you let me know more details and the generation parameters you are using? Thanks!

For the long text I set the max output to 32k, for just the hello I left everything on default. Sometimes it throws an error sometimes it just shows the 3 dots which disappear after a second

mauriceweber

Sep 6, 2023

•

edited Sep 6, 2023

I see, I think for the hello example the issue might be that the default top_p=0.7 is too high (this is the threshold, below which all less likely tokens are filtered out ). So what likely happens is that after hello, the distribution for the next token is very flat and all tokens have probability < 0.7 (Intuitively, many tokens can follow hello and make sense). I would suggest to lower this threshold if your prompt is very short.

The other error most likely doesn't have anything to do with the hello prompt (I could not reproduce the error lately with the hello prompt). Are you still observing this error?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment