Instructions to use togethercomputer/LLaMA-2-7B-32K with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use togethercomputer/LLaMA-2-7B-32K with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="togethercomputer/LLaMA-2-7B-32K")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K") model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use togethercomputer/LLaMA-2-7B-32K with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "togethercomputer/LLaMA-2-7B-32K" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/LLaMA-2-7B-32K", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/togethercomputer/LLaMA-2-7B-32K
- SGLang
How to use togethercomputer/LLaMA-2-7B-32K with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "togethercomputer/LLaMA-2-7B-32K" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/LLaMA-2-7B-32K", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "togethercomputer/LLaMA-2-7B-32K" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/LLaMA-2-7B-32K", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use togethercomputer/LLaMA-2-7B-32K with Docker Model Runner:
docker model run hf.co/togethercomputer/LLaMA-2-7B-32K
Model on your API Playground
Hey Team, I tried to play with this model on your API playground, But I found it hard to get working - especially with the context length - while inputting and also outputing.
Any guide on that?
@1littlecoder hmm, if you have any feedback please let us know and we will keep improving! What are the challenges you are facing? Thanks!
Ce
I have the same issues, when having 12000 tokens as input, in the playground the answer would load for a few seconds and then just stop without an error. When using the API, I would get a timeout error after a while.
Hi @Sc0urge , thanks for your feedback! does the timeout issue persist? and what is the number of tokens that works without any timeout errors for you?
Hi @Sc0urge , thanks for your feedback! does the timeout issue persist? and what is the number of tokens that works without any timeout errors for you?
Even when just giving "Hello" as prompt it crashes, however this time with "An unknown error has occurred with inference" (in playground). Normal LLAMA works though
I couldn't observe this problem, can you let me know more details and the generation parameters you are using? Thanks!
I couldn't observe this problem, can you let me know more details and the generation parameters you are using? Thanks!
For the long text I set the max output to 32k, for just the hello I left everything on default. Sometimes it throws an error sometimes it just shows the 3 dots which disappear after a second
I see, I think for the hello example the issue might be that the default top_p=0.7 is too high (this is the threshold, below which all less likely tokens are filtered out ). So what likely happens is that after hello, the distribution for the next token is very flat and all tokens have probability < 0.7 (Intuitively, many tokens can follow hello and make sense). I would suggest to lower this threshold if your prompt is very short.
The other error most likely doesn't have anything to do with the hello prompt (I could not reproduce the error lately with the hello prompt). Are you still observing this error?