Instructions to use microsoft/phi-2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/phi-2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="microsoft/phi-2")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2") model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2") - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use microsoft/phi-2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/phi-2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/phi-2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/microsoft/phi-2
- SGLang
How to use microsoft/phi-2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/phi-2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/phi-2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/phi-2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/phi-2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use microsoft/phi-2 with Docker Model Runner:
docker model run hf.co/microsoft/phi-2
ImportError: This modeling file requires the following packages that were not found in your environment: flash_attn. Run `pip install flash_attn`
I was running Phi-2 on my CPU in a Jupyter notebook. When I just tried, it broke :-((
I see that the model has been updated. From the little research I did, apparently, flash_attn requires that I have Nvidia GPU? How do I run this on a CPU now? Or is that no longer an option?
P.S: - I am unable to install flash_attn, I have updated torch, transformers and packages and wheel. Now I see the following error when trying to install this package. I don't have CUDA.
raise OSError('CUDA_HOME environment variable is not set. '
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
torch.__version__ = 2.1.2+cpu
Facing the same issue, I'm trying to download model weights and build a docker image with vLLm . It gave the same error. It worked perfectly fine 6 hrs back, but with the the latest commit something seems broken.
In the meantime, how do we pull weights programatically from previous commit-id?
Hello everyone!
We deployed a fix and it should be working now.
The issue was caused by the combination of using dynamic modules and remote code loading in transformers.
Regards,
Gustavo.
@gugarosa Hey just curious to understand the motivation behind renaming of layer_norm_epsilon to layer_norm_eps in the config.json?
I see vLLM use layer_norm_epsilon throughout all the models. So, now the recent commits in this repo is breaking things in vLLM
I think we will need to update vLLM as well.
There is no reason in using layer_norm_eps. It was used in the first implementation of Phi (internally in transformers) and we followed it minimize friction when merging the integration.
By the way, there is an active PR that will fix it: https://github.com/vllm-project/vllm/pull/2428/files
since the layernaming was changed for consistency reasons, don't you think it would be better to align with "layer_norm_epsilon" too ?
on the other hand llama uses "rms_norm_eps" .... go figure.
I definitely agree!
Maybe an attribute_map: {"layer_norm_epsilon": "layer_norm_eps"} on the configuration_phi.py would fix the issue. And it would be an easier PR.