Inference time is much longer than reported

#25

by jeff-gao - opened Sep 20, 2023

Sep 20, 2023

In the paper, it says the inference speed is < 3ms per token using a single A100-80G.
However when I test with the sample code on a single A100-80G, the inference speed is around 28ms per token. My torch version is 2.0.1.
May I know how to make the inference speed to be around 3ms?
Thank you very much!

jeff-gao changed discussion title from Inference speed is much longer than reported to Inference time is much longer than reported Sep 22, 2023

hugosousa

Sep 22, 2023

Did you used DeepSpeed to run inference? I am not sure, but it seems that they used it as it is mentioned on the model card.

gugarosa

Microsoft org Sep 26, 2023

Hello @jeff-gao !

This mismatch was caused by the absence of Flash-Attention in the model files. We opted to not add it at first to keep the implementation simple, but we plan in adding an option that uses such implementation to take advantage of faster inferences.

jeff-gao

Sep 27, 2023

Hello @jeff-gao !

This mismatch was caused by the absence of Flash-Attention in the model files. We opted to not add it at first to keep the implementation simple, but we plan in adding an option that uses such implementation to take advantage of faster inferences.

Hello @gugarosa , thank you very much! Looking forward to your implementations !!!

KrishnaKaasyap

Oct 15, 2023

Hey @jeff-gao - since at fp16 it takes only 3.16 GB VRAM, can we run 24 copies (approximately) of Phi 1.5 on an A100-80GB GPU?

If that is possible and 3ms per token is also achievable with flash attention - can we generate 7200 tokens (24 copies × 300 tokens per second) per second on a A100-80GB GPU?

I'm a non-technical guy. Just asking out of curiosity. Thanks. 🙏🏼

gugarosa changed discussion status to closed Nov 21, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment