Instructions to use Phind/Phind-CodeLlama-34B-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Phind/Phind-CodeLlama-34B-v2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Phind/Phind-CodeLlama-34B-v2")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Phind/Phind-CodeLlama-34B-v2") model = AutoModelForCausalLM.from_pretrained("Phind/Phind-CodeLlama-34B-v2") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Phind/Phind-CodeLlama-34B-v2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Phind/Phind-CodeLlama-34B-v2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Phind/Phind-CodeLlama-34B-v2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Phind/Phind-CodeLlama-34B-v2
- SGLang
How to use Phind/Phind-CodeLlama-34B-v2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Phind/Phind-CodeLlama-34B-v2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Phind/Phind-CodeLlama-34B-v2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Phind/Phind-CodeLlama-34B-v2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Phind/Phind-CodeLlama-34B-v2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Phind/Phind-CodeLlama-34B-v2 with Docker Model Runner:
docker model run hf.co/Phind/Phind-CodeLlama-34B-v2
Issue/Bug replicating HumanEval result
Hi all,
I'm looking to replicate the HumanEval result for this model so that I can then go on to testing on interesting orthogonal benchmarks.
Unfortunately, I find that the model goes off the rails frequently, and is likely far from Phind's quoted performance when i attempt to replicate. Does anyone see an obvious bug here - https://github.com/emrgnt-cmplxty/zero-shot-replication/blob/main/zero_shot_replication/model/hugging_face_model/phind_model.py?
For reference, I am seeing output like that shown:
def is_multiply_prime(a):
"""Write a function that returns true if the given number is the multiplication of 3 prime numbers
and false otherwise.
Knowing that (a) is less then 100.
Example:
is_multiply_prime(30) == True
30 = 2 * 3 * 5
"""
def is_prime(n))):
if n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n
This model have the Theta of 1000000. Is there any way to implement that in the script?
Thanks for reporting, we'll investigate
The eval code in the model card just worked for me. Could you please let me know if that works for you?
I will test explicitly tomorrow, I don't think there are any significant diffs w.r.t what I am doing, but this can help pinpoint.
The eval code in the model card just worked for me. Could you please let me know if that works for you?
same here, every outputs end with same words, it seems there is no end_token here
There is some commentary in the reddit thread here -> https://www.reddit.com/r/LocalLLaMA/comments/164754t/wizardcoder_eval_results_vs_chatgpt_and_claude_on/
It does seem that the issue is related to transformers version.
Can confirm, running off transformers main brach commit worked.
I tried this code on single gpu. but getting bad results.
from transformers import AutoTokenizer, LlamaForCausalLM
from transformers import BitsAndBytesConfig
import torch
import os
model_path = "Phind/Phind-CodeLlama-34B-v2"
model = LlamaForCausalLM.from_pretrained(model_path, load_in_8bit=True, device_map="auto")
#model = LlamaForCausalLM.from_pretrained(model_path, quantization_config=nf4_config)
tokenizer = AutoTokenizer.from_pretrained(model_path, legacy=True)
tokenizer.pad_token_id = tokenizer.eos_token_id
text = "Write a code in python for Inferecing large language models using Transformers library. Give step by step approach."
inputs = tokenizer(text, return_tensors="pt").to("cuda:0")
out = model.generate(**inputs, max_length=200, temperature=0.9, repetition_penalty=1.5, do_sample=True)
print(tokenizer.decode(out[0][len(inputs['input_ids'][0]):]))
This is the output i am getting.
In order to inferencing with transformer model, we need use the Hugging Face's pytorch-transformers Library.
Step 1: Installation of Libraries
You can install this required useful very necessary important big huge immense massive monstrous enormous vast colossal portentious prodigious sizeable sizable mammoth mind mouth multitudinously numberless numb numerous novel nones none non non nonsensical senseless insignificant inconsequentialist unimportant small sm
python
# Importing Necessary nec es ess ent en env e environments needed environment environments environments
import torch
from transformers import AutoModelForMaskedLM,AutoTokenizerFastBert BertConfigP
class Class Config Model Token BERT For
config = class Auto
Can someone suggest?