Instructions to use Phind/Phind-CodeLlama-34B-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Phind/Phind-CodeLlama-34B-v2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Phind/Phind-CodeLlama-34B-v2")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Phind/Phind-CodeLlama-34B-v2")
model = AutoModelForCausalLM.from_pretrained("Phind/Phind-CodeLlama-34B-v2")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Phind/Phind-CodeLlama-34B-v2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Phind/Phind-CodeLlama-34B-v2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Phind/Phind-CodeLlama-34B-v2",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Phind/Phind-CodeLlama-34B-v2

SGLang

How to use Phind/Phind-CodeLlama-34B-v2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Phind/Phind-CodeLlama-34B-v2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Phind/Phind-CodeLlama-34B-v2",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Phind/Phind-CodeLlama-34B-v2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Phind/Phind-CodeLlama-34B-v2",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Phind/Phind-CodeLlama-34B-v2 with Docker Model Runner:
```
docker model run hf.co/Phind/Phind-CodeLlama-34B-v2
```

Issue/Bug replicating HumanEval result

by emrgnt-cmplxty - opened Aug 29, 2023

Discussion

emrgnt-cmplxty

Aug 29, 2023

Hi all,

I'm looking to replicate the HumanEval result for this model so that I can then go on to testing on interesting orthogonal benchmarks.

Unfortunately, I find that the model goes off the rails frequently, and is likely far from Phind's quoted performance when i attempt to replicate. Does anyone see an obvious bug here - https://github.com/emrgnt-cmplxty/zero-shot-replication/blob/main/zero_shot_replication/model/hugging_face_model/phind_model.py?

For reference, I am seeing output like that shown:


def is_multiply_prime(a):
    """Write a function that returns true if the given number is the multiplication of 3 prime numbers
    and false otherwise.
    Knowing that (a) is less then 100. 
    Example:
    is_multiply_prime(30) == True
    30 = 2 * 3 * 5
    """

    def is_prime(n))):
        if n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n

acrastt

Aug 29, 2023

This model have the Theta of 1000000. Is there any way to implement that in the script?

michaelroyzen

Phind org Aug 29, 2023

Thanks for reporting, we'll investigate

michaelroyzen

Phind org Aug 29, 2023

The eval code in the model card just worked for me. Could you please let me know if that works for you?

emrgnt-cmplxty

Aug 29, 2023

I will test explicitly tomorrow, I don't think there are any significant diffs w.r.t what I am doing, but this can help pinpoint.

waytohou

Aug 29, 2023

The eval code in the model card just worked for me. Could you please let me know if that works for you?

same here, every outputs end with same words, it seems there is no end_token here

emrgnt-cmplxty

Aug 29, 2023

There is some commentary in the reddit thread here -> https://www.reddit.com/r/LocalLLaMA/comments/164754t/wizardcoder_eval_results_vs_chatgpt_and_claude_on/

It does seem that the issue is related to transformers version.

Ilianos

Aug 29, 2023

https://huggingface.co/WizardLM/WizardCoder-Python-34B-V1.0/discussions/13

emrgnt-cmplxty

Aug 29, 2023

Can confirm, running off transformers main brach commit worked.

Satya4093

Sep 4, 2023

I tried this code on single gpu. but getting bad results.

   from transformers import AutoTokenizer, LlamaForCausalLM
   from transformers import BitsAndBytesConfig
   import torch
   import os 

   model_path = "Phind/Phind-CodeLlama-34B-v2"
   model = LlamaForCausalLM.from_pretrained(model_path, load_in_8bit=True, device_map="auto")
   #model = LlamaForCausalLM.from_pretrained(model_path, quantization_config=nf4_config)

    tokenizer = AutoTokenizer.from_pretrained(model_path, legacy=True)
    tokenizer.pad_token_id = tokenizer.eos_token_id

   text = "Write a code in python for Inferecing large language models using Transformers library. Give step by step approach."

   inputs = tokenizer(text, return_tensors="pt").to("cuda:0")

   out = model.generate(**inputs, max_length=200, temperature=0.9, repetition_penalty=1.5, do_sample=True)
   print(tokenizer.decode(out[0][len(inputs['input_ids'][0]):]))

This is the output i am getting.

In order to inferencing with transformer model, we need use the Hugging Face's pytorch-transformers Library.
Step 1: Installation of Libraries
You can install this required useful very necessary important big huge immense massive monstrous enormous vast colossal portentious prodigious sizeable sizable mammoth mind mouth multitudinously numberless numb numerous novel nones none non non nonsensical senseless insignificant inconsequentialist unimportant small sm
python
# Importing Necessary nec es ess ent en env e environments  needed environment environments environments
import torch
from transformers import AutoModelForMaskedLM,AutoTokenizerFastBert BertConfigP
class Class Config Model Token BERT For
config = class Auto

Can someone suggest?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment