Instructions to use eryk-mazus/polka-1.1b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use eryk-mazus/polka-1.1b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="eryk-mazus/polka-1.1b")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("eryk-mazus/polka-1.1b")
model = AutoModelForCausalLM.from_pretrained("eryk-mazus/polka-1.1b")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use eryk-mazus/polka-1.1b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "eryk-mazus/polka-1.1b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "eryk-mazus/polka-1.1b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/eryk-mazus/polka-1.1b

SGLang

How to use eryk-mazus/polka-1.1b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "eryk-mazus/polka-1.1b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "eryk-mazus/polka-1.1b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "eryk-mazus/polka-1.1b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "eryk-mazus/polka-1.1b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use eryk-mazus/polka-1.1b with Docker Model Runner:
```
docker model run hf.co/eryk-mazus/polka-1.1b
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

polka-1.1b

polka-1.1b takes the TinyLlama-1.1B model and enhances it by continuing pretraining on an additional 5.7 billion Polish tokens, primarily sourced from the MADLAD-400 dataset. The tokens were sampled in a 10:1 ratio between Polish and English shards using DSIR. Furthermore, Polka extends the TinyLlama tokenizer's vocabulary to 43,882 tokens, improving its efficiency for generating Polish text.

The training took 680 GPU hours on a single 8 x RTX 4090 machine with DeepSpeed ZeRO-2.

Context size: 2,048 tokens.

Notes

This base model was initially developed as the foundation for instruction tuning, which resulted in polka-1.1b-chat. Nonetheless, I'm sharing it with the community because I see potential value in its combination of relatively good performance and an efficient bilingual tokenizer.

The model is capable of producing coherent Polish text, but due to its size, it is likely to suffer from hallucination.

Evaluation

Performed by OPI-PG, the authors of Qra models.

PolEval-2018

Model	Perplexity
English models
meta-llama/Llama-2-7b-hf	24.3
meta-llama/Llama-2-13b-hf	21.4
mistralai/Mistral-7B-v0.1	21.4
TinyLlama/TinyLlama-1.1B	40.4
Polish models
sdadas/polish-gpt2-small	134.4
sdadas/polish-gpt2-medium	100.8
sdadas/polish-gpt2-large	93.2
sdadas/polish-gpt2-xl	94.1
Azurro/APT3-275M-Base	129.8
Azurro/APT3-500M-Base	153.1
Azurro/APT3-1B-Base	106.8
eryk-mazus/polka-1.1b	18.1
szymonrucinski/Curie-7B-v1	13.5
OPI-PG/Qra-1b	14.7

Long documents (2024)

Currently, LLMs support contexts of thousands of tokens. Their practical applications usually also involve processing long documents. Therefore, evaluating perplexity on a sentence-based dataset such as PolEval-2018 may not be meaningful. Additionally, the PolEval corpus has been publicly available on the internet for the past few years, which raises the possibility that for some models the training sets have been contaminated by this data. For this reason, we have prepared a new collection consisting of long papers published exclusively in 2024, which will allow us to more reliably test the perplexities of the models on new knowledge that was not available to them at the time of training. The corpus consists of 5,000 documents ranging from several hundred to about 20,000 tokens. Half of the set consists of press texts from Polish news portals from February 2024, the other half are scientific articles published since January 2024. Most of the documents exceed the context size of the evaluated models. To calculate perplexity for these documents, we divided them into chunks of size equal to the model's context length with a stride of 512 tokens, following this example.

Model	Context	Perplexity
English models
meta-llama/Llama-2-7b-hf	4096	5.9
meta-llama/Llama-2-13b-hf	4096	5.3
mistralai/Mistral-7B-v0.1	4096	4.9
TinyLlama/TinyLlama-1.1B	2048	9.6
Polish models
sdadas/polish-gpt2-small	2048	27.3
sdadas/polish-gpt2-medium	2048	20.3
sdadas/polish-gpt2-large	1536	18.0
sdadas/polish-gpt2-xl	1536	16.6
Azurro/APT3-275M-Base	2048	77.0
Azurro/APT3-500M-Base	2048	50.5
Azurro/APT3-1B-Base	2048	19.1
eryk-mazus/polka-1.1b	2048	6.9
szymonrucinski/Curie-7B-v1	4096	4.8
OPI-PG/Qra-1b	4096	6.1

Sample code

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "eryk-mazus/polka-1.1b"

tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)

prompt = """Przykładowe zapytanie do modelu"""

model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
with torch.no_grad():
  generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
    do_sample=True,
    penalty_alpha=0.6,
    top_k=5
  )

output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(output)