Kona2-12B-Base

Kona2-12B-Base is a 12-billion parameter base language model optimized for Georgian language understanding and generation. It is built by continue-pretraining Mistral-Nemo-Base-2407 on approximately 30 billion tokens of Georgian/English data with expanded Georgian vocabulary.

Model Summary

Property Value
Parameters 12B
Architecture Mistral (Transformer)
Context Length 8K tokens
Vocabulary Extended (~20K Georgian tokens)
Languages Georgian (ka), English (en), other (limited)
Training Tokens ~30B
Training Continue pre-training (embeddings + high-rank LoRA)
Base Model mistralai/Mistral-Nemo-Base-2407

Intended Uses

Primary Use Cases

  • Base model for Georgian language fine-tuning
  • Georgian text generation and completion
  • Multilingual text understanding (KA/EN primary, others limited)
  • Foundation for instruction-tuned models
  • Translation capabilities (enhanced in fine-tuned versions)

Training

Training Data (~30B Tokens)

  • Open source corpora
  • Web content (custom crawlers/scrapers)
  • Translated texts

Vocabulary Expansion

Added ~20K Georgian tokens to improve tokenization efficiency:

  • Tokenizer fertility: 1.9 tokens/word on Georgian text
  • New embeddings initialized as mean of subtoken embeddings

Training Procedure

  • Method: Continue pre-training
  • Embeddings: Full training (unfrozen)
  • LoRA: High-rank adaptation on transformer layers
  • Training Context: 8K tokens
  • Precision: BF16
  • Infrastructure: NVIDIA H100 GPUs

Usage

Installation

pip install transformers torch accelerate

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "tbilisi-ai-lab/kona2-12B-Base",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("tbilisi-ai-lab/kona2-12B-Base")

# Text completion
text = "საქართველო არის"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Related Models

Model Description
kona2-12B-Instruct Instruction-tuned version (SFT)
kona2-12B Preference-aligned version (DPO)
kona2-small-3.8B Smaller 3.8B model

Limitations

  • Training data cutoff: 2024

Training Pipeline

Mistral-Nemo-Base-2407
    │
    ├── Expand Vocabulary (+20K Georgian tokens)
    │   └── Initialize with token average
    │
    └── Continue Pre-training (~30B tokens)
        └── Full embeddings training + high-rank LoRA
            │
            └── kona2-12B-Base ← YOU ARE HERE
                │
                └── SFT (~2.8M examples)
                    │
                    └── kona2-12B-Instruct
                        │
                        └── DPO (387K pairs)
                            │
                            └── kona2-12B

Technical Specifications

  • Precision: BF16/FP16 supported
  • Minimum VRAM: 24GB (with quantization)

Citation

@misc{tbilisi2025kona2base,
  title        = {Kona2-12B-Base: A Georgian Language Model},
  author       = {Tbilisi AI Lab Team},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/tbilisi-ai-lab/kona2-12B-Base}}
}

License

This model is released under the Apache 2.0 License.

Contact

Downloads last month
36
Safetensors
Model size
12B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tbilisi-ai-lab/kona2-12B-Base

Finetuned
(88)
this model
Finetunes
1 model
Quantizations
2 models