| | --- |
| | language: |
| | - ru |
| | - en |
| | pipeline_tag: sentence-similarity |
| | tags: |
| | - embeddings |
| | - sentence-transformers |
| | - vllm |
| | - inference-optimized |
| | - inference |
| | license: mit |
| | base_model: cointegrated/rubert-tiny2 |
| | --- |
| | |
| | # rubert-tiny2-vllm |
| |
|
| | **vLLM-optimized version** of [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) for high-performance embedding inference. |
| |
|
| | This model produces **numerically identical embeddings** to the original while enabling speedup through vLLM's optimized kernels and batching. |
| |
|
| | ## Modifications |
| |
|
| | - **No weight changes** - uses original query/key/value weights directly |
| | - vLLM automatically converts Q/K/V to fused qkv_proj format during loading |
| | - Removed pretraining heads (MLM/NSP) - not needed for embeddings |
| | - Changed architecture to `BertModel` for vLLM compatibility |
| | |
| | ## Usage |
| | |
| | ### vLLM Server |
| | ```bash |
| | # IMPORTANT: Use fp32 for exact numerical match with original model |
| | vllm serve WpythonW/rubert-tiny2-vllm --dtype float32 |
| | ``` |
| | |
| | ### OpenAI-compatible API |
| | ```python |
| | from openai import OpenAI |
| | |
| | client = OpenAI( |
| | base_url="http://localhost:8000/v1", |
| | api_key="dummy" |
| | ) |
| | |
| | response = client.embeddings.create( |
| | input="Привет мир", |
| | model="WpythonW/rubert-tiny2-vllm" |
| | ) |
| | print(response.data[0].embedding[:5]) |
| | ``` |
| | |
| | ### Transformers |
| | ```python |
| | import torch |
| | from transformers import AutoTokenizer, AutoModel |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("WpythonW/rubert-tiny2-vllm") |
| | model = AutoModel.from_pretrained("WpythonW/rubert-tiny2-vllm") |
| | |
| | def embed_bert_cls(text, model, tokenizer): |
| | t = tokenizer(text, padding=True, truncation=True, return_tensors='pt') |
| | with torch.no_grad(): |
| | model_output = model(**{k: v.to(model.device) for k, v in t.items()}) |
| | embeddings = model_output.last_hidden_state[:, 0, :] |
| | embeddings = torch.nn.functional.normalize(embeddings) |
| | return embeddings[0].cpu().numpy() |
| | |
| | print(embed_bert_cls('привет мир', model, tokenizer).shape) |
| | # (312,) |
| | ``` |
| |
|
| | ### Sentence Transformers |
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| | |
| | model = SentenceTransformer('WpythonW/rubert-tiny2-vllm') |
| | sentences = ["привет мир", "hello world", "здравствуй вселенная"] |
| | embeddings = model.encode(sentences) |
| | print(embeddings.shape) |
| | ``` |
| |
|
| | ## Validation Results |
| |
|
| | Comparison between vLLM and SentenceTransformers on identical inputs: |
| | ``` |
| | Max embedding difference: 3.375e-7 |
| | Mean embedding difference: 1.136e-7 |
| | Cosine similarity matrices: Identical (np.allclose with default tolerances) |
| | ``` |
| |
|
| | This confirms **bit-level equivalence** within float32 precision limits. |
| |
|
| | ## Conversion |
| |
|
| | Full conversion notebook with validation: [Google Colab](https://colab.research.google.com/drive/1SS9qEayvwZU1r1khxq9tWf7iEZcxw2yW) |
| |
|
| | **Conversion process:** |
| | 1. Load original cointegrated/rubert-tiny2 weights |
| | 2. Remove `bert.` prefix from weight names |
| | 3. Remove unused heads (cls.*, bert.pooler.*) |
| | 4. Keep query/key/value weights as-is (vLLM handles fusion automatically) |
| |
|
| | Tested on Google Colab Tesla T4 with: |
| | - vLLM 0.11.2 |
| | - Transformers 4.57.2 |
| | - PyTorch 2.9.0+cu126 |
| |
|
| | ## Original Model |
| |
|
| | For standard PyTorch/Transformers usage, see the original model: [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) |