Instructions to use Quazim0t0/Byrne-Embed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Quazim0t0/Byrne-Embed with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="Quazim0t0/Byrne-Embed")# Load model directly from transformers import SpikeWhaleLM model = SpikeWhaleLM.from_pretrained("Quazim0t0/Byrne-Embed", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Byrne-Embed
Byrne-Embed is a compact 85M-parameter sentence-embedding model. It maps text to 768-dimensional unit-norm vectors suitable for semantic similarity, retrieval, clustering, and reranking.
The backbone is a custom SpikeWhale decoder (the "Byrne" line). A mean-pooled representation of its last hidden state is projected to 768 dimensions by a learned head and unit-normalized, so cosine similarity between two embeddings is just a dot product.
Benchmark vs. EmbeddingGemma-300M
We benchmarked Byrne-Embed against Google's EmbeddingGemma-300M on 4,000 held-out sentences spanning educational web text, encyclopedic text, and instruction/chat text. Byrne-Embed's embedding geometry tracks closely with EmbeddingGemma's at roughly 1/3.5 the parameter count:
| Metric (Byrne-Embed vs EmbeddingGemma) | Result |
|---|---|
| Mean per-sentence cosine | 0.9415 (median 0.945, p10 0.912) |
| Sentences within 0.90 cosine | 94.7% |
| Similarity-structure agreement (Pearson) | 0.9702 |
| Similarity-structure agreement (Spearman) | 0.9599 |
| Per-anchor neighbour-ranking correlation | 0.9494 |
| Retrieval top-1 nearest-neighbour agreement | 72.8% |
| Retrieval Recall@10 overlap | 78.2% |
Reading the numbers. The two most important measures — how closely the two models agree on which sentences are similar — land at Pearson 0.97 / Spearman 0.96: when EmbeddingGemma judges two sentences similar, Byrne-Embed agrees almost identically. 94.7% of all sentences sit within 0.90 cosine. The lower top-1 retrieval number is expected and not a quality gap: in a dense pool of real sentences many neighbours are near-ties (0.88 vs 0.87), so the single #1 slot flips easily between near-duplicates — which is why Recall@10 stays at ~78% and the neighbour-ranking correlation is 0.95. Both models find the same neighbourhood; they just occasionally swap rank 1 and rank 2 among near-identical candidates.
Reproduce these numbers with the bundled run_tests.py (it loads both
models and prints the full table).
MTEB English Benchmark — MTEB(eng, v2)
Evaluated with the official mteb library on the full MTEB(eng, v2) suite (41/41 tasks). Raw results are in mteb_results/; machine-readable scores are in the model-index metadata above.
Overall MTEB(eng, v2) mean: 50.79
| Category | Mean | Tasks |
|---|---|---|
| STS | 71.93 | 9 |
| Classification | 70.57 | 8 |
| PairClassification | 74.07 | 3 |
| Clustering | 37.32 | 8 |
| Reranking | 40.48 | 2 |
| Retrieval | 24.64 | 10 |
| Summarization | 22.39 | 1 |
STS
| Task | Score |
|---|---|
| BIOSSES | 75.56 |
| SICK-R | 69.08 |
| STS12 | 64.88 |
| STS13 | 72.08 |
| STS14 | 67.76 |
| STS15 | 77.13 |
| STS17 | 83.23 |
| STS22.v2 | 60.53 |
| STSBenchmark | 77.08 |
Classification
| Task | Score |
|---|---|
| AmazonCounterfactualClassification | 80.12 |
| Banking77Classification | 74.64 |
| ImdbClassification | 60.97 |
| MTOPDomainClassification | 92.29 |
| MassiveIntentClassification | 63.23 |
| MassiveScenarioClassification | 73.05 |
| ToxicConversationsClassification | 62.94 |
| TweetSentimentExtractionClassification | 57.29 |
PairClassification
| Task | Score |
|---|---|
| SprintDuplicateQuestions | 86.47 |
| TwitterSemEval2015 | 53.19 |
| TwitterURLCorpus | 82.55 |
Clustering
| Task | Score |
|---|---|
| ArXivHierarchicalClusteringP2P | 53.15 |
| ArXivHierarchicalClusteringS2S | 50.39 |
| BiorxivClusteringP2P.v2 | 33.73 |
| MedrxivClusteringP2P.v2 | 32.70 |
| MedrxivClusteringS2S.v2 | 29.04 |
| StackExchangeClustering.v2 | 41.93 |
| StackExchangeClusteringP2P.v2 | 35.22 |
| TwentyNewsgroupsClustering.v2 | 22.39 |
Reranking
| Task | Score |
|---|---|
| AskUbuntuDupQuestions | 52.88 |
| MindSmallReranking | 28.07 |
Retrieval
| Task | Score |
|---|---|
| ArguAna | 37.67 |
| CQADupstackGamingRetrieval | 37.14 |
| CQADupstackUnixRetrieval | 23.48 |
| ClimateFEVERHardNegatives | 13.60 |
| FEVERHardNegatives | 28.70 |
| FiQA2018 | 11.38 |
| HotpotQAHardNegatives | 30.47 |
| SCIDOCS | 10.15 |
| TRECCOVID | 29.30 |
| Touche2020Retrieval.v3 | 24.50 |
Summarization
| Task | Score |
|---|---|
| SummEvalSummarization.v2 | 22.39 |
Usage
from byrne_embedder import ByrneEmbedder
enc = ByrneEmbedder(".") # load from the model dir
vecs = enc.encode(["The cat sat on the windowsill.",
"A feline rested by the window."]) # (2, 768), unit-norm
print(float(vecs[0] @ vecs[1])) # cosine similarity ~ 0.83
print(enc.similarity("How do I bake bread?",
"Photosynthesis converts sunlight to energy.")) # ~ 0.28
encode() returns L2-normalized torch.Tensor rows, so cosine similarity is just a dot
product.
Files
| File | Purpose |
|---|---|
model.safetensors, config.json |
SpikeWhale backbone weights + config |
embed_head.pt |
learned projection head to 768-dim |
tokenizer.json, tokenizer_config.json |
byte-level SpikeTokenizer |
byrne_embedder.py |
self-contained loader / encode() API |
model_v2.py, config.py, spike_tokenizer.py |
SpikeWhale architecture + tokenizer code |
run_tests.py |
reproduces the benchmark table |
Limitations
- English-centric evaluation; non-English performance is untested.
- The single residual weak spot observed during evaluation is finance/economics paraphrase retrieval; general semantic similarity is strong.
- Custom architecture: load via the bundled
byrne_embedder.py(local modeling code — no remote code execution).
Citation
If you use Byrne-Embed, please cite:
@misc{byrne2026byrneembed,
title = {Byrne-Embed: A Compact 85M Sentence-Embedding Model},
author = {Byrne, Dean},
year = {2026},
howpublished = {\url{https://huggingface.co/Quazim0t0/Byrne-Embed}},
}
License
Apache-2.0.
- Downloads last month
- -
Space using Quazim0t0/Byrne-Embed 1
Evaluation results
- accuracy on MTEB AmazonCounterfactualClassificationtest set self-reported80.120
- v_measure on MTEB ArXivHierarchicalClusteringP2Ptest set self-reported53.150
- v_measure on MTEB ArXivHierarchicalClusteringS2Stest set self-reported50.390
- ndcg_at_10 on MTEB ArguAnatest set self-reported37.670
- map_at_1000 on MTEB AskUbuntuDupQuestionstest set self-reported52.880
- cosine_spearman on MTEB BIOSSEStest set self-reported75.560
- accuracy on MTEB Banking77Classificationtest set self-reported74.640
- v_measure on MTEB BiorxivClusteringP2P.v2test set self-reported33.730