Sentence Similarity
Safetensors
Japanese
bert
feature-extraction

Ruri: Japanese General Text Embeddings

Notes: v3 models are out!
We recommend using the following v3 models going forward.

ID #Param. Max Len. Avg. JMTEB
cl-nagoya/ruri-v3-30m 37M 8192 74.51
cl-nagoya/ruri-v3-70m 70M 8192 75.48
cl-nagoya/ruri-v3-130m 132M 8192 76.55
cl-nagoya/ruri-v3-310m 315M 8192 77.24

Usage

First install the Sentence Transformers library:

pip install -U sentence-transformers fugashi sentencepiece unidic-lite

Then you can load this model and run inference.

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

# Download from the ๐Ÿค— Hub
model = SentenceTransformer("cl-nagoya/ruri-large")

# Don't forget to add the prefix "ใ‚ฏใ‚จใƒช: " for query-side or "ๆ–‡็ซ : " for passage-side texts.
sentences = [
    "ใ‚ฏใ‚จใƒช: ็‘ ็’ƒ่‰ฒใฏใฉใ‚“ใช่‰ฒ๏ผŸ",
    "ๆ–‡็ซ : ็‘ ็’ƒ่‰ฒ๏ผˆใ‚‹ใ‚Šใ„ใ‚๏ผ‰ใฏใ€็ดซใฟใ‚’ๅธฏใณใŸๆฟƒใ„้’ใ€‚ๅใฏใ€ๅŠ่ฒด็Ÿณใฎ็‘ ็’ƒ๏ผˆใƒฉใƒ”ใ‚นใƒฉใ‚บใƒชใ€่‹ฑ: lapis lazuli๏ผ‰ใซใ‚ˆใ‚‹ใ€‚JISๆ…ฃ็”จ่‰ฒๅใงใฏใ€Œใ“ใ„็ดซใฟใฎ้’ใ€๏ผˆ็•ฅๅท dp-pB๏ผ‰ใจๅฎš็พฉใ—ใฆใ„ใ‚‹[1][2]ใ€‚",
    "ใ‚ฏใ‚จใƒช: ใƒฏใ‚ทใ‚„ใ‚ฟใ‚ซใฎใ‚ˆใ†ใซใ€้‹ญใ„ใใกใฐใ—ใจ็ˆชใ‚’ๆŒใฃใŸๅคงๅž‹ใฎ้ณฅ้กžใ‚’็ท็งฐใ—ใฆใ€Œไฝ•้กžใ€ใจใ„ใ†ใงใ—ใ‚‡ใ†?",
    "ๆ–‡็ซ : ใƒฏใ‚ทใ€ใ‚ฟใ‚ซใ€ใƒใ‚ฒใƒฏใ‚ทใ€ใƒใƒคใƒ–ใ‚ตใ€ใ‚ณใƒณใƒ‰ใƒซใ€ใƒ•ใ‚ฏใƒญใ‚ฆใŒไปฃ่กจ็š„ใงใ‚ใ‚‹ใ€‚ใ“ใ‚Œใ‚‰ใฎ็Œ›็ฆฝ้กžใฏใƒชใƒณใƒๅ‰ๅพŒใฎๆ™‚ไปฃ(17~18ไธ–็ด€)ใซใฏ้ทฒ้กžใƒป้ทน้กžใƒป้šผ้กžๅŠใณๆขŸ้กžใซๅˆ†้กžใ•ใ‚ŒใŸใ€‚ใกใชใฟใซใƒชใƒณใƒใฏ็‹ฉใ‚Šใ‚’ใ™ใ‚‹้ณฅใ‚’ๅ˜ไธ€ใฎ็›ฎ(ใ‚‚ใ)ใซใพใจใ‚ใ€vultur(ใ‚ณใƒณใƒ‰ใƒซใ€ใƒใ‚ฒใƒฏใ‚ท)ใ€falco(ใƒฏใ‚ทใ€ใ‚ฟใ‚ซใ€ใƒใƒคใƒ–ใ‚ตใชใฉ)ใ€strix(ใƒ•ใ‚ฏใƒญใ‚ฆ)ใ€lanius(ใƒขใ‚บ)ใฎ4ๅฑžใ‚’ๅซใ‚ใฆใ„ใ‚‹ใ€‚",
]

embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.size())
# [4, 1024]

similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.9429, 0.6565, 0.6997],
#  [0.9429, 1.0000, 0.6579, 0.6768],
#  [0.6565, 0.6579, 1.0000, 0.8933],
#  [0.6997, 0.6768, 0.8933, 1.0000]]

Benchmarks

JMTEB

Evaluated with JMTEB.

Model #Param. Avg. Retrieval STS Classfification Reranking Clustering PairClassification
cl-nagoya/sup-simcse-ja-base 111M 68.56 49.64 82.05 73.47 91.83 51.79 62.57
cl-nagoya/sup-simcse-ja-large 337M 66.51 37.62 83.18 73.73 91.48 50.56 62.51
cl-nagoya/unsup-simcse-ja-base 111M 65.07 40.23 78.72 73.07 91.16 44.77 62.44
cl-nagoya/unsup-simcse-ja-large 337M 66.27 40.53 80.56 74.66 90.95 48.41 62.49
pkshatech/GLuCoSE-base-ja 133M 70.44 59.02 78.71 76.82 91.90 49.78 66.39
sentence-transformers/LaBSE 472M 64.70 40.12 76.56 72.66 91.63 44.88 62.33
intfloat/multilingual-e5-small 118M 69.52 67.27 80.07 67.62 93.03 46.91 62.19
intfloat/multilingual-e5-base 278M 70.12 68.21 79.84 69.30 92.85 48.26 62.26
intfloat/multilingual-e5-large 560M 71.65 70.98 79.70 72.89 92.96 51.24 62.15
OpenAI/text-embedding-ada-002 - 69.48 64.38 79.02 69.75 93.04 48.30 62.40
OpenAI/text-embedding-3-small - 70.86 66.39 79.46 73.06 92.92 51.06 62.27
OpenAI/text-embedding-3-large - 73.97 74.48 82.52 77.58 93.58 53.32 62.35
Ruri-Small 68M 71.53 69.41 82.79 76.22 93.00 51.19 62.11
Ruri-Base 111M 71.91 69.82 82.87 75.58 92.91 54.16 62.38
Ruri-Large (this model) 337M 73.31 73.02 83.13 77.43 92.99 51.82 62.29

Model Details

Model Description

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Framework Versions

  • Python: 3.10.13
  • Sentence Transformers: 3.0.0
  • Transformers: 4.41.2
  • PyTorch: 2.3.1+cu118
  • Accelerate: 0.30.1
  • Datasets: 2.19.1
  • Tokenizers: 0.19.1

Citation

@misc{
  Ruri,
  title={{Ruri: Japanese General Text Embeddings}}, 
  author={Hayato Tsukagoshi and Ryohei Sasano},
  year={2024},
  eprint={2409.07737},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2409.07737}, 
}

License

This model is published under the Apache License, Version 2.0.

Downloads last month
8,179
Safetensors
Model size
0.3B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cl-nagoya/ruri-large

Finetuned
(2)
this model
Finetunes
4 models

Dataset used to train cl-nagoya/ruri-large

Spaces using cl-nagoya/ruri-large 14

Collection including cl-nagoya/ruri-large

Paper for cl-nagoya/ruri-large