Instructions to use BAAI/bge-en-icl with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use BAAI/bge-en-icl with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("BAAI/bge-en-icl") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Transformers
How to use BAAI/bge-en-icl with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="BAAI/bge-en-icl")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-en-icl") model = AutoModel.from_pretrained("BAAI/bge-en-icl") - Notebooks
- Google Colab
- Kaggle
About model_max_length
in tokenizer_config.json it says: "model_max_length": 1000000000000000019884624838656,
Can you kindly tell me the length distribution of train set? That will help me to adjust better chunk length when testing your model.
Our query and passage are trained with a length of 512, but the maximum length of the query with examples is set to 2048.
Thanks for reply. Do you train all your passages with a length of 512, or their lengths are different while averages around 512?
Just like other LLM-based embedding models, we set all passages with a length of 512.
But the model can handle documents larger than 512, right? If so, would it be better to truncate to 512 or not?