This is a sentence-transformers model fine-tuned from the intfloat/multilingual-e5-large architecture. It maps sentences and paragraphs to a 1024-dimensional dense vector space and is explicitly designed to improve medical information retrieval and search capabilities across unstructured clinical data.
Model Details
Model Description
Existing embedding models are predominantly trained on publicly available datasets and often fall short in non-English healthcare settings containing domain-specific terminology, abbreviations, and nuanced clinical language.
The miracle-german model addresses this gap. It is a domain-specific embedding model fine-tuned on real-world German clinical documents to enhance context retrieval when integrated into Retrieval Augmented Generation (RAG) systems for healthcare applications. To protect patient privacy, all training procedures and evaluations were conducted on pseudonymized documents.
- Model Type: Sentence Transformer
- Base model:
intfloat/multilingual-e5-large - Language: German
- Domain: Healthcare / Medical Information Retrieval
- Output Dimensionality: 1024 dimensions
Usage
Direct Usage (Sentence Transformers)
First, install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference:
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("SHIPAI-IKIM/miracle-german")
# Run inference
sentences = [
'query: Was ist die Hauptdiagnose des Patienten?',
'passage: Patient wurde notfallstationär aufgenommen, wurde zuletzt wegen NSCLC behandelt.',
'passage: Histologie: Präoperative Indikation: Patientin mit histologisch gesichertem BCC an oben genannter Lokalisation, so dass jetzt die Indikation zur Tumorexzision gegeben ist.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Training Details
Training Dataset
The model was fine-tuned on a carefully curated dataset comprising German-language clinical notes documented at the University Hospital Essen between 2018 and 2023.
- Document Types: The corpus included 400,000 clinical documents spanning four categories: radiology reports, discharge letters, pathology reports, and surgical operation notes.
- Synthetic Data Generation: The dataset was segmented into chunks, and the
SauerkrautLM-SOLAR-InstructLarge Language Model was tasked with generating medically relevant questions alongside the correct answers contained in the chunks. - Scale: The training data consisted of approximately 11 million synthetically generated question-answer pairs.
- Pseudonymization: Protected Health Information (PHI) in the documents was identified and replaced with surrogates utilizing a dedicated de-identification pipeline.
Training Hyperparameters
- Optimizer: AdamW
- Learning Rate: 2e-5
- Batch Size: 1024
- Loss Function:
CachedMultipleNegativesRankingLosswith a mini-batch size of 32 - Epochs: 1 (Limited to a single epoch to prevent overfitting to specific linguistic patterns generated by the LLM)
Limitations and Risks
- Template Overfitting: Clinical documents from a single institution often share rigid structural templates, creating a risk that the model learns to associate relevance with institutional artifacts rather than purely semantic content.
- Document Diversity: The dataset is limited to four types of clinical documents and may benefit from expansion by including a greater variety of medical texts.
- Synthetic Data Noise: The LLM used for data generation is susceptible to hallucinations, and a manual sample audit revealed that 18.0% of generated pairs contained hallucinations and 6.0% contained factual errors. This introduces potential noise into the training dataset.
- Clinical Verification: Any application of these models in clinical practice must mandate robust downstream filtering and expert human verification to identify and intercept potential retrieval errors.
Citation
If you use this model, please cite the following publication:
Arzideh K., Schäfer H., Idrissi-Yaghir A. et al. "Improving Retrieval Augmented Generation for Health Care by Fine-Tuning Clinical Embedding Models: Development and Evaluation Study". Journal of Medical Internet Research, 28, e82997, doi: https://doi.org/10.2196/82997
- Downloads last month
- -
Model tree for SHIPAI-IKIM/miracle-german
Base model
intfloat/multilingual-e5-large