Instructions to use labrat-aiko/sa-compliance-embeddings-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use labrat-aiko/sa-compliance-embeddings-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="labrat-aiko/sa-compliance-embeddings-v1")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("labrat-aiko/sa-compliance-embeddings-v1") model = AutoModel.from_pretrained("labrat-aiko/sa-compliance-embeddings-v1") - Notebooks
- Google Colab
- Kaggle
sa-compliance-embeddings-v1
A 384-dimensional sentence embedding model fine-tuned for South African compliance retrieval, especially POPIA (Protection of Personal Information Act, 2013). Given a compliance scenario or query, this model retrieves the most relevant POPIA section text โ a task where general-purpose embeddings under-perform because they are not trained on South African regulatory language or POPIA's specific structure.
This is, to our knowledge, the first publicly distributed embedding model fine-tuned on POPIA-grounded data.
Why this exists
General embedding models (intfloat/e5-small-v2, BAAI/bge-small-en-v1.5) are trained on web-scale English. They have no special grounding in South African regulatory language, no exposure to POPIA's section structure, and no concept of which scenarios trigger which clauses. The base model only retrieves the correct POPIA section on the first try for ~21% of compliance queries; this fine-tune raises that to ~48%.
For practical use โ building "show me the POPIA clauses relevant to this scenario" tools, retrieval-augmented compliance reviews, or audit-pipeline section lookups โ that gap is the difference between useful and noise.
Evaluation
Retrieval task: given a labelled compliance scenario (entailment or contradiction relative to a known POPIA clause), retrieve the canonical POPIA section text for that clause from a corpus of 114 POPIA sections.
| Metric | Stock bge-small-en-v1.5 |
sa-compliance-embeddings-v1 | Delta |
|---|---|---|---|
| Recall@1 | 0.211 | 0.477 | +26.6pp |
| Recall@3 | 0.445 | 0.594 | +14.9pp |
| Recall@5 | 0.508 | 0.656 | +14.8pp |
| Recall@10 | 0.680 | 0.766 | +8.6pp |
Eval set: 128 entailment + contradiction scenarios from data/popia_eval.jsonl and data/popia_eval_v2.jsonl โ held out from training.
Usage
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("labrat-aiko/sa-compliance-embeddings-v1")
model = AutoModel.from_pretrained("labrat-aiko/sa-compliance-embeddings-v1")
def embed(texts, batch_size=16):
embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
enc = tokenizer(batch, padding=True, truncation=True, max_length=256, return_tensors="pt")
with torch.no_grad():
hidden = model(**enc).last_hidden_state
mask = enc["attention_mask"].unsqueeze(-1).float()
pooled = (hidden * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1e-9)
pooled = F.normalize(pooled, p=2, dim=1)
embeddings.append(pooled)
return torch.cat(embeddings, dim=0)
queries = ["Our app collects under-13 learner data without parental consent."]
docs = [
"POPIA ยง34. Prohibition on processing personal information of children...",
"POPIA ยง22. Notification of security compromises...",
]
q_emb = embed(queries)
d_emb = embed(docs)
scores = q_emb @ d_emb.T # higher = more relevant
Training
- Base:
BAAI/bge-small-en-v1.5(33M params, 384-dim, mean pooling) - Loss: MultipleNegativesRankingLoss (InfoNCE with in-batch negatives, temperature 1/20)
- Training data: 308 (anchor, positive) pairs built from:
- The full POPIA Act text (114 sections, extracted from the official PDF)
- The labelled scenarios in
data/popia_seeds*.jsonlanddata/popia_paraphrases*.jsonl - The clause hypotheses in those same files
- Schedule: 6 epochs, AdamW lr 2e-5, linear warmup over 10% of steps, batch 16
- Compute: NVIDIA GTX 1650 (4 GB), ~45 seconds total
Intended use
- Primary: retrieving the POPIA section text relevant to a given compliance scenario or query, e.g. for RAG pipelines that need to reason about POPIA.
- Secondary: clustering / similarity over SA-compliance documents, dataset deduplication, weak-supervision labelling for downstream classifiers.
Limitations
- English only โ POPIA materials are predominantly English. Multilingual coverage of South Africa's other 10 official languages is future work.
- POPIA-focused corpus โ training data is concentrated on POPIA Act text. Coverage of FSCA AI guidance, SARB circulars, Treasury directives, and Information Regulator media statements is planned for v2.
- Small base model โ 33M parameters. A larger base (
bge-base-en-v1.5~110M) would likely give a few more points of recall but is slower and less deployable. - Not a legal index โ verdict bias still requires human review. Use this for routing and recall, not as the final word on which clause applies.
Roadmap
- v2 corpus expansion: add FSCA AI report (Nov 2025), SARB circulars, IR media statements, and SA Treasury procurement guidance to the training pairs.
- GDPR sibling: the same recipe applied to GDPR articles + ECJ decisions, scheduled before EU AI Act Art. 50 binds (2 Aug 2026).
- Bench: the SA Compliance Retrieval Bench, released alongside POPIA-Bench v1.
License
Apache-2.0 โ both code and model weights. Free for commercial use.
Citation
@misc{eland2026sacompliance,
author = {Eland, Akhona},
title = {sa-compliance-embeddings-v1: A POPIA-Grounded Sentence Embedding Model for South African Compliance Retrieval},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/labrat-aiko/sa-compliance-embeddings-v1}
}
Sibling artefacts
labrat-aiko/nli-popia-v2โ clause-level NLI judge over the same 10 POPIA clauseslabrat-aiko/nli-popia-v1โ original 7-clause NLI judgesemantix-aiโ Python library for output-side compliance validation
- Downloads last month
- 19
Model tree for labrat-aiko/sa-compliance-embeddings-v1
Base model
BAAI/bge-small-en-v1.5Evaluation results
- Recall@1self-reported0.477
- Recall@5self-reported0.656