You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

MERIT-XS v2 (Research Preview)

MERIT-XS is a compact moderation encoder developed by Meridian Safety.

v2 upgrades from v1:

  • Multi-head taxonomy (8 production heads) replacing the single binary toxicity head
  • Calibrated per-head thresholds from dev F1 sweep
  • Unicode/leetspeak evasion normaliser included
  • Updated encoder with TPU-optimised attention (chunked local attention, no unfold)

Included files

File Description
merit_xs_multitask_v2.pt Multi-head moderation bundle (encoder + 10 heads)
infer_merit_xs.py CLI inference entrypoint
load_merit_xs.py Python loader โ€” multi-head API
evasion_normaliser.py Unicode homoglyph / leetspeak normaliser
metrics_summary.json Per-head dev/test AUROC, F1, calibrated thresholds
merit/ Local model package
assets/tokenizers/merit/ DeBERTa v3 tokenizer (128k vocab)

Setup

pip install -r requirements.txt

CLI usage

# Single message
python infer_merit_xs.py \
  --checkpoint merit_xs_multitask_v2.pt \
  --text "you are awful" \
  --text "send me a picture"

# Batch from file
python infer_merit_xs.py \
  --checkpoint merit_xs_multitask_v2.pt \
  --input-file messages.txt \
  --output-file results.jsonl

Python usage

from load_merit_xs import load_merit_xs

model = load_merit_xs()  # auto-detects merit_xs_multitask_v2.pt
results = model.predict([
    "you are awful",
    "send me a pic, just us",
    "what time does school end",
])
for r in results:
    print(r)

Output schema

{
    "text": str,
    "scores": {
        "toxicity": float,           # threshold 0.35
        "harassment_insult": float,  # threshold 0.40
        "threat_violence": float,    # threshold 0.30
        "identity_hate": float,      # threshold 0.35
        "sexual_explicit": float,    # threshold 0.65
        "grooming": float,           # threshold 0.60
        "prompt_injection": float,   # threshold 0.70
        "overall_risk": float,       # threshold 0.40
    },
    "flags": list[str],   # heads that exceeded their threshold
    "flagged": bool,      # any head exceeded threshold
    "evasion_score": float,  # 0.0 = clean text, 1.0 = heavy obfuscation
}

Performance (dev set, calibrated thresholds)

Head AUROC F1 Threshold
toxicity 0.964 0.838 0.35
harassment_insult 0.979 0.897 0.40
threat_violence 0.938 0.682 0.30
identity_hate 0.971 0.855 0.35
sexual_explicit 0.996 0.950 0.65
grooming 0.976 0.750 0.60
prompt_injection 0.986 0.890 0.70
overall_risk 0.964 0.846 0.40
macro average 0.972 0.838 โ€”

self_harm and extremism heads are present in the bundle but excluded from default output due to training data distribution issues.

Evasion normaliser

The included evasion_normaliser.py maps 600+ unicode homoglyphs, leetspeak substitutions, diacritics, and shattered words to ASCII before scoring. Applied by default in load_merit_xs.py.

model = load_merit_xs()
# evasion normaliser is on by default
results = model.predict(["h3ll0 k1d w4nn4 t4lk"], apply_evasion_normaliser=True)
print(results[0]["evasion_score"])  # > 0 indicates obfuscation detected

License

MERIT Research Preview License (MRPL v1.0) โ€” research, evaluation, and benchmarking use permitted. Commercial deployment and hosted/public API use require separate permission from Meridian Safety.

Current limitations

  • Message-level only โ€” does not model conversation trajectory (use MERIT-S for multi-turn)
  • Not a production safety system โ€” do not use as sole enforcement mechanism
  • English-primary โ€” multilingual coverage is partial

Research note

This package is intended for research, benchmarking, representation-transfer experiments, and moderation evaluation. It is not intended for production moderation, safety-critical enforcement, or fully automated policy decisions.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support