code-daemon-denoise-v1

A tiny, fast bilingual (EN + RU) word denoiser — it decides whether a single word form is a meaningful technical term (keep) or noise / ballast (drop). It ships with the UltraCode MCP server, where it runs in the knowledge-graph pipeline: classifying the UNKNOWN word forms harvested from a codebase's docs/identifiers so the search vocabulary stays clean.

Frozen encoder — intfloat/multilingual-e5-small (XLM-RoBERTa, 384-dim), no weight changes. Mean-pooling + L2-norm are baked into the graph.
Trained linear head — a logistic-regression probe (scikit-learn) over the 384-dim embedding, folded with its input scaler into a single affine P(keep) = sigmoid(w·e + b). Ships as denoise_head.json ({dim, w[384], b, strip_threshold}) — no Python at runtime; the daemon does the dot product in-process.
Vocab-pruned — the 250k-token SentencePiece vocab is cut by character class to Latin + Cyrillic + punctuation (142k tokens), lossless for EN + RU, dropping the INT8 weights from 121 MB to **76 MB**. The pruned-vocab id map is folded into a remap-Gather at the model input.

How it was made

Encoder: export the frozen mE5-small to ONNX with mean-pool + L2-norm fused, prune the embedding table to the kept character classes, and PTQ-quantize to INT8 (NNCF) for OpenVINO.
Head: embed a bilingual word-label set (EN: WordNet/BNC mid-frequency lemmas; RU: Taiga/OpenCorpora/Nerus mid-Zipf) plus per-language manual gold, fit LogisticRegression(class_weight="balanced"), then fold StandardScaler + LR into one (w, b). A strip_threshold (default 0.95) trades strip precision vs recall.

Words are embedded with a fixed "vocab: " prefix (the daemon pads every candidate word the same way) so very short inputs are not dropped by batch de-duplication — the head is trained on the prefixed embeddings, so reproduce the prefix for standalone use.

Built for speed

Short, single-word inputs — one length bucket only: batch 64 × seq 40 (-s_…_b64_s40).
INT8 weights (OpenVINO CPU); the embedding mean-pool + L2-norm are fused into the graph so the output is already [batch, 384].
CPU-first by design — on the daemon it runs on OpenVINO CPU and is moved to a discrete GPU (TensorRT / TVM) only when the card is large (≥12 GB total VRAM) with free room.

Intended use

Per-word "is this a technical term?" classification for cleaning a search vocabulary. Encode a word (with the "vocab: " prefix) with the bundled SentencePiece + mE5-small, then apply the linear head:

import onnxruntime as ort, sentencepiece as spm, numpy as np, json

sp   = spm.SentencePieceProcessor(model_file="sentencepiece.bpe.model")
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
head = json.load(open("denoise_head.json"))          # {dim, w[dim], b, strip_threshold}
w, b, thr = np.array(head["w"], np.float32), head["b"], head["strip_threshold"]

def p_keep(words, max_len=40):
    toks = [[2, *sp.encode("vocab: " + x)[: max_len - 2], 3] for x in words]   # bos … eos
    L    = max(len(t) for t in toks)
    ids  = np.array([t + [0] * (L - len(t)) for t in toks], dtype=np.int64)    # pad=0
    mask = (ids != 0).astype(np.int64)
    emb  = sess.run(None, {"input_ids": ids, "attention_mask": mask})[0]       # mean-pooled+L2 [B,384]
    return 1.0 / (1.0 + np.exp(-(emb @ w + b)))                                # P(keep)

scores = p_keep(["mutex", "tensorrt", "пожалуйста", "asdfgh"])
# keep where score >= thr ; the rest is ballast

What's in this repo

Pre-compiled, ready-to-run engines named per runtime × GPU arch × OS (single s bucket):

OpenVINO *_ov_cpu_int8_b64_s40.{xml,bin} — Intel/AMD/any CPU, INT8 (the default lane).
TensorRT *_{win_x64,linux_x64}_trt_sm_{86,89,120}.engine — NVIDIA, BF16 (optional GPU lane; the INT8 lane is OV CPU — this remap-baked SentencePiece ONNX isn't compatible with generic INT8 PTQ).
TVM *_b64_s40_{win_x64,…}_tvm_vulkan.{dll,so} — Vulkan fallback (optional GPU lane).
Head — denoise_head.json (the trained affine; required).
Tokenizer — sentencepiece.bpe.model (+ tokenizer_config.json). The daemon feeds raw SentencePiece ids; the fairseq +1 offset and pruned-vocab remap are baked into the ONNX.
ONNX source — model.onnx (FP32, pruned, mean-pool + L2-norm + remap fused) — the build source for the TRT/TVM engines and for standalone onnxruntime use.

Evaluation

On a frozen held-out word set (EN + RU): SAFE F1 ≈ 0.79, BALLAST F1 ≈ 0.84, strip precision ≈ 0.88 at strip_threshold = 0.95. The INT8 vocab-pruned build matches the full-vocab FP build (F1 0.79 vs 0.79) at 38 % of the size.

License & attribution

The encoder weights are intfloat/multilingual-e5-small (Apache-2.0), redistributed here in compiled form unchanged; this repo is therefore released under Apache-2.0. The linear head and the build/quantization tooling are original to UltraCode. Backbone: XLM-RoBERTa. Not legal advice.

Downloads last month: 5

Model tree for faxenoff/code-daemon-denoise-v1

Base model

intfloat/multilingual-e5-small

Quantized

(262)

this model