AIST-87M
AIST-87M is a compact audio + image + speech + text embedding model for
human-memory augmentation workloads.
Current published artifact: matryoshka32-prefixrot-20260625. This refresh
keeps the canonical AIST-87M name because the architecture and parameter
scale are unchanged, but the released weights now include the shared 128d
orthogonal prefix rotation used to support native 64d and 32d Matryoshka
slices. The old and new artifacts are retrieval-equivalent at 128d+ on the
local held-out SALT gate, while the new artifact improves 64d and 32d.
It is the single-audio evolution of the earlier dual-audio tower line: the
runtime audio path uses one merged native mn20_as EfficientAT encoder instead
of a separate EfficientAT + Whisper dual branch. The LoRA training weights are
merged into the native audio encoder in this release artifact, so there is no
separate LoRA pass at inference time.
Core stack:
- text:
MongoDB/mdbr-leaf-ir - image:
mobilenetv4_conv_medium.e180_r384_in12k - audio: native merged
mn20_asEfficientAT encoder - projection output:
1280d - Matryoshka slices:
[1280, 768, 512, 256, 128, 64, 32] - exact loaded params:
87,186,755
The canonical name follows the Augmem naming standard:
AIST= audio + image + speech + text87M= exact loaded parameter count rounded to integer millions
Runtime Contract
This model returns L2-normalized embeddings in a shared 1280-dimensional space. For smaller runtime profiles, truncate to a Matryoshka slice and renormalize:
z1280 = l2norm(model(input))
z768 = l2norm(z1280[0:768])
z512 = l2norm(z1280[0:512])
z256 = l2norm(z1280[0:256])
z128 = l2norm(z1280[0:128])
z64 = l2norm(z1280[0:64])
z32 = l2norm(z1280[0:32])
The release safetensors file is self-contained and includes the text encoder, image encoder, merged native audio encoder, and the three projection heads.
Evaluation Scope
This release uses a human-memory evaluation slice rather than a broad leaderboard sweep. The slice is chosen to match practical memory augmentation surfaces:
- text continuity: duplicate-question and semantic textual similarity tasks
- image recall: Flickr30k text-image and image-text retrieval
- audio recall: speech/general-audio text-audio retrieval tasks
Primary metrics:
- text continuity:
main_score - image recall:
NDCG@10 - audio recall:
NDCG@10
Human-Memory Slice
Source: aist87m_memory_slice_release_report.md and
aist87m_memory_slice_release_report.json.
| Dim | Tasks | Text continuity | Image recall | Audio recall | Overall |
|---|---|---|---|---|---|
| 1280 | 8 / 8 | 0.763 | 0.425 | 0.104 | 0.349 |
| 768 | 8 / 8 | 0.762 | 0.424 | 0.104 | 0.349 |
| 512 | 8 / 8 | 0.762 | 0.424 | 0.104 | 0.349 |
Selected 1280d task scores:
| Task | Family | Metric | Score | R@1 | R@10 |
|---|---|---|---|---|---|
| SprintDuplicateQuestions | Text continuity | main_score | 0.875 | - | - |
| STSBenchmark | Text continuity | main_score | 0.651 | - | - |
| Flickr30kT2IRetrieval | Image recall | NDCG@10 | 0.469 | 0.296 | 0.672 |
| Flickr30kI2TRetrieval | Image recall | NDCG@10 | 0.381 | 0.082 | 0.407 |
| CommonVoiceMini21T2ARetrieval | Audio recall | NDCG@10 | 0.028 | 0.006 | 0.062 |
| MACST2ARetrieval | Audio recall | NDCG@10 | 0.110 | 0.033 | 0.214 |
| UrbanSound8KT2ARetrieval | Audio recall | NDCG@10 | 0.009 | 0.002 | 0.018 |
| ClothoT2ARetrieval | Audio recall | NDCG@10 | 0.269 | 0.128 | 0.443 |
Matryoshka32 No-Regression Gate
Source: aist87m_salt_matryoshka32_summary_20260626.json.
The current artifact was compared against the previous published AIST-87M projection checkpoint on the local held-out SALT cached retrieval gate. Mean R@1 averages all six paired directions: audio-text, image-text, and image-audio in both directions.
| Dim | Previous mean R@1 | Current mean R@1 | Delta |
|---|---|---|---|
| 1280 | 0.193905 | 0.193905 | +0.000000 |
| 768 | 0.193072 | 0.193072 | +0.000000 |
| 512 | 0.193105 | 0.193105 | +0.000000 |
| 256 | 0.192472 | 0.192472 | +0.000000 |
| 128 | 0.190671 | 0.190671 | +0.000000 |
| 64 | 0.111089 | 0.142028 | +0.030940 |
| 32 | 0.056745 | 0.102354 | +0.045609 |
Task-Aligned Comparisons
Comparisons below are only for locally available, task-aligned runs from the same raw AIST line and its audio baselines.
| Comparison | Dim | Paired tasks | Read |
|---|---|---|---|
vs native mn20_as audio baseline |
768 | 4 | slightly lower selected audio recall on average; UrbanSound8K is flat |
| vs dual-audio tower | 768 | 6 | smaller single-audio runtime, but lower paired text/image/audio scores |
vs AIST-95M |
1280 | 2 | only paired Flickr tasks are available locally; AIST-95M remains stronger on that pair |
This release is not presented as a generic MTEB/MIEB/MAEB leaderboard model. Broad diagnostic runs contain many task families that are not part of this release gate.
Runtime Footprint vs Dual-Audio Tower
AIST-87M replaces the dual-audio tower's separate EfficientAT + Whisper-Tiny
audio branches with one merged native mn20_as EfficientAT encoder. The result
is a smaller deployed path with the same 1280d output contract.
| Runtime surface | AIST-87M | AIST-95M dual-audio tower | Delta |
|---|---|---|---|
| Loaded parameters | 87,186,755 | 95,315,959 | -8.5% |
| Safetensors artifact | 348.9 MB | 381.9 MB | -8.6% |
| Audio encoders | 1 | 2 | removes Whisper branch |
| Audio encoder parameters | 19,886,566 | 26,117,671 | -23.9% |
| Audio path parameters incl. projection | 32,193,126 | 40,390,311 | -20.3% |
| Audio projection input width | 1,280 | 2,304 | -44.4% |
Exact-gate tradeoff against the same dual-audio local baseline:
| 1280d exact-gate slice | AIST-87M | AIST-95M dual-audio tower | Delta |
|---|---|---|---|
| Speech holdout audio-text R@1 avg | 0.724 | 0.582 | +0.142 |
| WavCaps FSD audio-text R@1 avg | 0.097 | 0.105 | -0.009 |
| SALT audio-text R@1 avg | 0.008 | 0.007 | flat |
| SALT image-audio R@1 avg | 0.138 | 0.148 | -0.010 |
Measured PyTorch audio-stack throughput on an NVIDIA L4, using synthetic 10s 32 kHz CPU waveforms passed through waveform -> audio encoder -> projection -> normalized embedding. Median wall time is over 50 timed iterations after 20 warmup iterations. This excludes audio file decode, dataset download, and MTEB result serialization.
| Batch | AIST-87M median ms | AIST-87M throughput | AIST-95M median ms | AIST-95M throughput | Speedup |
|---|---|---|---|---|---|
| 1 | 5.36 | 186.7 clips/s; 1,867 audio-s/s | 10.50 | 95.2 clips/s; 952 audio-s/s | 1.96x |
| 8 | 16.46 | 486.0 clips/s; 4,860 audio-s/s | 60.29 | 132.7 clips/s; 1,327 audio-s/s | 3.66x |
| 16 | 41.19 | 388.5 clips/s; 3,885 audio-s/s | 133.95 | 119.4 clips/s; 1,194 audio-s/s | 3.25x |
Projection-only throughput at feature batch 2048 is also higher for the
single-audio path: 314k features/s for AIST-87M vs 282k features/s for the
dual-audio tower. Raw benchmark output is included as
aist87m_vs_dual_audio_throughput_l4_20260504.json.
Architecture
Text -> mdbr-leaf-ir (768-d) -----------------------> DeepProjectionHead-d2 -> 1280
Image -> MobileNetV4-Medium (1280-d) ----------------> DeepProjectionHead-d2 -> 1280
Audio -> merged native EfficientAT mn20_as (1280-d) -> DeepProjectionHead-d2 -> 1280
The audio encoder in this artifact is the merged native checkpoint:
mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt
Parameter Count
| Component | Params |
|---|---|
Text encoder (MongoDB/mdbr-leaf-ir) |
22,861,056 |
Image encoder (mobilenetv4_conv_medium.e180_r384_in12k) |
8,502,493 |
Audio encoder (merged native mn20_as) |
19,886,566 |
| Image projection head | 12,306,560 |
| Audio projection head | 12,306,560 |
| Text projection head | 11,323,520 |
| Total exact loaded params | 87,186,755 |
Files
| File | Purpose |
|---|---|
AIST-87M.safetensors |
Self-contained release artifact |
aist_81m_raw_mn20_lora.yaml |
Training recipe for the source run |
manifest.json |
Release manifest with checksums and eval coverage |
parameter_breakdown.json |
Exact parameter accounting |
aist87m_memory_slice_release_report.md |
Human-memory slice report |
aist87m_memory_slice_release_report.json |
Machine-readable evaluation summary |
aist87m_vs_dual_audio_throughput_l4_20260504.json |
L4 throughput benchmark vs dual-audio tower |
aist87m_salt_published_retrieval_20260626.json |
Previous published AIST-87M SALT gate output |
aist87m_salt_matryoshka32_retrieval_20260626.json |
Current artifact SALT gate output |
aist87m_salt_matryoshka32_summary_20260626.json |
SALT no-regression summary |
Caveats
- The model is optimized and reported for memory-relevant embedding surfaces, not broad leaderboard coverage.
- The single-audio path is smaller and simpler than the dual-audio tower, but it does not dominate the dual-audio tower on paired diagnostic scores.
- 1280d, 768d, and 512d human-memory slices are complete for the release checkpoint.
- Consumers should treat this hash as a new embedding-space artifact and regenerate caches rather than mixing embeddings from earlier AIST-87M hashes.