AIST-87M

AIST-87M is a compact audio + image + speech + text embedding model for human-memory augmentation workloads.

Current published artifact: matryoshka32-prefixrot-20260625. This refresh keeps the canonical AIST-87M name because the architecture and parameter scale are unchanged, but the released weights now include the shared 128d orthogonal prefix rotation used to support native 64d and 32d Matryoshka slices. The old and new artifacts are retrieval-equivalent at 128d+ on the local held-out SALT gate, while the new artifact improves 64d and 32d.

It is the single-audio evolution of the earlier dual-audio tower line: the runtime audio path uses one merged native mn20_as EfficientAT encoder instead of a separate EfficientAT + Whisper dual branch. The LoRA training weights are merged into the native audio encoder in this release artifact, so there is no separate LoRA pass at inference time.

Core stack:

  • text: MongoDB/mdbr-leaf-ir
  • image: mobilenetv4_conv_medium.e180_r384_in12k
  • audio: native merged mn20_as EfficientAT encoder
  • projection output: 1280d
  • Matryoshka slices: [1280, 768, 512, 256, 128, 64, 32]
  • exact loaded params: 87,186,755

The canonical name follows the Augmem naming standard:

  • AIST = audio + image + speech + text
  • 87M = exact loaded parameter count rounded to integer millions

Runtime Contract

This model returns L2-normalized embeddings in a shared 1280-dimensional space. For smaller runtime profiles, truncate to a Matryoshka slice and renormalize:

z1280 = l2norm(model(input))
z768  = l2norm(z1280[0:768])
z512  = l2norm(z1280[0:512])
z256  = l2norm(z1280[0:256])
z128  = l2norm(z1280[0:128])
z64   = l2norm(z1280[0:64])
z32   = l2norm(z1280[0:32])

The release safetensors file is self-contained and includes the text encoder, image encoder, merged native audio encoder, and the three projection heads.

Evaluation Scope

This release uses a human-memory evaluation slice rather than a broad leaderboard sweep. The slice is chosen to match practical memory augmentation surfaces:

  • text continuity: duplicate-question and semantic textual similarity tasks
  • image recall: Flickr30k text-image and image-text retrieval
  • audio recall: speech/general-audio text-audio retrieval tasks

Primary metrics:

  • text continuity: main_score
  • image recall: NDCG@10
  • audio recall: NDCG@10

Human-Memory Slice

Source: aist87m_memory_slice_release_report.md and aist87m_memory_slice_release_report.json.

Dim Tasks Text continuity Image recall Audio recall Overall
1280 8 / 8 0.763 0.425 0.104 0.349
768 8 / 8 0.762 0.424 0.104 0.349
512 8 / 8 0.762 0.424 0.104 0.349

Selected 1280d task scores:

Task Family Metric Score R@1 R@10
SprintDuplicateQuestions Text continuity main_score 0.875 - -
STSBenchmark Text continuity main_score 0.651 - -
Flickr30kT2IRetrieval Image recall NDCG@10 0.469 0.296 0.672
Flickr30kI2TRetrieval Image recall NDCG@10 0.381 0.082 0.407
CommonVoiceMini21T2ARetrieval Audio recall NDCG@10 0.028 0.006 0.062
MACST2ARetrieval Audio recall NDCG@10 0.110 0.033 0.214
UrbanSound8KT2ARetrieval Audio recall NDCG@10 0.009 0.002 0.018
ClothoT2ARetrieval Audio recall NDCG@10 0.269 0.128 0.443

Matryoshka32 No-Regression Gate

Source: aist87m_salt_matryoshka32_summary_20260626.json.

The current artifact was compared against the previous published AIST-87M projection checkpoint on the local held-out SALT cached retrieval gate. Mean R@1 averages all six paired directions: audio-text, image-text, and image-audio in both directions.

Dim Previous mean R@1 Current mean R@1 Delta
1280 0.193905 0.193905 +0.000000
768 0.193072 0.193072 +0.000000
512 0.193105 0.193105 +0.000000
256 0.192472 0.192472 +0.000000
128 0.190671 0.190671 +0.000000
64 0.111089 0.142028 +0.030940
32 0.056745 0.102354 +0.045609

Task-Aligned Comparisons

Comparisons below are only for locally available, task-aligned runs from the same raw AIST line and its audio baselines.

Comparison Dim Paired tasks Read
vs native mn20_as audio baseline 768 4 slightly lower selected audio recall on average; UrbanSound8K is flat
vs dual-audio tower 768 6 smaller single-audio runtime, but lower paired text/image/audio scores
vs AIST-95M 1280 2 only paired Flickr tasks are available locally; AIST-95M remains stronger on that pair

This release is not presented as a generic MTEB/MIEB/MAEB leaderboard model. Broad diagnostic runs contain many task families that are not part of this release gate.

Runtime Footprint vs Dual-Audio Tower

AIST-87M replaces the dual-audio tower's separate EfficientAT + Whisper-Tiny audio branches with one merged native mn20_as EfficientAT encoder. The result is a smaller deployed path with the same 1280d output contract.

Runtime surface AIST-87M AIST-95M dual-audio tower Delta
Loaded parameters 87,186,755 95,315,959 -8.5%
Safetensors artifact 348.9 MB 381.9 MB -8.6%
Audio encoders 1 2 removes Whisper branch
Audio encoder parameters 19,886,566 26,117,671 -23.9%
Audio path parameters incl. projection 32,193,126 40,390,311 -20.3%
Audio projection input width 1,280 2,304 -44.4%

Exact-gate tradeoff against the same dual-audio local baseline:

1280d exact-gate slice AIST-87M AIST-95M dual-audio tower Delta
Speech holdout audio-text R@1 avg 0.724 0.582 +0.142
WavCaps FSD audio-text R@1 avg 0.097 0.105 -0.009
SALT audio-text R@1 avg 0.008 0.007 flat
SALT image-audio R@1 avg 0.138 0.148 -0.010

Measured PyTorch audio-stack throughput on an NVIDIA L4, using synthetic 10s 32 kHz CPU waveforms passed through waveform -> audio encoder -> projection -> normalized embedding. Median wall time is over 50 timed iterations after 20 warmup iterations. This excludes audio file decode, dataset download, and MTEB result serialization.

Batch AIST-87M median ms AIST-87M throughput AIST-95M median ms AIST-95M throughput Speedup
1 5.36 186.7 clips/s; 1,867 audio-s/s 10.50 95.2 clips/s; 952 audio-s/s 1.96x
8 16.46 486.0 clips/s; 4,860 audio-s/s 60.29 132.7 clips/s; 1,327 audio-s/s 3.66x
16 41.19 388.5 clips/s; 3,885 audio-s/s 133.95 119.4 clips/s; 1,194 audio-s/s 3.25x

Projection-only throughput at feature batch 2048 is also higher for the single-audio path: 314k features/s for AIST-87M vs 282k features/s for the dual-audio tower. Raw benchmark output is included as aist87m_vs_dual_audio_throughput_l4_20260504.json.

Architecture

Text   -> mdbr-leaf-ir (768-d) -----------------------> DeepProjectionHead-d2 -> 1280
Image  -> MobileNetV4-Medium (1280-d) ----------------> DeepProjectionHead-d2 -> 1280
Audio  -> merged native EfficientAT mn20_as (1280-d) -> DeepProjectionHead-d2 -> 1280

The audio encoder in this artifact is the merged native checkpoint:

mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt

Parameter Count

Component Params
Text encoder (MongoDB/mdbr-leaf-ir) 22,861,056
Image encoder (mobilenetv4_conv_medium.e180_r384_in12k) 8,502,493
Audio encoder (merged native mn20_as) 19,886,566
Image projection head 12,306,560
Audio projection head 12,306,560
Text projection head 11,323,520
Total exact loaded params 87,186,755

Files

File Purpose
AIST-87M.safetensors Self-contained release artifact
aist_81m_raw_mn20_lora.yaml Training recipe for the source run
manifest.json Release manifest with checksums and eval coverage
parameter_breakdown.json Exact parameter accounting
aist87m_memory_slice_release_report.md Human-memory slice report
aist87m_memory_slice_release_report.json Machine-readable evaluation summary
aist87m_vs_dual_audio_throughput_l4_20260504.json L4 throughput benchmark vs dual-audio tower
aist87m_salt_published_retrieval_20260626.json Previous published AIST-87M SALT gate output
aist87m_salt_matryoshka32_retrieval_20260626.json Current artifact SALT gate output
aist87m_salt_matryoshka32_summary_20260626.json SALT no-regression summary

Caveats

  • The model is optimized and reported for memory-relevant embedding surfaces, not broad leaderboard coverage.
  • The single-audio path is smaller and simpler than the dual-audio tower, but it does not dominate the dual-audio tower on paired diagnostic scores.
  • 1280d, 768d, and 512d human-memory slices are complete for the release checkpoint.
  • Consumers should treat this hash as a new embedding-space artifact and regenerate caches rather than mixing embeddings from earlier AIST-87M hashes.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for augmem/AIST-87M

Quantizations
1 model