AIST-87M

AIST-87M is a compact audio + image + speech + text embedding model for human-memory augmentation workloads.

Current published artifact: matryoshka32-prefixrot-20260625. This refresh keeps the canonical AIST-87M name because the architecture and parameter scale are unchanged, but the released weights now include the shared 128d orthogonal prefix rotation used to support native 64d and 32d Matryoshka slices. The old and new artifacts are retrieval-equivalent at 128d+ on the local held-out SALT gate, while the new artifact improves 64d and 32d.

It is the single-audio evolution of the earlier dual-audio tower line: the runtime audio path uses one merged native mn20_as EfficientAT encoder instead of a separate EfficientAT + Whisper dual branch. The LoRA training weights are merged into the native audio encoder in this release artifact, so there is no separate LoRA pass at inference time.

Core stack:

text: MongoDB/mdbr-leaf-ir
image: mobilenetv4_conv_medium.e180_r384_in12k
audio: native merged mn20_as EfficientAT encoder
projection output: 1280d
Matryoshka slices: [1280, 768, 512, 256, 128, 64, 32]
exact loaded params: 87,186,755

The canonical name follows the Augmem naming standard:

AIST = audio + image + speech + text
87M = exact loaded parameter count rounded to integer millions

Runtime Contract

This model returns L2-normalized embeddings in a shared 1280-dimensional space. For smaller runtime profiles, truncate to a Matryoshka slice and renormalize:

z1280 = l2norm(model(input))
z768  = l2norm(z1280[0:768])
z512  = l2norm(z1280[0:512])
z256  = l2norm(z1280[0:256])
z128  = l2norm(z1280[0:128])
z64   = l2norm(z1280[0:64])
z32   = l2norm(z1280[0:32])

The release safetensors file is self-contained and includes the text encoder, image encoder, merged native audio encoder, and the three projection heads.

Evaluation Scope

This release uses a human-memory evaluation slice rather than a broad leaderboard sweep. The slice is chosen to match practical memory augmentation surfaces:

text continuity: duplicate-question and semantic textual similarity tasks
image recall: Flickr30k text-image and image-text retrieval
audio recall: speech/general-audio text-audio retrieval tasks

Primary metrics:

text continuity: main_score
image recall: NDCG@10
audio recall: NDCG@10

Human-Memory Slice

Source: aist87m_memory_slice_release_report.md and aist87m_memory_slice_release_report.json.

Dim	Tasks	Text continuity	Image recall	Audio recall	Overall
1280	8 / 8	0.763	0.425	0.104	0.349
768	8 / 8	0.762	0.424	0.104	0.349
512	8 / 8	0.762	0.424	0.104	0.349

Selected 1280d task scores:

Task	Family	Metric	Score	R@1	R@10
SprintDuplicateQuestions	Text continuity	main_score	0.875	-	-
STSBenchmark	Text continuity	main_score	0.651	-	-
Flickr30kT2IRetrieval	Image recall	NDCG@10	0.469	0.296	0.672
Flickr30kI2TRetrieval	Image recall	NDCG@10	0.381	0.082	0.407
CommonVoiceMini21T2ARetrieval	Audio recall	NDCG@10	0.028	0.006	0.062
MACST2ARetrieval	Audio recall	NDCG@10	0.110	0.033	0.214
UrbanSound8KT2ARetrieval	Audio recall	NDCG@10	0.009	0.002	0.018
ClothoT2ARetrieval	Audio recall	NDCG@10	0.269	0.128	0.443

Matryoshka32 No-Regression Gate

Source: aist87m_salt_matryoshka32_summary_20260626.json.

The current artifact was compared against the previous published AIST-87M projection checkpoint on the local held-out SALT cached retrieval gate. Mean R@1 averages all six paired directions: audio-text, image-text, and image-audio in both directions.

Dim	Previous mean R@1	Current mean R@1	Delta
1280	0.193905	0.193905	+0.000000
768	0.193072	0.193072	+0.000000
512	0.193105	0.193105	+0.000000
256	0.192472	0.192472	+0.000000
128	0.190671	0.190671	+0.000000
64	0.111089	0.142028	+0.030940
32	0.056745	0.102354	+0.045609

Task-Aligned Comparisons

Comparisons below are only for locally available, task-aligned runs from the same raw AIST line and its audio baselines.

Comparison	Dim	Paired tasks	Read
vs native `mn20_as` audio baseline	768	4	slightly lower selected audio recall on average; UrbanSound8K is flat
vs dual-audio tower	768	6	smaller single-audio runtime, but lower paired text/image/audio scores
vs `AIST-95M`	1280	2	only paired Flickr tasks are available locally; `AIST-95M` remains stronger on that pair

This release is not presented as a generic MTEB/MIEB/MAEB leaderboard model. Broad diagnostic runs contain many task families that are not part of this release gate.

Runtime Footprint vs Dual-Audio Tower

AIST-87M replaces the dual-audio tower's separate EfficientAT + Whisper-Tiny audio branches with one merged native mn20_as EfficientAT encoder. The result is a smaller deployed path with the same 1280d output contract.

Runtime surface	AIST-87M	AIST-95M dual-audio tower	Delta
Loaded parameters	87,186,755	95,315,959	-8.5%
Safetensors artifact	348.9 MB	381.9 MB	-8.6%
Audio encoders	1	2	removes Whisper branch
Audio encoder parameters	19,886,566	26,117,671	-23.9%
Audio path parameters incl. projection	32,193,126	40,390,311	-20.3%
Audio projection input width	1,280	2,304	-44.4%

Exact-gate tradeoff against the same dual-audio local baseline:

1280d exact-gate slice	AIST-87M	AIST-95M dual-audio tower	Delta
Speech holdout audio-text R@1 avg	0.724	0.582	+0.142
WavCaps FSD audio-text R@1 avg	0.097	0.105	-0.009
SALT audio-text R@1 avg	0.008	0.007	flat
SALT image-audio R@1 avg	0.138	0.148	-0.010

Measured PyTorch audio-stack throughput on an NVIDIA L4, using synthetic 10s 32 kHz CPU waveforms passed through waveform -> audio encoder -> projection -> normalized embedding. Median wall time is over 50 timed iterations after 20 warmup iterations. This excludes audio file decode, dataset download, and MTEB result serialization.

Batch	AIST-87M median ms	AIST-87M throughput	AIST-95M median ms	AIST-95M throughput	Speedup
1	5.36	186.7 clips/s; 1,867 audio-s/s	10.50	95.2 clips/s; 952 audio-s/s	1.96x
8	16.46	486.0 clips/s; 4,860 audio-s/s	60.29	132.7 clips/s; 1,327 audio-s/s	3.66x
16	41.19	388.5 clips/s; 3,885 audio-s/s	133.95	119.4 clips/s; 1,194 audio-s/s	3.25x

Projection-only throughput at feature batch 2048 is also higher for the single-audio path: 314k features/s for AIST-87M vs 282k features/s for the dual-audio tower. Raw benchmark output is included as aist87m_vs_dual_audio_throughput_l4_20260504.json.

Architecture

Text   -> mdbr-leaf-ir (768-d) -----------------------> DeepProjectionHead-d2 -> 1280
Image  -> MobileNetV4-Medium (1280-d) ----------------> DeepProjectionHead-d2 -> 1280
Audio  -> merged native EfficientAT mn20_as (1280-d) -> DeepProjectionHead-d2 -> 1280

The audio encoder in this artifact is the merged native checkpoint:

mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt

Parameter Count

Component	Params
Text encoder (`MongoDB/mdbr-leaf-ir`)	22,861,056
Image encoder (`mobilenetv4_conv_medium.e180_r384_in12k`)	8,502,493
Audio encoder (merged native `mn20_as`)	19,886,566
Image projection head	12,306,560
Audio projection head	12,306,560
Text projection head	11,323,520
Total exact loaded params	87,186,755

Files

File	Purpose
`AIST-87M.safetensors`	Self-contained release artifact
`aist_81m_raw_mn20_lora.yaml`	Training recipe for the source run
`manifest.json`	Release manifest with checksums and eval coverage
`parameter_breakdown.json`	Exact parameter accounting
`aist87m_memory_slice_release_report.md`	Human-memory slice report
`aist87m_memory_slice_release_report.json`	Machine-readable evaluation summary
`aist87m_vs_dual_audio_throughput_l4_20260504.json`	L4 throughput benchmark vs dual-audio tower
`aist87m_salt_published_retrieval_20260626.json`	Previous published AIST-87M SALT gate output
`aist87m_salt_matryoshka32_retrieval_20260626.json`	Current artifact SALT gate output
`aist87m_salt_matryoshka32_summary_20260626.json`	SALT no-regression summary

Caveats

The model is optimized and reported for memory-relevant embedding surfaces, not broad leaderboard coverage.
The single-audio path is smaller and simpler than the dual-audio tower, but it does not dominate the dual-audio tower on paired diagnostic scores.
1280d, 768d, and 512d human-memory slices are complete for the release checkpoint.
Consumers should treat this hash as a new embedding-space artifact and regenerate caches rather than mixing embeddings from earlier AIST-87M hashes.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for augmem/AIST-87M

Quantizations

1 model