LoSATok: Low-dimensional Semantic-Acoustic Tokenizer

github arXiv model

Python 3.10 PyTorch 2.8 License: MIT

LoSATok is a continuous, low-dimensional, 25 Hz audio tokenizer that jointly models semantic and acoustic information in a single latent space. LoSATok can performance competitive cross-domain understanding and generation capabilities with a 128-dim latent z.

See our paper "LoSATok: Low-dimensional Semantic-Acoustic Tokenizer for Cross-Domain Audio Understanding and Generation".


✨ Key Features

  • 128-Dim Semantic-Acoustic Latent β€” A compact continuous representation that unifies semantic and acoustic information for both understanding and generation.
  • Semantic Bottleneck (SemBo) β€” Compresses frozen MiDashengLM features from 1280-dim to 128-dim with temporal relation preservation.
  • Dual-Level Semantic Supervision β€” Uses both high- and low-dimensional semantic targets to balance semantic richness and acoustic reconstruction.
  • DiT-Efficient Generation β€” Makes downstream diffusion modeling easier, improving convergence and generation quality for TTS, TTM, and TTA.
  • Cross-Domain Audio Modeling β€” Works across speech, music, and general audio within a single tokenizer.

πŸ› οΈ Setup

We test our environment on Ubuntu 20.04 with Python 3.10 and CUDA 12.x.

Env Setup

conda create -n losatok python=3.10 -y
conda activate losatok

Basic Requirements

git clone https://github.com/wxzyd123/LoSATok.git
cd LoSATok

pip install torch==2.8.0 torchaudio==2.8.0
pip install -r requirements.txt

sudo apt install ffmpeg

πŸ“¦ Checkpoints

LoSATok needs two checkpoints:

File Description
ckpts/semantic_encoder.pth Frozen MiDashengLM semantic encoder + pretrained Semantic Bottleneck checkpoint.
ckpts/losatok_kl1e-3.pth or ckpts/losatok_kl1e-2.pth LoSATok checkpoint. kl1e-3 and kl1e-2 correspond to different KL clamp strengths.

Place them under the ckpts/ directory so that the layout looks like:

The semantic encoder loads the MiDashengLM-7B backbone at construction time.

You can download these checkpoints from huggingface links πŸ€— here.


πŸš€ Usage

1. Reconstruction via CLI

The simplest way to test LoSATok is to reconstruct a wav file:

python infer.py \
    --config_path config/16k_16k_25Hz_losatok.yml \
    --model_path  ckpts/losatok_kl1e-2.pth \
    --input_path  example/en.wav \
    --output_path example/recon.wav \
    --save_features example/en_features.pt

Arguments

Argument Description
--config_path YAML config that defines the AudioVAE skeleton. Default: config/16k_16k_25Hz_losatok.yml.
--model_path Trained LoSATok checkpoint (*.pth). If omitted, an untrained model is built (debug only).
--input_path Input wav path.
--output_path Reconstructed wav path.
--device cuda (default) or cpu.
--sample If set, use the reparameterized z = mu + eps * std instead of the deterministic mu.
--max_duration Optional float, clip the input to the first N seconds (avoids OOM on long files).
--save_features Optional .pt path to also dump the full encoded feature dict.

After running, the script prints the shapes of all encoded features, e.g.:

[LoSATok] Encoded feature shapes:
  z                  (1, T_token, 128)
  mu                 (1, T_token, 128)
  logvar             (1, T_token, 128)
  semantic_emb       (1, T_token, 1280)
  acoustic_emb       (1, T_token, 1280)
  unified_emb        (1, T_token, 1280)
  semantic_emb_low   (1, T_token, 128)
  acoustic_emb_low   (1, T_token, 128)
  unified_emb_low    (1, T_token, 128)

2. Python API

import librosa, torch
from infer import load_losatok, encode, decode

model = load_losatok(
    config_path="config/16k_16k_25Hz_losatok.yml",
    model_path="ckpts/losatok_kl1e-3.pth",
    device="cuda",
)

wav, sr = librosa.load("example/en.wav", sr=model.sample_rate, mono=True)
audio = torch.from_numpy(wav).unsqueeze(0)

# ---- Encode: waveform -> LoSATok tokens ----
features = encode(model, audio)

z                = features["z"]                   # (B, T_token, 128)  <- LoSATok tokens
mu               = features["mu"]                  # (B, T_token, 128)
logvar           = features["logvar"]              # (B, T_token, 128)
semantic_emb     = features["semantic_emb"]        # (B, T_token, 1280)
acoustic_emb     = features["acoustic_emb"]        # (B, T_token, 1280)
unified_emb      = features["unified_emb"]         # (B, T_token, 1280)
semantic_emb_low = features["semantic_emb_low"]    # (B, T_token, 128)
acoustic_emb_low = features["acoustic_emb_low"]    # (B, T_token, 128)
unified_emb_low  = features["unified_emb_low"]     # (B, T_token, 128)

# ---- Decode: LoSATok tokens -> waveform ----
audio_recon = decode(model, z)

z is the 128-dim continuous LoSATok token, typically fed to downstream models (LLMs, DiTs, etc.).


❀️ Acknowledgements

We sincerely thank these excellent open-source work:


πŸ“„ License

The code in this repository is released under the MIT license. See LICENSE for details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for wxzyd123/LoSATok