| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β DETAILED SOURCE FILE LISTING BY CATEGORY β |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
| MAIN INFERENCE PIPELINE FILES |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
| /home/user/IndexTTS-Rust/indextts/infer_v2.py (739 LINES) βββ CRITICAL |
| ββ Purpose: Main TTS inference class (IndexTTS2) |
| ββ Key Classes: |
| β ββ QwenEmotion (emotion text-to-vector conversion) |
| β ββ IndexTTS2 (main inference class) |
| β ββ Helper functions for emotion/audio processing |
| ββ Key Methods: |
| β ββ __init__() - Initialize all models and codecs |
| β ββ infer() - Single text generation with emotion control |
| β ββ infer_fast() - Parallel segment generation |
| β ββ get_emb() - Extract semantic embeddings |
| β ββ remove_long_silence() - Silence token removal |
| β ββ insert_interval_silence() - Silence insertion |
| β ββ Cache management for repeated generation |
| ββ Models Loaded: |
| β ββ UnifiedVoice (GPT model for mel token generation) |
| β ββ W2V-BERT (semantic feature extraction) |
| β ββ RepCodec (semantic codec) |
| β ββ S2Mel model (semantic-to-mel conversion) |
| β ββ CAMPPlus (speaker embedding) |
| β ββ BigVGAN vocoder |
| β ββ Qwen-based emotion model |
| β ββ Emotion/speaker matrices |
| ββ External Dependencies: torch, transformers, librosa, safetensors |
|
|
| /home/user/IndexTTS-Rust/webui.py (18KB) βββ WEB INTERFACE |
| ββ Purpose: Gradio-based web UI for IndexTTS |
| ββ Key Components: |
| β ββ Model initialization (IndexTTS2 instance) |
| β ββ Language selection (Chinese/English) |
| β ββ Emotion control modes (4 modes) |
| β ββ Example case loading from cases.jsonl |
| β ββ Progress bar integration |
| β ββ Output management |
| ββ Features: |
| β ββ Real-time inference |
| β ββ Multiple emotion control methods |
| β ββ Batch processing |
| β ββ Task caching |
| β ββ i18n support |
| β ββ Pre-loaded example cases |
| ββ Web Framework: Gradio 5.34.1 |
|
|
| /home/user/IndexTTS-Rust/indextts/cli.py (64 LINES) |
| ββ Purpose: Command-line interface |
| ββ Usage: python -m indextts.cli <text> -v <voice.wav> -o <output.wav> [options] |
| ββ Arguments: |
| β ββ text: Text to synthesize |
| β ββ -v/--voice: Voice reference audio |
| β ββ -o/--output_path: Output file path |
| β ββ -c/--config: Config file path |
| β ββ --model_dir: Model directory |
| β ββ --fp16: Use FP16 precision |
| β ββ -d/--device: Device (cpu/cuda/mps/xpu) |
| β ββ -f/--force: Force overwrite |
| ββ Uses: IndexTTS (v1 model) |
|
|
| TEXT PROCESSING & NORMALIZATION FILES |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
| /home/user/IndexTTS-Rust/indextts/utils/front.py (700 LINES) βββ CRITICAL |
| ββ Purpose: Text normalization and tokenization |
| ββ Key Classes: |
| β ββ TextNormalizer (700+ lines) |
| β β ββ Pattern Definitions: |
| β β β ββ PINYIN_TONE_PATTERN (regex for pinyin with tones 1-5) |
| β β β ββ NAME_PATTERN (regex for Chinese names) |
| β β β ββ ENGLISH_CONTRACTION_PATTERN (regex for 's contractions) |
| β β ββ Methods: |
| β β β ββ normalize() - Main normalization |
| β β β ββ use_chinese() - Language detection |
| β β β ββ save_pinyin_tones() - Extract pinyin with tones |
| β β β ββ restore_pinyin_tones() - Restore pinyin |
| β β β ββ save_names() - Extract names |
| β β β ββ restore_names() - Restore names |
| β β β ββ correct_pinyin() - Phoneme correction (jqxβv) |
| β β β ββ char_rep_map - Character replacement dictionary |
| β β ββ Normalizers: |
| β β ββ zh_normalizer (Chinese) - Uses WeTextProcessing/wetext |
| β β ββ en_normalizer (English) - Uses tn library |
| β β |
| β ββ TextTokenizer (200+ lines) |
| β ββ Methods: |
| β β ββ encode() - Text to token IDs |
| β β ββ decode() - Token IDs to text |
| β β ββ convert_tokens_to_ids() |
| β β ββ convert_ids_to_tokens() |
| β β ββ Vocab management |
| β ββ Special Tokens: |
| β β ββ BOS: "<s>" (ID 0) |
| β β ββ EOS: "</s>" (ID 1) |
| β β ββ UNK: "<unk>" |
| β ββ Tokenizer: SentencePiece (BPE-based) |
| ββ Language Support: |
| β ββ Chinese (simplified & traditional) |
| β ββ English |
| β ββ Mixed Chinese-English |
| ββ Critical Pattern Matching: |
| ββ Pinyin tone detection |
| ββ Name entity detection |
| ββ Email matching |
| ββ Character replacement |
| ββ Punctuation handling |
|
|
| GPT MODEL ARCHITECTURE FILES |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
| /home/user/IndexTTS-Rust/indextts/gpt/model_v2.py (747 LINES) βββ CRITICAL |
| ββ Purpose: UnifiedVoice GPT-based TTS model |
| ββ Key Classes: |
| β ββ UnifiedVoice (700+ lines) |
| β β ββ Architecture: |
| β β β ββ Input Embeddings: Text (256 vocab), Mel (8194 vocab) |
| β β β ββ Position Embeddings: Learned embeddings for mel/text |
| β β β ββ GPT Transformer: Configurable layers/heads |
| β β β ββ Conditioning Encoder: Conformer or Perceiver-based |
| β β β ββ Emotion Conditioning: Separate conformer + perceiver |
| β β β ββ Output Heads: Text prediction, Mel prediction |
| β β β |
| β β ββ Parameters: |
| β β β ββ layers: 8 (transformer depth) |
| β β β ββ model_dim: 512 (embedding dimension) |
| β β β ββ heads: 8 (attention heads) |
| β β β ββ max_text_tokens: 120 |
| β β β ββ max_mel_tokens: 250 |
| β β β ββ number_mel_codes: 8194 |
| β β β ββ condition_type: "conformer_perceiver" or "conformer_encoder" |
| β β β ββ Various activation functions |
| β β β |
| β β ββ Key Methods: |
| β β β ββ forward() - Forward pass |
| β β β ββ post_init_gpt2_config() - Initialize for inference |
| β β β ββ generate_mel() - Mel token generation |
| β β β ββ forward_with_cond_scale() - With classifier-free guidance |
| β β β ββ Cache management |
| β β β |
| β β ββ Conditioning System: |
| β β ββ Speaker conditioning via mel spectrogram |
| β β ββ Conformer encoder for speaker features |
| β β ββ Perceiver for attention pooling |
| β β ββ Emotion conditioning (separate pathway) |
| β β ββ Emotion vector support (8-dimensional) |
| β β |
| β ββ ResBlock (40+ lines) |
| β β ββ Conv1d layers with GroupNorm |
| β β ββ ReLU activation with residual connection |
| β β |
| β ββ GPT2InferenceModel (200+ lines) |
| β β ββ Inference wrapper for GPT2 |
| β β ββ KV cache support |
| β β ββ Model parallelism support |
| β β ββ Token-by-token generation |
| β β |
| β ββ ConditioningEncoder (30 lines) |
| β β ββ Conv1d initialization |
| β β ββ Attention blocks |
| β β ββ Optional mean pooling |
| β β |
| β ββ MelEncoder (30 lines) |
| β β ββ Conv1d layers |
| β β ββ ResBlocks |
| β β ββ 4x reduction |
| β β |
| β ββ LearnedPositionEmbeddings (15 lines) |
| β β ββ Learnable positional embeddings |
| β β |
| β ββ build_hf_gpt_transformer() (20 lines) |
| β ββ Builds HuggingFace GPT2 with custom embeddings |
| β |
| ββ External Dependencies: torch, transformers, indextts.gpt modules |
| ββ Critical Inference Parameters: |
| ββ Temperature control for generation |
| ββ Top-k/top-p sampling |
| ββ Classifier-free guidance scale |
| ββ Generation length limits |
|
|
| /home/user/IndexTTS-Rust/indextts/gpt/conformer_encoder.py (520 LINES) ββ |
| ββ Purpose: Conformer-based speaker conditioning encoder |
| ββ Key Classes: |
| β ββ ConformerEncoder (main) |
| β β ββ Modules: |
| β β β ββ Subsampling layer (Conv2d) |
| β β β ββ Positional encoding |
| β β β ββ Conformer blocks |
| β β β ββ Layer normalization |
| β β β ββ Optional projection layer |
| β β β |
| β β ββ Configuration Parameters: |
| β β β ββ input_size: 1024 (mel spectrogram bins) |
| β β β ββ output_size: depends on config |
| β β β ββ linear_units: hidden dim for FFN |
| β β β ββ attention_heads: 8 |
| β β β ββ num_blocks: 4 |
| β β β ββ input_layer: "linear" or "conv2d" |
| β β β |
| β β ββ Architecture: Conv β Pos Enc β [Conformer Block] * N β LayerNorm |
| β β |
| β ββ ConformerBlock (80+ lines) |
| β β ββ Residual connections |
| β β ββ FFN β Attention β Conv β FFN structure |
| β β ββ Feed-forward network (2-layer with dropout) |
| β β ββ Multi-head self-attention |
| β β ββ Convolution module (depthwise) |
| β β ββ Layer normalization |
| β β |
| β ββ ConvolutionModule (50 lines) |
| β β ββ Pointwise Conv 1x1 |
| β β ββ Depthwise Conv with kernel_size (e.g., 15) |
| β β ββ Batch normalization or layer normalization |
| β β ββ Activation (ReLU/SiLU) |
| β β ββ Projection |
| β β |
| β ββ PositionwiseFeedForward (15 lines) |
| β β ββ Dense layer (idim β hidden) |
| β β ββ Activation (ReLU) |
| β β ββ Dropout |
| β β ββ Dense layer (hidden β idim) |
| β β |
| β ββ MultiHeadedAttention (custom) |
| β ββ Scaled dot-product attention |
| β ββ Multiple heads |
| β ββ Optional relative position bias |
| β |
| ββ External Dependencies: torch, custom conformer modules |
| ββ Use Case: Processing mel spectrogram to extract speaker features |
|
|
| /home/user/IndexTTS-Rust/indextts/gpt/perceiver.py (317 LINES) ββ |
| ββ Purpose: Perceiver resampler for attention pooling |
| ββ Key Classes: |
| β ββ PerceiverResampler (250+ lines) |
| β β ββ Architecture: |
| β β β ββ Learnable latent queries |
| β β β ββ Cross-attention layers |
| β β β ββ Feed-forward networks |
| β β β ββ Layer normalization |
| β β β |
| β β ββ Parameters: |
| β β β ββ dim: 512 (embedding dimension) |
| β β β ββ dim_context: 512 (context dimension) |
| β β β ββ num_latents: 32 (number of latent queries) |
| β β β ββ num_latent_channels: 64 |
| β β β ββ num_layers: 6 |
| β β β ββ ff_mult: 4 (FFN expansion) |
| β β β ββ heads: 8 |
| β β β |
| β β ββ Key Methods: |
| β β β ββ forward() - Attend and pool |
| β β β ββ _cross_attend_block() - Single cross-attention layer |
| β β β |
| β β ββ Cross-Attention Mechanism: |
| β β ββ Queries: Learnable latents |
| β β ββ Keys/Values: Input context |
| β β ββ Output: Pooled features (num_latents Γ dim) |
| β β ββ FFN projection for dimension mixing |
| β β |
| β ββ FeedForward (15 lines) |
| β ββ Dense (dim β hidden) |
| β ββ GELU activation |
| β ββ Dense (hidden β dim) |
| β |
| ββ External Dependencies: torch, einsum operations |
| ββ Use Case: Pool conditioning encoder output to fixed-size representation |
|
|
| VOCODER & AUDIO SYNTHESIS FILES |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
| /home/user/IndexTTS-Rust/indextts/BigVGAN/models.py (1000+ LINES) βββ |
| ββ Purpose: BigVGAN neural vocoder for mel-to-audio conversion |
| ββ Key Classes: |
| β ββ BigVGAN (400+ lines) |
| β β ββ Architecture: |
| β β β ββ Initial Conv1d (80 mel bins β 192 channels) |
| β β β ββ Upsampling layers (transposed conv) |
| β β β ββ AMP blocks (anti-aliased multi-period) |
| β β β ββ Final Conv1d (channels β 1 waveform) |
| β β β ββ Tanh activation for output |
| β β β |
| β β ββ Upsampling: 4x β 8x β 8x β 4x (256x total) |
| β β β ββ Maps from 22050 Hz mel frames to audio samples |
| β β β ββ Kernel sizes: [16, 16, 4, 4] |
| β β β ββ Padding: [6, 6, 2, 2] |
| β β β |
| β β ββ Parameters: |
| β β β ββ num_mels: 80 |
| β β β ββ num_freq: 513 |
| β β β ββ num_mels: 80 |
| β β β ββ n_fft: 1024 |
| β β β ββ hop_size: 256 |
| β β β ββ win_size: 1024 |
| β β β ββ sampling_rate: 22050 |
| β β β ββ freq_min: 0 |
| β β β ββ freq_max: None |
| β β β ββ use_cuda_kernel: bool |
| β β β |
| β β ββ Key Methods: |
| β β β ββ forward() - Mel β audio waveform |
| β β β ββ from_pretrained() - Load from HuggingFace |
| β β β ββ remove_weight_norm() - Remove spectral normalization |
| β β β ββ eval() - Set to evaluation mode |
| β β β |
| β β ββ Special Features: |
| β β ββ Weight normalization for training stability |
| β β ββ Spectral normalization option |
| β β ββ CUDA kernel support for activation functions |
| β β ββ Snake/SnakeBeta activation (periodic) |
| β β ββ Anti-aliasing filters for high-quality upsampling |
| β β |
| β ββ AMPBlock1 (50 lines) |
| β β ββ Architecture: Conv1d Γ 2 with activations |
| β β ββ Multiple dilation patterns [1, 3, 5] |
| β β ββ Residual connections |
| β β ββ Activation1d wrapper for anti-aliasing |
| β β ββ Weight normalization |
| β β |
| β ββ AMPBlock2 (40 lines) |
| β β ββ Similar to AMPBlock1 but simpler |
| β β ββ Dilation patterns [1, 3] |
| β β ββ Residual connections |
| β β |
| β ββ Activation1d (custom, from alias_free_activation/) |
| β β ββ Applies activation function (Snake/SnakeBeta) |
| β β ββ Optional anti-aliasing filter |
| β β ββ Optional CUDA kernel for efficiency |
| β β |
| β ββ Snake Activation (from activations.py) |
| β β ββ Formula: x + (1/alpha) * sinΒ²(alpha * x) |
| β β ββ Periodic nonlinearity |
| β β ββ Learnable alpha parameter |
| β β |
| β ββ SnakeBeta Activation (from activations.py) |
| β ββ More complex periodic activation |
| β ββ Improved harmonic modeling |
| β |
| ββ External Dependencies: torch, scipy, librosa |
| ββ Model Size: ~100 MB (pretrained weights) |
|
|
| /home/user/IndexTTS-Rust/indextts/s2mel/modules/audio.py (83 LINES) |
| ββ Purpose: Mel-spectrogram computation (DSP) |
| ββ Key Functions: |
| β ββ load_wav() - Load WAV file with scipy |
| β ββ mel_spectrogram() - Compute mel spectrogram |
| β β ββ Parameters: |
| β β β ββ y: waveform tensor |
| β β β ββ n_fft: 1024 |
| β β β ββ num_mels: 80 |
| β β β ββ sampling_rate: 22050 |
| β β β ββ hop_size: 256 |
| β β β ββ win_size: 1024 |
| β β β ββ fmin: 0 |
| β β β ββ fmax: None or 8000 |
| β β β |
| β β ββ Process: |
| β β β 1. Pad input with reflect padding |
| β β β 2. Compute STFT (Short-Time Fourier Transform) |
| β β β 3. Convert to magnitude spectrogram |
| β β β 4. Apply mel filterbank (librosa) |
| β β β 5. Apply dynamic range compression (log) |
| β β β ββ Output: [1, 80, T] tensor |
| β β β |
| β β ββ Caching: |
| β β ββ Caches mel filterbank matrices |
| β β ββ Caches Hann windows |
| β β ββ Device-specific caching |
| β β |
| β ββ dynamic_range_compression() - Log compression |
| β ββ dynamic_range_decompression() - Inverse |
| β ββ spectral_normalize/denormalize() |
| β |
| ββ Critical DSP Parameters: |
| β ββ STFT Window: Hann window |
| β ββ FFT Size: 1024 |
| β ββ Hop Size: 256 (11.6 ms at 22050 Hz) |
| β ββ Mel Bins: 80 (perceptual scale) |
| β ββ Min Freq: 0 Hz |
| β ββ Max Freq: Variable (8000 Hz or Nyquist) |
| β |
| ββ External Dependencies: torch, librosa, scipy |
|
|
| SEMANTIC CODEC & FEATURE EXTRACTION FILES |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
| /home/user/IndexTTS-Rust/indextts/utils/maskgct_utils.py (250 LINES) |
| ββ Purpose: Build and manage semantic codecs |
| ββ Key Functions: |
| β ββ build_semantic_model() |
| β β ββ Loads: facebook/w2v-bert-2.0 model |
| β β ββ Extracts: wav2vec 2.0 BERT embeddings |
| β β ββ Returns: model, mean, std (for normalization) |
| β β ββ Output: 1024-dimensional embeddings |
| β β |
| β ββ build_semantic_codec() |
| β β ββ Creates: RepCodec (residual vector quantization) |
| β β ββ Quantizes: Semantic embeddings |
| β β ββ Returns: Codec model |
| β β ββ Output: Discrete tokens |
| β β |
| β ββ build_s2a_model() |
| β β ββ Builds: MaskGCT_S2A (semantic-to-acoustic) |
| β β ββ Maps: Semantic codes β acoustic codes |
| β β |
| β ββ build_acoustic_codec() |
| β β ββ Encoder: Encodes acoustic features |
| β β ββ Decoder: Decodes codes β audio |
| β β ββ Multiple codec variants |
| β β |
| β ββ Inference_Pipeline (class) |
| β ββ Combines all codecs |
| β ββ Methods: |
| β β ββ get_emb() - Get semantic embeddings |
| β β ββ get_scode() - Quantize to semantic codes |
| β β ββ semantic2acoustic() - Convert codes |
| β β ββ s2a_inference() - Full pipeline |
| β ββ Diffusion-based generation options |
| β |
| ββ External Dependencies: torch, transformers, huggingface_hub |
| ββ Pre-trained Models: |
| ββ W2V-BERT-2.0: 614M parameters |
| ββ MaskGCT: From amphion/MaskGCT |
| ββ Various codec checkpoints |
|
|
| CONFIGURATION & UTILITY FILES |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
| /home/user/IndexTTS-Rust/indextts/utils/checkpoint.py (50 LINES) |
| ββ Purpose: Load model checkpoints |
| ββ Key Functions: |
| β ββ load_checkpoint() - Load weights into model |
| β ββ Device handling (CPU/GPU/XPU/MPS) |
| ββ Supported Formats: .pth, .safetensors |
|
|
| /home/user/IndexTTS-Rust/indextts/utils/arch_util.py |
| ββ Purpose: Architecture utility modules |
| ββ Key Classes: |
| β ββ AttentionBlock - Generic attention layer |
| ββ Used in: Conditioning encoder, other modules |
|
|
| /home/user/IndexTTS-Rust/indextts/utils/xtransformers.py (1,600 LINES) |
| ββ Purpose: Extended transformer utilities |
| ββ Key Components: |
| β ββ Advanced attention mechanisms |
| β ββ Relative position bias |
| β ββ Cross-attention patterns |
| β ββ Various position encoding schemes |
| ββ Used in: GPT model, encoders |
|
|
| TESTING FILES |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
| /home/user/IndexTTS-Rust/tests/regression_test.py |
| ββ Test Cases: |
| β ββ Chinese text with pinyin tones (ζ XUAN4) |
| β ββ English text |
| β ββ Mixed Chinese-English |
| β ββ Long-form text with multiple sentences |
| β ββ Named entities (Joseph Gordon-Levitt) |
| β ββ Chinese names (ηΊ¦η倫·ι«η»-θ±η»΄ηΉ) |
| β ββ Extended passages for robustness |
| ββ Inference Modes: |
| β ββ Single inference (infer) |
| β ββ Fast inference (infer_fast) |
| ββ Output: WAV files in outputs/ directory |
|
|
| /home/user/IndexTTS-Rust/tests/padding_test.py |
| ββ Test Scenarios: |
| β ββ Variable length inputs |
| β ββ Batch processing |
| β ββ Edge cases |
| β ββ Padding handling |
| ββ Purpose: Ensure robust padding mechanics |
|
|
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
| KEY ALGORITHMS SUMMARY: |
|
|
| 1. TEXT PROCESSING: |
| - Regex-based pattern matching for pinyin/names |
| - Character-level CJK tokenization |
| - SentencePiece BPE encoding |
| - Language detection (Chinese vs English) |
|
|
| 2. FEATURE EXTRACTION: |
| - W2V-BERT semantic embeddings (1024-dim) |
| - RepCodec quantization |
| - Mel-spectrogram (STFT-based, 80-dim) |
| - CAMPPlus speaker embeddings (192-dim) |
|
|
| 3. SEQUENCE GENERATION: |
| - GPT-based autoregressive generation |
| - Conformer speaker conditioning |
| - Perceiver pooling for attention |
| - Classifier-free guidance (optional) |
| - Temperature/top-k/top-p sampling |
|
|
| 4. AUDIO SYNTHESIS: |
| - Transposed convolution upsampling (256x) |
| - Anti-aliased activation functions |
| - Residual connections |
| - Weight/spectral normalization |
|
|
| 5. EMOTION CONTROL: |
| - 8-dimensional emotion vectors |
| - Text-based emotion detection (via Qwen) |
| - Audio-based emotion extraction |
| - Emotion matrix interpolation |
|
|
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|