RNNT decoder stalls after sentence boundaries in streaming mode

by chatboo - opened 3 days ago

3 days ago

Hi NVIDIA team,

We're using nemotron-speech-streaming-en-0.6b for real-time streaming ASR and have observed that the RNNT decoder frequently stalls (emits only blank tokens) for 2-8+ seconds after encountering sentence-ending punctuation.

Environment

NeMo Toolkit: 2.6.0
Model: nvidia/nemotron-speech-streaming-en-0.6b
GPU: NVIDIA L4 (24GB)
CUDA: 12.1

Problem Description

During streaming inference, the model processes audio normally until it outputs a sentence-ending period. After the period, conformer_stream_step() continues to return the same transcript for many consecutive chunks (40+ chunks, representing 3-8 seconds of audio), even though new speech is being spoken.

Example from live radio transcription:

[  3.06s] Miller is running restaurant. Louis Miller Barbecue.
[  3.16s] Miller is running restaurant. Louis Miller Barbecue. There's a
...
[  3.72s] Miller is running restaurant. Louis Miller Barbecue. There's a ample tables and
[  3.72s - 6.40s] (STALLED - same transcript returned for 2.7 seconds)
[  6.40s] Miller is running restaurant. Louis Miller Barbecue. There's a ample tables and

The transcript remains frozen at "There's a ample tables and" while the speaker continues talking. Content spoken during the stall is lost.

Reproduction

We're using the exact approach from your Cache-Aware Streaming tutorial:

# Initialize cache state
cache_last_channel, cache_last_time, cache_last_channel_len = \
    model.encoder.get_initial_cache_state(batch_size=1)
previous_hypotheses = None
pred_out_stream = None

# Pre-encode cache
pre_encode_size = model.encoder.streaming_cfg.pre_encode_cache_size[1]
cache_pre_encode = torch.zeros((1, num_features, pre_encode_size), device=device)

# Process each chunk
for audio_chunk in audio_stream:
    processed_signal, processed_signal_length = model.preprocessor(
        input_signal=audio_tensor, length=audio_len
    )

    # Concatenate pre-encode cache
    processed_signal = torch.cat([cache_pre_encode, processed_signal], dim=-1)
    processed_signal_length += cache_pre_encode.shape[-1]
    cache_pre_encode = processed_signal[:, :, -pre_encode_size:].clone()

    (pred_out_stream, transcribed_texts, cache_last_channel,
     cache_last_time, cache_last_channel_len, previous_hypotheses
    ) = model.conformer_stream_step(
        processed_signal=processed_signal,
        processed_signal_length=processed_signal_length,
        cache_last_channel=cache_last_channel,
        cache_last_time=cache_last_time,
        cache_last_channel_len=cache_last_channel_len,
        keep_all_outputs=False,
        previous_hypotheses=previous_hypotheses,
        previous_pred_out=pred_out_stream,
        return_transcription=True,
    )

Model config shows:

chunk_size: [105, 112] (1.12s native chunk)
shift_size: [105, 112]
pre_encode_cache_size: [0, 9]

What We've Tried

keep_all_outputs=True - No improvement
Resetting previous_hypotheses after stall detection - Partially helps but loses context
Resetting previous_hypotheses on period detection - Makes it worse
Not carrying over previous_hypotheses - Breaks transcription entirely

Questions

Is this stalling behavior expected for the RNNT decoder after sentence boundaries?
Is there a recommended way to handle long-form streaming transcription with multiple sentences?
Should we be resetting decoder state at certain points, and if so, how?
Are there any parameters we might be missing that could help?

We're building a real-time voice agent and need continuous transcription without these multi-second gaps. Any guidance would be greatly appreciated.

Thank you!

kunaldhawan

NVIDIA org 3 days ago

Hi @Chataboo, thank you for the question. We haven’t observed any stalling behavior with the RNN-T decoder at sentence boundaries on our end, and this is not the expected behavior.

To help us investigate further, could you please share more details:

Do you observe the same behavior across other audio samples, or only with this particular recording?
Could you try running the audio using NeMo's inference script and check if you see the same behavior?
If possible, would you be able to share the audio sample so we can reproduce the issue on our end?

chatboo

3 days ago

Hi @kunaldhawan , thank you for the quick response!

I have good news to share - after implementing proper chunk sizing, the model works correctly for our voice agent use case.

Root Cause: Incorrect Chunk Sizing

Our original implementation was sending 80-200ms chunks, but the model expects 1120ms native chunks (112 frames × 10ms). This caused the decoder to get stuck in a waiting state.

Fix Applied

We updated our handler to:

Buffer incoming audio until we have a full 1120ms chunk (17920 samples at 16kHz)
Process one model chunk at a time with proper drop_extra_pre_encoded handling
Use the native chunk size from model.encoder.streaming_cfg.chunk_size[1]

Production Evaluation Results

Ran a full evaluation on our voice agent pipeline with the companion dataset (5 turns of natural conversation):

Metric	P50	P95
Spec lead (ms)	668	736
LLM TTFT (ms)	93	857
LLM TTFS (ms)	304	873
TTS TTFU (ms)	261	548
A2F TTFU (ms)	62	66

Key findings:

No 11-second stalls observed
ASR producing interim transcripts consistently
Turn 3 (12s audio): 8 interim transcripts over the duration - good transcript flow
Speculative responses ready 668ms before EOU fires

Audio Complexity Observation

We also confirmed that clean single-speaker audio works perfectly:

Clean speech (18s, studio quality): 0 stalls despite 5 sentence boundaries
Complex audio (BBC radio with music): Occasional 3.4s stalls

For voice agent use cases with clean microphone input from a single speaker, the model performs excellently.

Summary

The issue was our implementation, not the model. Using proper 1120ms chunk sizing, the RNNT decoder works as expected. Thank you for pointing us in the right direction!

Test scripts and full report available if helpful for documentation.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment