Add eos_token to tokenizer implicitly to help downstream vllm integrate

#43

by dongwng - opened 13 days ago

base: refs/heads/main

←

from: refs/pr/43

Discussion Files changed

-0

dongwng

13 days ago

•

edited 13 days ago

Summary

Declare the Parakeet TDT end-of-transcription token in the standard Hugging Face tokenizer and generation config metadata.

This updates:

tokenizer_config.json to mark <|endoftext|> as the tokenizer EOS token
generation_config.json to set eos_token_id to 3

Rationale

The Parakeet tokenizer already contains <|endoftext|> at token id 3:

tokenizer.convert_tokens_to_ids("<|endoftext|>") == 3
tokenizer.decode([3]) == "<|endoftext|>"

The model also emits token id 3 as the terminal marker during TDT decoding. However, the current repository metadata does not expose that token as EOS:

tokenizer.eos_token_id is None
GenerationConfig.from_pretrained(...).eos_token_id is None

Downstream runtimes that rely on standard Hugging Face metadata therefore cannot discover the model’s stop token from either the tokenizer config or generation_config.json.

Most ASR / speech-generation checkpoints expose this metadata through the standard files. For example, Whisper, Qwen3-ASR, Granite Speech, Fun-ASR, and FireRed ASR/LID publish eos_token_id in generation_config.json, and usually also declare the corresponding tokenizer EOS token.

Adding this metadata makes Parakeet consistent with those checkpoints and avoids downstream framework-specific workarounds.

Compatibility

This does not change tokenizer vocabulary or model weights. It only declares existing semantics:

<|endoftext|> already exists in the tokenizer vocabulary
token id 3 already decodes to <|endoftext|>
token id 3 is already used as the model’s end marker

Expected behavior after the change:

tokenizer.eos_token == "<|endoftext|>"
tokenizer.eos_token_id == 3
GenerationConfig.from_pretrained(...).eos_token_id == 3

Downstream impact

This helps runtimes such as vLLM, Transformers-based serving stacks, and OpenAI-compatible transcription servers stop generation cleanly without adding Parakeet-specific stop-token workarounds. This is one vllm integration PR https://github.com/vllm-project/vllm/pull/41708. When this PR gets merged the vllm integration will be simpler.

add eos token implicitly to help downstream vllm integrate this modelb51b7dc0

dongwng changed pull request status to open 13 days ago

dongwng

13 days ago

@eustlb @nithinraok

nithinraok

NVIDIA org 3 days ago

LGTM, thanks @dongwng

nithinraok changed pull request status to merged 3 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment