---
license: other
license_name: ltx-2-community-license-agreement
license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE
pipeline_tag: text-to-video
tags:
- text-to-video
- video-generation
- audio-video-generation
- long-video
- multi-shot
- dmd
library_name: ltx-video
---
Echo-LongVideo
🎬 Pushing the Frontier of Long Video Generation
Official model weights for minute-level multi-shot audio-video generation with a distilled DMD generator, paired cross-modal memory, and story-level consistency.
📄 Paper |
💻 Inference Code |
🧬 Model |
🚀 Usage |
📊 Results |
📝 Citation
## Model Summary
**Echo-LongVideo** (a.k.a. **JoyAI-Echo**) is a long-form, multi-shot, audio-video generation model. A cross-modal audio-visual memory bank preserves character appearance and voice timbre consistently across up to **five-minute** videos, and a post-training pipeline combining memory-based reinforcement learning with distribution matching distillation (DMD) delivers a **7.5× inference speedup** without sacrificing quality.
In human evaluation, Echo-LongVideo decisively outperforms *HappyOyster* (directing mode) on long-form generation and surpasses the short-video specialist *Wan 2.6* on human-centric tasks.
This repository hosts the **released checkpoint**. Inference code is released separately — see the [Usage](#usage) section.
## Model Details
- **Developed by:** Echo Team @ Joy Future Academy, JD
- **Model type:** Text-to-(Audio+Video) diffusion transformer, DMD 8-step
- **Modality:** Text → synchronized video + audio
- **Backbone:** Built on top of [LTX-Video](https://github.com/Lightricks/LTX-Video)
- **Text encoder:** [`google/gemma-3-12b-it`](https://huggingface.co/google/gemma-3-12b-it) (downloaded separately)
- **Resolution / length (by default):** 1280 × 736, 241 frames @ 25 fps per shot
- **Max story length:** up to 5 minutes (multi-shot)
- **License:** LTX-2 Community License Agreement
## Highlights
- 🎞️ **Minute-level multi-shot stories** from a single prompt JSON.
- ⚡ **DMD-distilled few-step inference**, ~7.5× faster than the original pipeline.
- 🔊 **Joint audio-video generation** in a single pipeline.
- 🧠 **Paired cross-modal memory bank** for story-level identity and voice consistency.
## Usage
Inference is run with the standalone **Echo-LongVideo** inference repository.
### 1. Download the checkpoint
```bash
huggingface-cli download /Echo-LongVideo \
--local-dir checkpoints
```
Also download the Gemma text encoder:
```bash
huggingface-cli download google/gemma-3-12b-it \
--local-dir checkpoints/gemma-3-12b
```
Expected layout:
```text
checkpoints/
├── echo-longvideo-release.safetensors
└── gemma-3-12b/
```
### 2. Get the inference code
```bash
git clone https://github.com/Echo-Team-Joy-Future-Academy-JD/JoyAI-Echo.git
cd JoyAI-Echo
```
Environment: **Python 3.11 + PyTorch 2.8 + CUDA 12.8** (see the inference repo's `environment.yml` / `requirements.txt`).
### 3. Write a story prompt
**Enhance your prompt first.** We provide prompt enhancers — system prompts that expand a short story or idea into well-formed shot prompts: **`prompts/long_story_writer_system_prompt.md`** for long, multi-shot video, and **`prompts/short_story_writer_system_prompt.md`** for single-shot short video. We **strongly recommend** running your input through the matching enhancer before inference; un-enhanced prompts tend to produce noticeably weaker results.
Create a JSON file under `prompts/`. Each file is a single object with a `prompts` list, where **every string is one complete shot**. A single string produces one shot; multiple strings produce a multi-shot story, with each new shot conditioned on the previous ones through the paired audio-video memory bank.
Inside each string, write these parts in order:
| Part | What to describe |
| --- | --- |
| **Roles & Subjects** | Describe the appearance of all visible people, including age, build, hair, face, wardrobe, and speaking voice timbre when applicable. |
| **Action & Dialogue** | What the subject does and speaks. |
| **Style** | The overall visual and emotional aesthetic — e.g. realistic motorsport film language, cool daylight, restrained cinematic tension. |
| **Camera Movement** | The shot type and framing or movement — e.g. a stable close-up on the face, or a medium shot from the waist up. |
| **Background** | The setting and scene details behind the subject. |
| **Sound Effects & BGM** | The sounds in the scene and the background music — e.g. room tone, wind, footsteps and fabric, with a soft low music bed under the dialogue or nobackground music |
A more convenient prompt-writing workflow will be released as a **director agent** for everyone to use.
### 4. Run
```bash
python inference.py
```
Outputs land in `inference_result/outputs//inference_/`.
## Hardware
Peak GPU memory is **~46–50 GB** at the default 1280 × 736 × 241 frame setting — a single H100/A100 (80 GB) or 48 GB GPU is sufficient. For smaller GPUs, lower resolution or frame count:
```bash
python inference.py --num-frames 121 --video-height 480 --video-width 832
```
## Results
### Reported Scale
| Item | Value |
| --- | ---: |
| 🎬 Long-form coherent story length | **5 min** |
| ⚡ Speedup over the original multi-step pipeline | **7.5×** |
| 📚 Benchmark stories | **100** |
| 🎞️ Generated evaluation shots | **3,000** |
| 🕒 Frames per shot | **241 @ 25 fps** |
### Human Evaluation
GSB user study. Values are the percentage of user preferences.
| Aspect (Long Video) | JoyAI-Echo | Tie | HappyOyster (Directing) |
| --- | ---: | ---: | ---: |
| Visual aesthetics | **63.6%** | 8.8% | 27.6% |
| Audio quality | **81.7%** | 6.5% | 11.8% |
| Prompt following | **80.6%** | 13.5% | 5.9% |
| IP consistency | **59.4%** | 12.9% | 27.7% |
| Aspect (Short Video) | JoyAI-Echo | Tie | Wan 2.6 |
| --- | ---: | ---: | ---: |
| Visual aesthetics | **58.8%** | 14.7% | 26.5% |
| Audio quality | 32.3% | 30.9% | 36.8% |
| Prompt following | 33.8% | 36.8% | 29.4% |
## Acknowledgements
We gratefully acknowledge [LTX-Video](https://github.com/Lightricks/LTX-Video) for the base video generator and [Gemma](https://huggingface.co/google/gemma-3-12b-it) for the text encoder, along with the broader open-source community.
## Citation
If Echo-LongVideo helps your research or products, please cite:
```bibtex
@techreport{echo2026longvideo,
title = {Echo-LongVideo: Pushing the Frontier of Long Video Generation},
author = {{Echo Team @ Joy Future Academy, JD}},
institution = {Joy Future Academy, JD},
year = {2026},
month = {June},
url = {https://github.com/Echo-Team-Joy-Future-Academy-JD/JoyAI-Echo}
}
```
## License
Released under the [LTX-2 Community License Agreement](https://github.com/Lightricks/LTX-2/blob/main/LICENSE). By downloading or using these weights, you agree to its terms. The bundled Gemma text encoder is governed by Google's separate Gemma license.