Instructions to use Lin-Chen/ShareCaptioner-Video with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Lin-Chen/ShareCaptioner-Video with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Lin-Chen/ShareCaptioner-Video", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Lin-Chen/ShareCaptioner-Video", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Lin-Chen/ShareCaptioner-Video with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Lin-Chen/ShareCaptioner-Video" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Lin-Chen/ShareCaptioner-Video", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Lin-Chen/ShareCaptioner-Video
- SGLang
How to use Lin-Chen/ShareCaptioner-Video with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Lin-Chen/ShareCaptioner-Video" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Lin-Chen/ShareCaptioner-Video", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Lin-Chen/ShareCaptioner-Video" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Lin-Chen/ShareCaptioner-Video", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Lin-Chen/ShareCaptioner-Video with Docker Model Runner:
docker model run hf.co/Lin-Chen/ShareCaptioner-Video
ShareCaptioner-Video Model Card
Model details
Model type: ShareCaptioner-Video is an open-source captioner fine-tuned on GPT4V-assisted ShareGPT4Video detailed caption data with supporting various durations, aspect ratios, and resolutions of videos. ShareCaptioner-Video is based on the InternLM-Xcomposer2-4KHD model.
ShareCaptaioner-Video features 4 roles:
- Fast Captioning: The model employs an image-grid format for direct video captioning, providing rapid generation speeds that are ideal for short videos. In practice, we concatenate all the keyframes of a video into a vertically elongated image and train the model on a caption task.
- Sliding Captioning: The model supports streaming captioning in a differential sliding-window format, yielding high-quality captions that are suitable for long videos. We take the two adjacent keyframes alongside the previous differential caption as input, and train the model to describe the events occurring between them.
- Clip Summarizing: The model can swiftly summarize any clip from ShareGPT4Video or videos that have undergone the differential sliding-window captioning process, eliminating the need to re-process frames. We use all the differential descriptions as input, and the output is the video caption.
- Prompt Re-Captioning: The model can rephrase prompts input by users who prefer specific video generation areas, ensuring that T2VMs trained on high-quality video-caption data maintain format alignment during inference with their training. In practice, we use GPT-4 to generate Sora-style prompts for our dense captions, and we train the re-captioning task in reverse, i.e., by using the generated prompt as input and the dense caption as the training target.
Model date: ShareCaptioner was trained in May 2024.
Paper or resources for more information: [Project] [Paper] [Code]
Intended use
Primary intended uses: The primary use of ShareCaptioner-Video is about producing high-quality video captions.
Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
Finetuning dataset
- 40K GPT4V-generated video-caption pairs
- 40K differential sliding-window captioning conversations
- 40K prompt-to-caption textual data
Paper
arxiv.org/abs/2406.04325
- Downloads last month
- 921