$\color{orange}{\textbf{{[CVPR 2026]}}}$ PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
PyraTok is a language-aligned pyramidal video tokenizer designed for both video understanding and generation.
This repository includes model code, inference scripts, and finetuning scripts with Accelerate.
- Language-aligned pyramidal quantization (
LaPQ) for semantically meaningful video tokens. - End-to-end VAE-style reconstruction with optional text conditioning.
- Sliding-window inference over full videos at frame level.
- Finetuning pipeline with multi-GPU support via
accelerate.
- Hugging Face model page:
https://huggingface.co/onkarsus13/PyraTok
Download with script:
python download.pyOr directly:
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='onkarsus13/PyraTok', local_dir='./checkpoints', local_dir_use_symlinks=False)"git clone <your-repo-url>
cd CVPR25-PyraTokpython -m venv .venv
source .venv/bin/activatepip install --upgrade pip
pip install -r requirements.txtTraining/inference manifest supports .json or .jsonl.
Each row must contain:
path: relative or absolute video pathcaption: text instruction/caption
Example .json:
[
{"path": "videos/clip_0001.mp4", "caption": "a person is walking in a park"},
{"path": "videos/clip_0002.mp4", "caption": "two people talking indoors"}
]Example .jsonl:
{"path": "videos/clip_0001.mp4", "caption": "a person is walking in a park"}
{"path": "videos/clip_0002.mp4", "caption": "two people talking indoors"}Run with explicit inputs:
python infer.py /abs/path/to/video.mp4 "your prompt here"Outputs are written under ./reconstructions/:
*_input.mp4*_recon.mp4*_sbs.mp4reconstruction_metadata.json
vae_model_path: path to downloaded/finetuned VAE checkpoint.qwen_model_path: path to text encoder checkpoint.use_text_condition: enable/disable language conditioning.num_frames: sliding window size.window_stride: sliding step (Nonemeans no overlap, stride =num_frames).height,width,fps,device,dtype.
This project uses finetune.py as the Accelerate entrypoint, which calls fine_tune.py.
Update these paths in fine_tune.py (TrainConfig):
video_base_pathfallback_video_base_pathstrain_manifest_pathpyratok_pretrained_pathqwen_model_pathoutput_dir
In fine_tune.py, set:
- Core training:
num_epochs,max_steps,learning_rate,weight_decay,grad_clip_norm - Batch/layout:
batch_size,gradient_accumulation_steps,num_workers,num_frames - Precision/distributed:
mixed_precision,offload_text_encoder_to_cpu - LaPQ params:
lapq_num_codes,lapq_num_quantizers,lapq_codebook_dim, etc.
accelerate configSingle command:
accelerate launch --mixed_precision bf16 finetune.pyOr use provided script:
bash train.shtrain.sh example:
NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch --mixed_precision bf16 finetune.pySaved in output_dir:
- periodic checkpoints:
checkpoint_step_XXXXXXX.pt - diffusers-format VAE folders:
vae_checkpoint_step_XXXXXXX/ - dumped config:
hardcoded_train_config.json
- Video decoding tries
torchvision, thenimageio, thenopencv. - Unreadable videos are skipped during dataset loading instead of crashing training.
- If you see missing decoder/backend errors, install both
torchvisionandopencv-python(orimageio-ffmpeg). fine_tune.pycurrently uses hardcoded config viaTrainConfig, so editing that dataclass is the primary way to change runs.
While installing or finetuning, if you find any issues, please contact to onkarsus13@gmail.com.
⭐ If you find this work useful, please cite our paper
@inproceedings{susladkar2026pyratok,
title={PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation},
author={Susladkar, Onkar and Prakash, Tushar and Juvekar, Adheesh and Nguyen, Kiet A. and Jang, Dong-Hwan and Dhillon, Inderjit S. and Lourentzou, Ismini},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}