$\color{orange}{\textbf{{[CVPR 2026]}}}$ PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

PyraTok is a language-aligned pyramidal video tokenizer designed for both video understanding and generation.
This repository includes model code, inference scripts, and finetuning scripts with Accelerate.

Highlights

Language-aligned pyramidal quantization (LaPQ) for semantically meaningful video tokens.
End-to-end VAE-style reconstruction with optional text conditioning.
Sliding-window inference over full videos at frame level.
Finetuning pipeline with multi-GPU support via accelerate.

Model Weights

Hugging Face model page: https://huggingface.co/onkarsus13/PyraTok

Download with script:

python download.py

Or directly:

python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='onkarsus13/PyraTok', local_dir='./checkpoints', local_dir_use_symlinks=False)"

Installation

1) Clone and enter the repo

git clone <your-repo-url>
cd CVPR25-PyraTok

2) Create environment

python -m venv .venv
source .venv/bin/activate

3) Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

Data Format

Training/inference manifest supports .json or .jsonl.

Each row must contain:

path: relative or absolute video path
caption: text instruction/caption

Example .json:

[
  {"path": "videos/clip_0001.mp4", "caption": "a person is walking in a park"},
  {"path": "videos/clip_0002.mp4", "caption": "two people talking indoors"}
]

Example .jsonl:

{"path": "videos/clip_0001.mp4", "caption": "a person is walking in a park"}
{"path": "videos/clip_0002.mp4", "caption": "two people talking indoors"}

Inference

Run with explicit inputs:

python infer.py /abs/path/to/video.mp4 "your prompt here"

Outputs are written under ./reconstructions/:

*_input.mp4
*_recon.mp4
*_sbs.mp4
reconstruction_metadata.json

Inference settings (edit in `infer.py`)

vae_model_path: path to downloaded/finetuned VAE checkpoint.
qwen_model_path: path to text encoder checkpoint.
use_text_condition: enable/disable language conditioning.
num_frames: sliding window size.
window_stride: sliding step (None means no overlap, stride = num_frames).
height, width, fps, device, dtype.

Finetuning (Complete Workflow)

This project uses finetune.py as the Accelerate entrypoint, which calls fine_tune.py.

1) Prepare data

Update these paths in fine_tune.py (TrainConfig):

video_base_path
fallback_video_base_paths
train_manifest_path
pyratok_pretrained_path
qwen_model_path
output_dir

2) Configure training hyperparameters

In fine_tune.py, set:

Core training: num_epochs, max_steps, learning_rate, weight_decay, grad_clip_norm
Batch/layout: batch_size, gradient_accumulation_steps, num_workers, num_frames
Precision/distributed: mixed_precision, offload_text_encoder_to_cpu
LaPQ params: lapq_num_codes, lapq_num_quantizers, lapq_codebook_dim, etc.

3) Configure Accelerate (first run)

accelerate config

4) Launch training

Single command:

accelerate launch --mixed_precision bf16 finetune.py

Or use provided script:

bash train.sh

train.sh example:

NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch --mixed_precision bf16 finetune.py

5) Training outputs

Saved in output_dir:

periodic checkpoints: checkpoint_step_XXXXXXX.pt
diffusers-format VAE folders: vae_checkpoint_step_XXXXXXX/
dumped config: hardcoded_train_config.json

Notes and Troubleshooting

Video decoding tries torchvision, then imageio, then opencv.
Unreadable videos are skipped during dataset loading instead of crashing training.
If you see missing decoder/backend errors, install both torchvision and opencv-python (or imageio-ffmpeg).
fine_tune.py currently uses hardcoded config via TrainConfig, so editing that dataclass is the primary way to change runs.

Contact

While installing or finetuning, if you find any issues, please contact to onkarsus13@gmail.com.

Citation

⭐ If you find this work useful, please cite our paper

@inproceedings{susladkar2026pyratok,
  title={PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation},
  author={Susladkar, Onkar and Prakash, Tushar and Juvekar, Adheesh and Nguyen, Kiet A. and Jang, Dong-Hwan and Dhillon, Inderjit S. and Lourentzou, Ismini},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
LaPQ.py		LaPQ.py
README.md		README.md
data.py		data.py
download.py		download.py
embedder.py		embedder.py
fine_tune.py		fine_tune.py
finetune.py		finetune.py
infer.py		infer.py
model.py		model.py
requirements.txt		requirements.txt
ssim_loss.py		ssim_loss.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

$\color{orange}{\textbf{{[CVPR 2026]}}}$ PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

Highlights

Model Weights

Installation

1) Clone and enter the repo

2) Create environment

3) Install dependencies

Data Format

Inference

Inference settings (edit in `infer.py`)

Finetuning (Complete Workflow)

1) Prepare data

2) Configure training hyperparameters

3) Configure Accelerate (first run)

4) Launch training

5) Training outputs

Notes and Troubleshooting

Contact

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

$\color{orange}{\textbf{{[CVPR 2026]}}}$ PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

Highlights

Model Weights

Installation

1) Clone and enter the repo

2) Create environment

3) Install dependencies

Data Format

Inference

Inference settings (edit in infer.py)

Finetuning (Complete Workflow)

1) Prepare data

2) Configure training hyperparameters

3) Configure Accelerate (first run)

4) Launch training

5) Training outputs

Notes and Troubleshooting

Contact

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Inference settings (edit in `infer.py`)

Packages