Multi-modality Associative Bridging through Memory.
Application in Lip Reading : Cross-modal Memory Augmented Visual Speech Recognition
This repository contains the official PyTorch implementation of the following paper:
Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face
Minsu Kim, Joanna Hong, Sejin Park, and Yong Man Ro
Paper: https://openaccess.thecvf.com/content/ICCV2021/papers/Kim_Multi-Modality_Associative_Bridging_Through_Memory_Speech_Sound_Recollected_From_Face_ICCV_2021_paper.pdf
- python 3.7
- pytorch 1.6 ~ 1.9
- torchvision
- torchaudio
- av
- tensorboard
- pillow
LRW dataset can be downloaded from the below link.
The pre-processing will be done in the data loader.
The video is cropped with the bounding box [x1:59, y1:95, x2:195, y2:231].
main.py saves the weights in --checkpoint_dir and shows the training logs in ./runs.
To train the model, run following command:
# Distributed training example for LRW
python -m torch.distributed.launch --nproc_per_node='number of gpus' main.py \
--lrw 'enter_data_path' \
--checkpoint_dir 'enter_the_path_for_save' \
--batch_size 80 --epochs 200 \
--mode train --radius 16 --n_slot 88 \
--augmentations True --distributed True --dataparallel False\
--gpu 0,1...# Data Parallel training example for LRW
python main.py \
--lrw 'enter_data_path' \
--checkpoint_dir 'enter_the_path_for_save' \
--batch_size 320 --epochs 200 \
--mode train --radius 16 --n_slot 88 \
--augmentations True --distributed False --dataparallel True \
--gpu 0,1...Descriptions of training parameters are as follows:
--lrw: training dataset location (lrw)--checkpoint_dir: directory for saving checkpoints--batch_size: batch size--epochs: number of epochs--mode: train / val / test--augmentations: whether performing augmentation--distributed: Use DataDistributedParallel--dataparallel: Use DataParallel--gpu: gpu for using--lr: learning rate--n_slot: memory slot size--radius: scaling factor for addressing score- Refer to
train.pyfor the other training parameters
To test the model, run following command:
# Testing example for LRW
python main.py \
--lrw 'enter_data_path' \
--checkpoint 'enter_the_checkpoint_path' \
--batch_size 80 \
--mode test --radius 16 --n_slot 88 \
--test_aug True --distributed False --dataparallel False \
--gpu 0Descriptions of training parameters are as follows:
--lrw: training dataset location (lrw)--checkpoint: the checkpoint file--batch_size: batch size--mode: train / val / test--test_aug: whether performing test time augmentation--distributed: Use DataDistributedParallel--dataparallel: Use DataParallel--gpu: gpu for using--lr: learning rate--n_slot: memory slot size--radius: scaling factor for addressing score- Refer to
test.pyfor the other testing parameters
You can download the pretrained models.
Put the ckpt in './data/'
Bi-GRU Backend
To test the pretrained model, run following command:
# Testing example for LRW
python main.py \
--lrw 'enter_data_path' \
--checkpoint ./data/GRU_Back_Ckpt.ckpt \
--batch_size 80 --backend GRU\
--mode test --radius 16 --n_slot 88 \
--test_aug True --distributed False --dataparallel False \
--gpu 0MS-TCN Backend
To test the pretrained model, run following command:
# Testing example for LRW
python main.py \
--lrw 'enter_data_path' \
--checkpoint ./data/MSTCN_Back_Ckpt.ckpt \
--batch_size 80 --backend MSTCN\
--mode test --radius 16 --n_slot 168 \
--test_aug True --distributed False --dataparallel False \
--gpu 0| Architecture | Acc. |
|---|---|
| Resnet18 + MS-TCN + Multi-modal Mem | 85.408 |
| Resnet18 + Bi-GRU + Multi-modal Mem | 85.864 |
You can also use the pre-trained model to perform Audio Visual Speech Recognition (AVSR), since it is trained with both audio and video inputs.
In order to use AVSR, just use ''tr_fusion'' (refer to the train code) for prediction.
If you find this work useful in your research, please cite the paper:
@inproceedings{kim2021multimodalmem,
title={Multi-Modality Associative Bridging Through Memory: Speech Sound Recollected From Face Video},
author={Kim, Minsu and Hong, Joanna and Park, Se Jin and Ro, Yong Man},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={296--306},
year={2021}
}
