This is the PyTorch implementation of the paper:
Baoyu Jing, Zeyu You, Tao Yang, Wei Fan and Hanghang Tong Multiplex Graph Neural Network for Extractive Text Summarization, EMNLP'2021
- torch>=1.7.1
- stanza==1.1.1
- pyrouge==0.1.3
- numpy>=1.19.5
- scipy>=1.5.4
- PyYAML>=6.0
- sklearn==0.0
- six >= 1.16.0
- tqdm>=4.59.0
Packages can be installed via: pip install -r requirements.txt
The preprocessed CNN/DailyMail dataset can be downloaded here. If you would like to process the raw data by yourself, please follow the instructions below.
The raw data for CNN/Daily_Mail can be downloaded from https://cs.nyu.edu/~kcho/DMQA/.
Data splits can be downloaded from https://github.com/abisee/cnn-dailymail, which is already included in this repository.
- Tokenization
We follow Get To The Point: Summarization with Pointer-Generator Networks and use Stanford CoreNLP to tokenize the dataset. We use stanza to access CoreNLP. Here's the instruction.
- Graph Construction
We build multiplex graphs at both word level and sentence level.
At word level, we consider the syntactic and semantic relations.
At sentence level, we consider the natural connection (same keywords) and semantic relations.
The semantic graphs are computed on-the-fly within the model, and the other graphs are constructed during preprocessing.
The syntactic graph for words within a sentence are constructed based on the dependency graph of the sentence, which is obtained from Stanford CoreNLP.
The natural connection graph for sentences are constructed based on their TF-IDF vectors.
- Oracle Extraction
We follow Text Summarization with Pretrained Encoders, and greedily select the sentences within the documents, which have the highest ROUGE scores, as oracles.
We use GloVe embeddings as the initial word embeddings.
Since we use different numbers of the extracted sentences as summaries for CNN and DailyMail, we need to know whether a document is from CNN or DailyMail during evaluation.
Therefore, you need to run cnn_split.py to obtain the split files of CNN.
Following previous works, we use ROUGE-1.5.5 to evaluate the model.
For efficiency, we use pyrouge to calculate ROUGE scores when extracting oracles.
- Download ROUGE-1.5.5 here
export ROUGE_EVAL_HOME="/absolute_path_to/ROUGE-1.5.5/data/"
- Install Perl Packages
sudo apt-get install perl
sudo apt-get update
sudo cpan install XML::DOM
- Remove files to avoid ERROR of the .db files
rm WordNet-2.0.exc.db
./WordNet-2.0-Exceptions/buildExeptionDB.pl ./WordNet-2.0-Exceptions ./smart_common_words.txt ./WordNet-2.0.exc.db
- Install
pyrouge
pip install pyrouge
pyrouge_set_rouge_path /absolute_path_to/ROUGE-1.5.5/
- Specify the configurations in
model.yml,dataloader.yml,trainer.ymlandvocabulary.yml. - Specify the paths of the configuration files in
trainer.py.
python train.py
- Specify the configurations in
model.yml,dataloader.ymlandvocabulary.yml. - Specify the paths of the configuration files in
evaluate.py.
python evaluate.py
Please cite the following paper, if you find the repository or the paper useful.
Baoyu Jing, Zeyu You, Tao Yang, Wei Fan and Hanghang Tong Multiplex Graph Neural Network for Extractive Text Summarization, EMNLP'2021
@article{jing2021multiplex,
title={Multiplex Graph Neural Network for Extractive Text Summarization},
author={Jing, Baoyu and You, Zeyu and Yang, Tao and Fan, Wei and Tong, Hanghang},
journal={arXiv preprint arXiv:2108.12870},
year={2021}
}