Skip to content

NLP-RISE/extractguardian

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ExtractGuardian

Code to extract articles from The Guardian using their OpenPlatform API. The code extracts articles based on a set of tags and filters out articles that share one or more tags.

The list of tags is found in config.py.

Docs

Setup

  1. Get an access key from the Guardian OpenPlatform. A free developer key should be sufficient. Place your key inside the .env file. Example:

    ACCESS_KEY="5c32337f2-aa32-4066-ass3-25fdcebee2fc"
  2. To install the dependencies, download Poetry, install the dependencies, and activate the virtual environment:

    poetry install
    poetry env activate

Run the code

To get the data through the API, run:

poetry run python3 get_data.py --mode tags --output_dir articles --tag all

To adjust the tags from which data is pulled, or add new tag categories, modify the config.py file as needed.

To split the data into a train, dev, and test set, run:

poetry run python3 prep_data.py --data_dir articles --output_dir dataset --filter_by_tags Y

To split the data without filtering out articles that share tags across categories, change the "filter by tags" flag:

poetry run python3 prep_data.py --data_dir articles --output_dir dataset --filter_by_tags N

The script above supports pulling articles by tags and keywords. Prepping data pulled based on keywords is not fully implemented.

Warning

Keep in mind: In prep_data.py, the UNRELATED_TO_CLIMATE and SIMILAR_BUT_NOT_CLIMATE categories are merged.

Plot the distribution of labels per year

poetry run python3 plot_data.py --data_dir dataset --data_split all

Get help

Use the help function if you are lost:

poetry run python3 get_data.py --help
poetry run python3 prep_data.py --help

Otherwise, you are welcome to start a new issue on github.

The Guardian Climate News Corpus

Are you looking for The Guardian Climate News Corpus, as part of the ClimateEval pipeline?

The Guardian Climate News Corpus was created using this same script, but the results are not deterministic. So, if you are looking to replicate the ClimateEval evaluation results described in our paper, use the dataset provided on HuggingFace.

This script is intended for creating similar datasets and extending the existing one. To evaluate your model on your newly created dataset using the LM Evaluation Harness as part of ClimateEval, you need to first upload the dataset to HuggingFace. You can follow this format. Once your new dataset has been uploaded, manipulate the LM harness YAML configs found here to refer to your newly created dataset instead.

Citation

@inproceedings{
    kurfali2025climateeval,
    title={ClimateEval: A Comprehensive Benchmark for {NLP} Tasks Related to Climate Change},
    author={Murathan Kurfali and Shorouq Zahra and Joakim Nivre and Gabriele Messori},
    booktitle={The 2nd Workshop of Natural Language Processing meets Climate Change},
    year={2025},
    url={https://openreview.net/forum?id=183GtY94tB}
}

Disclaimer

All articles extracted here are courtesy of Guardian News & Media Ltd. Open License Terms can be found here

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages