Code to extract articles from The Guardian using their OpenPlatform API. The code extracts articles based on a set of tags and filters out articles that share one or more tags.
The list of tags is found in config.py.
-
Get an access key from the Guardian OpenPlatform. A free developer key should be sufficient. Place your key inside the .env file. Example:
ACCESS_KEY="5c32337f2-aa32-4066-ass3-25fdcebee2fc" -
To install the dependencies, download Poetry, install the dependencies, and activate the virtual environment:
poetry install poetry env activate
To get the data through the API, run:
poetry run python3 get_data.py --mode tags --output_dir articles --tag allTo adjust the tags from which data is pulled, or add new tag categories, modify the config.py file as needed.
To split the data into a train, dev, and test set, run:
poetry run python3 prep_data.py --data_dir articles --output_dir dataset --filter_by_tags YTo split the data without filtering out articles that share tags across categories, change the "filter by tags" flag:
poetry run python3 prep_data.py --data_dir articles --output_dir dataset --filter_by_tags NThe script above supports pulling articles by tags and keywords. Prepping data pulled based on keywords is not fully implemented.
Warning
Keep in mind:
In prep_data.py, the UNRELATED_TO_CLIMATE and SIMILAR_BUT_NOT_CLIMATE categories are merged.
poetry run python3 plot_data.py --data_dir dataset --data_split allUse the help function if you are lost:
poetry run python3 get_data.py --help
poetry run python3 prep_data.py --helpOtherwise, you are welcome to start a new issue on github.
Are you looking for The Guardian Climate News Corpus, as part of the ClimateEval pipeline?
The Guardian Climate News Corpus was created using this same script, but the results are not deterministic. So, if you are looking to replicate the ClimateEval evaluation results described in our paper, use the dataset provided on HuggingFace.
This script is intended for creating similar datasets and extending the existing one. To evaluate your model on your newly created dataset using the LM Evaluation Harness as part of ClimateEval, you need to first upload the dataset to HuggingFace. You can follow this format. Once your new dataset has been uploaded, manipulate the LM harness YAML configs found here to refer to your newly created dataset instead.
@inproceedings{
kurfali2025climateeval,
title={ClimateEval: A Comprehensive Benchmark for {NLP} Tasks Related to Climate Change},
author={Murathan Kurfali and Shorouq Zahra and Joakim Nivre and Gabriele Messori},
booktitle={The 2nd Workshop of Natural Language Processing meets Climate Change},
year={2025},
url={https://openreview.net/forum?id=183GtY94tB}
}All articles extracted here are courtesy of Guardian News & Media Ltd. Open License Terms can be found here