These tools are designed to access unit process data from the National Pollutant Discharge Elimination System (NPDES) permits for California.
Researchers have utilized the CWNS to quantify greenhouse gas emissions. However, this data is infrequent, voluntary, and sparse. To address these limitations, we utilize NPDES permits. The following Python tools are used to collect its data:
-
Build CWNS process tables
- wwtp_process_extraction/step0_build_cwns_table.py: creates cwns_processes_by_facility.csv from CWNS 2004/2008/2012 data. 2022 doesn't include CA
-
Scrape permits and site metadata
- wwtp_process_extraction/step1_scrape_npdes.py: downloads NPDES permit PDFs and writes site_data.csv and matched_cwns_npdes_ca.csv
Uses npdes_detection.py helpers to detect which files are actually NPDES
-
Detect treatment processes in permits with keyword search
- wwtp_process_extraction/step2_search_npdes_text.py: scans PDFs against unitprocess_keywords and writes kw_unit_processes.csv with present/future status
-
Detect treatment processes in permits with LLM search
- wwtp_process_extraction/step3a_llm_ontology.py: run the LLM extraction using the ontology format
- use --init_ontology to reload the ontology and make it up-to-date as a .txt file under wwtp_process_extraction/data
- use --model "model_name" --pdf_folder "path_to_pdf_folder" --facilities_information "path_to_facilities_csv" to run the LLM extraction using one PDF per facility (first PDF_File per Facility Name): the results are saved as json file under output/date/llm_search_ontology
- wwtp_process_extraction/step3b_llm_list.py: run the LLM extraction using the unitprocess_list format
- use --model "model_name" --pdf "pdf_file_or_pdf_folder_path" to run the LLM extraction using the specific model on the specific pdf(s) : the results are saved as json file under output/date/llm_search_list
- wwtp_process_extraction/step3a_llm_ontology.py: run the LLM extraction using the ontology format
-
Post-process LLM output back to CWNS format
- wwtp_process_extraction/step4_postprocess_llm_output.py: post-process the outputs of the LLM using the ontology and writes llm_ontology_cwns_processes_by_facility.csv with present/planned/past status.
-
Compare NPDES text extraction vs CWNS survey data
- wwtp_process_extraction/step5a_compare_aggregate_results.py: compares llm_unit_processes.csv to cwns_processes_by_facility.csv with bar chart comparisons
Executing from the repository root directory:
python wwtp_process_extraction/step0_build_cwns_table.py
python wwtp_process_extraction/step1_scrape_npdes.py
python wwtp_process_extraction/step2_search_npdes_text.py
python wwtp_process_extraction/step3a_llm_ontology.py --init_ontology
python wwtp_process_extraction/step3a_llm_ontology.py --model gemini-2.0-flash-001 --pdf_folder wwtp_process_extraction/output/2026-2-18/npdes --facilities_information wwtp_process_extraction/data/test_set_npdes_manual.csv
python wwtp_process_extraction/step4_postprocess_llm_output.py
python wwtp_process_extraction/step5a_compare_aggregate_results.py
python wwtp_process_extraction/step5b_compare_facility_results.py- When first running permit_scrape.py, a Timeout Error may appear. Continue to rerun until the program successfully opens ChromeDrive (Ensure that the "MM-DD-YYYY" Folder is deleted before rerunning).
- There are two distinct locations where permit_scrape.py is slow:
- After Region selection
- Selection of "ALL" Display range
- ...
Constance Rouffet - rouffetc@stanford.edu
Ashley Ramirez - ashlecr3@uci.edu
Daly Wettermark - dalyw@stanford.edu
Fletcher Chapin - fchapin@stanford.edu
This work is funded in part by: Stanford SURGE program National Alliance for Water Innovation