Skip to content

we3lab/wwtp-process-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

115 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NPDES Permit Analysis Tools

These tools are designed to access unit process data from the National Pollutant Discharge Elimination System (NPDES) permits for California.

Description

Researchers have utilized the CWNS to quantify greenhouse gas emissions. However, this data is infrequent, voluntary, and sparse. To address these limitations, we utilize NPDES permits. The following Python tools are used to collect its data:

  1. Build CWNS process tables

  2. Scrape permits and site metadata

    Uses npdes_detection.py helpers to detect which files are actually NPDES

  3. Detect treatment processes in permits with keyword search

  4. Detect treatment processes in permits with LLM search

    • wwtp_process_extraction/step3a_llm_ontology.py: run the LLM extraction using the ontology format
      • use --init_ontology to reload the ontology and make it up-to-date as a .txt file under wwtp_process_extraction/data
      • use --model "model_name" --pdf_folder "path_to_pdf_folder" --facilities_information "path_to_facilities_csv" to run the LLM extraction using one PDF per facility (first PDF_File per Facility Name): the results are saved as json file under output/date/llm_search_ontology
    • wwtp_process_extraction/step3b_llm_list.py: run the LLM extraction using the unitprocess_list format
      • use --model "model_name" --pdf "pdf_file_or_pdf_folder_path" to run the LLM extraction using the specific model on the specific pdf(s) : the results are saved as json file under output/date/llm_search_list
  5. Post-process LLM output back to CWNS format

  6. Compare NPDES text extraction vs CWNS survey data

How to Run

Executing from the repository root directory:

python wwtp_process_extraction/step0_build_cwns_table.py
python wwtp_process_extraction/step1_scrape_npdes.py
python wwtp_process_extraction/step2_search_npdes_text.py
python wwtp_process_extraction/step3a_llm_ontology.py --init_ontology
python wwtp_process_extraction/step3a_llm_ontology.py --model gemini-2.0-flash-001 --pdf_folder wwtp_process_extraction/output/2026-2-18/npdes --facilities_information wwtp_process_extraction/data/test_set_npdes_manual.csv
python wwtp_process_extraction/step4_postprocess_llm_output.py
python wwtp_process_extraction/step5a_compare_aggregate_results.py
python wwtp_process_extraction/step5b_compare_facility_results.py

Known issues and limitations

  • When first running permit_scrape.py, a Timeout Error may appear. Continue to rerun until the program successfully opens ChromeDrive (Ensure that the "MM-DD-YYYY" Folder is deleted before rerunning).
  • There are two distinct locations where permit_scrape.py is slow:
    1. After Region selection
    2. Selection of "ALL" Display range
  • ...

Contact

Constance Rouffet - rouffetc@stanford.edu

Ashley Ramirez - ashlecr3@uci.edu

Daly Wettermark - dalyw@stanford.edu

Fletcher Chapin - fchapin@stanford.edu

Acknowledgements

This work is funded in part by: Stanford SURGE program National Alliance for Water Innovation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages