Homebrew Python is "externally managed" (PEP 668), so install dependencies in a project virtual environment.
cd /path/to/ExternalLinkScraper
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install scrapy
python -m pip install scrapy-playwright
python -m playwright installIf python is aliased to /usr/bin/python3, either remove the alias or call
the venv Python directly:
.venv/bin/python -m pip install scrapyUse the helper script to prompt for URL, click-test, format, and modal summary options:
cd ./ExternalLinkScraper
chmod +x run_crawl.sh
./run_crawl.shOutputs are written to ./output/ as {datetime}-links.{format}.
The spider crawls a start URL you provide and emits every link it finds.
It only follows links within the start URL's domain. Links to
cdn.prod.website-files.com are treated as internal (so they do not trigger the
external-link modal logic).
cd ./e_link_scraper
../.venv/bin/scrapy crawl links -a start_url=https://www.fiveriversbank.com/ -O links.jsonEnable click-test mode to verify that modal links actually open the modal:
cd ./e_link_scraper
../.venv/bin/scrapy crawl links -a start_url=https://www.fiveriversbank.com/ -a click_test=1 -O links.jsonTo skip the modal summary entry (useful for CSV exports):
cd ./e_link_scraper
../.venv/bin/scrapy crawl links -a start_url=https://www.fiveriversbank.com/ -a include_modal_info=0 -a click_test=1 -O links.csvTo crawl every page from a sitemap (including nested sitemaps), pass the sitemap URL as the start URL:
cd ./e_link_scraper
../.venv/bin/scrapy crawl links -a start_url=https://www.fiveriversbank.com/sitemap.xml -a click_test=1 -a include_modal_info=0 -O links.csv