ExternalLinkScraper

Setup Instructions

Homebrew Python is "externally managed" (PEP 668), so install dependencies in a project virtual environment.

cd /path/to/ExternalLinkScraper
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install scrapy
python -m pip install scrapy-playwright
python -m playwright install

If python is aliased to /usr/bin/python3, either remove the alias or call the venv Python directly:

.venv/bin/python -m pip install scrapy

Interactive Run

Use the helper script to prompt for URL, click-test, format, and modal summary options:

cd ./ExternalLinkScraper
chmod +x run_crawl.sh
./run_crawl.sh

Outputs are written to ./output/ as {datetime}-links.{format}.

Run the Scraper

The spider crawls a start URL you provide and emits every link it finds. It only follows links within the start URL's domain. Links to cdn.prod.website-files.com are treated as internal (so they do not trigger the external-link modal logic).

cd ./e_link_scraper
../.venv/bin/scrapy crawl links -a start_url=https://www.fiveriversbank.com/ -O links.json

Enable click-test mode to verify that modal links actually open the modal:

cd ./e_link_scraper
../.venv/bin/scrapy crawl links -a start_url=https://www.fiveriversbank.com/ -a click_test=1 -O links.json

To skip the modal summary entry (useful for CSV exports):

cd ./e_link_scraper
../.venv/bin/scrapy crawl links -a start_url=https://www.fiveriversbank.com/ -a include_modal_info=0 -a click_test=1 -O links.csv

To crawl every page from a sitemap (including nested sitemaps), pass the sitemap URL as the start URL:

cd ./e_link_scraper
../.venv/bin/scrapy crawl links -a start_url=https://www.fiveriversbank.com/sitemap.xml -a click_test=1 -a include_modal_info=0 -O links.csv

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
e_link_scraper		e_link_scraper
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
run_crawl.sh		run_crawl.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ExternalLinkScraper

Setup Instructions

Interactive Run

Run the Scraper

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ExternalLinkScraper

Setup Instructions

Interactive Run

Run the Scraper

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages