Skip to content

skyrocket-digital/ExternalLinkScraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ExternalLinkScraper

Setup Instructions

Homebrew Python is "externally managed" (PEP 668), so install dependencies in a project virtual environment.

cd /path/to/ExternalLinkScraper
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install scrapy
python -m pip install scrapy-playwright
python -m playwright install

If python is aliased to /usr/bin/python3, either remove the alias or call the venv Python directly:

.venv/bin/python -m pip install scrapy

Interactive Run

Use the helper script to prompt for URL, click-test, format, and modal summary options:

cd ./ExternalLinkScraper
chmod +x run_crawl.sh
./run_crawl.sh

Outputs are written to ./output/ as {datetime}-links.{format}.

Run the Scraper

The spider crawls a start URL you provide and emits every link it finds. It only follows links within the start URL's domain. Links to cdn.prod.website-files.com are treated as internal (so they do not trigger the external-link modal logic).

cd ./e_link_scraper
../.venv/bin/scrapy crawl links -a start_url=https://www.fiveriversbank.com/ -O links.json

Enable click-test mode to verify that modal links actually open the modal:

cd ./e_link_scraper
../.venv/bin/scrapy crawl links -a start_url=https://www.fiveriversbank.com/ -a click_test=1 -O links.json

To skip the modal summary entry (useful for CSV exports):

cd ./e_link_scraper
../.venv/bin/scrapy crawl links -a start_url=https://www.fiveriversbank.com/ -a include_modal_info=0 -a click_test=1 -O links.csv

To crawl every page from a sitemap (including nested sitemaps), pass the sitemap URL as the start URL:

cd ./e_link_scraper
../.venv/bin/scrapy crawl links -a start_url=https://www.fiveriversbank.com/sitemap.xml -a click_test=1 -a include_modal_info=0 -O links.csv

About

a scraper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors