A comprehensive, production-ready web crawler built in Rust to analyze and improve website quality.
RustCrawler provides three types of website analysis with multiple output formats:
Analyzes search engine optimization aspects:
- Title tag presence and length
- Meta description tags
- H1 heading tags
- Canonical URL tags
- Robots meta tags
- Internal link validation (configurable limit)
Evaluates website performance metrics:
- Response time measurement
- Page size analysis
- External resource counting (scripts, stylesheets)
- Compression detection (Brotli, Gzip, Deflate)
Checks web accessibility standards:
- HTML lang attribute
- Image alt attributes
- ARIA landmarks and attributes
- Semantic HTML5 tags
- Form label associations
- Skip navigation links
- Terminal: Color-coded, human-readable output
- JSON: Machine-readable format for integration
- HTML: Styled report for sharing
- Docker
- Make
Note: Rust and Cargo are NOT required on your host machine. They are included in the Docker container.
First, build the Docker image with the latest Rust version:
make installThis command downloads and sets up a Docker container with the latest version of Rust.
make run # Run in debug mode
make run-release # Run in release mode
make run-release # Run in release modeFollows an interactive prompt to select URL and crawlers.
# Analyze a URL with all crawlers
docker run --rm rustcrawler cargo run -- --url https://example.com --all
# Run specific crawlers
docker run --rm rustcrawler cargo run -- --url https://example.com --seo --performance
# Generate JSON report
docker run --rm rustcrawler cargo run -- --url https://example.com --all --format json --output report.json
# Generate HTML report
docker run --rm rustcrawler cargo run -- --url https://example.com --all --format html --output report.html
# Use custom configuration
docker run --rm rustcrawler cargo run -- --url https://example.com --all --config config.json
# Override settings
docker run --rm rustcrawler cargo run -- --url https://example.com --all --timeout 60 --max-links 20--url <URL>: URL to analyze--seo: Run SEO crawler--performance: Run Performance crawler--a11y: Run A11Y crawler--all: Run all crawlers--format <terminal|json|html>: Output format (default: terminal)--output <FILE>: Output file for JSON/HTML--config <FILE>: Configuration file path--timeout <SECONDS>: Request timeout--max-links <N>: Maximum internal links to check
Create a config.json:
{
"timeout_secs": 30,
"max_links_to_check": 10,
"user_agent": "RustCrawler/0.1.0",
"follow_redirects": true,
"max_redirects": 5
}All commands run inside the Docker container, so you don't need Rust installed locally.
make build # Build in debug mode
make build-release # Build in release modemake test # Run all tests (17 tests)
make test-verbose # Run tests with verbose outputmake format # Format code with rustfmt
make format-check # Check formatting without modifying filesmake lint # Run clippy linter
make check # Check if code compilesmake shell # Open a shell in the Docker containermake clean # Remove build artifacts and Docker imagemake help # Display all available targetsThe project uses the following main dependencies:
reqwest- HTTP client for making requestsurl- URL parsing and validationcolored- Terminal color outputtokio- Async runtimethiserror- Custom error typesserde/serde_json- Serialization for JSON outputclap- Command-line argument parsingchrono- Date/time handling for reports
The project follows Rust best practices with a modular architecture:
RustCrawler/
├── src/
│ ├── main.rs # Application entry point with CLI
│ ├── lib.rs # Library root with public exports
│ ├── cli.rs # CLI argument definitions
│ ├── config.rs # Configuration management
│ ├── error.rs # Custom error types
│ ├── config.rs # Configuration management
│ ├── error.rs # Custom error types
│ ├── client.rs # HTTP client wrapper
│ ├── models.rs # Data models and validation
│ ├── output.rs # JSON/HTML report generation
│ ├── utils.rs # Utility functions for I/O and display
│ └── crawlers/
│ ├── mod.rs # Crawler trait and common functions
│ ├── seo.rs # SEO crawler implementation
│ ├── performance.rs # Performance crawler implementation
│ └── a11y.rs # Accessibility crawler implementation
├── Cargo.toml # Rust dependencies and project configuration
├── Dockerfile # Docker container setup
├── Makefile # Build and run commands
├── ARCHITECTURE.md # Detailed architecture documentation
└── README.md # This file
- Modular Design: Each crawler is implemented in its own module with the
Crawlertrait - Separation of Concerns: HTTP client, models, configuration, and utilities are separate modules
- Error Handling: Custom error types using
thiserrorfor better error messages - Configuration: Externalized configuration with JSON file support
- CLI + Interactive: Supports both command-line and interactive modes
- Multiple Outputs: Terminal, JSON, and HTML report formats
- Testable: 17 unit tests covering all major functionality
- Extensible: Easy to add new crawlers by implementing the
Crawlertrait - Type Safety: Strong typing with custom models for data structures
- Library + Binary: Can be used as a library or standalone application
When contributing to this project:
- Ensure your code builds with
make build - Run tests with
make test(17 tests should pass) - Format code with
make format - Check for linting issues with
make lint - Follow Rust naming conventions and best practices
- Add tests for new functionality
- ✅ Custom error types with
thiserror - ✅ Configuration management (JSON file support)
- ✅ CLI with
clapfor non-interactive use - ✅ JSON and HTML output formats
- ✅ Configurable timeouts and limits
- ✅ User-agent customization
- ✅ Redirect policy configuration
- ✅ 17 comprehensive unit tests
- Async/await for parallel crawling
- HTML parser (
scrapercrate) for more accurate analysis - Integration tests with mock servers
- Sitemap crawling
- Rate limiting
- Retry logic with exponential backoff