ScrapAPI is a FastAPI-based web service designed for scraping and analyzing web content. It utilizes Celery for task management and Redis as a message broker. The service allows users to submit URLs for scraping and retrieves structured data such as article titles, authors, publish dates, and more.
- Web Scraping: Extract structured data from web articles using
newspaper4k. - Task Management: Asynchronous task processing with Celery.
- API Endpoints: RESTful API for submitting scrape tasks and retrieving results.
- Redis Integration: Redis is used as a message broker for Celery tasks.
- Flower Integration: Monitor Celery tasks with Flower.
- Proxy Support: Configurable HTTP/HTTPS proxies for scraping.
- Python 3.12
- Docker and Docker Compose
-
Clone the repository:
git clone https://github.com/legard/scrapapi.git cd scrapapi -
Build and start the services:
docker-compose up --build
-
Access the API:
- The FastAPI application will be available at
http://localhost:8000. - The Flower monitoring tool will be available at
http://localhost:5555.
- The FastAPI application will be available at
- URL:
/api/v1/scrape - Method:
POST - Request Body:
{ "url": "https://example.com/article" } - Response:
{ "task_id": "task-id", "status": "PENDING" }
- URL:
/api/v1/tasks/{task_id} - Method:
GET - Response:
{ "task_id": "task-id", "status": "SUCCESS", "result": { "title": "Article Title", "authors": ["Author 1", "Author 2"], "publish_date": "2023-10-01T00:00:00", "text": "Article content...", "keywords": ["keyword1", "keyword2"], "summary": "Article summary...", "url": "https://example.com/article" } }
Configuration is managed via environment variables or the Settings class in app/core/config.py. Key settings include:
- REDIS_URL: URL for the Redis instance.
- HTTP_PROXY: HTTP proxy for scraping (optional).
- HTTPS_PROXY: HTTPS proxy for scraping (optional).
To use proxies, set the HTTP_PROXY and HTTPS_PROXY environment variables in the worker service in docker-compose.yml:
environment:
- HTTP_PROXY=${HTTP_PROXY}
- HTTPS_PROXY=${HTTPS_PROXY}
To set up the development environment:
-
Install dependencies:
uv sync
-
Run the application:
uvicorn app.main:app --reload
-
Run Celery worker:
celery -A app.core.celery.celery_app worker -Q main-queue --loglevel=info
Contributions are welcome! Please open an issue or submit a pull request.
This project is licensed under the MIT License.