Skip to content
This repository was archived by the owner on Apr 24, 2021. It is now read-only.

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Distributed Web Crawler

High perfomance web crawler

Task (in russian)

Features

  • asynchronous
  • search Rest API
  • throttling (by domain)
  • distributed
  • auth

Usage

  1. Start docker-compose
    docker-compose up -d
  2. Signup
    curl -X POST -d "email=oleg@mail.net&name=Oleg&password=qwerty123" http://localhost:8080/api/v1/signup
  3. Login
    curl -X POST -d "email=oleg@mail.net&password=qwerty123" http://localhost:8080/api/v1/login
  4. Add crawl task
    curl -X POST -H "X-Token: <Token ID>" -d "domain=docs.python.org&https=1" http://localhost:8080/api/v1/index
  5. Get result
    curl http://localhost:8080/api/v1/search?q=asyncio&limit=20&offset=5

Architecture

                     +-------------+                      
+---------------+    |             |     +-----------------+
|               |<---|  Rest API   |<--->|                 |
|               |    |             |     |                 |
|               |    +-------------+     |                 |
|               |    +-------------+     |                 |
|               |    |             |     |                 |
|               |--->|   Crawler   |---->|  ElasticSearch  |
|               |    |             |     | (crawled pages) |
|   RabbitMQ    |    +-------------+     |                 |
|               |    +-------------+     |                 |
|               |    |             |     |                 |
|               |--->|   Crawler   |---->|                 |
|               |    |             |     +-----------------+
|               |    +-------------+     +---------------+ 
|               |    +-------------+     |               |
|               |    |             |     |  PostgreSQL   |
|               |<-->|    Auth     |<--->|  (user data)  |
+---------------+    |             |     |               |
                     +-------------+     +---------------+