The link_lizard from yogeshexplore

Important

For a production project, I would use celery, rabbitmq/kafka, dockerized producer/consumer to achieve if high scale is needed.
For the assignment purpose I have just used gevent and take care of thread-safety
If you want I can explain the project over call or you can run it once in the debug mode.

Dependency

python 3.10.2 ( You can pyenv to maintain multiple version of python in your system )
pipenv ( For virtual environment)

Setup

cd link_lizard
pipenv install ( I have created .venv directory so that the dependency would be installed in that directory only )
pipenv shell ( or other way to activate virtual env)
create .env and copy the content of .env.dist to .env. Please change environment variable accordingly

Run

Arguments

--url

The website url you want to crawl, example http://bbc.co.uk , http://example.com, http://docs.celeryq.dev, http://gevent.org Default is http://example.com ( It has no links, so please pass this)

--workers

Gevent workers, please do not put a high number, the target website can block you ( proxy and capcha is not implement yet) Also you can affect the real users of the website

Put --workers 1 if you want to debug it, monkey patch is only applied if workers are greater that 1. It helps in debugging

--output

How do you want to save the output Currently supporting file and mongodb ( For mongo db , please update the .env) Default file

--type

Might be required in future, if you want to add other kind of crawlers For now only --type website_links works Default is website_links

From your activated virtualenv please run python -m app.main --url 'http://bbc.co.uk' --workers 50 --output file --type website_links The result file would be present in output directory

Highlights of projects

Tried to used OOPS, SOLID and 12-factor app design
app.connection module - connection wrapper, can be used in any project. Generally I would ship this to python libs which the organisation can install and use
app.services ( store services ) - based on input, I inject the storage engine. It can be also be done via Factory method and dependency injection via configuration file These services are also reusable and not not dependent on anything
app_logging.py - Control complete logging of app and injected in the very beginning. The log files are stored in logs directory and rotated automatically.
proxy - Is not implemented yet, but the lrequest service can directly use it when request start failing
config.py - Handles environment (maily from os or .env )
setting.py - Control all the setting of app
main.py - Entry point
services.lrequest.lizard_request.py -> Handle all request and retries
website_links.py -> Basically use gevent and all other services to give the desired results
The project make sure there is no deadlock or thread-safety is considered.
All edge cases are considered ( http == https, remove fragment, what is queue is empty, no url requested twice)
Result is written into file or db in chunks

Todos

add return types and data types for services - Not needed for now , partially done
add auth make sure right people are accessing it - Not needed
use session in requests
monkey patch only required libs - Not needed
push error-urls ( 404, 401, 408 etc ) into separate output files, although they are already getting printed and written to log files
canonical host names. The use might want bbc.com also as it is canonical to bbc.co.uk in some cases
Ignore files ( urls end with .pdf. .mp3 etc ) - Not needed
Add S3 output - Not needed
Add MongoDB output - Done
Add Postgres output - Not needed
Setup Proxy Services ( Request Modify, Free Proxy, Paid Proxy) - Setup done , but not usable, implementation pending
Setup Captcha bypass - Not needed
Factory method and dependency injection - Partially done, rest not needed for now
No difference between http and https ( mark both as crawled ) - Done
Input url validation and modification - Done
Dockerize - Will do on Sunday if time, simple docker file only
README.md - Partially done

yogeshexplore / link_lizard Goto Github PK

link_lizard's Introduction

Important

Dependency

Setup

Run

Arguments

Highlights of projects

Todos

link_lizard's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent