Coder Social home page Coder Social logo

my8100 / scrapyd-cluster-on-heroku Goto Github PK

View Code? Open in Web Editor NEW
121.0 7.0 94.0 242 KB

Set up free and scalable Scrapyd cluster for distributed web-crawling with just a few clicks. DEMO :point_right:

Home Page: https://scrapydweb.herokuapp.com/

License: GNU General Public License v3.0

Python 100.00%
scrapy scrapyd cluster heroku python scrapydweb logparser web-crawling web-scraping

scrapyd-cluster-on-heroku's Introduction

🔤 English | 🀄 简体中文

How to set up Scrapyd cluster on Heroku

Demo

scrapydweb.herokuapp.com

Network topology

network topo

Create accounts

View contents
  1. Heroku

Visit heroku.com to create a free account, with which you can create and run up to 5 apps.

heroku register

  1. Redis Labs (optional)

Visit redislabs.com to create a free account, which provides 30MB storage and can be used by scrapy-redis for distributed crawling.

redislabs register

Deploy Heroku apps in the browser

View contents
  1. Visit my8100/scrapyd-cluster-on-heroku-scrapyd-app to deploy the Scrapyd app. (Don't forget to update the host, port and password of your Redis server in the form)
  2. Repeat step 1 to deploy up to 4 Scrapyd apps, assuming theri names are svr-1, svr-2, svr-3 and svr-4
  3. Visit my8100/scrapyd-cluster-on-heroku-scrapydweb-app-git to deploy the ScrapydWeb app named myscrapydweb
  4. (optional) Click the Reveal Config Vars button on dashboard.heroku.com/apps/myscrapydweb/settings to add more Scrapyd server accordingly, e.g. SCRAPYD_SERVER_2 as the KEY and svr-2.herokuapp.com:80#group2 as the VALUE.
  5. Visit myscrapydweb.herokuapp.com
  6. Jump to the Deploy and run distributed spiders section below and move on.

Custom deployment

View contents

Install tools

  1. Git
  2. Heroku CLI
  3. Python client for Redis: Simply run the pip install redis command.

Download config files

Open a new terminal:

git clone https://github.com/my8100/scrapyd-cluster-on-heroku
cd scrapyd-cluster-on-heroku

Log in to Heroku

# Or run 'heroku login -i' to login with username/password
heroku login
# outputs:
# heroku: Press any key to open up the browser to login or q to exit:
# Opening browser to https://cli-auth.heroku.com/auth/browser/12345-abcde
# Logging in... done
# Logged in as [email protected]

Set up Scrapyd cluster

  1. New Git repo
cd scrapyd
git init
# explore and update the files if needed
git status
git add .
git commit -a -m "first commit"
git status
  1. Deploy Scrapyd app
heroku apps:create svr-1
heroku git:remote -a svr-1
git remote -v
git push heroku master
heroku logs --tail
# Press ctrl+c to stop logs outputting
# Visit https://svr-1.herokuapp.com
  1. Add environment variables

    • Timezone
    # python -c "import tzlocal; print(tzlocal.get_localzone())"
    heroku config:set TZ=US/Eastern
    # heroku config:get TZ
    
    • Redis account (optional, see settings.py in the scrapy_redis_demo_project.zip)
    heroku config:set REDIS_HOST=your-redis-host
    heroku config:set REDIS_PORT=your-redis-port
    heroku config:set REDIS_PASSWORD=your-redis-password
    
  2. Repeat step 2 and step 3 to get the rest Scrapyd apps ready: svr-2, svr-3 and svr-4

Set up ScrapydWeb app

  1. New Git repo
cd ..
cd scrapydweb
git init
# explore and update the files if needed
git status
git add .
git commit -a -m "first commit"
git status
  1. Deploy ScrapydWeb app
heroku apps:create myscrapydweb
heroku git:remote -a myscrapydweb
git remote -v
git push heroku master
  1. Add environment variables

    • Timezone
    heroku config:set TZ=US/Eastern
    
    • Scrapyd servers (see scrapydweb_settings_vN.py in the scrapydweb directory)
    heroku config:set SCRAPYD_SERVER_1=svr-1.herokuapp.com:80
    heroku config:set SCRAPYD_SERVER_2=svr-2.herokuapp.com:80#group1
    heroku config:set SCRAPYD_SERVER_3=svr-3.herokuapp.com:80#group1
    heroku config:set SCRAPYD_SERVER_4=svr-4.herokuapp.com:80#group2
    
  2. Visit myscrapydweb.herokuapp.com scrapydweb

Deploy and run distributed spiders

View contents
  1. Simply upload the compressed file scrapy_redis_demo_project.zip which resides in the scrapyd-cluster-on-heroku directory
  2. Push seed URLs into mycrawler:start_urls to fire crawling and check out the scraped items
In [1]: import redis  # pip install redis

In [2]: r = redis.Redis(host='your-redis-host', port=your-redis-port, password='your-redis-password')

In [3]: r.delete('mycrawler_redis:requests', 'mycrawler_redis:dupefilter', 'mycrawler_redis:items')
Out[3]: 0

In [4]: r.lpush('mycrawler:start_urls', 'http://books.toscrape.com', 'http://quotes.toscrape.com')
Out[4]: 2

# wait for a minute
In [5]: r.lrange('mycrawler_redis:items', 0, 1)
Out[5]:
[b'{"url": "http://quotes.toscrape.com/", "title": "Quotes to Scrape", "hostname": "d6cf94d5-324e-4def-a1ab-e7ee2aaca45a", "crawled": "2019-04-02 03:42:37", "spider": "mycrawler_redis"}',
 b'{"url": "http://books.toscrape.com/index.html", "title": "All products | Books to Scrape - Sandbox", "hostname": "d6cf94d5-324e-4def-a1ab-e7ee2aaca45a", "crawled": "2019-04-02 03:42:37", "spider": "mycrawler_redis"}']

scrapyd cluster on heroku

Conclusion

View contents
  • Pros
  • Cons
    • Heroku apps would be restarted (cycled) at least once per day and any changes to the local filesystem will be deleted, so you need the external database to persist data. Check out devcenter.heroku.com for more info.

scrapyd-cluster-on-heroku's People

Contributors

my8100 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapyd-cluster-on-heroku's Issues

[Question] Postgresql setup

I added the Heroku Postgres add on, which creates the DATABASE_URL

Once I restarted the scrapydweb server I got this error:

2020-05-23T07:13:33.128467+00:00 heroku[web.1]: Starting process with command `scrapydweb`
2020-05-23T07:13:35.447284+00:00 app[web.1]: Traceback (most recent call last):
2020-05-23T07:13:35.447303+00:00 app[web.1]: File "/app/.heroku/python/bin/scrapydweb", line 5, in <module>
2020-05-23T07:13:35.447478+00:00 app[web.1]: from scrapydweb.run import main
2020-05-23T07:13:35.447489+00:00 app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/scrapydweb/__init__.py", line 14, in <module>
2020-05-23T07:13:35.447642+00:00 app[web.1]: from .common import handle_metadata
2020-05-23T07:13:35.447652+00:00 app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/scrapydweb/common.py", line 15, in <module>
2020-05-23T07:13:35.447771+00:00 app[web.1]: from .models import Metadata, db
2020-05-23T07:13:35.447781+00:00 app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/scrapydweb/models.py", line 8, in <module>
2020-05-23T07:13:35.447901+00:00 app[web.1]: from .vars import STATE_RUNNING
2020-05-23T07:13:35.447911+00:00 app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/scrapydweb/vars.py", line 61, in <module>
2020-05-23T07:13:35.448037+00:00 app[web.1]: results = setup_database(DATABASE_URL, DATABASE_PATH)
2020-05-23T07:13:35.448047+00:00 app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/scrapydweb/utils/setup_database.py", line 37, in setup_database
2020-05-23T07:13:35.448172+00:00 app[web.1]: setup_postgresql(*m_postgres.groups())
2020-05-23T07:13:35.448194+00:00 app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/scrapydweb/utils/setup_database.py", line 134, in setup_postgresql
2020-05-23T07:13:35.448345+00:00 app[web.1]: conn = psycopg2.connect(host=host, port=int(port), user=username, password=password)
2020-05-23T07:13:35.448355+00:00 app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/psycopg2/__init__.py", line 127, in connect
2020-05-23T07:13:35.448492+00:00 app[web.1]: conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
2020-05-23T07:13:35.448526+00:00 app[web.1]: psycopg2.OperationalError: FATAL:  database "wrzusiwvpowsmj" does not exist

What else do I need to do to get this working?

Log Error

I'm seeing this in the logs. Is there something that needs to be done?

2020-05-05T18:15:16.297349+00:00 app[web.1]: [2020-05-05 21:15:16,297] ERROR in logparser.logparser: No logfiles found in /app/logs/*/*/, check and update the SCRAPYD_LOGS_DIR option in /app/.heroku/python/lib/python3.7/site-packages/logparser/settings.py 2020-05-05T18:15:16.297863+00:00 app[web.1]: [2020-05-05 21:15:16,297] INFO in logparser.logparser: Saved to /app/logs/stats.json (887 bytes). Visit stats at: http://127.0.0.1:39971/logs/stats.json 2020-05-05T18:15:16.298134+00:00 app[web.1]: [2020-05-05 21:15:16,298] INFO in logparser.logparser: Sleeping for 10s

Deployed spiders disappear after few hours

After deploying the scrapyd and scrapydweb to heroku, I try to deploy a spider and this works perfectly normal; however, after certain hours, the application refreshes and my spiders disappear. Is there a workaround to this issue?
Thank you very much

Not able to select spider after uploading to Heroku

I'm trying to create a crawler from scratch, after looking through the example posted. I have created a basic scrapy spider with a spiders.py file and a spider called "myspider", however when I zip that file up and upload it, it is not showing up as selectable as a spider. It just says "Select a version first."

Is there any config needed to be done within the spider in order for scrapyd to recognize the spider when I upload it?

Scrapy Selenium

In my scrapy app, I use scrapy -selenium with geckodriver. When I try use command 'scrapy-deploy', I get an error:
file "/tmp/scrapy_job_it-1590437526-tezy3xue.egg/scrapy_job_it/spiders/bdcrawler.py", line 3, in \nModuleNotFoundError: No module named 'scrapy_selenium'\n"}
What can I do to deploy app with selenium?
All track for error is here:
Server response (200):
{"node_name": "cbb42d40-da02-4d65-bb04-ff436c5aacc1", "status": "error", "message": "/app/.heroku/python/lib/python3.6/site-packages/scrapy/utils/project.py:94: ScrapyDeprecationWarning: Use of environment variables prefixed with SCRAPY_ to override settings is deprecated. The following environment variables are currently defined: EGG_VERSION\n ScrapyDeprecationWarning\nTraceback (most recent call last):\n File "/app/.heroku/python/lib/python3.6/runpy.py", line 193, in _run_module_as_main\n "main", mod_spec)\n File "/app/.heroku/python/lib/python3.6/runpy.py", line 85, in _run_code\n exec(code, run_globals)\n File "/app/.heroku/python/lib/python3.6/site-packages/scrapyd/runner.py", line 40, in \n main()\n File "/app/.heroku/python/lib/python3.6/site-packages/scrapyd/runner.py", line 37, in main\n execute()\n File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/cmdline.py", line 142, in execute\n cmd.crawler_process = CrawlerProcess(settings)\n File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/crawler.py", line 280, in init\n super(CrawlerProcess, self).init(settings)\n File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/crawler.py", line 152, in init\n self.spider_loader = self._get_spider_loader(settings)\n File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/crawler.py", line 146, in _get_spider_loader\n return loader_cls.from_settings(settings.frozencopy())\n File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/spiderloader.py", line 60, in from_settings\n return cls(settings)\n File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/spiderloader.py", line 24, in init\n self._load_all_spiders()\n File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/spiderloader.py", line 46, in _load_all_spiders\n for module in walk_modules(name):\n File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/utils/misc.py", line 77, in walk_modules\n submod = import_module(fullpath)\n File "/app/.heroku/python/lib/python3.6/importlib/init.py", line 126, in import_module\n return _bootstrap._gcd_import(name[level:], package, level)\n File "", line 994, in _gcd_import\n File "", line 971, in _find_and_load\n File "", line 955, in _find_and_load_unlocked\n File "", line 656, in _load_unlocked\n File "", line 626, in _load_backward_compatible\n File "/tmp/scrapy_job_it-1590437526-tezy3xue.egg/scrapy_job_it/spiders/bdcrawler.py", line 3, in \nModuleNotFoundError: No module named 'scrapy_selenium'\n"}

About Redis Labs account

Hi there, I'd like to ask a question on this cool repo for Scrapy.

I was wondering do you select cache or standard when signing up for redis-labs.

I'm a bit reluctant to put my credit card details into an account with redis. If I was running two spiders on a daily crawl three times a day would I exceed the 30mb after say 1 month?

Is it possible to use this repo with heroku hosting only?

Thanks
Tom

Scheduler don't work on Free Dynos

The scheduled jobs don't work on Heroku free Dynos as they sleep after 30 minutes.
According to Heroku tutorial to use APScheduler we need a clock process on Procfile, is that really needed?

how to specify python version and requirements.txt

my8100/scrapydweb#87 (comment)

Thanks for working on this great project. I have followed your instructions to setup scrapyd and scrapydweb on heroku, but ran into issue that there are other package dependencies in my scrapy project such as sqlalchemy.

When I deploy to scrapinghub.com, I specify that in the scrapinghub.yml in my project as follows:

project: 404937

stacks:
    default: scrapy:1.7-py3
requirements:
  file: requirements.txt

How to do this for Scrapydweb?

Thanks a lot!

[Question] How do you protect spider from being easily accessed?

I noticed that if I visit the spider page, I see the following:

`Scrapyd
Available projects: ScrapydWeb_demo

Jobs
Items
Logs
Documentation
How to schedule a spider?
To schedule a spider you need to use the API (this web UI is only for monitoring)

Example using curl:

curl http://localhost:6800/schedule.json -d project=default -d spider=somespider

For more information about the API, see the Scrapyd documentation`

How can this be protected?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.