my8100 / scrapyd-cluster-on-heroku Goto Github PK

View Code? Open in Web Editor NEW

121.0 7.0 94.0 242 KB

Set up free and scalable Scrapyd cluster for distributed web-crawling with just a few clicks. DEMO :point_right:

Home Page: https://scrapydweb.herokuapp.com/

License: GNU General Public License v3.0

Python 100.00%

scrapy scrapyd cluster heroku python scrapydweb logparser web-crawling web-scraping

scrapyd-cluster-on-heroku's Introduction

🔤 English | 🀄 简体中文

How to set up Scrapyd cluster on Heroku

Demo

scrapydweb.herokuapp.com

Network topology

Create accounts

View contents

Heroku

Visit heroku.com to create a free account, with which you can create and run up to 5 apps.

Redis Labs (optional)

Visit redislabs.com to create a free account, which provides 30MB storage and can be used by scrapy-redis for distributed crawling.

Deploy Heroku apps in the browser

View contents

Visit my8100/scrapyd-cluster-on-heroku-scrapyd-app to deploy the Scrapyd app. (Don't forget to update the host, port and password of your Redis server in the form)
Repeat step 1 to deploy up to 4 Scrapyd apps, assuming theri names are svr-1, svr-2, svr-3 and svr-4
Visit my8100/scrapyd-cluster-on-heroku-scrapydweb-app-git to deploy the ScrapydWeb app named myscrapydweb
(optional) Click the Reveal Config Vars button on dashboard.heroku.com/apps/myscrapydweb/settings to add more Scrapyd server accordingly, e.g. SCRAPYD_SERVER_2 as the KEY and svr-2.herokuapp.com:80#group2 as the VALUE.
Visit myscrapydweb.herokuapp.com
Jump to the Deploy and run distributed spiders section below and move on.

Custom deployment

View contents

Install tools

Git
Heroku CLI
Python client for Redis: Simply run the pip install redis command.

Download config files

Open a new terminal:

git clone https://github.com/my8100/scrapyd-cluster-on-heroku
cd scrapyd-cluster-on-heroku

Log in to Heroku

# Or run 'heroku login -i' to login with username/password
heroku login
# outputs:
# heroku: Press any key to open up the browser to login or q to exit:
# Opening browser to https://cli-auth.heroku.com/auth/browser/12345-abcde
# Logging in... done
# Logged in as [email protected]

Set up Scrapyd cluster

New Git repo

cd scrapyd
git init
# explore and update the files if needed
git status
git add .
git commit -a -m "first commit"
git status

Deploy Scrapyd app

heroku apps:create svr-1
heroku git:remote -a svr-1
git remote -v
git push heroku master
heroku logs --tail
# Press ctrl+c to stop logs outputting
# Visit https://svr-1.herokuapp.com

Add environment variables

Timezone

# python -c "import tzlocal; print(tzlocal.get_localzone())"
heroku config:set TZ=US/Eastern
# heroku config:get TZ

Redis account (optional, see settings.py in the scrapy_redis_demo_project.zip)

heroku config:set REDIS_HOST=your-redis-host
heroku config:set REDIS_PORT=your-redis-port
heroku config:set REDIS_PASSWORD=your-redis-password

Repeat step 2 and step 3 to get the rest Scrapyd apps ready: svr-2, svr-3 and svr-4

Set up ScrapydWeb app

New Git repo

cd ..
cd scrapydweb
git init
# explore and update the files if needed
git status
git add .
git commit -a -m "first commit"
git status

Deploy ScrapydWeb app

heroku apps:create myscrapydweb
heroku git:remote -a myscrapydweb
git remote -v
git push heroku master

Add environment variables

Timezone

heroku config:set TZ=US/Eastern

Scrapyd servers (see scrapydweb_settings_vN.py in the scrapydweb directory)

heroku config:set SCRAPYD_SERVER_1=svr-1.herokuapp.com:80
heroku config:set SCRAPYD_SERVER_2=svr-2.herokuapp.com:80#group1
heroku config:set SCRAPYD_SERVER_3=svr-3.herokuapp.com:80#group1
heroku config:set SCRAPYD_SERVER_4=svr-4.herokuapp.com:80#group2

Visit myscrapydweb.herokuapp.com

Deploy and run distributed spiders

View contents

Simply upload the compressed file scrapy_redis_demo_project.zip which resides in the scrapyd-cluster-on-heroku directory
Push seed URLs into mycrawler:start_urls to fire crawling and check out the scraped items

In [1]: import redis  # pip install redis

In [2]: r = redis.Redis(host='your-redis-host', port=your-redis-port, password='your-redis-password')

In [3]: r.delete('mycrawler_redis:requests', 'mycrawler_redis:dupefilter', 'mycrawler_redis:items')
Out[3]: 0

In [4]: r.lpush('mycrawler:start_urls', 'http://books.toscrape.com', 'http://quotes.toscrape.com')
Out[4]: 2

# wait for a minute
In [5]: r.lrange('mycrawler_redis:items', 0, 1)
Out[5]:
[b'{"url": "http://quotes.toscrape.com/", "title": "Quotes to Scrape", "hostname": "d6cf94d5-324e-4def-a1ab-e7ee2aaca45a", "crawled": "2019-04-02 03:42:37", "spider": "mycrawler_redis"}',
 b'{"url": "http://books.toscrape.com/index.html", "title": "All products | Books to Scrape - Sandbox", "hostname": "d6cf94d5-324e-4def-a1ab-e7ee2aaca45a", "crawled": "2019-04-02 03:42:37", "spider": "mycrawler_redis"}']

Conclusion

View contents

Pros
- Free
- Scalable (with the help of ScrapydWeb)
Cons
- Heroku apps would be restarted (cycled) at least once per day and any changes to the local filesystem will be deleted, so you need the external database to persist data. Check out devcenter.heroku.com for more info.

scrapyd-cluster-on-heroku's People

Contributors

Stargazers

Watchers

Forkers

honsa aiya1 c21xdx wangyuescream zzzz123321 dallasautumn fightingpjh kaanuki nsdown chinjun dongshige cloudchng harrywang pratikaher88 jfens mihaicustura maestro1 bpowers4 baffbarker luisfer5 christophergizard everbyte-io-dev 9018 dionnys5 taimoorkhokhar rkhudov amilami konlex josiahhostetter katbotkowska jiacheng135 databuzzer wooodhead naikabhilash yassinemahjoub841 justin-wise suchana34 richeks rigel772 miraiomezasu emilingemarkarlsson jackdevau mikewakawski d4tw meeran-learning adilwaris94 andresvidal thomasroshin albertommoura whjqb zhjsg ivanlonel d1aboilik tudou0002 scottatdrake pumpkinheadd anya-pich johnvillanueva danrosher leccese reddart echoirvin dbissonnette-collab gideonmandu stnorbi hoseok-lee harackmaster zahra-j6 flojoe ccsourcecode milancelap dmoniz22 yiyuebitian denis-littig 0xlearner shinsbong thedice ldsoftwareteam abdeljalil97 empty-c0de kaungsithu thaslinger mdrifatcom whob1

scrapyd-cluster-on-heroku's Issues

Published projects disappear and I have to deploy again

As mentioned in cons , Heroku server by default restarts every 24 hours and removes all published projects done using the ‘scrapyd-deploy’ command.
How can I keep the projects published?
Please help.

[Question] Postgresql setup

I added the Heroku Postgres add on, which creates the DATABASE_URL

Once I restarted the scrapydweb server I got this error:

2020-05-23T07:13:33.128467+00:00 heroku[web.1]: Starting process with command `scrapydweb`
2020-05-23T07:13:35.447284+00:00 app[web.1]: Traceback (most recent call last):
2020-05-23T07:13:35.447303+00:00 app[web.1]: File "/app/.heroku/python/bin/scrapydweb", line 5, in <module>
2020-05-23T07:13:35.447478+00:00 app[web.1]: from scrapydweb.run import main
2020-05-23T07:13:35.447489+00:00 app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/scrapydweb/__init__.py", line 14, in <module>
2020-05-23T07:13:35.447642+00:00 app[web.1]: from .common import handle_metadata
2020-05-23T07:13:35.447652+00:00 app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/scrapydweb/common.py", line 15, in <module>
2020-05-23T07:13:35.447771+00:00 app[web.1]: from .models import Metadata, db
2020-05-23T07:13:35.447781+00:00 app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/scrapydweb/models.py", line 8, in <module>
2020-05-23T07:13:35.447901+00:00 app[web.1]: from .vars import STATE_RUNNING
2020-05-23T07:13:35.447911+00:00 app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/scrapydweb/vars.py", line 61, in <module>
2020-05-23T07:13:35.448037+00:00 app[web.1]: results = setup_database(DATABASE_URL, DATABASE_PATH)
2020-05-23T07:13:35.448047+00:00 app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/scrapydweb/utils/setup_database.py", line 37, in setup_database
2020-05-23T07:13:35.448172+00:00 app[web.1]: setup_postgresql(*m_postgres.groups())
2020-05-23T07:13:35.448194+00:00 app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/scrapydweb/utils/setup_database.py", line 134, in setup_postgresql
2020-05-23T07:13:35.448345+00:00 app[web.1]: conn = psycopg2.connect(host=host, port=int(port), user=username, password=password)
2020-05-23T07:13:35.448355+00:00 app[web.1]: File "/app/.heroku/python/lib/python3.7/site-packages/psycopg2/__init__.py", line 127, in connect
2020-05-23T07:13:35.448492+00:00 app[web.1]: conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
2020-05-23T07:13:35.448526+00:00 app[web.1]: psycopg2.OperationalError: FATAL:  database "wrzusiwvpowsmj" does not exist

What else do I need to do to get this working?

Log Error

I'm seeing this in the logs. Is there something that needs to be done?

2020-05-05T18:15:16.297349+00:00 app[web.1]: [2020-05-05 21:15:16,297] ERROR in logparser.logparser: No logfiles found in /app/logs/*/*/, check and update the SCRAPYD_LOGS_DIR option in /app/.heroku/python/lib/python3.7/site-packages/logparser/settings.py 2020-05-05T18:15:16.297863+00:00 app[web.1]: [2020-05-05 21:15:16,297] INFO in logparser.logparser: Saved to /app/logs/stats.json (887 bytes). Visit stats at: http://127.0.0.1:39971/logs/stats.json 2020-05-05T18:15:16.298134+00:00 app[web.1]: [2020-05-05 21:15:16,298] INFO in logparser.logparser: Sleeping for 10s

Deployed spiders disappear after few hours

After deploying the scrapyd and scrapydweb to heroku, I try to deploy a spider and this works perfectly normal; however, after certain hours, the application refreshes and my spiders disappear. Is there a workaround to this issue?
Thank you very much

Not able to select spider after uploading to Heroku

I'm trying to create a crawler from scratch, after looking through the example posted. I have created a basic scrapy spider with a spiders.py file and a spider called "myspider", however when I zip that file up and upload it, it is not showing up as selectable as a spider. It just says "Select a version first."

Is there any config needed to be done within the spider in order for scrapyd to recognize the spider when I upload it?

The ScrapydWeb app is redirected to HTTP

为什么我的 heroku 应用访问 https 时，会 302 跳转到 http，是 heroku 的原因，还是 scrapydweb 的原因呢？

Scrapy Selenium

In my scrapy app, I use scrapy -selenium with geckodriver. When I try use command 'scrapy-deploy', I get an error:
file "/tmp/scrapy_job_it-1590437526-tezy3xue.egg/scrapy_job_it/spiders/bdcrawler.py", line 3, in \nModuleNotFoundError: No module named 'scrapy_selenium'\n"}
What can I do to deploy app with selenium?
All track for error is here:
Server response (200):
{"node_name": "cbb42d40-da02-4d65-bb04-ff436c5aacc1", "status": "error", "message": "/app/.heroku/python/lib/python3.6/site-packages/scrapy/utils/project.py:94: ScrapyDeprecationWarning: Use of environment variables prefixed with SCRAPY_ to override settings is deprecated. The following environment variables are currently defined: EGG_VERSION\n ScrapyDeprecationWarning\nTraceback (most recent call last):\n File "/app/.heroku/python/lib/python3.6/runpy.py", line 193, in _run_module_as_main\n "main", mod_spec)\n File "/app/.heroku/python/lib/python3.6/runpy.py", line 85, in _run_code\n exec(code, run_globals)\n File "/app/.heroku/python/lib/python3.6/site-packages/scrapyd/runner.py", line 40, in \n main()\n File "/app/.heroku/python/lib/python3.6/site-packages/scrapyd/runner.py", line 37, in main\n execute()\n File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/cmdline.py", line 142, in execute\n cmd.crawler_process = CrawlerProcess(settings)\n File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/crawler.py", line 280, in init\n super(CrawlerProcess, self).init(settings)\n File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/crawler.py", line 152, in init\n self.spider_loader = self._get_spider_loader(settings)\n File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/crawler.py", line 146, in _get_spider_loader\n return loader_cls.from_settings(settings.frozencopy())\n File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/spiderloader.py", line 60, in from_settings\n return cls(settings)\n File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/spiderloader.py", line 24, in init\n self._load_all_spiders()\n File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/spiderloader.py", line 46, in _load_all_spiders\n for module in walk_modules(name):\n File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/utils/misc.py", line 77, in walk_modules\n submod = import_module(fullpath)\n File "/app/.heroku/python/lib/python3.6/importlib/init.py", line 126, in import_module\n return _bootstrap._gcd_import(name[level:], package, level)\n File "", line 994, in _gcd_import\n File "", line 971, in _find_and_load\n File "", line 955, in _find_and_load_unlocked\n File "", line 656, in _load_unlocked\n File "", line 626, in _load_backward_compatible\n File "/tmp/scrapy_job_it-1590437526-tezy3xue.egg/scrapy_job_it/spiders/bdcrawler.py", line 3, in \nModuleNotFoundError: No module named 'scrapy_selenium'\n"}

About Redis Labs account

Hi there, I'd like to ask a question on this cool repo for Scrapy.

I was wondering do you select cache or standard when signing up for redis-labs.

I'm a bit reluctant to put my credit card details into an account with redis. If I was running two spiders on a daily crawl three times a day would I exceed the 30mb after say 1 month?

Is it possible to use this repo with heroku hosting only?

Thanks
Tom

Thanks for working on this great project. I have followed your instructions to setup scrapyd and scrapydweb on heroku, but ran into issue that there are other package dependencies in my scrapy project such as sqlalchemy.

When I deploy to scrapinghub.com, I specify that in the scrapinghub.yml in my project as follows:
project: 404937

stacks:
    default: scrapy:1.7-py3
requirements:
  file: requirements.txt
How to do this for Scrapydweb?

Thanks a lot!

[Question] How do you protect spider from being easily accessed?

I noticed that if I visit the spider page, I see the following:

`Scrapyd
Available projects: ScrapydWeb_demo

Jobs
Items
Logs
Documentation
How to schedule a spider?
To schedule a spider you need to use the API (this web UI is only for monitoring)

Example using curl:

curl http://localhost:6800/schedule.json -d project=default -d spider=somespider

For more information about the API, see the Scrapyd documentation`

How can this be protected?

my8100 / scrapyd-cluster-on-heroku Goto Github PK

scrapyd-cluster-on-heroku's Introduction

How to set up Scrapyd cluster on Heroku

Demo

Network topology

Create accounts

Deploy Heroku apps in the browser

Custom deployment

Install tools

Download config files

Log in to Heroku

Set up Scrapyd cluster

Set up ScrapydWeb app

Deploy and run distributed spiders

Conclusion

scrapyd-cluster-on-heroku's People

Contributors

Stargazers

Watchers

Forkers

scrapyd-cluster-on-heroku's Issues

为什么我的 heroku 应用访问 https 时，会 302 跳转到 http，是 heroku 的原因，还是 scrapydweb 的原因呢？

Recommend Projects

Recommend Topics

Recommend Org