tonywangcn / scaleable-crawler-with-docker-cluster Goto Github PK
View Code? Open in Web Editor NEWa scaleable and efficient crawelr with docker cluster , crawl million pages in 2 hours with a single machine
a scaleable and efficient crawelr with docker cluster , crawl million pages in 2 hours with a single machine
error as below:
worker_1 | /usr/local/lib/python2.7/site-packages/celery/platforms.py:793: RuntimeWarning: You're running the worker with superuser privileges: this is
worker_1 | absolutely not recommended!
worker_1 |
worker_1 | Please specify a different user using the -u option.
worker_1 |
worker_1 | User information: uid=0 euid=0 gid=0 egid=0
worker_1 |
worker_1 | uid=uid, euid=euid, gid=gid, egid=egid,
scaleable-crawler-with-docker-cluster_worker_2 exited with code 1
worker_1 | Traceback (most recent call last):
worker_1 | File "/usr/local/bin/celery", line 10, in <module>
worker_1 | sys.exit(main())
worker_1 | File "/usr/local/lib/python2.7/site-packages/celery/__main__.py", line 14, in main
worker_1 | _main()
worker_1 | File "/usr/local/lib/python2.7/site-packages/celery/bin/celery.py", line 326, in main
worker_1 | cmd.execute_from_commandline(argv)
worker_1 | File "/usr/local/lib/python2.7/site-packages/celery/bin/celery.py", line 488, in execute_from_commandline
worker_1 | super(CeleryCommand, self).execute_from_commandline(argv)))
worker_1 | File "/usr/local/lib/python2.7/site-packages/celery/bin/base.py", line 281, in execute_from_commandline
worker_1 | return self.handle_argv(self.prog_name, argv[1:])
worker_1 | File "/usr/local/lib/python2.7/site-packages/celery/bin/celery.py", line 480, in handle_argv
worker_1 | return self.execute(command, argv)
worker_1 | File "/usr/local/lib/python2.7/site-packages/celery/bin/celery.py", line 412, in execute
worker_1 | ).run_from_argv(self.prog_name, argv[1:], command=argv[0])
worker_1 | File "/usr/local/lib/python2.7/site-packages/celery/bin/worker.py", line 221, in run_from_argv
worker_1 | return self(*args, **options)
worker_1 | File "/usr/local/lib/python2.7/site-packages/celery/bin/base.py", line 244, in __call__
worker_1 | ret = self.run(*args, **kwargs)
worker_1 | File "/usr/local/lib/python2.7/site-packages/celery/bin/worker.py", line 255, in run
worker_1 | **kwargs)
worker_1 | File "/usr/local/lib/python2.7/site-packages/celery/worker/worker.py", line 99, in __init__
worker_1 | self.setup_instance(**self.prepare_args(**kwargs))
worker_1 | File "/usr/local/lib/python2.7/site-packages/celery/worker/worker.py", line 122, in setup_instance
worker_1 | self.should_use_eventloop() if use_eventloop is None
worker_1 | File "/usr/local/lib/python2.7/site-packages/celery/worker/worker.py", line 241, in should_use_eventloop
worker_1 | self._conninfo.transport.implements.async and
worker_1 | File "/usr/local/lib/python2.7/site-packages/kombu/transport/base.py", line 127, in __getattr__
worker_1 | raise AttributeError(key)
worker_1 | AttributeError: async
Thank you so much for the open sourcing the code and detailed article.
I found few tips to be followed to make it work:
In your test_celery/celery.py
you have to set broker url as broker='amqp://admin:mypass@rabbit:5672'
You have to execute a run_tasks
inside a worker container instead of host machine.
sudo docker exec -i -t scaleablecrawlerwithdockercluster_worker_1 /bin/bash
python -m test_celery.run_tasks
Where scaleablecrawlerwithdockercluster_worker_1
is a container name. Make sure you replace it with your worker container name or id.
mongo --host 172.18.0.1 --port 27018
In the medium article, you said about installation of docker-engine
. I think. we only need installation docker
and docker-compose
. I installed both docker and docker-compose using following steps
1. Install Docker
Refer: https://docs.docker.com/engine/installation/linux/docker-ce/ubuntu/
sudo apt-get remove docker docker-engine docker.io
sudo apt-get update
sudo apt-get install \
apt-transport-https \
ca-certificates \
curl \
software-properties-common
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo apt-key fingerprint 0EBFCD88
sudo add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable"
sudo apt-get update
sudo apt-get install docker-ce
sudo docker --version
2. Install Docker-compose
sudo curl -L https://github.com/docker/compose/releases/download/1.18.0/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
sudo docker-compose --version
Could you please update the same on the article?
Last night, when I ran python -m test_celery.run_tasks
, anyone knows how to fix it?
worker_1 | [2018-05-19 05:12:06,257: WARNING/ForkPoolWorker-3] /usr/local/lib/python2.7/site-packages/pymongo/topology.py:145: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#pymongo-fork-safe>
worker_1 | "MongoClient opened before fork. Create MongoClient "
and
worker_1 | [2018-05-19 05:12:36,612: ERROR/ForkPoolWorker-3] Task test_celery.tasks.longtime_add[287b7121-cb07-42b5-868b-785e6aab74cc] raised unexpected: ServerSelectionTimeoutError('127.0.0.1:27017: [Errno 111] Connection refused',)
worker_1 | Traceback (most recent call last):
worker_1 | File "/usr/local/lib/python2.7/site-packages/celery/app/trace.py", line 367, in trace_task
worker_1 | R = retval = fun(*args, **kwargs)
worker_1 | File "/usr/local/lib/python2.7/site-packages/celery/app/trace.py", line 622, in __protected_call__
worker_1 | return self.run(*args, **kwargs)
worker_1 | File "/app/test_celery/tasks.py", line 17, in longtime_add
worker_1 | raise self.retry(exc=exc)
worker_1 | File "/usr/local/lib/python2.7/site-packages/celery/app/task.py", line 668, in retry
worker_1 | raise_with_context(exc)
worker_1 | File "/app/test_celery/tasks.py", line 14, in longtime_add
worker_1 | post.insert({'status':r.status_code,"creat_time":time.time()}) # store status code and current time to mongodb
worker_1 | File "/usr/local/lib/python2.7/site-packages/pymongo/collection.py", line 2467, in insert
worker_1 | with self._socket_for_writes() as sock_info:
worker_1 | File "/usr/local/lib/python2.7/contextlib.py", line 17, in __enter__
worker_1 | return self.gen.next()
worker_1 | File "/usr/local/lib/python2.7/site-packages/pymongo/mongo_client.py", line 823, in _get_socket
worker_1 | server = self._get_topology().select_server(selector)
worker_1 | File "/usr/local/lib/python2.7/site-packages/pymongo/topology.py", line 214, in select_server
worker_1 | address))
worker_1 | File "/usr/local/lib/python2.7/site-packages/pymongo/topology.py", line 189, in select_servers
worker_1 | self._error_message(selector))
worker_1 | ServerSelectionTimeoutError: 127.0.0.1:27017: [Errno 111] Connection refused
Currently, I have
pymongo==3.6.1
celery==4.1.0
python==3.5.2
Ubuntu==16.04
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.