Coder Social home page Coder Social logo

insutanto / scrapy-distributed Goto Github PK

View Code? Open in Web Editor NEW
52.0 52.0 12.0 43 KB

A series of distributed components for Scrapy. Including RabbitMQ-based components, Kafka-based components, and RedisBloom-based components for Scrapy.

Python 100.00%
crawler crawling distributed-spider kafka python rabbitmq rabbitmq-pipeline redis redisbloom scraping scrapy spider

scrapy-distributed's Introduction

Hello, World! 👋

  • 📙 Focusing on backend development
  • 🏗️ Have a passion for build software and frameworks
  • 📸 A photographer and video maker
  • 🔊 Love music

scrapy-distributed's People

Contributors

insutanto avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

scrapy-distributed's Issues

Implementation proposal

Hi @Insutanto

You doing nice work in this repo. I have the same desire: different message queues should be supported in scrapy.

Old implementations of this idea and one you have here share common disadvantage. For every type of queue you need to implement separate scheduler. Beside amount of work required such implementations can't use work done on improvement of scheduling. I am talking mostly about scrapy/scrapy#3520. The reason for going distributed(at least for me) is a lot of domains in a single crawl. Not using DownloaderAwarePriorityQueue makes crawling slower(like 10 times slower) according to benchmarks in mentioned PR.

To overcome this situation I developed and merged in scrapy/scrapy#3884 separation between logic of scheduler and external message queue.

It would be great for your project and scrapy community if you change from scheduler-based to queue-based.

More details and discussions can be find in scrapy/scrapy#4326. Example of such implementation for redis you can find in https://github.com/whalebot-helmsman/scrapy/blob/redis/scrapy/squeues.py#L101-L173 .

Also there is a PR for external queue protocol scrapy/scrapy#4783

dynamic web crawlers

Can you add dynamic web crawlers into your project? Need to use simulation click technology and anti-climbing mechanism. Thx!

First feedback from users

Hi @Insutanto

Great! Your work is really impressive. However, I would like to add some suggestions.

First of all, I wanna open the front console of RabbitMQ(http://127.0.0.1:15672), but it didn't work.
I fixed it by navigating to the RabbitMQ configuration directory, then install web rabbitmq_management and refresh the front console of RabbitMQ. The commands are:

docker exec -it <rabbitmq-container-id> /bin/bash
cd /etc/rabbitmq/
rabbitmq-plugins enable rabbitmq_management

Secondly, delete the key from redis. The commands are:

docker exec -it <redis-container-id> /bin/bash
redis-cli
keys *
del <key_name>

Thirdly, I added the images and files download functions for rabbitmq example common. The code is:

PATH: examples/rabbitmq_example/simple_example/settings.py

ITEM_PIPELINES = {
   'simple_example.pipelines.SimpleExamplePipeline': 201,
   'scrapy_distributed.pipelines.amqp.RabbitPipeline': 200,
   'simple_example.pipelines.ImagePipeline': 202,
   'simple_example.pipelines.MyFilesPipeline': 203,
}

FILES_STORE = './test_data/example_common/files_dir'
IMAGES_STORE = './test_data/example_common/images_dir'

PATH: examples/rabbitmq_example/simple_example/pipelines.py

from scrapy.pipelines.images import ImagesPipeline
from scrapy.pipelines.files import FilesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request


class ImagePipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield Request(image_url, meta={'item': item, 'index': item['image_urls'].index(image_url)})

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        return item

    def file_path(self, request, response=None, info=None):
        item = request.meta['item']
        image_guid = request.url.split('/')[-1]
        filename = './{}/{}/{}'.format(item["url"].replace("/", "_"), item['title'], image_guid)
        return filename


class MyFilesPipeline(FilesPipeline):

    def get_media_requests(self, item, info):
        for file_url in item['file_urls']:
            yield Request(file_url, meta={'item': item, 'index': item['file_urls'].index(file_url)})

    def item_completed(self, results, item, info):
        file_paths = [x['path'] for ok, x in results if ok]
        if not file_paths:
            raise DropItem("Item contains no files")
        return item

    def file_path(self, request, response=None, info=None):
        item = request.meta['item']
        file_guid = request.url.split('/')[-1]
        filename = './{}/{}/{}'.format(item["url"].replace("/", "_"), item['title'], file_guid)
        return filename

PATH: examples/rabbitmq_example/simple_example/items.py

class CommonExampleItem(scrapy.Item):

    # define the fields for your item here like:
    title = scrapy.Field()
    url = scrapy.Field()
    content = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()
    file_urls = scrapy.Field()
    files = scrapy.Field()

PATH: examples/rabbitmq_example/simple_example/spiders/example.py

    def parse(self, response):
        self.logger.info(f"parse response, url: {response.url}")
        for link in response.xpath("//a/@href").extract():
            if not link.startswith('http'):
                link = response.url + link
            yield Request(url=link)
        item = CommonExampleItem()
        item['url'] = response.url
        item['title'] = response.xpath("//title/text()").extract_first()
        item["content"] = response.text

        image_urls = []
        for image_url in response.xpath('//a/img/@src').extract():
            if image_url.endswith(('jpg', 'png')):
                if not image_url.startswith('http'):
                    image_url = re.match("(.*?//.*?)/", response.url).group(1) + image_url
                    image_urls.append(image_url)
                else:
                    image_urls.append(image_url)
        item['image_urls'] = image_urls

        file_urls = []
        for file_url in response.xpath(
                "//a[re:match(@href,'.*(\.docx|\.doc|\.xlsx|\.pdf|\.xls|\.zip)$')]/@href").extract():
            if not file_url.startswith('http'):
                file_url = re.match("(.*?//.*?)/", response.url).group(1) + file_url
                file_urls.append(file_url)
            else:
                file_urls.append(file_url)
        item['file_urls'] = file_urls

        yield item

Finally, I hope that the author can add the dynamic web crawler tutorial. Thanks again!!!!

Congratulate

As for distributed crawlers, I think it's just friendly URL management with reasonable scheduling rules. Semaphore will be their command gun, this is a really great open source project, looking forward to it。

Or we try to use Raft

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.