Light

insutanto / scrapy-distributed Goto Github PK

A series of distributed components for Scrapy. Including RabbitMQ-based components, Kafka-based components, and RedisBloom-based components for Scrapy.

Python 100.00%

crawler crawling distributed-spider kafka python rabbitmq rabbitmq-pipeline redis redisbloom scraping scrapy spider

scrapy-distributed's Introduction

Hello, World! 👋

📙 Focusing on backend development
🏗️ Have a passion for build software and frameworks
📸 A photographer and video maker
🔊 Love music

scrapy-distributed's People

Contributors

Stargazers

Watchers

Forkers

xrosliang zanachka espressofx doginal zheshilixin net5 hantop alexkeleon kingking888 phamvanhanh6720 dqsdatalabs jchenga

scrapy-distributed's Issues

Implementation proposal

Hi @Insutanto

You doing nice work in this repo. I have the same desire: different message queues should be supported in scrapy.

Old implementations of this idea and one you have here share common disadvantage. For every type of queue you need to implement separate scheduler. Beside amount of work required such implementations can't use work done on improvement of scheduling. I am talking mostly about scrapy/scrapy#3520. The reason for going distributed(at least for me) is a lot of domains in a single crawl. Not using DownloaderAwarePriorityQueue makes crawling slower(like 10 times slower) according to benchmarks in mentioned PR.

To overcome this situation I developed and merged in scrapy/scrapy#3884 separation between logic of scheduler and external message queue.

It would be great for your project and scrapy community if you change from scheduler-based to queue-based.

More details and discussions can be find in scrapy/scrapy#4326. Example of such implementation for redis you can find in https://github.com/whalebot-helmsman/scrapy/blob/redis/scrapy/squeues.py#L101-L173 .

Also there is a PR for external queue protocol scrapy/scrapy#4783

Support Scrapy 2.6+

RocketMQ Scheduler

RocketMQ Item Pipeline

Support Delayed Message in RabbitMQ Scheduler

Redis Streams Item Pipeline

dynamic web crawlers

Can you add dynamic web crawlers into your project? Need to use simulation click technology and anti-climbing mechanism. Thx！

SQLAlchemy Pipeline

Support SQLAlchemy Pipeline

Custom Interface for DupeFilter

First feedback from users

Hi @Insutanto

Great! Your work is really impressive. However, I would like to add some suggestions.

First of all, I wanna open the front console of RabbitMQ(http://127.0.0.1:15672), but it didn't work.
I fixed it by navigating to the RabbitMQ configuration directory, then install web rabbitmq_management and refresh the front console of RabbitMQ. The commands are:

docker exec -it <rabbitmq-container-id> /bin/bash
cd /etc/rabbitmq/
rabbitmq-plugins enable rabbitmq_management

Secondly, delete the key from redis. The commands are:

docker exec -it <redis-container-id> /bin/bash
redis-cli
keys *
del <key_name>

Thirdly, I added the images and files download functions for rabbitmq example common. The code is:

PATH: examples/rabbitmq_example/simple_example/settings.py

ITEM_PIPELINES = {
   'simple_example.pipelines.SimpleExamplePipeline': 201,
   'scrapy_distributed.pipelines.amqp.RabbitPipeline': 200,
   'simple_example.pipelines.ImagePipeline': 202,
   'simple_example.pipelines.MyFilesPipeline': 203,
}

FILES_STORE = './test_data/example_common/files_dir'
IMAGES_STORE = './test_data/example_common/images_dir'

PATH: examples/rabbitmq_example/simple_example/pipelines.py

from scrapy.pipelines.images import ImagesPipeline
from scrapy.pipelines.files import FilesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request


class ImagePipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield Request(image_url, meta={'item': item, 'index': item['image_urls'].index(image_url)})

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        return item

    def file_path(self, request, response=None, info=None):
        item = request.meta['item']
        image_guid = request.url.split('/')[-1]
        filename = './{}/{}/{}'.format(item["url"].replace("/", "_"), item['title'], image_guid)
        return filename


class MyFilesPipeline(FilesPipeline):

    def get_media_requests(self, item, info):
        for file_url in item['file_urls']:
            yield Request(file_url, meta={'item': item, 'index': item['file_urls'].index(file_url)})

    def item_completed(self, results, item, info):
        file_paths = [x['path'] for ok, x in results if ok]
        if not file_paths:
            raise DropItem("Item contains no files")
        return item

    def file_path(self, request, response=None, info=None):
        item = request.meta['item']
        file_guid = request.url.split('/')[-1]
        filename = './{}/{}/{}'.format(item["url"].replace("/", "_"), item['title'], file_guid)
        return filename

PATH: examples/rabbitmq_example/simple_example/items.py

class CommonExampleItem(scrapy.Item):

    # define the fields for your item here like:
    title = scrapy.Field()
    url = scrapy.Field()
    content = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()
    file_urls = scrapy.Field()
    files = scrapy.Field()

PATH: examples/rabbitmq_example/simple_example/spiders/example.py

    def parse(self, response):
        self.logger.info(f"parse response, url: {response.url}")
        for link in response.xpath("//a/@href").extract():
            if not link.startswith('http'):
                link = response.url + link
            yield Request(url=link)
        item = CommonExampleItem()
        item['url'] = response.url
        item['title'] = response.xpath("//title/text()").extract_first()
        item["content"] = response.text

        image_urls = []
        for image_url in response.xpath('//a/img/@src').extract():
            if image_url.endswith(('jpg', 'png')):
                if not image_url.startswith('http'):
                    image_url = re.match("(.*?//.*?)/", response.url).group(1) + image_url
                    image_urls.append(image_url)
                else:
                    image_urls.append(image_url)
        item['image_urls'] = image_urls

        file_urls = []
        for file_url in response.xpath(
                "//a[re:match(@href,'.*(\.docx|\.doc|\.xlsx|\.pdf|\.xls|\.zip)$')]/@href").extract():
            if not file_url.startswith('http'):
                file_url = re.match("(.*?//.*?)/", response.url).group(1) + file_url
                file_urls.append(file_url)
            else:
                file_urls.append(file_url)
        item['file_urls'] = file_urls

        yield item

Finally, I hope that the author can add the dynamic web crawler tutorial. Thanks again!!!!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.