Coder Social home page Coder Social logo

Comments (4)

ramsayleung avatar ramsayleung commented on May 31, 2024

https://github.com/rmax/scrapy-redis, 这个是scrapy-redis的地址,我是中间件的形式直接调用scrapy-redis,只不过之前我自己因为某些原因把RedisSpider 换回了scrapy.Spider, 所以你看到的项目现在是 用的scrapy.Spider. 需要注意的是,现在基于scrapy-redis 实现的分布式爬虫并不是严格意义的分布式,对于hbase/zookeeper 这种"纯分布式",它只能算是"伪分布式",因为它是存在单点的,单点就是那台master redis. 即使你用了redis 的master-slave, 单点还是存在。

from jd_spider.

yangcongtougg avatar yangcongtougg commented on May 31, 2024

之前有人说scrapy-redis是伪分布式,我的理解是多台worker从redis拿到spider的start_request,而之后解析出来的urls还是走每台worker自身的scheduler过滤处理;如果是这样那就把解析出来的url塞回redis不就ok了么,还是不太理解伪分布式,没玩过集群。-、-

from jd_spider.

ramsayleung avatar ramsayleung commented on May 31, 2024

scrapy-redis 提供的组件包括SchedulerDupefilterPipelineSpider, 多个worker 拿到解析之后的url, 走的是scrapy-redis 的scheduler,因为如果是每台worker 自身的scheduler进行过滤,worker怎么知道这个url 有没有其他worker 已经爬取过了呢?所以存储待爬任务的是scrapy-redis, 分配待爬url 的也是 scrapy-redis, 在pipeline 保存结果的也是scrapy-redis。区别只是从原来一个人是老板,也是员工(单机模式),变成了一个老板(scrapy-redis) 指挥一堆员工 (worker) 干活 (crawl )- (伪分布式)。说它是伪分布式就是因为它不像纯分布式那样,每个员工(worker)都知道干了什么活,还有什么活没干,纯分布式是没有老板(单点)的. 个人见解

from jd_spider.

yangcongtougg avatar yangcongtougg commented on May 31, 2024

ok、多谢~

from jd_spider.

Related Issues (19)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.