好像不是分布式的？ about jd_spider HOT 5 CLOSED

ramsayleung commented on June 9, 2024

好像不是分布式的？

from jd_spider.

Comments (5)

ramsayleung commented on June 9, 2024 1

self.item_db_parameter_name="parameter" 是将self.it_db_parameter_name 当作self.db[self.item_db_parameter_name] 的Key, 也就是等价于　self.db["parameter"]. 而self.db的赋值语句是self.db = init_mongodb(),也就是初始化mongodb, 然后在mongodb数据库中获取商品信息的参数．爬取的时候，先到jd_spider/jd目录下运行 scrapy crawl jindong,然后再到 jd_spider/jd_comment 下运行 scrapy crawl jd_comment

from jd_spider.

ramsayleung commented on June 9, 2024

这个爬虫是分布式的，通过 redis来实现分布式，如果你有不同的机器，你只需要指定从哪一台机器的redis读取需要爬取的数据，然后多台机器就可以协作爬取．实现分布式爬虫的策略有很多种，使用redis只是其中的一种，可以免去很多的麻烦．如果你还对这个项目是否是分布式存在疑问的话，你可以去了解一下 scrapy-redis :)

from jd_spider.

ShayChris commented on June 9, 2024

我看你的爬虫继承的是Spider不是RedisSpider，想请问一下这样是如何实现分布式的？

from jd_spider.

ramsayleung commented on June 9, 2024

在scrapy-redis上关于用法的介绍么：
<Use the following setting in your project

 #Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# Default requests serializer is pickle, but it can be changed to any module
# with loads and dumps functions. Note that pickle is not compatible between
# python versions.
# Caveat: In python 3.x, the serializer must return strings keys and support
# bytes as values. Because of this reason the json or msgpack module will not
# work by default. In python 2.x there is no such issue and you can use
# 'json' or 'msgpack' as serializers.
#SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"

# Don't cleanup redis queues, allows to pause/resume crawls.
#SCHEDULER_PERSIST = True

# Schedule requests using a priority queue. (default)
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

# Alternative queues.
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'

# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
#SCHEDULER_IDLE_BEFORE_CLOSE = 10

# Store scraped item in redis for post-processing.
ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 300
}

就scrapy-redis 而言，因为用 redis 取代了 scrapy 自带的 collection.deque,就可以把需要爬取的队列从保存到内存中变成了保存到内存数据库中，但是这个时候，原来配置 collection.deque 使用的调度器就没办法使用了，也没办法进行分布式调度，于是 scrapy-redis 重写了 scrapy 的调度器．所以在配置文件里面指定使用scrapy-redis的调度器以及去重策略以及Pipeline即可了

from jd_spider.

kaixinbaba commented on June 9, 2024

我试着运行jd-comment 爬虫的时候，日志滚动一下就结束了，看了下代码中 self.item_db_parameter_name = "parameter" 这个parameter 有什么用因为通过这个参数 find出来的parameters 为空，最后的good也为空，所以爬虫就结束了，
另外启动 jd 爬虫的时候不管使用 scrapy crawl jingdong 还是 scrapy crawl jd 都返回 Key Error 好像是找不到对应爬虫。。不知道是哪儿的问题。菜鸟求问

from jd_spider.

好像不是分布式的？ about jd_spider HOT 5 CLOSED

Comments (5)

Related Issues (19)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent