Coder Social home page Coder Social logo

Comments (5)

ramsayleung avatar ramsayleung commented on June 9, 2024 1

self.item_db_parameter_name="parameter" 是将self.it_db_parameter_name 当作self.db[self.item_db_parameter_name]Key, 也就是等价于 self.db["parameter"]. 而self.db的赋值语句是self.db = init_mongodb(),也就是初始化mongodb, 然后在mongodb数据库中获取商品信息的参数.爬取的时候,先到jd_spider/jd目录下运行 scrapy crawl jindong,然后再到 jd_spider/jd_comment 下运行 scrapy crawl jd_comment

from jd_spider.

ramsayleung avatar ramsayleung commented on June 9, 2024

这个爬虫是分布式的,通过 redis来实现分布式,如果你有不同的机器,你只需要指定从哪一台机器的redis读取需要爬取的数据,然后多台机器就可以协作爬取.实现分布式爬虫的策略有很多种,使用redis只是其中的一种,可以免去很多的麻烦.如果你还对这个项目是否是分布式存在疑问的话,你可以去了解一下 scrapy-redis :)

from jd_spider.

ShayChris avatar ShayChris commented on June 9, 2024

我看你的爬虫继承的是Spider不是RedisSpider,想请问一下这样是如何实现分布式的?

from jd_spider.

ramsayleung avatar ramsayleung commented on June 9, 2024

scrapy-redis上关于用法的介绍么:
<Use the following setting in your project

 #Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# Default requests serializer is pickle, but it can be changed to any module
# with loads and dumps functions. Note that pickle is not compatible between
# python versions.
# Caveat: In python 3.x, the serializer must return strings keys and support
# bytes as values. Because of this reason the json or msgpack module will not
# work by default. In python 2.x there is no such issue and you can use
# 'json' or 'msgpack' as serializers.
#SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"

# Don't cleanup redis queues, allows to pause/resume crawls.
#SCHEDULER_PERSIST = True

# Schedule requests using a priority queue. (default)
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

# Alternative queues.
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'

# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
#SCHEDULER_IDLE_BEFORE_CLOSE = 10

# Store scraped item in redis for post-processing.
ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 300
}

就scrapy-redis 而言,因为用 redis 取代了 scrapy 自带的 collection.deque,就可以把需要爬取的队列从保存到内存中变成了保存到内存数据库中,但是这个时候,原来配置 collection.deque 使用的调度器就没办法使用了,也没办法进行分布式调度,于是 scrapy-redis 重写了 scrapy 的调度器.所以在配置文件里面指定使用scrapy-redis的调度器以及去重策略以及Pipeline即可了

from jd_spider.

kaixinbaba avatar kaixinbaba commented on June 9, 2024

我试着运行jd-comment 爬虫的时候,日志滚动一下就结束了,看了下代码中 self.item_db_parameter_name = "parameter" 这个parameter 有什么用 因为通过这个参数 find出来的parameters 为空,最后的good也为空,所以爬虫就结束了,
另外 启动 jd 爬虫的时候 不管使用 scrapy crawl jingdong 还是 scrapy crawl jd 都返回 Key Error 好像是找不到对应爬虫。。不知道是哪儿的问题。菜鸟求问

from jd_spider.

Related Issues (19)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.