Comments (5)
self.item_db_parameter_name="parameter"
是将self.it_db_parameter_name
当作self.db[self.item_db_parameter_name]
的Key
, 也就是等价于 self.db["parameter"]
. 而self.db
的赋值语句是self.db = init_mongodb()
,也就是初始化mongodb, 然后在mongodb数据库中获取商品信息的参数.爬取的时候,先到jd_spider/jd
目录下运行 scrapy crawl jindong
,然后再到 jd_spider/jd_comment
下运行 scrapy crawl jd_comment
from jd_spider.
这个爬虫是分布式的,通过 redis
来实现分布式,如果你有不同的机器,你只需要指定从哪一台机器的redis
读取需要爬取的数据,然后多台机器就可以协作爬取.实现分布式爬虫的策略有很多种,使用redis
只是其中的一种,可以免去很多的麻烦.如果你还对这个项目是否是分布式存在疑问的话,你可以去了解一下 scrapy-redis
:)
from jd_spider.
我看你的爬虫继承的是Spider不是RedisSpider,想请问一下这样是如何实现分布式的?
from jd_spider.
在scrapy-redis
上关于用法的介绍么:
<Use the following setting in your project
#Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# Default requests serializer is pickle, but it can be changed to any module
# with loads and dumps functions. Note that pickle is not compatible between
# python versions.
# Caveat: In python 3.x, the serializer must return strings keys and support
# bytes as values. Because of this reason the json or msgpack module will not
# work by default. In python 2.x there is no such issue and you can use
# 'json' or 'msgpack' as serializers.
#SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"
# Don't cleanup redis queues, allows to pause/resume crawls.
#SCHEDULER_PERSIST = True
# Schedule requests using a priority queue. (default)
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
# Alternative queues.
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'
# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
#SCHEDULER_IDLE_BEFORE_CLOSE = 10
# Store scraped item in redis for post-processing.
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 300
}
就scrapy-redis 而言,因为用 redis 取代了 scrapy 自带的 collection.deque,就可以把需要爬取的队列从保存到内存中变成了保存到内存数据库中,但是这个时候,原来配置 collection.deque 使用的调度器就没办法使用了,也没办法进行分布式调度,于是 scrapy-redis 重写了 scrapy 的调度器.所以在配置文件里面指定使用scrapy-redis
的调度器以及去重策略以及Pipeline
即可了
from jd_spider.
我试着运行jd-comment 爬虫的时候,日志滚动一下就结束了,看了下代码中 self.item_db_parameter_name = "parameter" 这个parameter 有什么用 因为通过这个参数 find出来的parameters 为空,最后的good也为空,所以爬虫就结束了,
另外 启动 jd 爬虫的时候 不管使用 scrapy crawl jingdong 还是 scrapy crawl jd 都返回 Key Error 好像是找不到对应爬虫。。不知道是哪儿的问题。菜鸟求问
from jd_spider.
Related Issues (19)
- 加个好友吧,谢谢! HOT 1
- 请教一下为什么爬出来的数据80%以上都是图书? HOT 15
- erro HOT 18
- 如何用python3? HOT 18
- 项目中Scrapy-Redis的核心代码在哪里可以找到。 HOT 4
- 同一个商品多个sku,如何获取所有sku的信息 HOT 15
- 大佬,能不能借用一下你的部分爬到的数据,诚心感谢啊 HOT 4
- graphite 没有监控到 scrapy 数据 HOT 1
- 爬虫优化 HOT 2
- 关于graphite部分,楼主可以解释一下怎么创建的吗(新手) HOT 20
- scrapy_redis HOT 1
- 能不能写个java版本的出来啊 HOT 1
- 京东前端策略更改了吧,这样的抓取不到了 HOT 6
- 新手。。如何运行 HOT 6
- 使用scrapy-splash渲染javascript HOT 3
- 没有验证重复的商品? HOT 2
- 爬取页数限制 HOT 1
- 下载问题 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jd_spider.