Comments (2)
因为爬虫的去重是scrapy-redis实现的,我并没有重写去重规则,而scrapy-redis 是根据 Request
的fingerprint
来作去重的判断依据的,所以可能相同的商品(即相同的url),但是Request
对象不相同,就没有去掉.所以可能会出现重复的商品
from jd_spider.
其实最最开始的时候,也是有验证重复商品的,只是我是通过Mongodb 的唯一索引来去重的.插入是遇到重复的商品就抛出异常,然后把异常吃掉.这种去重方式很不优雅,现在我已经重写了商品的去重逻辑,通过Redis 保存商品的sku-id, 插入Mongodb 前验证一下Redis 是否已经有这个商品的sku-id, 有则不插入,反之亦然.
from jd_spider.
Related Issues (19)
- 加个好友吧,谢谢! HOT 1
- 请教一下为什么爬出来的数据80%以上都是图书? HOT 15
- erro HOT 18
- 如何用python3? HOT 18
- 项目中Scrapy-Redis的核心代码在哪里可以找到。 HOT 4
- 同一个商品多个sku,如何获取所有sku的信息 HOT 15
- 大佬,能不能借用一下你的部分爬到的数据,诚心感谢啊 HOT 4
- graphite 没有监控到 scrapy 数据 HOT 1
- 爬虫优化 HOT 2
- 关于graphite部分,楼主可以解释一下怎么创建的吗(新手) HOT 20
- scrapy_redis HOT 1
- 能不能写个java版本的出来啊 HOT 1
- 京东前端策略更改了吧,这样的抓取不到了 HOT 6
- 新手。。如何运行 HOT 6
- 使用scrapy-splash渲染javascript HOT 3
- 好像不是分布式的? HOT 5
- 爬取页数限制 HOT 1
- 下载问题 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jd_spider.