Coder Social home page Coder Social logo

wjianwei126 / sinahousecrawler Goto Github PK

View Code? Open in Web Editor NEW

This project forked from backto17/sinahousecrawler

0.0 2.0 0.0 35 KB

基于scrapy,scrapy-redis实现的一个分布式网络爬虫,爬取了新浪房产的楼盘信息及户型图片,实现了常用的爬虫功能需求.

Python 100.00%

sinahousecrawler's Introduction

#SinaHouseCrawler ##基于scrapy,scrapy-redis实现的一个分布式网络爬虫,爬取了新浪房产的楼盘信息及户型图片,实现了常用的爬虫功能需求.
1.'sinahouse.pipelines.MongoPipeline'实现数据持久化到mongodb,'sinahouse.pipelines.MySQLPipeline'实现数据异步写入mysql
2.'sinahouse.middlewares.UserAgentMiddleware','sinahouse.middlewares.ProxyMiddleware' 分别实现用户代理UserAgent变换和IP代理变换
3.'sinahouse.pipelines.ImagePipeline','sinahouse.pipelines.CustomImagesPipeline'分别是基于多线程的图片下载保存和继承scrapy自带 ImagePipline的实现的图片下载保存
4.'scrapy.extensions.statsmailer.StatsMailer'是通过设置settings中的mai等相关参数实现发送爬虫运行状态信息到指定邮件.scrapy.mail中的 MailSender也可以实现发送自定义内容邮件
5.通过设置setting中的scrapy-redis的相关参数,实现爬虫的分布式运行,或者单机多进程运行.无redis环境时,可以注释掉相关参数,转化为普通的 scrapy爬虫程序
6.运行日志保存

其他说明.LOG_FORMATTER = 'sinahouse.utils.PoliteLogFormatter', 实现raise DropItem()时避免scrapy弹出大量提示信息; 图片保存路径,数据库连接等参数,请根据自己环境设置; 更多相关信息请查阅scrapy以及scrapy-redis文档

运行前,请先安装 requirements.txt 中的模块!
测试方法:
scrapy parse --spider=sinahouse -c parse_item -d 5 "http://data.house.sina.com.cn/jx108948?wt_source=search_nr_bt02"
查看item是否提取成功,item中字段意义,请查看SinaHouseItem中的注释。

运行方法:
单机:
cd SinaHouseCrawler
scrapy crawl sinahouse
分布式:
配置好setting中的scrapy-redis的相关参数,在各机器中分别按单机方式启动即可

爬取目标网站:新浪房产

sinahousecrawler's People

Contributors

backto17 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.