Coder Social home page Coder Social logo

picture_scrapy's Introduction

picture_scrapy

项目地址: https://github.com/HeLiangHIT/picture_scrapy

.=======================================================================================================.
||           _          _                                                                              ||
||   _ __   (_)   ___  | |_   _   _   _ __    ___           ___    ___   _ __    __ _   _ __    _   _  ||
||  | '_ \  | |  / __| | __| | | | | | '__|  / _ \  _____  / __|  / __| | '__|  / _` | | '_ \  | | | | ||
||  | |_) | | | | (__  | |_  | |_| | | |    |  __/ |_____| \__ \ | (__  | |    | (_| | | |_) | | |_| | ||
||  | .__/  |_|  \___|  \__|  \__,_| |_|     \___|         |___/  \___| |_|     \__,_| | .__/   \__, | ||
||  |_|                                                                                |_|      |___/  ||
|'-----------------------------------------------------------------------------------------------------'|
||                                                                                   -- 美女图片爬取框架。||
|'====================================================================================================='|
||                                                  .::::.                                             ||
||                                                .::::::::.                                           ||
||                                                :::::::::::                                          ||
||                                                ':::::::::::..                                       ||
||                                                .:::::::::::::::'                                    ||
||                                                  '::::::::::::::.`                                  ||
||                                                    .::::::::::::::::.'                              ||
||                                                  .::::::::::::..                                    ||
||                                                .::::::::::::::''                                    ||
||                                     .:::.       '::::::::''::::                                     ||
||                                   .::::::::.      ':::::'  '::::                                    ||
||                                  .::::':::::::.    :::::    '::::.                                  ||
||                                .:::::' ':::::::::. :::::.     ':::.                                 ||
||                              .:::::'     ':::::::::.::::::.      '::.                               ||
||                            .::::''         ':::::::::::::::'       '::.                             ||
||                           .::''              '::::::::::::::'        ::..                           ||
||                        ..::::                  ':::::::::::'         :'''`                          ||
||                     ..''''':'                    '::::::.'                                          ||
|'====================================================================================================='|
||                                                                              [email protected] ||
||                                                                       https://github.com/HeLiangHIT ||
'======================================================================================================='

项目介绍

使用 scrapy 实现的图片爬取框架,融合了 UserAgentMiddleware/ChromeDownloaderMiddleware 中间件,RedisSetPipeline 管道用于将爬取到的图片保存到redis的set类型中,另外提供一个多线程异步下载器从redis中依次取出图片地址进行批量下载并保存。

下载结果示例:

爬取过程... 爬取过程... 爬取结果...

PS. 爬取过程看似有点缓慢实际很惊人,感觉爬一晚上后得到的图片这辈子都已经看不完👀了...美图太多看不过来现在已经审美疲劳了。

软件架构

实现了四个美女图片的爬虫:

这样做的优势是"支持分布式爬取 + 分布式下载",比如我就使用 mac 爬取图片地址,然后用 windows 连上移动硬盘下载图片, win/mac 搭配,干活不累。如果有更多电脑的话可以更好的配合。

安装教程 && 使用说明

  1. 在某台机器上启动 redis-server path/to/redis.conf 注意配置中注释掉 bind 127.0.0.1 ::1、设置protected-mode no
  2. 在多个电脑上分别 git clone 本项目地址, 然后到工程目录下使用 pip install -r requirement.txt 或者使用 pipenv shell
  3. 在 settings.py 中设置正确的 REDIS_IPREDIS_PORT 参数。
  4. 分别使用 scrapy crawl xxx 爬取指定的网站
  5. 分别使用 python picture_downloader.py --key='xxx' --dir='xxx' 下载指定网站的图片,更多参数python picture_downloader.py --help
异步协程下载器:从 redis 里面连续读取图片json信息,然后使用协程下载保存到指定文件夹中。有效的json举例如下:
`{"url": "http://www.a.com/a/a.jpg", "name": "a.jpg", "folder": "a", "page":"www.a.com"}`

Usage:
  picture_download.py [--dir=dir] [--ip=ip] [--port=port] [--key=key] [--empty_exit=empty_exit] [--concurrency=concurrency]
  picture_download.py --version
Options:
  --dir=dir                    select picture save dir. * default: '$HOME/Pictures/scrapy/'
  --ip=ip                      select redis ip. [default: 127.0.0.1]
  --port=port                  select redis ip. [default: 6379]
  --key=key                    select redis key. [default: picture:jiandan]
  --empty_exit=empty_exit      select if exit when redis set empty. [default: true]
  --concurrency=concurrency    select the concurrency number of downloader. [default: 20]

example of mine:

function start_crawl(){
    name=$1
    rm -f log/name.log
    scrapy crawl ${name} &
    sleep 2 && python picture_downloader.py --key=picture:${name} --dir=/Users/heliang/Pictures/scrapy/${name} --empty_exit=0 --concurrency=20
}
function stop_crawl(){
    name=$1
    while [ $(ps -ef | grep "scrapy crawl ${name}" | grep -v grep | wc -l) -ge 1 ]; do
        ps -ef | grep "scrapy crawl ${name}" | awk '{print $2}' | xargs kill # 停止爬虫
        sleep 1
    done
}
function clear_all(){
    while [ $(ps -ef | grep "scrapy crawl" | grep -v grep | wc -l) -ge 1 ]; do
        ps -ef | grep 'scrapy crawl' | awk '{print $2}' | xargs kill # 停止所有爬虫
    done
    while [ $(ps -ef | grep "chromedriver" | grep -v grep | wc -l) -ge 1 ]; do
        ps -ef | grep chromedriver | awk '{print $2}' | xargs kill -9 # 清理后台可能残留的 chromedriver 进程
    done
    rm -f log/*.log
}

start_crawl jiandan # meizitu mzitu mmjpg

todo

  1. 代理ip: 当前没有遇到封锁ip的现象,所以未实现ip池,如果后期有需要可以增加
  2. 下载文件去重复功能,发现本地已经存在的文件就不再下载了 -- done
  3. 爬取网页去重复功能,爬取过的网页就不再爬了(某些主页列表例外) - 即使重启机器/爬虫,如何实现? -- 使用RedisCrawlSpider
  4. 扩展到视频爬取的功能,再参考 picture_downloader.py 实现一个 video_downloader.py
  5. 扩展 mysql/sqlite3 等实现更多方便易用的 pipline 保存数据,方便后期扩展其它功能
  6. 删除对系统默认的某些中间件的依赖,加速处理

参与贡献

  1. Fork 本项目
  2. 新建 Feat_xxx 分支
  3. 提交代码
  4. 新建 Pull Request

欢迎扫码关注作者,获取更多信息哦~另外如果本源码对你有所帮助,可以点赞以支持作者的持续更新

关注作者

picture_scrapy's People

Contributors

helianghit avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.