Coder Social home page Coder Social logo

photography-crawler's Introduction

saint-yellow/beauty-photography

爬虫习作 Nr. 2: 以图片网站为目标的定向爬取

目标

爬取四海资讯图库 (https://www.shzx.org/b/12-0.html) 的图集信息以及图片 img

技术栈

主要是Scrapy结合SQLAlchemy. Scrapy爬取网页内容并提取出项目(Item), SQLAlcehmy将项目存储在数据库

项目结构

.
├── logs
│   └── 2019-11-16.log
├── photos
│   └── img.shzx.org
│       ├── 138-7500
│       │   ├── 138-7500-1573929540.webp
│       │   └── 138-7500-1573929818.webp
│       ├── 141-7501
│       │   ├── 141-7501-1573929546.webp
│       │   └── 141-7501-1573929587.webp
│       └── 162-2550
│           ├── 162-2550-1573929592.jpg
│           └── 162-2550-1573929596.jpg
├── README.md
├── run.py
├── scraper
│   ├── database.py
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── __pycache__
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       ├── photo.py
│       ├── photoset.py
│       └── __pycache__
├── scrapy.cfg
  • logs文件夹存放爬虫的运行日志
  • photos文件夹存放爬虫下载的图片
  • scraper文件夹存放爬虫的主要代码
  • scraper/database.py文件定义了数据库包含的表以及各个表的结构
  • scraper/spiders/photo.py文件定义了用于爬取图片的PhotoSpider
  • scraper/spiders/photoset.py文件定义了用于爬取图集信息的PhotoSetSpider

代码组织 - One Item, One Spider, One Pipeline

因为本项目旨在完成图集信息入库以及图片下载这两项工作, 所以作者为该两项工作分别配备一个项目, 一只蜘蛛, 一条管道. 具体而言, 即:
爬取图集(photoset) = PhotoSetItem + PhotoSetSpider + PhotoSetPipeline
爬取图片(phot) = PhotoItem + PhotoSpider + PhotoPipeline

蜘蛛的职责

  • PhotoSetSpider: 爬取图集信息 img img
  • PhotoSpider: 爬取图片 img

其他细节

  • 使用ORM的方式插入图集信息到数据库中
class PhotoSetPipeline(object):
    def process_item(self, item, spider: Spider):
        record = PhotoSet(
            url=item['url'],
            title=item['title'], 
            author=item['author'],
            source=item['source'], 
            datetime_published=item['datetime_published'],
            description=item['description'],
            tags=item['tags'])
        self.session.add(record)
        self.session.commit()
  • 在下载图片的请求中加入Referer首部, 降低图片下载失败的概率
class PhotoPipeline(ImagesPipeline):
    def get_media_requests(self, item: PhotoItem, info):
        for photo_url in item['photo_urls']:
            yield Request(photo_url, meta={'name': item['notation']}, headers={'Referer': item['webpage_url']})

运行方法

  • 单独运行单个蜘蛛

img img

  • 同时运行多个蜘蛛

img

运行结果存放在logs文件夹中以日期命名的日志文件内, 例如: 2019-11-11.log
具体命名视乎运行日期

运行结果

  • 爬取图集 img
  • 爬取图片 img

未曾尝试全站爬取, 图示结果仅供参考

存在的缺陷

  • 有时出现下载图片失败的情况

要做的改进

  • 提升爬取效率
  • 提升入库速度
  • 提升图片下载的成功率

可能会做的扩展

参考文献

[1] Web Scraping with Python, Second Edition by Ryan Mitchell (O'Reilly). Copyright 2018 Ryan Mitchell, 978-1-491-998557-1.
[2] Essemtial SQLAlchemy, Second Edition, by Jason Myers and Rick Copeland (O'Reilly). Copyright 2016 Jason Myers and Rick Copeland, 978-1-4919-1646-9.
[3] Learning Scrapy, by Dimitrios Kouzis-Loukas (Packt Publishing). Copyright 2016 Packt Publishing, 978-7-115-47420-9.

photography-crawler's People

Contributors

saint-yellow avatar dependabot[bot] avatar

Watchers

James Cloos avatar  avatar

Forkers

verketh

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.