Coder Social home page Coder Social logo

generals-space / site-mirror-py Goto Github PK

View Code? Open in Web Editor NEW
62.0 4.0 21.0 413 KB

[码云](https://gitee.com/generals-space/site-mirror-py) 通用爬虫, 仿站工具, 整站下载

License: MIT License

Python 100.00%
crawler mirror commoncrawl spider

site-mirror-py's Introduction

site-mirror-py

这是一个通用的爬虫, 整站下载工具, 可以下载包括页面, 图片, css样式及js文件的所有资源, 并存储到本地指定目录下.

python版本: 3.7

功能特性

  1. 指定抓取深度(0为不限深度, 1为只抓取单页面)
  2. 可以通过配置指定不下载图片, css, js或字体等资源
  3. 设置黑名单以屏蔽指定链接的资源

完成后可以通过仓库中的docker-compose.yml启动一个nginx容器从本地访问.

注意: 本工具只能下载静态页面, 对于通过js动态加载的内容无能为力(比如bilibili), 一般只限于文章, 图片, 新闻资讯等网站.

使用方法

安装依赖

pip install -r requirements.txt

修改main.py入口程序, 主要是两个配置项

  1. main_url: 目标网站主页
  2. max_depth: 抓取深度

其他配置项见crawler/config.py.

扩展

同类的golang版本见

实现逻辑相同.


效果截图

site-mirror-py's People

Contributors

generals-space avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

site-mirror-py's Issues

重复请求问题

当爬取深度设置为0时(或者其他级别),存在重复爬取一些公共资源url的问题,
问题原因疑似:将任务入队时,没有检测数据库中url是否已经存在?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.