Coder Social home page Coder Social logo

qichachascrapy's Introduction

qichachaScrapy

项目介绍

企查查( https://www.qichacha.com ) 供应商信息采集

界面预览

qcc_list qcc_info_title qcc_info_detail

执行过程

qcc_run qcc_run2

执行结果

qcc_data

软件架构

软件架构说明
1、Scrapy + Scrapy-redis 分布式爬取
2、bloomfilter 过滤、redis sadd去重
3、Selenium渲染js 和 滑块认证
4、数据入mysql(sqlalchemy) ,数据入mongodb(motor异步)
5、User-Agent代理,IP代理,cookies登录态
6、scrapyd-client打包
7、gerapy 爬虫管理平台

安装教程

安装Python库[Scrapy,scrapy-redis-bloomfilter,scrapyd ,scrapyd-client,Redis,sqlalchemy,motor]等等
安装redis
安装chrome
安装mysql,mongodb
搭建gerapy环境 - 略

懒人专享 一键docker

1、安装docker-略
2、下载docker镜像:https://pan.baidu.com/s/1JU9xqbsDkYh-3ehhiuK1Jg
3、加载镜像(耗时):docker load -i centos7:python3.tar
4、创建并运行容器:docker run -itd -p 8002:8001 -p 6802:6800 --privileged=true centos7:python3 /usr/sbin/init
5、访问gerapy管理部署爬虫: http://IP:8002/

如果访问 http://IP:8002/ 报500,解决方式如下:
1.查看容器的CID:docker ps -a
2.进入容器命令行: docker exec -it 容器CID /bin/bash
3.容器命令行里重启gerapy服务:
ps -ef|grep gerapy
kill -9 gerapy进程ID
/gerapy/gerapy_start.sh # 重启gerapy脚本
docker_python3

使用说明

1、纯脚本运行:Python run.py
2、gerapy 爬虫管理平台: 部署、启动停止删除爬虫 -略

gerapy管理平台

qcc_gerapy_1 qcc_gerapy_2 qcc_gerapy_3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.