Coder Social home page Coder Social logo

magical_spider's Introduction

magical_spider

神奇的蜘蛛🕷,一个几乎适用于所有web端站点的采集方案。

诞生背景

纯属瞎扯:2022年全球变暖,各行业内卷严重,爬虫届更是入门抖音起步瑞数,为了减缓人才流失,推出magical_spider。

真实原因:一时兴起,吾辈当自强,重铸selenium荣光!

博客地址: lxspider 爬虫逆向工具站:lxtools

项目简介

  • 非常规derver.pageSource。
  • 通过Flask远程调用chromederver实现xmlHttpRequest。
  • 通过sqlit记录任务状态。
  • 通过undetected_selenium+stealth.min.js绕过一些校验。
  • 目前适用于瑞数、加速乐等cookie加密,以及头条系的请求过程加密。

项目声明

  • 项目仅供学习参考。
  • 如有风控校验需自行解决,滑块可参考middlerware.py。
  • 方案适用于应急场景或数据量要求不高时,若时间充裕建议通过逆向处理。推荐阅读:《爬虫逆向进阶实战》

部署

linux部署文档


使用说明

1、配置settings.py,启动 flask 服务

2、运行方法参考demo文件内容,主要借助runflow.py。

3、测试代码

GET请求

from demo.runflow import magical_start,magical_request,magical_close

project_name = 'cnipa'
base_url = 'https://www.cnipa.gov.cn'

session_id,process_url = magical_start(project_name,base_url)

print(len(magical_request(session_id, process_url,'https://www.cnipa.gov.cn/col/col57/index.html')))

magical_close(session_id,process_url,project_name)

POST请求

from demo.runflow import magical_start,magical_request,magical_close
import json

project_name = 'chinadrugtrials'
base_url = 'http://www.chinadrugtrials.org.cn'

session_id,process_url = magical_start(project_name,base_url)

data = {"id": "","ckm_index": "","sort": "desc","sort2": "","rule": "CTR","secondLevel": "0","currentpage": "2","keywords": "","reg_no": "","indication": "","case_no": "","drugs_name": "","drugs_type": "","appliers": "","communities": "","researchers": "","agencies": "","state": ""}
formdata = json.dumps(data)

print(magical_request(session_id=session_id, process_url=process_url,
                      request_url='http://www.chinadrugtrials.org.cn/clinicaltrials.searchlist.dhtml',
                      request_type='post',formdata=formdata
                      ))

magical_close(session_id,process_url,project_name)

4、index页可以查看和管理当前运行中的任务,也能查看系统内存和磁盘使用情况。

5、demo文件夹中有任务流程汇总runflow.py,以及抖音、药监局案例,单任务和多任务示例。

Alt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.