Coder Social home page Coder Social logo

zhihucrawler's Introduction

🐞使用WebMagic爬取知乎

  • 爬取知乎热榜
  • 爬取知乎话题列表,并获取高赞回答
  • 使用github workflow自动触发任务

🔨项目环境

  • JDK17
  • gradle7.4.2

📄归档文件

data文件夹下包含:

  • archives:每日热榜数据,每天12点自动更新

  • json:每日热榜json数据、回答json数据

  • topic:话题数据,话题关注人数大于两万

  • answer:各话题高赞回答,回答点赞数大于一万

项目结构如下:

├─.github
│  └─workflows
├─data
│  ├─answer
│  ├─archives
│  ├─json
│  └─topic
└─src
    ├─main
    │  ├─java
    │  │  └─com
    │  │      └─zwl
    │  │          ├─constant
    │  │          ├─listener
    │  │          ├─model
    │  │          ├─process
    │  │          └─util
    │  └─resources
    └─test
        ├─java
        │  └─com
        │      └─zwl
        │          └─test
        └─resources

部分截图:

  • 热榜数据:

  • 回答数据:

知乎API

获取知乎热榜

获取知乎话题列表

method=next&params={"topic_id":1761,"offset":20,"hash_id":""}

ps: 话题爬取需要用到cookie,取cookie的值为z_c0

Xpath语法

选取节点

表达式 描述
nodename 选取此节点的所有子节点。
/ 从根节点选取(取子节点)。
// 从匹配选择的当前节点选择文档中的节点,而不考虑它们的位置(取子孙节点)。
. 选取当前节点。
.. 选取当前节点的父节点。
@ 选取属性。

路径表达式

表达式 描述
bookstore 选取 bookstore 元素的所有子节点。
/bookstore 选取根元素 bookstore
bookstore/book 选取属于 bookstore 的子元素的所有 book 元素。
//book 选取所有 book 子元素,而不管它们在文档中的位置。
bookstore//book 选择属于 bookstore 元素的后代的所有 book 元素,而不管它们位于 bookstore 之下的什么位置。
//@lang 选取名为 lang 的所有属性。

zhihucrawler's People

Contributors

zhaoweilong007 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

blackhole2023

zhihucrawler's Issues

登录限制怎么解决的

{"error":{"need_login":true,"redirect":"https://www.zhihu.com/account/unhuman?type=unhuman&message=%E7%B3%BB%E7%BB%9F%E7%9B%91%E6%B5%8B%E5%88%B0%E6%82%A8%E7%9A%84%E7%BD%91%E7%BB%9C%E7%8E%AF%E5%A2%83%E5%AD%98%E5%9C%A8%E5%BC%82%E5%B8%B8%EF%BC%8C%E4%B8%BA%E4%BF%9D%E8%AF%81%E6%82%A8%E7%9A%84%E6%AD%A3%E5%B8%B8%E8%AE%BF%E9%97%AE%EF%BC%8C%E8%AF%B7%E8%BE%93%E5%85%A5%E9%AA%8C%E8%AF%81%E7%A0%81%E8%BF%9B%E8%A1%8C%E9%AA%8C%E8%AF%81%E3%80%82%E8%8B%A5%E9%A2%91%E7%B9%81%E5%87%BA%E7%8E%B0%E6%AD%A4%E9%A1%B5%E9%9D%A2%EF%BC%8C%E5%8F%AF%E5%B0%9D%E8%AF%95%E7%99%BB%E5%BD%95%E5%90%8E%E8%AE%BF%E9%97%AE%E7%9F%A5%E4%B9%8E%E3%80%82&need_login=true","code":40352,"message":"系统监测到您的网络环境存在异常,为保证您的正常访问,请输入验证码进行验证。若频繁出现此页面,可尝试登录后访问知乎。"}}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.