Coder Social home page Coder Social logo

doubanspider's Introduction

豆瓣小组数据查询程序

目的:

方便从各种小组发布信息中找到自己需要的信息,减少肉眼过滤关键字的工作;

使用方式:

  1. 首先需要安装python和scrapy环境, 参考:https://docs.scrapy.org/en/latest/intro/install.html

  2. 项目目录下执行命令: scrapy crawl douban -a start_url=http://www.douban.com/group/xxxxx/discussion -a max_page=30 -a search_strs='XXX,XX' -a proxy='http://xxx.xxx.xxx.xxx:xxxx' 其中proxy参数可选, start_url为豆瓣小组第一页的路径,注意一定是带有页码的第一页,max_size为最多查询的页数, search_strs为过滤的关键字,以,隔开,强烈建议配置proxy以防IP被封禁;

  3. 运行结束后同一目录下会生成一个output.html文件,使用浏览器打开即可。

    你也可以使用docker来安装scrapy:

    1. 下载安装docker,参考https://docs.docker.com/get-docker/ ,根据操作系统选择安装docker;

    2. 项目路径下的Dockerfile已经创建好了scrapy的镜像文件,只需要在此目录下执行docker image build -t scrapy . 即可创建名为scrapy的docker镜像;

    3. 执行docker run -v /xxxx/xxx/doubanSpider:/usr/src/app/spider -w /usr/src/app/spider scrapy scrapy crawl douban -a start_url=https://www.douban.com/group/xxxxx/discussion -a search_strs='xxx,xx' -a max_page=50 -a proxy='xxx.xxx.xxx.xxx:xxxx'命令即可启动查询程序,其中/xxxx/xxx/doubanSpider为项目所在目录,其余参数含义见上文,运行完毕后同样会在此目录下生成output.htlm文件,使用浏览器打开即可。

登陆校验

豆瓣更新后未登陆用户不能查看小组较大页码的页面,爬取过程中会跳转到登陆页面。

解决方法:在浏览器中使用自己的账号登陆,通过开发者工具找到Cookies,找到其中dbcl2的值,在爬虫启动参数中增加: -a cookies='{"dbcl2":"xxxxxxyour_cookie_valuexxxxx"}', cookies格式为json,也可尝试添加其他cookie,我试着加这个就可以了。。

doubanspider's People

Contributors

wuruiliang avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

doubanspider's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.