Coder Social home page Coder Social logo

weibo_crawler's Introduction

weibo_crawler -- 微博数据爬虫

爬取含关键词的新浪微博数据: 利用微博高级搜索功能,按关键字搜集一定时间范围内的微博。

  • 思路:构造URL,爬取网页,然后解析网页中的微博ID。后续利用微博API进行数据入库。本程序只负责收集微博的ID或自行使用lxml解析微博内容。 ++ 登陆新浪微博,进入高级搜索,输入关键字”空气污染“,选择”实时“,时间为”2013-07-02-2:2013-07-09-2“,地区为”北京“,之后发送请求会发现地址栏变为如下:
http://s.weibo.com/wb/%25E7%25A9%25BA%25E6%25B0%2594%25E6%25B1%25A1%25E6%259F%2593&xsort=time&region=custom:11:1000&timescope=custom:2013-07-02-2:2013-07-09-2&Refer=g         
  固定地址部分:http://s.weibo.com/wb/
  关键字二次UTF-8编码:%25E7%25A9%25BA%25E6%25B0%2594%25E6%25B1%25A1%25E6%259F%2593
  排序为“实时”:xsort=time
  搜索地区:region=custom:11:1000
  搜索时间范围:timescope=custom:2013-07-02-2:2013-07-09-2
  可忽略项:Refer=g
  显示类似微博:nodup=1    注:这个选项可多收集微博,建议加上。默认不加此参数,省略了部分相似微博。
  某次请求的页数:page=1
  另外,高级搜索最多返回50页微博,那么时间间隔设置最小为宜。所以该类设置为搜集一定时间段内最多50页微博
  • 依赖包:lxml(解析网页)、py2exe(编译成windows窗口程序依赖包)。

  • 运行方法:

    1. 登录微博后使用自己的cookie(暂不支持自动登陆)填充122行中的your_cookie
    2. 命令行直接运行python fetch_weibo_by_keyword.py; windows 编译窗口程序方法:windows进入控制台,运行python setup.py py2exe,即可生成window窗口程序。

爬取含GPS的新浪微博数据: 利用微博API,按一定空间范围搜集一定时间范围内的含GPS微博。

  • 思路:

    1. 选择多个中心点,以10km为半径做buffer覆盖整个城市;
    2. 圆形区域较多,可采用多线程进行。一个buffer对应一个圆形区域,对应一个线程;
    3. 第三步:用额外的线程将采集到的微博数据入库。
  • 依赖包:yaml(搜集参数)、pymongo(连接数据库MongoDB)。

  • 运行方法:命令行直接运行python fetch_weibo_by_geo.py

  • 配置文件:详见config.yaml文件.

weibo_crawler's People

Contributors

heloowird avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

weibo_crawler's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.