Coder Social home page Coder Social logo

weibospider's Introduction



Codacy Badge Coverity Scan Build Status GitHub stars GitHub issues GitHub forks GitHub license

持续维护的新浪微博采集工具🚀🚀🚀

项目特色

  • 基于weibo.com的新版API构建,拥有最丰富的字段信息
  • 多种采集模式,包含微博用户,推文,粉丝,关注,转发,评论,关键词搜索
  • 核心代码仅100行,代码可读性高,可快速按需进行定制化改造

快速开始

拉取&&安装

git clone https://github.com/nghuyong/WeiboSpider.git --depth 1 
cd WeiboSpider
pip install -r requirements.txt

替换Cookie

访问https://weibo.com/, 登陆账号,打开浏览器的开发者模式,再次刷新

复制weibo.com数据包,network中的cookie值。编辑weibospider/cookie.txt并替换成刚刚复制的Cookie

添加代理IP(可选)

重写fetch_proxy 方法,该方法需要返回一个代理ip,具体参考这里

运行程序

根据自己实际需要重写./weibospider/spiders/*中的start_requests函数

采集的数据存在output文件中,命名为{spider.name}_{datetime}.jsonl

用户信息采集

cd weibospider
python run_spider.py user
{
  "crawl_time": 1666863485,
  "_id": "1749127163",
  "avatar_hd": "https://tvax4.sinaimg.cn/crop.0.0.1080.1080.1024/001Un9Srly8h3fpj11yjyj60u00u0q7f02.jpg?KID=imgbed,tva&Expires=1666874283&ssig=a%2FMfgFzvRo",
  "nick_name": "雷军",
  "verified": true,
  "description": "小米董事长,金山软件董事长。业余爱好是天使投资。",
  "followers_count": 22756103,
  "friends_count": 1373,
  "statuses_count": 14923,
  "gender": "m",
  "location": "北京 海淀区",
  "mbrank": 7,
  "mbtype": 12,
  "verified_type": 0,
  "verified_reason": "小米创办人,董事长兼CEO;金山软件董事长;天使投资人。",
  "birthday": "",
  "created_at": "2010-05-31 23:07:59",
  "desc_text": "小米创办人,董事长兼CEO;金山软件董事长;天使投资人。",
  "ip_location": "IP属地:北京",
  "sunshine_credit": "信用极好",
  "label_desc": [
    "V指数 财经 75.30分",
    "热门财经博主 数据飙升",
    "昨日发博3,阅读数100万+,互动数1.9万",
    "视频累计播放量9819.3万",
    "群友 3132"
  ],
  "company": "金山软件",
  "education": {
    "school": "武汉大学"
  }
}

用户粉丝列表采集

python run_spider.py fan
{
  "crawl_time": 1666863563,
  "_id": "1087770692_5968044974",
  "follower_id": "1087770692",
  "fan_info": {
    "_id": "5968044974",
    "avatar_hd": "https://tvax1.sinaimg.cn/default/images/default_avatar_male_180.gif?KID=imgbed,tva&Expires=1666874363&ssig=UuzaeK437R",
    "nick_name": "用户5968044974",
    "verified": false,
    "description": "",
    "followers_count": 0,
    "friends_count": 195,
    "statuses_count": 9,
    "gender": "m",
    "location": "其他",
    "mbrank": 0,
    "mbtype": 0,
    "credit_score": 80,
    "created_at": "2016-06-25 22:30:13"
  }
}
...

用户关注列表采集

python run_spider.py follow
{
  "crawl_time": 1666863679,
  "_id": "1087770692_7083568088",
  "fan_id": "1087770692",
  "follower_info": {
    "_id": "7083568088",
    "avatar_hd": "https://tvax4.sinaimg.cn/crop.0.0.1080.1080.1024/007JnVEcly8gyqd9jadjlj30u00u0gpn.jpg?KID=imgbed,tva&Expires=1666874479&ssig=9zhfeMPLzr",
    "nick_name": "蒋昀霖",
    "verified": true,
    "description": "工作请联系:[email protected]",
    "followers_count": 329216,
    "friends_count": 58,
    "statuses_count": 342,
    "gender": "m",
    "location": "北京",
    "mbrank": 6,
    "mbtype": 12,
    "credit_score": 80,
    "created_at": "2019-04-17 16:25:43",
    "verified_type": 0,
    "verified_reason": "东申未来 演员"
  }
}
...

微博评论采集

python run_spider.py comment
{
  "crawl_time": 1666863805,
  "_id": 4826279188108038,
  "created_at": "2022-10-19 13:41:29",
  "like_counts": 1,
  "ip_location": "来自河南",
  "content": "五周年快乐呀,请坤哥哥继续保持这份热爱,奔赴下一场山海",
  "comment_user": {
    "_id": "2380967841",
    "avatar_hd": "https://tvax4.sinaimg.cn/crop.0.0.888.888.1024/002B8iv7ly8gv647ipgxvj60oo0oojtk02.jpg?KID=imgbed,tva&Expires=1666874604&ssig=%2FdGaaIRkhf",
    "nick_name": "流年执念的二瓜娇",
    "verified": false,
    "description": "蓝桉已遇释怀鸟,不爱万物唯爱你。",
    "followers_count": 238,
    "friends_count": 1655,
    "statuses_count": 12546,
    "gender": "f",
    "location": "河南",
    "mbrank": 6,
    "mbtype": 11
  }
}
...

微博转发采集

python run_spider.py repost
{
  "_id": "4826312651310475",
  "mblogid": "Mb2vL5uUH",
  "created_at": "2022-10-19 15:54:27",
  "geo": null,
  "ip_location": "发布于 德国",
  "reposts_count": 0,
  "comments_count": 0,
  "attitudes_count": 0,
  "source": "iPhone客户端",
  "content": "共享[鼓掌][太开心][鼓掌]五周年快乐!//@陈坤:#山下学堂五周年# 五年, 感谢同行。",
  "pic_urls": [],
  "pic_num": 0,
  "user": {
    "_id": "2717869081",
    "avatar_hd": "https://tvax1.sinaimg.cn/crop.0.0.160.160.1024/a1ff6419ly8gz1xoq9oolj204g04g745.jpg?KID=imgbed,tva&Expires=1666876939&ssig=Cl93CLjdB%2F",
    "nick_name": "YuFeeC",
    "verified": false,
    "mbrank": 0,
    "mbtype": 0
  },
  "url": "https://weibo.com/2717869081/Mb2vL5uUH",
  "crawl_time": 1666866139
}
...

基于微博ID的微博采集

python run_spider.py tweet_by_tweet_id
{
    "_id": "4762810834227120",
    "mblogid": "LqlZNhJFm",
    "created_at": "2022-04-27 10:20:54",
    "geo": null,
    "ip_location": null,
    "reposts_count": 1890,
    "comments_count": 1924,
    "attitudes_count": 12167,
    "source": "三星Galaxy S22 Ultra",
    "content": "生于乱世纵横四海,义之所在不计生死,孤勇者陈恭一生当如是。#风起陇西今日开播# #风起陇西#  今晚,恭候你!",
    "pic_urls": [],
    "pic_num": 0,
    "isLongText": false,
    "user": {
        "_id": "1087770692",
        "avatar_hd": "https://tvax1.sinaimg.cn/crop.0.0.1080.1080.1024/40d61044ly8gbhxwgy419j20u00u0goc.jpg?KID=imgbed,tva&Expires=1682768013&ssig=r1QurGoc2L",
        "nick_name": "陈坤",
        "verified": true,
        "mbrank": 7,
        "mbtype": 12,
        "verified_type": 0
    },
    "video": "http://f.video.weibocdn.com/o0/CmQEWK1ylx07VAm0nrxe01041200YDIc0E010.mp4?label=mp4_720p&template=1280x720.25.0&ori=0&ps=1CwnkDw1GXwCQx&Expires=1682760813&ssig=26udcPSXFJ&KID=unistore,video",
    "url": "https://weibo.com/1087770692/LqlZNhJFm",
    "crawl_time": 1682757213
}
...

基于用户ID的微博采集

python run_spider.py tweet_by_user_id
{
  "crawl_time": 1666864583,
  "_id": "4762810834227120",
  "mblogid": "LqlZNhJFm",
  "created_at": "2022-04-27 10:20:54",
  "geo": null,
  "ip_location": null,
  "reposts_count": 1907,
  "comments_count": 1924,
  "attitudes_count": 12169,
  "source": "三星Galaxy S22 Ultra",
  "content": "生于乱世纵横四海,义之所在不计生死,孤勇者陈恭一生当如是。#风起陇西今日开播# #风起陇西#  今晚,恭候你!",
  "pic_urls": [],
  "pic_num": 0,
  "video": "http://f.video.weibocdn.com/o0/CmQEWK1ylx07VAm0nrxe01041200YDIc0E010.mp4?label=mp4_720p&template=1280x720.25.0&ori=0&ps=1CwnkDw1GXwCQx&Expires=1666868183&ssig=RlIeOt286i&KID=unistore,video",
  "url": "https://weibo.com/1087770692/LqlZNhJFm"
}
...

基于关键词的微博采集

python run_spider.py tweet_by_keyword
{
  "crawl_time": 1666869049,
  "keyword": "丽江",
  "_id": "4829255386537989",
  "mblogid": "Mch46rqPr",
  "created_at": "2022-10-27 18:47:50",
  "geo": {
    "type": "Point",
    "coordinates": [
      26.962427,
      100.248299
    ],
    "detail": {
      "poiid": "B2094251D06FAAF44299",
      "title": "山野文创旅拍圣地",
      "type": "checkin",
      "spot_type": "0"
    }
  },
  "ip_location": "发布于 云南",
  "reposts_count": 0,
  "comments_count": 0,
  "attitudes_count": 1,
  "source": "iPhone1314iPhone客户端",
  "content": "丽江小漾日出\n推出户外移动餐桌\n接受私人定制\n让美食融入美景心情自然美丽了!\n#小众宝藏旅行地##超出片的艺术街区#  ",
  "pic_urls": [
    "https://wx1.sinaimg.cn/orj960/4b138405gy1h7k1a56c4oj234022onph",
    "https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19eb2kxj22ts1vvb2a",
    "https://wx1.sinaimg.cn/orj960/4b138405gy1h7k1a0wzglj22ua1w7hdw",
    "https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19wsafnj231x21a7wj",
    "https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19jd1xkj22oh1sbkjo",
    "https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19mma74j22ru1ukx6q",
    "https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19tf1bfj234022oe85",
    "https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19pk37pj234022okjm",
    "https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19g6nzfj20wi0lo7my"
  ],
  "pic_num": 9,
  "user": {
    "_id": "1259570181",
    "avatar_hd": "https://tvax1.sinaimg.cn/crop.0.0.1080.1080.1024/4b138405ly8gzfkfikyqvj20u00u0ag1.jpg?KID=imgbed,tva&Expires=1666879848&ssig=6PUDG5RonQ",
    "nick_name": "飞鸟与鱼",
    "verified": true,
    "mbrank": 7,
    "mbtype": 12,
    "verified_type": 0
  },
  "url": "https://weibo.com/1259570181/Mch46rqPr"
}
...

更新日志

  • 2024.02: 支持采集自己推文的阅读量 #313
  • 2024.02: 支持采集视频的播放量 #315
  • 2024.01: 支持转发推文溯源到原推文 #314
  • 2023.12: 支持采集推文的二级评论 #302
  • 2023.12: 支持采集指定时间段的用户推文 #308
  • 2023.04: 支持针对推文id的推文采集 #272
  • 2022.11: 支持针对单个关键词获取单天超过1200页的检索结果 #257
  • 2022.11: 支持长微博全文的获取
  • 2022.11: 基于关键词微博搜索支持指定时间范围
  • 2022.10: 添加IP归属地信息的采集,包括用户数据,微博数据和微博评论数据
  • 2022.10: 基于weibo.com站点对项目进行重构

其他工作

  • 已构建超大规模数据集WeiboCOV,可免费申请,包含2千万微博活跃用户以及6千万推文数据,参见这里

weibospider's People

Contributors

incandescentxxc avatar lf-lin avatar lvehaoshen avatar nghuyong avatar xyangwu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

weibospider's Issues

Weibo_spider报错

运行weibo_spider.py 出现ModuleNotFoundError: No module named 'sina.items'
是因为缺少某个第三方库吗?

獲取不到微博內容

你好,我是新手想請教一下。我用了你的例子來嘗試一下,可是為什麼我拿不到微博信息呢?
萬分感激!

2018-10-03 02:05:33 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.10 (default, Oct  6 2017, 22:29:07) - [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Darwin-17.7.0-x86_64-i386-64bit
2018-10-03 02:05:33 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'sina.spiders', 'SPIDER_MODULES': ['sina.spiders'], 'DOWNLOAD_DELAY': 3, 'BOT_NAME': 'sina'}
2018-10-03 02:05:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2018-10-03 02:05:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-10-03 02:05:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-10-03 02:05:33 [scrapy.middleware] INFO: Enabled item pipelines:
['sina.pipelines.MongoDBPipeline']
2018-10-03 02:05:33 [scrapy.core.engine] INFO: Spider opened
2018-10-03 02:05:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-10-03 02:05:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-03 02:05:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://weibo.cn/2803301701/info> (referer: None)
2018-10-03 02:05:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://weibo.cn/1699432410/info> (referer: None)
2018-10-03 02:05:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://weibo.cn/u/2803301701> (referer: https://weibo.cn/2803301701/info)
2018-10-03 02:05:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weibo.cn/u/2803301701>
{'_id': '2803301701',
 'authentication': u'\u300a\u4eba\u6c11\u65e5\u62a5\u300b\u6cd5\u4eba\u5fae\u535a',
 'birthday': u'1948-06-15',
 'crawl_time': 1538546734,
 'fans_num': 72033515,
 'follows_num': 3033,
 'gender': u'\u7537',
 'nick_name': u'\u4eba\u6c11\u65e5\u62a5',
 'province': u'\u5317\u4eac',
 'tweets_num': 91312,
 'vip_level': u'6\u7ea7'}
2018-10-03 02:05:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://weibo.cn/u/1699432410> (referer: https://weibo.cn/1699432410/info)
2018-10-03 02:05:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weibo.cn/u/1699432410>
{'_id': '1699432410',
 'authentication': u'\u65b0\u534e\u793e\u6cd5\u4eba\u5fae\u535a',
 'birthday': u'1931-11-07',
 'crawl_time': 1538546737,
 'fans_num': 42741520,
 'follows_num': 4242,
 'gender': u'\u7537',
 'nick_name': u'\u65b0\u534e\u89c6\u70b9',
 'province': u'\u5317\u4eac',
 'tweets_num': 100178,
 'vip_level': u'5\u7ea7'}
2018-10-03 02:05:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://weibo.cn/2803301701/profile?page=1> (referer: https://weibo.cn/u/2803301701)
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
2018-10-03 02:05:48 [weibo_spider] ERROR: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

Exception: 当前账号池为空

你好,请问在search分支,2个账号情况下如何解决在爬取一会之后提示‘当前账号池为空’的问题,谢谢

能否通过表的界面查看爬取的数据?

可能涉及到mongo操作的疑问,谢谢:

  1. 除了在mongo shell中去访问数据,可以以表视图查看吗,如data_structure.md中的那样?

  2. 可以将爬取的数据导出成其他方便处理的格式吗?

分布式千万级中mongodb性能插入能撑住吗?

分布式中多台主机连到一个mongodb,一天千万级的数据的话,写入的速度能够跟上吗?写入会阻塞住吗?数据量大了之后mongodb性能会下降吧。因为没有这个环境只好来提问了。

爬取带链接表情的微博内容时会丢弃img标签后的内容

image

这种带img表情的微博,一般后面跟的是标签结束符,这样的话您抓取的就会在表情那儿终止,而后面的文字就会被丢弃。请问如何解决这个问题呢。我观察到了一条微博结束时会有一个   ,不知道能否将这个作为判断条件。

ERROR: Spider error processing

2018-07-23 22:16:08 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: sina)
2018-07-23 22:16:08 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.5.0, Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) - [GCC 7.2.0], pyOpenSSL 18.0.0 (OpenSSL 1.0.2o 27 Mar 2018), cryptography 2.2.2, Platform Linux-3.10.0-862.9.1.el7.x86_64-x86_64-with-centos-7.5.1804-Core
2018-07-23 22:16:08 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'sina', 'CONCURRENT_REQUESTS': 32, 'DOWNLOAD_DELAY': 0.5, 'NEWSPIDER_MODULE': 'sina.spiders', 'SPIDER_MODULES': ['sina.spiders']}
2018-07-23 22:16:09 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2018-07-23 22:16:09 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'sina.middlewares.UserAgentMiddleware',
'sina.middlewares.CookiesMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-07-23 22:16:09 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-07-23 22:16:09 [scrapy.middleware] INFO: Enabled item pipelines:
['sina.pipelines.MongoDBPipeline']
2018-07-23 22:16:09 [scrapy.core.engine] INFO: Spider opened
2018-07-23 22:16:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-23 22:16:09 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-07-23 22:16:09 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://passport.weibo.cn/signin/login?entry=mweibo&r=http%3A%2F%2Fweibo.cn&uid=5303798085> from <GET https://weibo.cn/5303798085/info>
2018-07-23 22:16:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://passport.weibo.cn/signin/login?entry=mweibo&r=http%3A%2F%2Fweibo.cn&uid=5303798085> (referer: None)
2018-07-23 22:16:09 [scrapy.core.scraper] ERROR: Spider error processing <GET https://passport.weibo.cn/signin/login?entry=mweibo&r=http%3A%2F%2Fweibo.cn&uid=5303798085> (referer: None)
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/root/WeiboSpider/sina/spiders/weibo_spider.py", line 29, in parse_information
ID = re.findall('(\d+)/info', response.url)[0]
IndexError: list index out of range
2018-07-23 22:16:09 [scrapy.core.engine] INFO: Closing spider (finished)
2018-07-23 22:16:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 825,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2765,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 7, 23, 14, 16, 9, 992802),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'memusage/max': 54239232,
'memusage/startup': 54239232,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'spider_exceptions/IndexError': 1,
'start_time': datetime.datetime(2018, 7, 23, 14, 16, 9, 126466)}
2018-07-23 22:16:09 [scrapy.core.engine] INFO: Spider closed (finished)

账号池+分布式分支的爬虫工具不能用了吗

从网上买了300多个白号,前两个月还能使用,从上周开始就有问题了,一直提示twisted.internet.error.TCPTimedOutError: TCP connection timed out: 110: Connection timed out.或者418,最后的结果是账号池为空,重新登录以后也不能用,明明自己再网页上测试,账号都是能用的

如何在这个代码的基础上加入IP池?

如题,这个爬虫超级棒!但这几天使用后发现,几乎几分钟就会被提示“[weibo_spider] ERROR: ip 被封了!!!请更换ip,或者停止程序...”, 想请教一下:如何在源代码中加入IP池

如何实时抓取微博内容

求教!
如何满足以下要求:

01 抓取某一关键词的全部微博内容
02 从今日开始,每天或者每小时抓取,持续一段时间

weibo.cn无法登陆获取cookie

image
image
像上面这个,我把代码图片校验的代码注释后发现cookie格式很异常,自行登陆发现,有个js包404了,人工都无法登陆了,更别说机器了,限制还是很大啊

单条微博内容你抓全了吗

单条微博你只获取 tweet_node.xpath('.//span[@Class="ctt"]')[0] 这span标签里的text,微博内容获取的不全吧,有些在span标签外边的,你是怎么解决的

账号池的方法已经完全没法用了。前几天还可以

前几天一天爬过800万数据,昨天开始突然开始封锁ip了。不知道换了其他ip会不会解决。作者有没有计划做一个破解封锁ip的版本?还有,能确定418是ip被封锁的代码吗?因为我发现这个所谓的封锁,好像不一会就可以再次用了。会不会不是ip封锁的原因?

爬取数据不全

您好,我发现有好几天的数据只爬到了下午5点之后的,通过手动搜索发现当天5点之前也有数据,但就是爬不到,请问是为什么呀?该如何解决?多谢!

https://weibo.com/1631641650/H7cTe8otL

这个地址只挖掘到了:

木兔这句话真的 我已经哭到窒息了…… ​​​​

其他转发的信息都没有保留下来,这样正确吗?

微博限制能看到的关注数量为5页

我在抓一个用户的所有关注时候发现weibo.com上只能看到5页,而weibo.cn上面能看到20页,但是这两个都无法看到全部的关注,这个问题有遇到过吗

抓取不到所有的微博内容

你好,我在你的代码上进行了一些修改,希望能够实现循环爬取,主要就是在抓取粉丝列表这个方法的最后,对每个粉丝进行信息的抓取,具体代码就两行,如下图红框所示。但是我遇到了一个问题,就是爬虫虽然在持续运行,也在爬取数据,但是爬取的数据确不全。比如人民日报有9w条微博,但我只爬到了250条就没了。我不确定是不是因为我加的循环爬取的位置不对,或是这样方式不对。因此想请教一下是哪里错了。如果是思路不对,麻烦指导一下正确的循环爬取应该怎样改,非常感谢!
image

AttributeError: 'NoneType' object has no attribute 'xpath'

2019-03-05 18:28:58 [scrapy.core.scraper] ERROR: Spider error processing <GET https://weibo.cn/6132874125/profile?page=1> (referer: https://weibo.cn/u/6132874125)
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/root/WeiboSpider/WeiboSpider/sina/spiders/weibo_spider.py", line 115, in parse_tweet
tweet_nodes = tree_node.xpath('//div[@Class="c" and @id]')
AttributeError: 'NoneType' object has no attribute 'xpath'

如何获取微博中@的人的信息

您好!
自己想爬微博@了谁的信息,例如
屏幕快照 2019-03-09 07 18 05 PM
其中的@叶婉婷cici,想要他的基本信息,chrome检查元素的内容是<a href="/n/%E5%8F%B6%E5%A9%89%E5%A9%B7cici">@叶婉婷cici</a>
我在您的search分支weibo_spider.py代码的基础上增加了yield Request(url=self.base_url+href, callback=self.parse_atwho ),但是在运行爬虫的时候总是提示

[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://weibo.cn/n/%E5%90%8D%E4%BA%BA%E5%9D%8A%E9%97%B4%E5%85%AB%E5%8D%A6> (failed 3 times): TCP connection timed out: 10060: 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。.

请问该如何解决?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.