Coder Social home page Coder Social logo

lucasjinreal / weibo_terminater Goto Github PK

View Code? Open in Web Editor NEW
2.3K 168.0 457.0 166 KB

Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator

Python 100.00%
scraper weibo sina corpus chinese chatbot

weibo_terminater's Introduction

Weibo Terminater

NLP语料准备工具,温馨提示,本项目仅作为学术研究使用,用于其他用途引发的一系列后果本作者不承担任何责任。大概两年过去了,再次更新一下这个项目,只是为了责任和信仰,本次更新如下:

  • 添加了一些辅助库logging更好的显示信息,log库来自于alfred: http://github.com/jinfagang/alfred;
  • 将PhantomJS的driver废弃掉了,默认使用FireFox作为代理,这样导致的结果是你可能需要安装一下FireFox的selenium插件,叫做geckodriver,Google一下安装到/usr/bin里面即可;
  • 删除了一些不必要的文件。

之前的图片链接好像都不能用了,直接删掉,只能说国内的云服务器商太坑爹太抠,不交钱直接断你的链接,也可能是大环境不好,该倒闭的都倒闭了。

其实作者两年之后已经不搞NLP了,研究生毕业后从事了自动驾驶领域的相关工作,但一直没有放弃NLP工作的跟进也不妨碍我一直依赖对NLP的兴趣,为此特地给大家提供几个最新的有趣的干货:

本项目会持续更新并维护,感谢大家的关注。

迟来的更新

这个项目从发起到现在过去了二十多天,五百年了终于得把事情真真正正干起来了!!这个项目会一直保持更新,为了方便contribute一起贡献,我重新发起了一个项目:https://github.com/jinfagang/weibo_terminator_workflow.git , 如果想一起贡献爬取语料可以同时star一下workflow这个项目,如果想play with微博爬虫可以继续关注这个项目。

2017-4-19 重磅更新!!!启动微博终结者计划(WT Plan)

weibo_terminator 微博终结者爬虫基本上准备就绪:

这次我们更新了以下功能:

  • 增加了延时策略,每次爬取10个页面,暂停五分钟,这样依旧不能百分百保证账号不被ban,但是我们还有策略!!
  • 现在我们同时使用十几个账号同时开爬了,weibo_scraper 会在一个账号被禁止之后自动切换到下一个账号!!
  • 不需要设置cookies!!!重要的事情说三遍,我们不需要在手动设置cookies了,只需要在accounts.py里面设置相应的账号,WT自动获取cookies,后面也可以设置更新,或者删掉cookies缓存手动更新;

如果你认为只有这些你就图样图森破了,三木檀木子拿衣服。更重要的更新在于:

  • id不仅仅限于数字id了,一些明星大v的字母id照样爬,我们这次更新默认的id就是angelababy的微博,她的id为: realangelababy;
  • 作者完善了从微博内容对话格式提取聊天pair对的脚本, 对话的准确率在 99% 左右(consider copyright issue, we will open source it later);
  • 作者提交了分门别类的近800万用户id的list,全网开爬(Consider weibo official limitations, we can't distributed all list, just for sample, join our contributor team we will give every contributor single and unique part of id_file.);
  • 作者新增了断点续爬功能,这次更新我们的爬虫会记住上一次爬取到了哪个地方,第二次会直接从上一次中断的地方开始爬取,直到爬完整个微博,所以当你的cookies被ban了以后,直接换小号继续爬即可;
  • 所有工作将在半个月之内完成,构建的语料仅限于contributor使用,欢迎大家为WT贡献进来。

为了基于庞大的微博网络,我们发起终结者计划,群策群力爬取微博中文计划语料,这次更新的repo中一个 weibo_id.list 文件,这里面有分门别类的近800万用户的id。 不要问我怎么来的,接下来我们分配给每个contributor一定区间段的id,对全部微博进行爬取,然后把结果上传到我们内部的百度云网盘,所有数据只有所有的contributor以及 weibo_terminator authors可以获取。 最后声明以下,本项目参考了一些类似项目,但是本项目实现的功能,考虑的问题复杂度不是以上这些项目能比拟,我们实现的都是最新的网页API和Python3,很多其他项目都是基于scrapy构建的,本项目根本使用任何类似的爬虫库, 不是别的原因,拿那些库构建的项目缺乏灵活性,我们不太喜欢。希望大家理解。

最后依旧欢迎大家submit issue,我们永远开源,维护更新!! automaticaly dispatch multi account

Contribution tips:

  • Clone this repo: git clone https://github.com/jinfagang/weibo_terminater.git;
  • Install PhantomJS to enable weibo_terminator auto get cookies, from here get it and set your unzip path to settings/config.py, follow the instruction there;
  • Set your multi account, inside settings/accounts.py, you can using multi account now, terminator will automatically dispatch them;
  • Run python3 main.py -i realangelababy, scrap single user, set settings/id_file for multi user scrap;
  • Contact project administrator via wechat jintianiloveu, if you want contribute, administrator will hand out you and id_file which is unique in our project;
  • All data will saved into ./weibo_detail, with different id separately.
  • Collect data to project administrator.
  • When all the work finished, administrator will distribute all data as one single file to all contributors. Using it under WT & TIANEYE COPYRIGHT.

Research & Discuss Group

We fund several group for our project:

QQ
AI智能自然语言处理: 476464663
Tensorflow智能聊天Bot: 621970965
GitHub深度学习开源交流: 263018023

Wechat
add administrator `jintianiloveu` to be added in.

Tutorial

这是第一次commit丢失的部分,使用帮助:

# -h see helps
python3 main.py -h

# -i specific an single id or id_file path(with every id as a line.)
python3 main.py -i 167385960
python3 main.py -i ./id_file

# -f specific filter mode, if 0, all weibo are all original, if 1, contains repost one, default is 0
python3 main.py -i 16758795 -f 0

# -d specific debug mode for testing, be aware debug mode only support one single id.
python3 main.py -i 178600077 -d 1

That's all, simple and easy.

About cookies

The cookies still maybe banned, if our scraper continues get information from weibo, that is exactly we have to get this job done under people's strength, no one can build such a big corpora under one single power. If your cookies out of date or being banned, we strongly recommended using another weibo account which can be your friends or anyone else, and continue scrap, one thing you have to remind is that our weibo_terminator can remember scrap progress and it will scrap from where it stopped last time. :)

微博终结者爬虫

关于聊天对话系统我后面会开源一个项目,这个repo目的是基于微博构建一个高质量的对话语料,本项目将继续更进开发,大家快star!!永远开源!

这个项目致力于对抗微博的反爬虫机制,集合众人的力量把微博成千上万的微博评论语料爬取下来并制作成一个开源的高质量中文对话语料,推动中文对话系统的研发。 本系统现已实现:

  • 爬取指定id用户的微博数,关注数,粉丝数,所有微博内容以及所有微博对应的评论;
  • 作者考虑到制作对话系统的可行性以及微博语料的难处理性,爬取过程中,所有微博会保存为可提取的形式,具体可以参照爬取结果保存样例;
  • 本项目不依赖于任何第三方爬取框架,但手动实现了一个多线程库,当爬取多用户时会开启上百条线程工作,爬取速度在每小时百万级别;
  • 本项目最终目的是为了充分利用庞大的微博平台构建一个开源高质量的中文对话系统(据作者所知,很多公司对自己的数据视如珍宝,鄙之);
  • 除此之外,本项目还可以用于指定用户评论分析,比如爬取罗永浩的微博可以分析他第二年锤子手机的销量(牛逼把)

希望更多童鞋们contribute进来,还有很多工作要做,欢迎提交PR!

为人工智能而生

中文语料一直以来备受诟病,没有机构或者组织去建立一些公开的数据集,反观国外,英文语料相当丰富,而且已经做的非常精准。

微博语料作者认为是覆盖最广,最活跃最新鲜的语料,使用之构建对话系统不说模型是否精准,但新鲜的词汇量是肯定有的。

爬取结果

指定用户的微博和评论形式如下:

E
4月15日#傲娇与偏见# 超前点映,跟我一起去抢光它 [太开心]  傲娇与偏见 8.8元超前点映  顺便预告一下,本周四(13号)下
午我会微博直播送福利,不见不散哦[坏笑]   电影傲娇与偏见的秒拍视频 <200b><200b><200b>
E
F
<哈哈哈哈哈哈狗->: 还唱吗[doge]
<緑麓>: 绿麓!
<哈哈哈哈哈哈狗->: [doge][doge]
<至诚dliraba>: 哈哈哈哈哈哈哈
<五只热巴肩上扛>: 大哥已经唱完了[哆啦A梦吃惊]
<哈哈哈哈哈哈狗->: 大哥[哆啦A梦吃惊]
<独爱Dear>: 10:49坐等我迪的直播[喵喵][喵喵][喵喵]
<四只热巴肩上扛>: 对不起[可怜]我不赶
<四只热巴肩上扛>: 哈狗[哆啦A梦花心][哆啦A梦花心]
<至诚dliraba>: 哈狗来了 哈哈哈
<四只热巴肩上扛>: [摊手]绿林鹿去哪里了!!!!
<哈哈哈哈哈哈狗->: 阿健[哆啦A梦花心]
<至诚dliraba>: 然而你还要赶我出去[喵喵]
<四只热巴肩上扛>: 我也很绝望
<至诚dliraba>: 只剩翻墙而来的我了
<四只热巴肩上扛>: [摊手]我能怎么办
<四只热巴肩上扛>: [摊手]一首歌唱到一半被掐断是一个歌手的耻辱[摊手]
<至诚dliraba>: 下一首
<四只热巴肩上扛>: 最害怕就是黑屋[摊手]
<至诚dliraba>: 我脑海一直是 跨过傲娇与偏见 永恒的信念
F

说明:

  • E E 表示微博内容的开头和结果
  • F F表示所有评论的开头和结尾
  • 每条评论中 <> 是发起评论的用户id, $$ 中是at用户的id

Future Work

现在爬取的语料是最原始版本,大家对于语料的用途可以从这里开始,可以用来做话题评论机器人,但作者后面将继续开发后期处理程序,把微博raw data变成对话形式,并开源。 当然也欢迎有兴趣的童鞋们给我提交PR,选取一个最佳方案,推动本项目的进展。

Contact

对于项目有任何疑问的可以联系我 wechat: jintianiloveu, 也欢迎提issue

Copyright

(c) 2017 Jin Fagang & Tianmu Inc. & weibo_terminator authors LICENSE Apache 2.0

weibo_terminater's People

Contributors

af1ynch avatar chenleilei avatar flysky1991 avatar lucasjinreal avatar xiaochao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

weibo_terminater's Issues

微博回复评论部分的修改

weibo_scraper.py Line 224:
single_comment_content = span_element.xpath('/text()')
修改为:
single_comment_content = child.xpath('span[1]/text()')
为了显示回复内容

Language Issues.

I was searching for a Chinese dialogue dataset and I found this project. To be honest, it is a very meaningful thing to do and you should be given thumbs up.

I read README.md and find two tiny language issues:

... 本项目根本使用任何类似的爬虫库 ...

"没有" is missing.

We fund several group for our project.

which perhaps should be 'We found several groups for our project.'

I sincerely wish that this project can make a greater difference in the future.

使用遇到很多困难

错误信息:
have not get cookies yet. headers: {'User-Agent': 'Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} http://weibo.cn/u/1?filter=0&page=1 'WeiBoScraper' object has no attribute 'cookie' -- getting user name 'WeiBoScraper' object has no attribute 'html' html not properly loaded, maybe cookies out of date or account being banned. change an account please
本人不会python,想爬取新浪微博数据,能不能给我分享数据?联系方式[email protected]
我做过一些java版本的爬虫,有新浪爱问的爬虫,还有大众点评等等。急需联系,谢谢合作

无法模拟登录了吗?

终于配置好可以跑通了,只是登录的时候好像失败了。
File "D:\weibo_terminater-master\core\dispatch_center.py", line 50, in execute
error, account id 18991158502 is not valid, pass this account, you can edit it and then update cookies.
然后我通过https://passport.weibo.cn/signin/login 这个地址试着登录了一下,发现需要手工画一个登录确认的图标,就是按照箭头拖动鼠标。晕了,微博刚刚加上这个功能吗?

提取对话格式的脚本公开了吗?

你好,我在README.md里面看到了关于提取对话格式的脚本,很感兴趣,请问脚本已经公开或者可以使用了吗?谢谢!

作者完善了从微博内容对话格式提取聊天pair对的脚本, 对话的准确率在 99% 左右(consider copyright issue, we will open source it later);

There are some errors, how to fix it?

can not find PhantomJS driver, please download from http://phantomjs.org/download.html based on your system.
all accounts getting cookies finished. starting scrap..
[Errno 2] No such file or directory: 'settings/cookies.pkl'
error, not find cookies file.
Traceback (most recent call last):
File "main.py", line 56, in
dispatcher.execute()
File "/Users/samchen/Documents/weibo-crawer/weibo_terminater/core/dispatch_center.py", line 50, in execute
self._init_single_mode()
File "/Users/samchen/Documents/weibo-crawer/weibo_terminater/core/dispatch_center.py", line 93, in _init_single_mode
scraper = WeiBoScraper(using_account=self.all_accounts[0], uuid=self.user_id, filter_flag=self.filter_flag)
AttributeError: 'Dispatcher' object has no attribute 'all_accounts'

Could not get all the weibo?

Hello, I would like to get all the weibos from an account 6252963398. But I could get none of them and the file in weibo_detail is empty as well.
vva dq e6 a64fhrm q v

WeiBoScraper对象中的_get_weibo_detail_comment()方法创建文件夹以及微博评论细节.txt的逻辑不太合适

weibo_scraper.py Line 186 - 188:
weibo_comments_save_path = './weibo_detail/{}.txt'.format(self.user_id) if not os.path.exists(weibo_comments_save_path): os.makedirs(os.path.dirname(weibo_comments_save_path))

如果爬取不同用户ID的微博内容,既不存在新的用户ID对应的.txt,Line 187中的if判断为True,则创建文件夹"./weibo.detail"时则会产生"文件夹已存在,无法创建错误"。

weibo_comments_save_path = './weibo_detail' weibo_comments_save_name = '/{}.txt'.format(self.user_id if not os.path.exists(weibo_comments_save_path): os.makedirs(os.path.dirname(weibo_comments_save_path)) with open(os.path.join(weibo_comments_save_path, weibo_comments_save_name), 'w+') as f:
这样会不会更好一些?

微博全文不会展开

screenshot from 2017-10-18 17-24-54-a

我这爱豆发博比较长些……

顺便,可以在哪改每爬一页休息五分钟这个设定?

谢谢

可以读取微博内容,但是最后无法保存

通过PyCharm运行,可以看到已经读取到微博内容了。但是最后保存内容的时候报错:
[WinError 183] 当文件已存在时,无法创建该文件。: './weibo_detail'
current account being banned, return to dispatch center, resign for new account..
scrap not finish, account resource run out. update account move on scrap.
我觉得是不是应该第一步就保存读到的微博内容呢?读10页保存一次,下次就可以略过已经保存的内容只抓取新的内容。

Could not save file

Hello, here are two difference examples which I cannot save a file.
ubuntu virtual machine ubuntu-16.04.2-desktop-amd64, python 3.5.2
1 with weibo_detail folder
when I try the command: sudo python3 main -i 5979819802
I failed to get all the data and save file if I had not deleted the weibo_detail folder, as indicated in the following figure.
sb2 xrw8 e n n8vgqb
2 without weibo_detail folder
when I try the command: sudo python3 main -i 5979819802
I failed to get all the data and save file if I had deleted the weibo_detail folder, as indicated in the following figure.
4mjrqltnw3ct0 ruy5c0b h

I have no idea what to do with it although I also referred a few closed issues.
Thanks for your efforts!

settings/accounts.py文件里的账号设置

你好!
我用了一个自己的weibo小号测试,如以下 xxxxxx,xxxx 处。得到如下报错信息:

apple$ python3 main.py
[debug mode] crawling weibo from id realangelababy
preparing cookies for account {'password': 'xxxxxx', 'id': 'xxxx'}
loading PhantomJS from /Users/apple/phantomjs-2.1.1-macosx/bin/phantomjs
opening weibo login page, this is first done for prepare for cookies. be patience to waite load complete.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:20<00:00, 1.99it/s]
Message: {"errorMessage":"Element is not currently interactable and may not be manipulated","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"170","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:56354","User-Agent":"Python http auth"},"httpVersion":"1.1","method":"POST","post":"{"value": ["1", "3", "3", "1", "7", "3", "1", "6", "2", "8", "2"], "sessionId": "f48650a0-38c7-11e7-8cf6-7ba4e9053550", "text": "xxxx", "id": ":wdc:1494781733275"}","url":"/value","urlParsed":{"anchor":"","query":"","file":"value","directory":"/","path":"/value","relative":"/value","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/value","queryKey":{},"chunks":["value"]},"urlOriginal":"/session/f48650a0-38c7-11e7-8cf6-7ba4e9053550/element/:wdc:1494781733275/value"}}
Screenshot: available via screen

error, account id xxx is not valid, pass this account, you can edit it and then update cookies.

all accounts getting cookies finished. starting scrap..
[Errno 2] No such file or directory: 'settings/cookies.pkl'
error, not find cookies file.
Traceback (most recent call last):
File "main.py", line 56, in
dispatcher.execute()
File "/Users/apple/Documents/weibo_terminater/core/dispatch_center.py", line 50, in execute
self._init_single_mode()
File "/Users/apple/Documents/weibo_terminater/core/dispatch_center.py", line 93, in _init_single_mode
scraper = WeiBoScraper(using_account=self.all_accounts[0], uuid=self.user_id, filter_flag=self.filter_flag)
AttributeError: 'Dispatcher' object has no attribute 'all_accounts'

是因为account.py中的account id设置不正确吗?试过各种情况,包括手机号、中文别名,以及http://weibo.com/xxxxxx/profile?topnav=1&wvr=6 中的 xxxxxx 数字串,都是上述类似的错误,不知这是什么情况?

谢谢!

seems that something is wrong with phantomjs

Message: {"errorMessage":"Element is not currentl
y interactable and may not be manipulated","reque
st":{"headers":{"Accept":"application/json","Acce
pt-Encoding":"identity","Connection":"close","Con
tent-Length":"164","Content-Type":"application/js
on;charset=UTF-8","Host":"127.0.0.1:42484","User-
Agent":"Python http auth"},"httpVersion":"1.1","m
ethod":"POST","post":"{\"value\": [\"5\", \"5\", 
\"8\", \"5\", \"2\", \"8\", \"5\", \"6\", \"5\", 
\"2\"], \"sessionId\": \"69079350-50e0-11e7-883b-
9988ae14f7a9\", \"id\": \":wdc:1497431064179\", \
"text\": \"5585285652\"}","url":"/value","urlPars
ed":{"anchor":"","query":"","file":"value","direc
tory":"/","path":"/value","relative":"/value","po
rt":"","host":"","password":"","user":"","userInf
o":"","authority":"","protocol":"","source":"/val
ue","queryKey":{},"chunks":["value"]},"urlOrigina
l":"/session/69079350-50e0-11e7-883b-9988ae14f7a9
/element/:wdc:1497431064179/value"}}             
Screenshot: available via screen

After googling I found it in issues of PhantomJS. What might be the problem?

感觉判断cookie那里的逻辑有一些问题

你好,我大概读完了该项目的源码,在运行test的时候,发现get_cookie_from_network这个函数好像有一些问题,只有当if cookies_dict[account_id] is not None才会去更新或者cookie.试想一下,如果我第一次模拟登陆失败了,返回的cookie是None,存到了cookies文件中,那么以后我用该账号再登陆的话,该账号的cookie永远就写不进去了,因为它一直为None.不知道我的想法是否正确

'Dispatcher' object has no attribute 'all_accounts'

运行报错“ 'Dispatcher' object has no attribute 'all_accounts'”
看了运行结果,提示“[Errno 2] No such file or directory: 'settings/cookies.pkl'”,是不是因为没有cookies.pkl文件导致出错“'Dispatcher' object has no attribute 'all_accounts'”

建议添加ip代理

由于存在微博账号登录在不同环境下表现不同的问题(#47),建议添加ip代理,
以下代码仅供参考:

import re
import requests
import pymysql
import time
import random

class SpiderProxy(object):
    def __init__(self):
        self.req = requests.Session()
        self.headers = {
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'zh-CN,zh;q=0.8',
            'Referer': 'http://www.ip181.com/',
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 '
                          '(KHTML, like Gecko) Ubuntu Chromium/60.0.3112.113 Chrome/60.0.3112.113 Safari/537.36',
        }
        self.proxyHeaders = {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 '
                          '(KHTML, like Gecko) Ubuntu Chromium/60.0.3112.113 Chrome/60.0.3112.113 Safari/537.36',
        }
        self.con = pymysql.Connect(
            host='127.0.0.1',
            user='root',
            password="password",
            database='xici',
            port=3306,
            charset='utf8',
        )
        self.cur = self.con.cursor()

    def getPage(self, url):
        content = self.req.get(url, headers=self.headers).text
        return content

    def Page(self, text):
        time.sleep(2)
        # pattern = re.compile(u'<tr class=".*?">.*?'
        #                      + u'<td class="country"><img.*?/></td>.*?'
        #                      + u'<td>(\d+\.\d+\.\d+\.\d+)</td>.*?'
        #                      + u'<td>(\d+)</td>.*?'
        #                      + u'<td>.*?'
        #                      + u'<a href=".*?">(.*?)</a>.*?'
        #                      + u'</td>.*?'
        #                      + u'<td>([A-Z]+)</td>.*?'
        #                      + '</tr>'
        #                      , re.S)
        pattern = re.compile(u'<td>(\d+\.\d+\.\d+\.\d+)</td>.*?'
                             + u'<td>(\d+)</td>.*?'
                             + u'<td>.*?</td>.*?'
                             + u'<td>([A-Z]+)</td>.*?'
                             + u'<td>.*?</td>.*?'
                             + u'<td>.*?</td>.*?'
                             , re.S)
        l = re.findall(pattern, text)
        return l
    def getUrl(self):
        url = 'http://www.ip181.com/'
        return url

    def insert(self, l):
        print("插入{}条".format(len(l)))
        self.cur.executemany("insert into xc values(%s,%s,%s)", l)
        self.con.commit()

    def select(self):
        a = self.cur.execute("select ip,port,protocol from xc")
        info = self.cur.fetchall()
        return info

    def getAccessIP(self):
        content = self.getPage(self.getUrl())
        proxys = self.Page(content)
        p = {}
        for i in proxys:

            try:
                # p.setdefault("{}".format(i[2]).lower(), "{}://{}:{}".format(i[2], i[0], i[1]).lower())
                # self.req.proxies = p
                r = self.req.get("http://ip.taobao.com/service/getIpInfo.php?ip=myip",
                                 proxies={"{}".format(i[2]).lower(): "{}://{}:{}".format(i[2], i[0], i[1]).lower()},
                                 timeout=5)

                print("原始ip:", "xxx.xxx.xxx.xxx  获取到的代理ip:", r.json()['ip'])
                if len(p) == 1:
                    return p
            except Exception as e:
                ##todo 删除无用ip
                print("{} is valid".format(i))
                print(e)

    def getNewipToMysql(self):
        content = self.getPage(self.getUrl())
        proxys = self.Page(content)

if __name__ == '__main__':
    p = SpiderProxy()
    p.getAccessIP()

获取可用代理后可以直接,在get请求中设置代理,亲测有效,由于修改了很多源码,所以暂时不提交requests

百度网盘?你们疯了么!

对全部微博进行爬取,然后把结果上传到我们内部的 百度云网盘

就说一条:百度网盘存在封号后申述无门的现象。

hey,i have a little questions

I have succeeded in running the terminater,and now i wonder how to save the data?:)
admire to your nice work , jinfagang

无法爬取数据

添加了一个id后运行python main.py得到如下结果:
[debug mode] crawling weibo from id 1669879400 setting cookies.. headers: {'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate'} http://weibo.cn/u/1669879400?filter=0&page=1 success load html.. -- getting user name list index out of range html not properly loaded, maybe cookies out of date.

WinError 5 Access is denied

permission

Hello, I run this project in my win 10 system, it did not work. The problem is shown in the figure above.
I also tried in another win7 system and encountered the same question. The python versions are 3.5.3 and 3.6.1. Neither of them worked. What is more, I run the Command Prompt as an administrator not as a common user.
Thanks

尝试运行失败,No module named 'lxml' , 日志如下

skyArraondeMacBook-Pro:weibo_terminater skyArraon$ python3 main.py -i 2956950657
Traceback (most recent call last):
  File "main.py", line 21, in <module>
    from core.dispatch_center import Dispatcher
  File "/Users/skyArraon/Downloads/tech/python/weibo_terminater/core/dispatch_center.py", line 19, in <module>
    from scraper.weibo_scraper import WeiBoScraper
  File "/Users/skyArraon/Downloads/tech/python/weibo_terminater/scraper/weibo_scraper.py", line 34, in <module>
    from lxml import etree
ModuleNotFoundError: No module named 'lxml'

modify suggest

1 set a ip addrs for all the spider to check if this content is be checked
ex. the spider can request xx.xx.xxx.xxx:80,to obtain a name list which is be obatained
2 redis install the crawl status
ex. save all the crawl msg into redis,to manage the weibo_id list is the key-point of project
3 "oray"(huashengge) can make the ip addrs easy

Mac OS平台下 PermissionError: [Errno 13] Permission denied

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/peco/PycharmProjects/weibo_terminater/tests.py", line 51, in
test()
File "/Users/peco/PycharmProjects/weibo_terminater/tests.py", line 27, in test
cookies = get_cookie_from_network(accounts[0]['id'], accounts[0]['password'])
File "/Users/peco/PycharmProjects/weibo_terminater/utils/cookies.py", line 49, in get_cookie_from_network
driver = webdriver.PhantomJS(phantom_js_driver_file)
File "/Users/peco/anaconda/envs/tensorflow_3.4/lib/python3.4/site-packages/selenium/webdriver/phantomjs/webdriver.py", line 52, in init
self.service.start()
File "/Users/peco/anaconda/envs/tensorflow_3.4/lib/python3.4/site-packages/selenium/webdriver/common/service.py", line 86, in start
os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'phantomjs' executable may have wrong permissions.

运行的时候出现error, not find cookies file.

我已经在setting下添加了cookies.pkl的文件。但添加之后运行的时候已经不获取cookies了,直接从ranglebaby开始爬。。之前还会看到进度条,但并没有成功获取过cookies

生成cookies出错

报错内容:
error, account id [email protected] is not valid, pass this account, you can edit it and then update cookies.

PhantomJS 也装了
MAC\win\云上的linux都是这条错误,上述三台主机IP也不在一个网段
Any suggestions?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.