Coder Social home page Coder Social logo

weixin_sogou's Introduction

weixin_sogou

爬取微信公众号文章

服务地址: WeiRSS

UPDATE: 目前因为搜狗微信接口调整,服务处于不稳定状态...

依赖

  1. Python 3.4+
  2. BeautifulSoup
  3. requests
  4. selenium
  5. phantomjs

使用说明

搜狗微信搜索平台搜公众号名称,在url里获得公众号的openid

get_account_info() 获取账号信息,可传入openid、url、cookies

parse_list()获取文章列表,可传入openid、link

parse_essay()获取文章内容,传入文章链接

update_cookies()更新cookies,触发反爬虫时使用

示例

open_id = 'oIWsFt3nvJ2jaaxm9UOB_LUos02k'
cookies = update_cookies()
print(get_account_info(open_id,cookies=cookies))
#{'description': '一个基于内容分享的社区——「交流故事·沟通想法」', 'logo'...
print(parse_list(open_id))
#[{'link': 'http://mp.weixin.qq.com/s?__biz=MjM5NjM4OTAyMA==&mid=206650

weixin_sogou's People

Contributors

iberryful avatar taoalpha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

weixin_sogou's Issues

Cannot get direct essay link from parse_list

After call parse_list , I can only get URL like this http://weixin.sogou.com/websearch/art.jsp?sg=CBf80b2xkgZWehj5vWa6p7H14b.... . However, most weixin essay's direct link is something like this http://mp.weixin.qq.com/s?__biz=MjM5NjM4OTAyMA=.... . If you request the first link you can get 302 redirection. Response header:

HTTP/1.1 302 Found
Server: nginx
Date: Sun, 08 Nov 2015 07:52:29 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
x_log_ext: suv=008A25C47B7F4D26563EFED6A708B610&openid=oIWsFt3nvJ2jaaxm9UOB_LUos02k&query=%E7%AE%80%E4%B9%A6&rank=1&from=gzhjs&page=1&snuid=76022F344F557452C69017DF5064A727&dec_art=succ
Location: http://mp.weixin.qq.com/s?__biz=MjM5NjM4OTAyMA==&mid=400225618&idx=1&sn=4560366d10e320d2feddf1ce0e00bf0e&3rd=MzA3MDU4NTYzMw==&scene=6#rd
Set-Cookie: black_passportid=1; domain=.sogou.com; path=/; expires=Thu, 01-Dec-1994 16:00:00 GMT
Expires: Sun, 08 Nov 2015 07:52:29 GMT
Cache-Control: max-age=0

The question is that I cannot get the redirect url using requests package:

r = requests.get(link, headers=headers, cookies=cookies)
print(r.headers)
print(r.url)
for resp in r.history:
    print(resp.status_code, resp.url)

I try to use this code to get response Location in the header. But I always get 200 status code not 302. And get 当前请求已过期,请点击重新加载 error. Did I miss something?

js解析是不是有问题呢

Exception AttributeError: "'Service' object has no attribute 'process'" in <bound method Service.del of <selenium.webdriver.phantomjs.service.Service object at 0x104bd4ad0>> ignored
这个是不是搜狗的反爬虫了

Feeds not updating since 07/15/2015

Issue as stated in the title. Is the daemon on the public server not functioning today? If I may, please take a look at the problem while you have time. Thanks so much for all you have done so far!

发现几个问题

我用的是pyspider,对于通过搜狗爬取公众号发现2个问题
1是发现搜狗上的数据并不准确 有些时候会出现旧文章排在前面的情况
2是爬了几天,每小时爬一次,发现被搜狗给封IP了
不知道你在频率方面是怎样限制而不会被搜狗封的

无法添加未收录

比方说 「厚大司考」/houdask 这个未收录都没有
或者是「清法LAWYERS」,未收录
点击 add ,总是提示「添加失败, 请点此重试」,重试也没用。
firefox 39 x64,win8.1 x64

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.