Comments (12)
应该是腾讯把页面上base64的解析算法改了。
比如任意一页的漫画页: http://ac.qq.com/ComicView/index/id/634670/cid/2
中的源码,都能找到类似的:
<script>
var DATA = 'eyJjb21pYyI6ecyJpZCI6NjM0NjcwLCJ0aXRsZbcSI6Ilx1NGYwYVx1NzUzOFx1NjYxZlx1NTM5ZiBFREVOUyBaRVJPIiwcffiY29sbGVjdCI6IjMwMTUiLCJpc0phcGFuQ29taWMfiOmZhbHNlLCJpc0xpZ2h0Tm92ZWwiOmZhbHNlLCJpc0xpZ2h0Q29taWMiOmZhbHNlLCJpc0ZpbmlzaCI6ZmFsc2UsImlzUm9hc3RhYmxlIjp0cnVlLCJlSWQiOiJLbEJQVEVKQlZGUlZDUXNmQWdZQ0FROEpIRUl5In0sImNoYXB0ZXIiOnsiY2lkIjoyLCJjVGl0bGUiOiJcdTY1NmNcdThiZjdcdTY3MWZcdTVmODUiLCJjU2VxIjoiMSIsInZpcFN0YXR1cyI6MSwicHJldkNpZCI6MCwibmV4dENpZCI6NCwiYmxhbmtGaXJzdCI6MSwiY2FuUmVhZCI6dHJ1ZX0sInBpY3R1cmUiOlt7InBpZCI6IjcwMjgiLCJ3aWR0aCI6MTEwMCwiaGVpZ2h0Ijo1NDgsInVybCI6Imh0dHBzOlwvXC9tYW5odWEucXBpYy5jblwvbWFuaHVhX2RldGFpbFwvMFwvMjZfMTBfNTZfZTZjMDhhMzMxNGE4MTY4MThmOGI0NTM4OTY0ODAwZjVfNzAyOC5qcGdcLzAifV0sImFkcyI6eyJ0b3AiOiIiLCJsZWZ0IjpbXSwiYm90dG9tIjp7InRpdGxlIjoiXHU5MDFhXHU3MDc1XHU1OTgzXHU2NzA5XHU1OGYwXHU2ZjJiXHU3NTNiIiwicGljIjoiaHR0cHM6XC9cL21hbmh1YS5xcGljLmNuXC9vcGVyYXRpb25cLzBcLzA3XzEyXzM2X2UyNjY2ZGQ4NTFiMzY1M2NlMDAxMjRkMDk2ZjdlYjEyXzE1NDE1NjU0MDYxODguanBnXC8wIiwidXJsIjoiaHR0cHM6XC9cL3YucXEuY29tXC94XC9wYWdlXC94MDc4NjJrd2VsaS5odG1sIiwid2lkdGgiOiI2NTAiLCJoZWlnaHQiOiIxMTAifX0sImFydGlzdCI6eyJhdmF0YXIiOiJodHRwOlwvXC90aGlyZHFxLnFsb2dvLmNuXC9nP2I9c2RrJms9NjlpY1gwNzFZT0xRZ0R2RVJ1MmhMVHcmcz02NDAmdD0xNDgzMzY2MTE5IiwibmljayI6Ilx1OGJiMlx1OGMwOFx1NzkzZVx1NTMxN1x1NGVhYyIsInVpbkNyeXB0IjoiYUc5elZ6SXplV2RFY0hnMldVUXlVbkUyWm14WWR6MDkifX0=',
PRELOAD_NUM = 2,
NOTICE_TIME = 15,
ROAST_SIZE = 300,
ROAST_PRE = 5,
ROAST_VIEW = 11,
DANMU_TIME = 10000;
</script>
这个DATA
变量中存储的实际是包含了章节图片的json,但是不能直接解析。以前的算法是移除第一个字母后,其余字符串便是标准的Base64。目前看来改了,这个Base64无法解析,需要在页面打断点,找到具体的解析算法了。
from getcomic.
有办法在页面的JS函数里找到解密base64的相关段落么?
from getcomic.
当初我的做法是直接在chrome的开发者工具中的js打断点,观察变量的变化,找到是哪一段函数解析了DATA
,然后只阅读这段代码就能找出来它是怎么解析这段DATA变量的。
经过初步观察应该是在这个js中解析的,需要找到具体是哪一段代码解析了这个DATA: http://ac.gtimg.com/media/js/ac.page.chapter.view_v2.4.0.js?v=20170622
from getcomic.
也就是说js打断点的方法局限性很大?没法直接找到具体是哪个函数负责解密?
from getcomic.
也不算太难,就是要有耐心。在开发者工具中格式化js代码,一个函数一个函数打断点,基本很快就能找到
from getcomic.
的确如我所料,就是那个js中,这个_v
变量存储着DATA
解析后的json,只要向上找应该就能很快找到具体解析算法了。
from getcomic.
我知道了,打断点再加上格式化,已经找到这段算法了,js代码大致如下:
var B = new Base(),
T = W['DATA'].split(''),
N = W['nonce'],
len,
locate,
str;
N = N.match(/\d+[a-zA-Z]+/g);
len = N.length;
while (len--) {
locate = parseInt(N[len]) & 255;
str = N[len].replace(/\d+/g, '');
T.splice(locate, str.length)
}
T = T.join('');
_v = JSON.parse(B.decode(T));
关键在于T
变量和N
变量,T
变量的数值来自于页面的DATA
部分,而N
变量来自于页面的window.nonce
部分,经过下面那一段解密算法还原正确的base64就可以了,只要把这段js代码翻译为python就搞定了。
而且这个B.decode
函数还挺阴险,表面上看是反解base64的,但是它的入口函数有一段正则替换:
function Base() {
_keyStr = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=";
this.decode = function(c) {
var a = "",
b, d, h, f, g, e = 0;
for (c = c.replace(/[^A-Za-z0-9\+\/\=]/g, ""); e < c.length;) b = _keyStr.indexOf(c.charAt(e++)),
d = _keyStr.indexOf(c.charAt(e++)),
f = _keyStr.indexOf(c.charAt(e++)),
g = _keyStr.indexOf(c.charAt(e++)),
b = b << 2 | d >> 4,
d = (d & 15) << 4 | f >> 2,
h = (f & 3) << 6 | g,
a += String.fromCharCode(b),
64 != f && (a += String.fromCharCode(d)),
64 != g && (a += String.fromCharCode(h));
return a = _utf8_decode(a)
};
_utf8_decode = function(c) {
for (var a = "",
b = 0,
d = c1 = c2 = 0; b < c.length;) d = c.charCodeAt(b),
128 > d ? (a += String.fromCharCode(d), b++) : 191 < d && 224 > d ? (c2 = c.charCodeAt(b + 1), a += String.fromCharCode((d & 31) << 6 | c2 & 63), b += 2) : (c2 = c.charCodeAt(b + 1), c3 = c.charCodeAt(b + 2), a += String.fromCharCode((d & 15) << 12 | (c2 & 63) << 6 | c3 & 63), b += 3);
return a
}
}
除了这一段初始化 c = c.replace(/[^A-Za-z0-9\+\/\=]/g, "")
之外,别的算法都是base64 decode算法,直接用标准的base64 decoder就可以搞定,但是T
处理之后必须要经过正则再做一次替换,去除掉里面的非法字符才能正确反解。
明天修复这个问题。
from getcomic.
fixed. 多谢反馈
from getcomic.
居然又失效了……http://ac.qq.com/Comic/comicInfo/id/634670
from getcomic.
看来http://ac.qq.com/Comic/comicInfo/id/{}
这个接口被废弃掉了,等有时间了看看现在页面是怎么拿到漫画章节列表的吧。
from getcomic.
getComic.py -u http://ac.qq.com/Comic/ComicInfo/id/634393
正在下载第0001话: 预告
下载失败,重试1次
下载失败,重试2次
下载失败,重试3次
下载失败,重试4次
Traceback (most recent call last):
File "./getComic.py", line 338, in
main(url, path, lst, one_folder)
File "./getComic.py", line 294, in main
imgList = getImgList(contentList[i - 1]['url'])
File "./getComic.py", line 105, in getImgList
img_detail_json = __decode_data(data, nonce)
File "./getComic.py", line 166, in __decode_data
json_str = base64.b64decode(base64_str).decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8c in position 7: invalid start byte
from getcomic.
已知问题。最近暂时没时间细细研究页面更改。感觉已经被腾讯盯上了,我这边一改那边立刻就改。欢迎递交PR解决。
from getcomic.
Related Issues (20)
- 不会用 HOT 7
- 好像下不了番外? HOT 4
- 请教 HOT 1
- 你好,我想问一下如果模拟cookie该怎么做? HOT 3
- Bug Report:关于章节目录和下载中的字符处理问题 HOT 12
- 关于获取腾讯漫画网站中的json内容 HOT 4
- 一个小问题 HOT 1
- 已经失效了哦。。 HOT 6
- 付费漫画只爬取到一张 HOT 1
- 改版后画质下降了! HOT 6
- 求个能下哔哩漫画的脚本
- 出现以下问题怎么解 HOT 1
- 关于nonce的值 HOT 7
- 不能用了。。!!! HOT 1
- 现在还能爬取腾讯漫画吗? 腾讯已经做了反爬取了吧? HOT 5
- 疑问 HOT 2
- getComic.py pulling first image for all images HOT 3
- Proxy support? HOT 2
- Error if chapters more 99 HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from getcomic.