vict-cn / crawlbaiduwenku Goto Github PK
View Code? Open in Web Editor NEW这可能是爬百度文库最全的项目了
License: MIT License
这可能是爬百度文库最全的项目了
License: MIT License
在对一些百度文库中的文档,例如《高等学院研究生英语上reading more中英对照翻译》进行截图拼接pdf时,出现图片编号对不上的情况。
运行环境
OS : Windows 10 pro
Python版本:3.7.4
所安装库版本:
Package Version
---------- ----------
baidu-aip 2.2.18.0
certifi 2019.11.28
chardet 3.0.4
idna 2.8
Pillow 6.2.1
pip 19.3.1
reportlab 3.5.32
requests 2.22.0
selenium 2.40.0
setuptools 42.0.2
urllib3 1.25.7
wheel 0.33.6
日志如下:
(venv) C:\Users\Martin\Desktop\crawlBaiduWenku>python Screenshot_to_pdf.py
请输入你要获取的网页:https://wenku.baidu.com/view/b5564dd0a76e58fafbb00379.html
此过程较慢请稍后
开始截图
第1页截图开始
第1页截图成功
第2页截图开始
第2页截图成功
第3页截图开始
第3页截图成功
第4页截图开始
第4页截图成功
第5页截图开始
第5页截图成功
第6页截图开始
第6页截图成功
第7页截图开始
第7页截图成功
第8页截图开始
第8页截图成功
第9页截图开始
第9页截图成功
第10页截图开始
第10页截图成功
第11页截图开始
第11页截图成功
第12页截图开始
第12页截图成功
第13页截图开始
第13页截图成功
第14页截图开始
第14页截图成功
第15页截图开始
第15页截图成功
第16页截图开始
第16页截图成功
第17页截图开始
第17页截图成功
第18页截图开始
第18页截图成功
第19页截图开始
第19页截图成功
第20页截图开始
第20页截图成功
第21页截图开始
第21页截图成功
开始写入pdf
Traceback (most recent call last):
File "C:\Users\Martin\Desktop\crawlBaiduWenku\venv\lib\site-packages\reportlab\lib\utils.py", line 653, in open_for_read
return open_for_read_by_name(name,mode)
File "C:\Users\Martin\Desktop\crawlBaiduWenku\venv\lib\site-packages\reportlab\lib\utils.py", line 597, in open_for_read_by_name
return open(name,mode)
FileNotFoundError: [Errno 2] No such file or directory: '高等学院研究生英语上reading more中英对照翻译/第22页图片.png'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Martin\Desktop\crawlBaiduWenku\venv\lib\site-packages\reportlab\lib\utils.py", line 656, in open_for_read
return getBytesIO(datareader(name) if name[:5].lower()=='data:' else urlopen(name).read())
File "c:\users\martin\appdata\local\programs\python\python37\Lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "c:\users\martin\appdata\local\programs\python\python37\Lib\urllib\request.py", line 510, in open
req = Request(fullurl, data)
File "c:\users\martin\appdata\local\programs\python\python37\Lib\urllib\request.py", line 328, in __init__
self.full_url = url
File "c:\users\martin\appdata\local\programs\python\python37\Lib\urllib\request.py", line 354, in full_url
self._parse()
File "c:\users\martin\appdata\local\programs\python\python37\Lib\urllib\request.py", line 383, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: '高等学院研究生英语上reading more中英对照翻译/第22页图片.png'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "Screenshot_to_pdf.py", line 124, in <module>
main()
File "Screenshot_to_pdf.py", line 122, in main
parse_pdf(url,wenku_title)
File "Screenshot_to_pdf.py", line 116, in parse_pdf
screenshot(browser,wenku_title)
File "Screenshot_to_pdf.py", line 103, in screenshot
pic_to_pdf(page_count,wenku_title)
File "Screenshot_to_pdf.py", line 53, in pic_to_pdf
c.drawImage('{}/第{}页图片.png'.format(wenku_title,i),0,0,w,h)
File "C:\Users\Martin\Desktop\crawlBaiduWenku\venv\lib\site-packages\reportlab\pdfgen\canvas.py", line 953, in drawImage
imgObj = pdfdoc.PDFImageXObject(name, image, mask=mask)
File "C:\Users\Martin\Desktop\crawlBaiduWenku\venv\lib\site-packages\reportlab\pdfbase\pdfdoc.py", line 2089, in __init__
src = open_for_read(source)
File "C:\Users\Martin\Desktop\crawlBaiduWenku\venv\lib\site-packages\reportlab\lib\utils.py", line 658, in open_for_read
raise IOError('Cannot open resource "%s"' % name)
OSError: Cannot open resource "高等学院研究生英语上reading more中英对照翻译/第22页图片.png"
我看现在网址都是这样的,不管是啥格式,好像地址都变成了这样的
https://wenku.baidu.com/view/da21e5cd1be8b8f67c1cfad6195f312b3169eb9f.html
你们这个项目里的源代码里写的网址形式是这昂
content_url = "https://wenku.baidu.com/browse/getbcsurl?doc_id=" + wenku_id + "&pn=1&rn=99999&type=ppt"
可以请教下吗?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.