Coder Social home page Coder Social logo

crawlbaiduwenku's People

Contributors

vict-cn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

crawlbaiduwenku's Issues

网页转pdf时的截图与拼接问题

在对一些百度文库中的文档,例如《高等学院研究生英语上reading more中英对照翻译》进行截图拼接pdf时,出现图片编号对不上的情况。
运行环境

OS : Windows 10 pro
Python版本:3.7.4
所安装库版本:
Package    Version
---------- ----------
baidu-aip  2.2.18.0
certifi    2019.11.28
chardet    3.0.4
idna       2.8
Pillow     6.2.1
pip        19.3.1
reportlab  3.5.32
requests   2.22.0
selenium   2.40.0
setuptools 42.0.2
urllib3    1.25.7
wheel      0.33.6

日志如下:

(venv) C:\Users\Martin\Desktop\crawlBaiduWenku>python Screenshot_to_pdf.py
请输入你要获取的网页:https://wenku.baidu.com/view/b5564dd0a76e58fafbb00379.html
此过程较慢请稍后
开始截图

第1页截图开始
第1页截图成功
第2页截图开始
第2页截图成功
第3页截图开始
第3页截图成功
第4页截图开始
第4页截图成功
第5页截图开始
第5页截图成功
第6页截图开始
第6页截图成功
第7页截图开始
第7页截图成功
第8页截图开始
第8页截图成功
第9页截图开始
第9页截图成功
第10页截图开始
第10页截图成功
第11页截图开始
第11页截图成功
第12页截图开始
第12页截图成功
第13页截图开始
第13页截图成功
第14页截图开始
第14页截图成功
第15页截图开始
第15页截图成功
第16页截图开始
第16页截图成功
第17页截图开始
第17页截图成功
第18页截图开始
第18页截图成功
第19页截图开始
第19页截图成功
第20页截图开始
第20页截图成功
第21页截图开始
第21页截图成功
开始写入pdf
Traceback (most recent call last):
  File "C:\Users\Martin\Desktop\crawlBaiduWenku\venv\lib\site-packages\reportlab\lib\utils.py", line 653, in open_for_read
    return open_for_read_by_name(name,mode)
  File "C:\Users\Martin\Desktop\crawlBaiduWenku\venv\lib\site-packages\reportlab\lib\utils.py", line 597, in open_for_read_by_name
    return open(name,mode)
FileNotFoundError: [Errno 2] No such file or directory: '高等学院研究生英语上reading more中英对照翻译/第22页图片.png'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Martin\Desktop\crawlBaiduWenku\venv\lib\site-packages\reportlab\lib\utils.py", line 656, in open_for_read
    return getBytesIO(datareader(name) if name[:5].lower()=='data:' else urlopen(name).read())
  File "c:\users\martin\appdata\local\programs\python\python37\Lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "c:\users\martin\appdata\local\programs\python\python37\Lib\urllib\request.py", line 510, in open
    req = Request(fullurl, data)
  File "c:\users\martin\appdata\local\programs\python\python37\Lib\urllib\request.py", line 328, in __init__
    self.full_url = url
  File "c:\users\martin\appdata\local\programs\python\python37\Lib\urllib\request.py", line 354, in full_url
    self._parse()
  File "c:\users\martin\appdata\local\programs\python\python37\Lib\urllib\request.py", line 383, in _parse
    raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: '高等学院研究生英语上reading more中英对照翻译/第22页图片.png'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "Screenshot_to_pdf.py", line 124, in <module>
    main()
  File "Screenshot_to_pdf.py", line 122, in main
    parse_pdf(url,wenku_title)
  File "Screenshot_to_pdf.py", line 116, in parse_pdf
    screenshot(browser,wenku_title)
  File "Screenshot_to_pdf.py", line 103, in screenshot
    pic_to_pdf(page_count,wenku_title)
  File "Screenshot_to_pdf.py", line 53, in pic_to_pdf
    c.drawImage('{}/第{}页图片.png'.format(wenku_title,i),0,0,w,h)
  File "C:\Users\Martin\Desktop\crawlBaiduWenku\venv\lib\site-packages\reportlab\pdfgen\canvas.py", line 953, in drawImage
    imgObj = pdfdoc.PDFImageXObject(name, image, mask=mask)
  File "C:\Users\Martin\Desktop\crawlBaiduWenku\venv\lib\site-packages\reportlab\pdfbase\pdfdoc.py", line 2089, in __init__
    src = open_for_read(source)
  File "C:\Users\Martin\Desktop\crawlBaiduWenku\venv\lib\site-packages\reportlab\lib\utils.py", line 658, in open_for_read
    raise IOError('Cannot open resource "%s"' % name)
OSError: Cannot open resource "高等学院研究生英语上reading more中英对照翻译/第22页图片.png"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.