jean-humann / docs-to-pdf Goto Github PK

View Code? Open in Web Editor NEW

93.0 0.0 17.0 21.86 MB

Generate PDF for document website 🧑‍🔧

Home Page: https://www.npmjs.com/package/docs-to-pdf

License: MIT License

JavaScript 0.94% TypeScript 10.89% Shell 0.07% Dockerfile 0.22% MDX 0.58% HTML 87.04% CSS 0.25%

documentation docusaurus docusaurus-documentation pdf-generation pdf pdf-converter

docs-to-pdf's Introduction

Hey, I'm Jean Human! 👋

👨‍💻 About Me

I'm the Technical Director at Cleyrop, where we're on a mission to create an end-to-end data platform that prioritizes security and sovereignty. With a background in Machine Learning and a passion for technology, I'm excited about exploring new frontiers in data, containers, and transformers.

🌟 Professional Goals

🚀 Lead a talented team to build innovative data solutions that make an impact.
🌐 Create a secure and sovereign data platform that empowers businesses to harness the full potential of their data.
🧠 Foster a culture of continuous learning and exploration, keeping up with the latest tech trends.

🌱 Personal Goals

📚 Dive deeper into cutting-edge Natural Language Processing (NLP) techniques and models.
🏃‍♂️ Balance work with my passion for outdoor activities like biking and running.
🍎 Explore the intersection of technology and the Apple ecosystem, finding creative ways to integrate the two.

🤝 Let's Connect

💬 I'm always open to engaging discussions about data, tech, and everything in between.
📫 You can reach me at [email protected].
🐦 Connect with me on Twitter.
💼 Let's connect on LinkedIn

docs-to-pdf's People

Contributors

Stargazers

Forkers

codingluke jan-dix vintagentleman giraffesyo ds4497 anatolykopyl clayshoaf cbeeler ilinksolutionsbr mrtomyshellby pangeoradar pfdgithub westorres9 tobi1220 ngrayluna

docs-to-pdf's Issues

Error on generating - timeout

I am trying to generate PDF from

npx docs-to-pdf --initialDocURLs="https://ignatandrei.github.io/RSCG_Examples/v2/docs/List-of-RSCG" --contentSelector="article" --paginationSelector="a.pagination-nav__link.pagination-nav__link--next" --excludeSelectors=".margin-vert--xl a,[class^='tocCollapsible'],.breadcrumbs,.theme-edit-this-page"  --coverTitle="RSCG --protocolTimeout=54000"

It is all well before the final
[30.08.2023 23:15.27.852] [LOG] Start generating PDF...
[30.08.2023 23:15.27.852] [LOG] Generate cover...
[30.08.2023 23:15.27.852] [LOG] Start generating TOC...
[30.08.2023 23:15.27.958] [LOG] Restructuring the html of a document...
[30.08.2023 23:15.35.378] [LOG] Remove unnecessary HTML...
[30.08.2023 23:15.35.379] [LOG] Scroll to the bottom of the page...
[30.08.2023 23:16.29.393] [ERROR] ProtocolError: Runtime.callFunctionOn timed out. Increase the 'protocolTimeout' setting in launch/connect calls for a higher timeout if needed.
at <instance_members_initializer> (C:\Users\ignat\AppData\Local\npm-cache_npx\c16ac64a6c7aba73\node_modules\puppeteer-core\lib\cjs\puppeteer\common\Connection.js:49:14)
at new Callback (C:\Users\ignat\AppData\Local\npm-cache_npx\c16ac64a6c7aba73\node_modules\puppeteer-core\lib\cjs\puppeteer\common\Connection.js:53:16)
at CallbackRegistry.create (C:\Users\ignat\AppData\Local\npm-cache_npx\c16ac64a6c7aba73\node_modules\puppeteer-core\lib\cjs\puppeteer\common\Connection.js:93:26)

Could you please help?

Idea: Align headers level to the sidebar nesting, or make page level configurable by meta keywords

At the moment, when generating a PDF from a Website, every subpage starts with a <h1>. However on the Website some pages are nested under higher level pages.

For example:

Here getting started is the entry point and has multiple subsites like "installation" and "configuration" and so on.

I question myself whether it would be great to finde out, if a page is a parent or a child and automatically change the heading level to the next, when it is a child. On installation the <h1> would become a <h2> and so on...

💡 We could also manage this with meta keywords, so it would be manual configurable per page :)
Together with the bookmarks enhancement this would make it superior to word and google docs.

What do you think?

Quick Start example doesn't work

I tried running the example from the README

npx docs-to-pdf --initialDocURLs="https://docusaurus.io/docs/" --contentSelector="article" --paginationSelector="a.pagination-nav__link.pagination-nav__link--next" --excludeSelectors=".margin-vert--xl a,[class^='tocCollapsible'],.breadcrumbs,.theme-edit-this-page" --coverImage="https://docusaurus.io/img/docusaurus.png" --coverTitle="Docusaurus v2"

and I got this error:

[10.10.2023 11:08.19.379] [DEBUG] Using Chromium from /home/kkovacs/.cache/puppeteer/chrome/linux-117.0.5938.149/chrome-linux64/chrome
[10.10.2023 11:08.19.607] [DEBUG] Chrome user data dir: /tmp/puppeteer_dev_chrome_profile-2V52e1
[10.10.2023 11:08.19.646] [LOG]   Retrieving html from https://docusaurus.io/docs/
[10.10.2023 11:08.21.047] [DEBUG] Found 0 elements
[10.10.2023 11:08.21.049] [LOG]   Success
[10.10.2023 11:08.21.051] [LOG]   Retrieving html from https://docusaurus.io/docs/category/getting-started
[10.10.2023 11:08.22.165] [DEBUG] Found 0 elements
[10.10.2023 11:08.22.166] [LOG]   Success


...


[10.10.2023 11:09.23.630] [LOG]   Success
[10.10.2023 11:09.23.634] [LOG]   Retrieving html from https://docusaurus.io/docs/deployment
[10.10.2023 11:09.25.372] [DEBUG] Found 6 elements
[10.10.2023 11:09.25.379] [DEBUG] Clicking summary: How much resource (person-hours, money) am I willing to invest in this?
[10.10.2023 11:09.26.267] [DEBUG] Clicking summary: How much server-side configuration would I need?
[10.10.2023 11:09.27.104] [DEBUG] Clicking summary: Do I have needs to cooperate?
[10.10.2023 11:09.27.944] [DEBUG] Clicking summary: GitHub action files
[10.10.2023 11:09.28.771] [DEBUG] Clicking summary: GitHub action file
[10.10.2023 11:09.28.780] [ERROR] Error: Node is either not clickable or not an Element
    at CdpElementHandle.clickablePoint (/home/kkovacs/.npm/_npx/c16ac64a6c7aba73/node_modules/puppeteer-core/lib/cjs/puppeteer/api/ElementHandle.js:680:23)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async CdpElementHandle.<anonymous> (/home/kkovacs/.npm/_npx/c16ac64a6c7aba73/node_modules/puppeteer-core/lib/cjs/puppeteer/api/ElementHandle.js:258:32)
    at async CdpElementHandle.click (/home/kkovacs/.npm/_npx/c16ac64a6c7aba73/node_modules/puppeteer-core/lib/cjs/puppeteer/api/ElementHandle.js:710:30)
    at async CdpElementHandle.<anonymous> (/home/kkovacs/.npm/_npx/c16ac64a6c7aba73/node_modules/puppeteer-core/lib/cjs/puppeteer/api/ElementHandle.js:261:36)
    at async openDetails (/home/kkovacs/.npm/_npx/c16ac64a6c7aba73/node_modules/docs-to-pdf/lib/utils.js:212:13)
    at async generatePDF (/home/kkovacs/.npm/_npx/c16ac64a6c7aba73/node_modules/docs-to-pdf/lib/utils.js:82:21)

Just wanted to point this out because I'm struggling to get this to work on my own site, so I wanted a working example reference.

Error: Could not find Google Chrome executable for channel 'stable' at '/opt/google/chrome/chrome'.

Since v0.3.1 this error shows up on start, with v0.3.0 everything works fine

Error: Node is either not clickable or not an Element when <details> is inside <tabs>

Hello!

I have a page with <tabs>, one of which contains <details>.

Last logs before the error:

[LOG]   Retrieving html from <page url>
[DEBUG] Found 1 elements
[DEBUG] Clicking summary: <element name>

and then the error:

Error: Node is either not clickable or not an Element
    at CdpElementHandle.clickablePoint (C:\Users\user\AppData\Roaming\npm\node_modules\docs-to-pdf\node_modules\puppeteer-core\lib\cjs\puppeteer\api\ElementHandle.js:682:23)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async CdpElementHandle.<anonymous> (C:\Users\user\AppData\Roaming\npm\node_modules\docs-to-pdf\node_modules\puppeteer-core\lib\cjs\puppeteer\api\ElementHandle.js:259:32)
    at async CdpElementHandle.click (C:\Users\user\AppData\Roaming\npm\node_modules\docs-to-pdf\node_modules\puppeteer-core\lib\cjs\puppeteer\api\ElementHandle.js:712:30)
    at async CdpElementHandle.<anonymous> (C:\Users\user\AppData\Roaming\npm\node_modules\docs-to-pdf\node_modules\puppeteer-core\lib\cjs\puppeteer\api\ElementHandle.js:262:36)
    at async openDetails (C:\Users\user\AppData\Roaming\npm\node_modules\docs-to-pdf\lib\utils.js:212:13)
    at async generatePDF (C:\Users\user\AppData\Roaming\npm\node_modules\docs-to-pdf\lib\utils.js:82:21)

Basic Auth support

Hi Jean,

thanks for creating this project.
It works great for me.

The production version of my documentation is behind a basic auth access.
Would it be possible add the credentials at startup of the crawler?

Kind regards

Parametrize Arguments to Table of Contens

Would it be possible to parameterize the title Table of Contents when generating the PDF?

Incorrect requirements documented

The docs state --initialDocURLs as the only required parameter. That's incorrect.

can you provide docker image?

bookmarks

Can I support generating PDF bookmarks?

docs-to-pdf runs forever with circular links

Converting https://python.langchain.com/ runs forever because https://python.langchain.com/docs/expression_language/cookbook/tools --next leads to a previous page.
npx docs-to-pdf --initialDocURLs="https://python.langchain.com/" --contentSelector="article" --paginationSelector="a.pagination-nav__link.pagination-nav__link--next" --coverImage="https://upload.wikimedia.org/wikipedia/commons/3/3f/LangChain_logo.png" --coverTitle="LangChain"

Option to restrict the subpath range

npx docs-to-pdf --initialDocURLs="https://docusaurus.io/docs/markdown-features" --contentSele
ctor="article" --paginationSelector="a.pagination-nav__link.pagination-nav__link--next" --excludeSelectors=".margin-vert--xl a,[class^='tocCollapsible'],.breadcrumbs,.theme-edit-this-page" --coverImage="https://docusaurus.io/img/docusaurus.png" --coverTitle="Docusaurus v2"
[13.08.2023 17:17.08.551] [DEBUG] Using Chromium from C:\Program Files\Google\Chrome\Application\chrome.exe
[13.08.2023 17:17.08.781] [DEBUG] Chrome user data dir: C:\Users\tatsu\AppData\Local\Temp\puppeteer_dev_chrome_profile-wjQgPd
[13.08.2023 17:17.08.870] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features
[13.08.2023 17:17.10.684] [LOG]   Success
[13.08.2023 17:17.10.689] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/react
[13.08.2023 17:17.12.843] [LOG]   Success
[13.08.2023 17:17.12.844] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/tabs
[13.08.2023 17:17.14.508] [LOG]   Success
[13.08.2023 17:17.14.510] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/code-blocks
[13.08.2023 17:17.16.113] [LOG]   Success
[13.08.2023 17:17.16.114] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/admonitions
[13.08.2023 17:17.17.707] [LOG]   Success
[13.08.2023 17:17.17.711] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/toc
[13.08.2023 17:17.19.122] [LOG]   Success
[13.08.2023 17:17.19.127] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/assets
[13.08.2023 17:17.21.602] [LOG]   Success
[13.08.2023 17:17.21.603] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/links
[13.08.2023 17:17.23.143] [LOG]   Success
[13.08.2023 17:17.23.144] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/plugins
[13.08.2023 17:17.24.639] [LOG]   Success
[13.08.2023 17:17.24.641] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/math-equations
[13.08.2023 17:17.26.649] [LOG]   Success
[13.08.2023 17:17.26.650] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/diagrams
[13.08.2023 17:17.28.193] [LOG]   Success
[13.08.2023 17:17.28.194] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/head-metadata
[13.08.2023 17:17.29.655] [LOG]   Success
[13.08.2023 17:17.29.658] [LOG]   Retrieving html from https://docusaurus.io/docs/styling-layout
[13.08.2023 17:17.30.985] [LOG]   Success
[13.08.2023 17:17.30.987] [LOG]   Retrieving html from https://docusaurus.io/docs/swizzling
[13.08.2023 17:17.32.235] [LOG]   Success
︙

Is there an option to prevent this software from fetching pages out of https://docusaurus.io/docs/markdown-features?
It can't be covered by --excludeURLs.

An option to control whether all of `<details>` elements are opened

https://docusaurus.io/docs/markdown-features#details

<details> allows us to hide contents only for experts. It would be nice if we can control whether <details> are opened.

In the current version, all of <details> are closed.

For beginners

For experts

Can Puppeteer do this operation before printing the jointed page?

flowchart TD

S(Start) --> F[Find and open closed elements]
F --> C{New closed\nelements appeared?}
C -->|Yes| F
C -->|No| Done(Done)

how to inject vars into html template

as per https://pptr.dev/api/puppeteer.pdfoptions/#properties how do you pass

- date formatted print date

- title document title

- url document location

- pageNumber current page number

- totalPages total pages in the document```
to
--headerTemplate

is it `--headerTemplate="${date}"` etc

Clean puppeteer_dev_chrome_profile

Puppeteer saves a lot of GB's in tmp folder and never clears it. I ran out of disc space. Would be nice if this is cleaned up.
puppeteer/puppeteer#1791 (comment)

Line Break Control / Prevent page breaks after headers

A lot of my pages break at suboptimal places

Would love to be able to make it so that a header is never the last thing printed on a page

ProtocolError: Runtime.callFunctionOn timed out.

How to disabled cover and TOC title

Without coverTitle coverImage coverSub options, a blank cover is still generated.
TOC title Table of contents: cannot be modified or disabled.

Search / Select in Mac Preview not working

Hi @jean-humann

I just figured out something very strange. When I open the generated PDF in my firefox, I can select and search text just fine. However, when I open the same File in Mac Preview the text is not correctly selectable.

Here a video showing it with the example pdf.

Screenshot_2023-08-10_000075.mp4

When I try the same with the PDFs generated by marp which also uses pupperteer/chromium to generate PDFs from HTML, everything works fine. @yhatt do you maybe have some idea on this?

Best codingluke

Templates for arguments

--contentSelector="article" --paginationSelector="a.pagination-nav__link.pagination-nav__link--next" --excludeSelectors=".margin-vert--xl a,[class^='tocCollapsible'],.breadcrumbs,.theme-edit-this-page"

This software always requires a so-long options. It is so long that no one can input without reading the README. It would be nice if we can shorten this to like:

--template docusaurus2