fdawgs / node-poppler Goto Github PK
View Code? Open in Web Editor NEWAsynchronous node.js wrapper for the Poppler PDF rendering library
Home Page: https://npmjs.com/package/node-poppler
License: MIT License
Asynchronous node.js wrapper for the Poppler PDF rendering library
Home Page: https://npmjs.com/package/node-poppler
License: MIT License
Is your feature request related to a problem? Please describe.
OSX Poppler binaries included with this module are currently at v0.89.0, (found in ./src/lib/darwin/poppler-0.89.0
).
The latest at time of writing is v20.12.1 and provides the following fixes/enhancements to the binaries:
Node 14 is already EOL and Node 16 becomes EOL on 2023-09-11.
It's a waste of time and CI resources/electricity to continue to support these as users should be moving off of them.
Will drop support on 2023-10-01.
API Docs do not have a mention of how to flatten PDF.
I have written a descriptive title
I have searched existing feature requests to ensure it has not already been proposed
I agree to follow the Code of Conduct that this project adheres to
Thank you for a really nice library.
I have been wrapping parts of Poppler myself for some time, but I will probably switch over to node-poppler. Node-poppler is more thought through with argument handling and overall more neat than my code.
Is there any reason you don't support streams? Using streams is quite neat since you can build a pipeline with for example Sharp to convert output files and then stream them to storage (for example AWS S3).
I think stream support could fit nicely in your API, but I'm wondering if there are any drawbacks that I'm not aware of.
Hello, if I choose not to save to file directly, but rather get the output to a buffer to do whatever I need to do with it, it seems that the returned buffer is corrupted.
const pdfBuffer = await fs.promises.readFile(inputPdfFile)
const result = await poppler.pdfToCairo(pdfBuffer, null, { singleFile: true, pngFile: true }) // jpg too!
const pngBuffer = Buffer.from(result, 'utf-8') // this buffer is always broken
await fs.promises.writeFile(outputPath, pngBuffer);
I tried setting the encoding of Buffer.from()
to binary as well, but when the file is saved it is always broken. From a quick look at the code it seems that the problem comes from the fact that the png contents are converted to utf-8 on the way... one clue about this is that SVG output works, because SVG is a text-based format, while PNG (and JPEG, for that matter) is a binary format, and they get corrupted.
Lines 749 to 753 in 3cfb17b
I have written a descriptive title
I have searched existing feature requests to ensure it has not already been proposed
I agree to follow the Code of Conduct that this project adheres to
Poppler provide C++ api for pdf manipulation
I can write some wrapper function to export these apis
and I can export it into wasm using emscripten
if there is existing wasm package, I hope to use it directly.
Otherewise, I can submit a PR
Currently sat at 96%
pdftocairo convert to image some characters can not be display as below error message
Missing language pack for 'Adobe-GB1' mapping
Missing language pack for 'Adobe-CNS1' mapping
No font in show
Suggest to build win 32 poppler with poppler-data
below win32 build is included Poppler-data for your reference
https://tm23forest.com/contents/poppler-for-windows
Without poppler-data.
$ pdfinfo -listenc
Available encodings are:
ASCII7
Latin1
Symbol
UTF-16
UTF-8
ZapfDingbats
With poppler-data.
$ pdfinfo -listenc
Available encodings are:
ASCII7
Big5
Big5ascii
EUC-CN
EUC-JP
GBK
ISO-2022-CN
ISO-2022-JP
ISO-2022-KR
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
KOI8-R
Latin1
Latin2
Shift-JIS
Symbol
TIS-620
UTF-16
UTF-8
Windows-1255
ZapfDingbats
I have written a descriptive issue title
I have searched existing issues to ensure it has not already been reported
I agree to follow the Code of Conduct that this project adheres to
Any
16
Windows
10
How to handle errors?
Error on server Error: Command failed: C:\helloworld\node_modules\node-poppler\src\lib\win32\poppler-21.03.0\Library\bin\pdfseparate C:\helloworld\cache\main.pdf C:\helloworld\cache\pdfs\small-pdf-%d.pdf
Syntax Warning: May not be a PDF file (continuing anyway)
Syntax Error (1): Illegal character '{'
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table
Syntax Error: Could not extract page(s) from damaged file ('C:\helloworld\cache\main.pdf')
at ChildProcess.exithandler (node:child_process:397:12)
at ChildProcess.emit (node:events:390:28)
at maybeClose (node:internal/child_process:1062:16)
at Process.ChildProcess._handle.onexit (node:internal/child_process:301:5) {
killed: false,
code: 99,
signal: null,
cmd: 'C:\\helloworld\\node_modules\\node-poppler\\src\\lib\\win32\\poppler-21.03.0\\Library\\bin\\pdfseparate C:\\helloworld\\cache\\main.pdf C:\\helloworld\\cache\\pdfs\\small-pdf-%d.pdf',
stdout: '',
stderr: 'Syntax Warning: May not be a PDF file (continuing anyway)\r\n' +
"Syntax Error (1): Illegal character '{'\r\n" +
"Syntax Error: Couldn't find trailer dictionary\r\n" +
"Syntax Error: Couldn't find trailer dictionary\r\n" +
"Syntax Error: Couldn't read xref table\r\n" +
"Syntax Error: Could not extract page(s) from damaged file ('C:\\helloworld\\cache\\main.pdf')\r\n"
Just used function like this:
await poppler.pdfSeparate(mainPdf, pdfsDir)
Should have given err that we can try catch but I can't .
I have written a descriptive title
I have searched existing feature requests to ensure it has not already been proposed
I agree to follow the Code of Conduct that this project adheres to
First, thank you for making this!
It took me a while to figure out what I was looking for in terms of the poppler-utils directory, but I figured it out. It does seem like brew install poppler put the utils onto the path, though, so I just wondering why this path is needed. On my local machine this path will be different from in CI, so just curious if there is a trick to get around providing the path or something.
Self explanatory.
I consider this more as a caveat than an actual bug:
When using this library in an app that is packaged as asar archive, e.g. an Electron application packaged with electron-packager, electron-forge or electron-builder, the binaries will not be found when running the installed application.
This is due to execa
using child_process.spawn
and not child_process.execFile. Only the latter will cause unpacking the binaries/executing the unpacked binaries, while the first one will try to execute a path like ...\resources\app.asar\node_modules\node-poppler\src\lib\win32\poppler-0.90.1\bin
. This behaviour is described in the Electron docs
This can be mitigated by setting the popplerPath in the Poppler constructor manually.
Electron: ^9.0.0 and ^10.0.0
I have written a descriptive issue title
I have searched existing issues to ensure it has not already been reported
I agree to follow the Code of Conduct that this project adheres to
5.1.6
v16.13.2
macOS
12.4
I am trying to fetch a single page from a pdf without writing it down in a separate image file. This works beautifully for jpg's and png's as documented in https://github.com/Fdawgs/node-poppler/blob/master/README.md#popplerpdftocairo.
For tiff files it's a different story though: It works only if I give an output file as a second parameter in the pdfToCairo
function, but not when I use 'undefined'. The result ist way to small - like it is only the header of the tiff file or something.
I checked wether my poppler version (22.05.0) is working correctly on the comand line. It does.
pdftocairo -tiff -f 1 -l 1 -singlefile example.pdf - > example.tiff works perfectly on the shell.
As far as I can see - node-poppler does send the correct params to the spawned child process. But the result is a very short string - sth like this:
From debugging index.js in node-poppler I can see that
is only called once for the whole childprocess. This could be the problem.
This code should be enough to see that foo.tif is not a valid tiff file.
You can use any or the provided pdf file: example.pdf
import { Poppler } from 'node-poppler';
import fs from 'fs';
import path from 'path';
const file = fs.readFileSync(path.join(__dirname, 'example.pdf'));
(async () => {
const poppler = new Poppler('/usr/bin');
const res: string | Error = await poppler.pdfToCairo(file, undefined, {
firstPageToConvert:1,
lastPageToConvert: 1,
singleFile:true,
tiffCompression: 'jpeg',
tiffFile: true
//pngFile: true
});
if (res instanceof Error) {
console.log('Error: ' + JSON.stringify(res));
return;
}
fs.writeFileSync('foo.tif', res, { encoding: 'binary' })
})();
Additional information:
Though I wrote this code on OSX I also tried it on a docker container with alpine linux expecting the behaviour to be an OSX glitch. But I could also reproduce the problem on linux successfully.
The expected behaviour should be equal for all possible output formats - meaning when using an 'undefined' outputfile and the -singleFile Option the resulting string should contain valid image data.
The first example of the documentation returns an error when ran because an output file is required. I suggest changing this to include the parameter, something like ./filepath.png
.
Please Add support for Linux/Ubuntu os support of pdftotext...&all. This is really in demand and i don't find it anywhere. PDF.js is one alternative but it didn't extracts every text but pdftotext from poppler does.
Please Please Please add support. The os dependency is a problem because we want to use it in our backend server in firebase functions. which uses ubuntu 18.0.4 LTS and won't allow installing system wide libraries.
Thank you. Hoping a great response.
A clear and concise description of what the bug is.
Steps to reproduce the behavior:
A clear and concise description of what you expected to happen.
If applicable, add screenshots to help explain your problem.
Add any other context about the problem here.
When i try to run the NodeJs code with the node-poppler lib i have the
"Library not loaded: @rpath/libpoppler.100.dylib" error. As can be seen in the printscreen below
i have this simple function for converting pdf to jpg,just like one of the examples in the page of the lib, when ran in macOS had the error
It was expected to convert the files as normal and as it happens in my Windows system too.
I have written a descriptive title
I have searched existing feature requests to ensure it has not already been proposed
I agree to follow the Code of Conduct that this project adheres to
At present, all functions use the popplerPath
property for their bin path.
This should be broken down to an individual level so that it can be modified if needs be:
const { Poppler } = require('node-poppler');
const poppler = new Poppler('/usr/bin');
poppler.pdfToTextPath = '/totallydifferentpath/bin';
poppler.pdfToHtmlPath = '/anotherpath';
await poppler.pdfToText(new Buffer('bleh'));
Describe the solution you'd like
New releases of Poppler introduce new options/args to the util binaries, which are subsequently added to this module's functions.
Users may be using an older version of the Poppler util binaries with this module, and may attempt to use the new options.
The module should determine whether the Poppler util binaries provided to this module have the options passed to the functions, and throw an error if not.
I have written a descriptive issue title
I have searched existing issues to ensure it has not already been reported
I agree to follow the Code of Conduct that this project adheres to
5.1.5
16
Linux
20.04
The poppler.pdfInfo
always reports 0 bytes in fileSize if the PDF file is Buffer.
Add the below code to index.test.js and run to see it
test("Should list info of PDF file as Buffer as a JSON object", async () => {
const poppler = new Poppler(testBinaryPath);
const attachmentFile = await fs.promises.readFile(file);
const res = await poppler.pdfInfo(attachmentFile, {
printAsJson: true,
});
expect(res).toMatchObject({
tagged: "yes",
userProperties: "no",
suspects: "no",
form: "AcroForm",
javaScript: "no",
pages: "16",
encrypted: "no",
pageSize: "595.276 x 841.89 pts (A4)",
pageRot: "0",
fileSize: "583094 bytes",
optimized: "no",
pdfVersion: "1.3",
});
});
The test reports is
● Node-Poppler Module › pdfInfo Function › Should list info of PDF file as Buffer as a JSON object
expect(received).toMatchObject(expected)
- Expected - 1
+ Received + 1
@@ -1,8 +1,8 @@
Object {
"encrypted": "no",
- "fileSize": "583094 bytes",
+ "fileSize": "0 bytes",
"form": "AcroForm",
"javaScript": "no",
"optimized": "no",
"pageRot": "0",
"pageSize": "595.276 x 841.89 pts (A4)",
414 | });
415 |
> 416 | expect(res).toMatchObject({
| ^
417 | tagged: "yes",
418 | userProperties: "no",
419 | suspects: "no",
at Object.toMatchObject (src/index.test.js:416:16)
Test Suites: 1 failed, 1 total
Tests: 1 failed, 90 passed, 91 total
Snapshots: 0 total
Time: 32.347 s, estimated 33 s
Ran all test suites.
It should report correct fileSize.
I have written a descriptive title
I have searched existing feature requests to ensure it has not already been proposed
I agree to follow the Code of Conduct that this project adheres to
Currently, when I use the poppler.pdfToCairo method to export images, I don’t want to export them directly to the specified directory. I want to get the file stream, buffer, etc., and upload them directly to oss, so that I don’t have to read local files. I looked through the relevant documents. I haven't found any similar API. Can the user decide whether to export?
I have written a descriptive issue title
I have searched existing issues to ensure it has not already been reported
I agree to follow the Code of Conduct that this project adheres to
node-poppler:5.1.1 | poppler:21.11.0
v17.9.0
Linux
node:17-alpine
Text is truncated to 1MB.
Is there some limit?
const { Poppler } = require('node-poppler');
const prettyBytes = require('pretty-bytes');
let popplerOptions = {};
popplerOptions.firstPageToConvert = 3; // 3rd page of 700
const poppler = new Poppler('/usr/bin/');
poppler.pdfToText('my-pdf-with-700-pages-and-30mb.pdf', undefined, popplerOptions)
.then((res) => {
// res is only 1MB - text is cut off
let textSize = prettyBytes(res.length);
console.log(JSON.stringify([res.length, textSize]));
});
// [1028143, "1.03 MB"]
# running as cli works OK and extracts all text in file
pdftotext -f 3 my-pdf-with-700-pages-and-30mb.pdf
Returns all text found in PDF.
Hello, I've noticed that this package bundles all the windows binaries regardless of what platform it's being installed on.
I suggest putting all the windows binaries into a separate package and adding
"os": [
"win32"
]
to that packages package.json
.
You can then include that package using optionalDependencies
which will only install the package if we are on windows and ignore it on any other platform.
Doing this will remove over 99% of the unpacked package size.
Having to include almost 50MB of unused files in a docker image is not great (especially since you need to install the Linux binaries on top of that anyway).
Also might as well include
"cpu": [
"x64",
]
while you're at it since the binaries won't work on x86 or arm builds of windows.
When I run the application in windows, its working as expected.(pdf to image conversion using poppler.pdftocairo)
When the application is deployed in linux environment, pdf to image conversion is not working.
I am getting below error. Please let me know, how to fix. If there is any package need to be added, please share the procedure.
PDF-to-TIFF conversion with pdfToCairo()
throws Internal process error
, only converts first page
Steps to reproduce the behavior:
const poppler = new Poppler();
const options = {
tiffFile: true,
};
const outputFile = `${testDirectory}pdf_1.3_NHS_Constitution`;
const res = await poppler.pdfToCairo(file, outputFile, options);
If applicable, add screenshots to help explain your problem.
Appears to be a known issue.
See Belval/pdf2image#206
Helps to handle Window's UNC paths.
I have written a descriptive issue title
I have searched existing issues to ensure it has not already been reported
I agree to follow the Code of Conduct that this project adheres to
6.0.3
16.18.1
Linux
Amazon Linux 2
I moved my server from Heroku to AWS EC2 instance. On heroku everything worked fine, but on the AWS instance I get this error:
Error: Error opening output file fd://0.png
at ChildProcess.<anonymous> (/home/ec2-user/repo/node_modules/node-poppler/src/index.js:774:14)
at ChildProcess.emit (node:events:513:28)
at ChildProcess.emit (node:domain:489:12)
at maybeClose (node:internal/child_process:1100:16)
at Process.ChildProcess._handle.onexit (node:internal/child_process:304:5)
To install poppler dependencies on heroku
instance I added a buildpack in heroku settings: https://github.com/amitree/heroku-buildpack-poppler
To install poppler dependencies on AWS EC2
instance I installed them with:
sudo yum install poppler-data
sudo yum install poppler-utils
I found this StackOverflow issue. Which says that this is a bug in pdfToCairo. But the same code worked in Heroku.
Do you think this is an issue of different linux os. Or is there something I am missing and maybe I just need to install some kind of dependencies for this to work?
On Heroku, this is called the "stack"—an operating system image curated and maintained by Heroku. The stack is based on Ubuntu, the open source Linux distribution.
AWS's Amazon Linux will be based on Red Hat's Fedora community Linux.
I just created a AWS EC2 instance with default settings, installed node, installed poppler dependencies and tried running the code below.
I need to generate a png file from pdf which has a single page.
First I generate the pdf buffer:
const PdfOptions = {
base: `file:///${base}/`,
format: 'letter',
height: 2551,
localUrlAccess: true,
orientation: 'landscape',
timeout: '100000',
width: 3295,
};
const html = this.getHtml();
const fileName = await PdfService.GenerateFileName(FileExtension.Pdf);
return new Promise((resolve, reject) => {
pdf.create(html, PdfOptions).toBuffer(function (err, buffer) {
if (err) {
reject(err);
return logger.error(err);
}
resolve({ buffer, fileName });
});
});
Then I try to generate the png from the pdf buffer:
const pdfToCairoOptions = {
pngFile: true,
singleFile: true,
resolutionXYAxis: 72,
};
const pngBuffer = await poppler.pdfToCairo(pdfPath, undefined, pdfToCairoOptions); // Crashes here
const binaryBuffer = Buffer.from(pngBuffer, 'binary');
return { pngBuffer: binaryBuffer };
Expected behaviour should be that the png file is generated. On my development machine(macOS) and heroku it works. But on AWS EC2 instance it doesn't work.
I have written a descriptive issue title
I have searched existing issues to ensure it has not already been reported
I agree to follow the Code of Conduct that this project adheres to
5.0.2
16.13.0
Windows
10
Use of singleFile
option when combined with any image file option in pdfToCairo produces corrupted files when writing to stdout.
const file = 'test_file.pdf';
const poppler = new Poppler();
const options = {
jpegFile: true,
singleFile: true,
};
const res = await poppler.pdfToCairo(file, undefined, options);
fs.writeFileSync(`${testDirectory}test1.jpg`, res);
Resulting file also ends up being double the size of what would be generated if passing an output path to pdfToCairo.
Corruption does not occur when output path is defined.
No response
I have written a descriptive issue title
I have searched existing issues to ensure it has not already been reported
I agree to follow the Code of Conduct that this project adheres to
No response
16.16.0
Linux
Ubuntu
Just a heads up... I spent some time debugging and feel like the docs should say
/usr/bin
in place of
./usr/bin
At least that is what I needed to get it to work, but I am no Linux expert :)
Cheers
Just a documentation note.
No response
I have written a descriptive issue title
I have searched existing issues to ensure it has not already been reported
I agree to follow the Code of Conduct that this project adheres to
22.04.0
14.18
Windows
21H2
convert pdf file to jpg files but encounter following error
I/O Error: Couldn't open 'nameToUnicode' file 'node_modules\node-poppler\src\lib\win32\poppler-22.04.0\share\poppler\nameToUnicode\Bulgarian'
I/O Error: Couldn't open 'nameToUnicode' file 'node_modules\node-poppler\src\lib\win32\poppler-22.04.0\share\poppler\nameToUnicode\Greek'
I/O Error: Couldn't open 'nameToUnicode' file 'node_modules\node-poppler\src\lib\win32\poppler-22.04.0\share\poppler\nameToUnicode\Thai'
const _opts = {
scalePageTo:3072,
jpegFile:true
};
poppler.pdfToCairo(pdfFilePath,undefined, _opts)
.then(res => {
...
})
.catch(error => {
console.error(error);
reject(error);
})
should result with jpg files
Describe the solution you'd like
Support TypeScript usage with a d.ts
TypeScript declaration file for the module.
Describe the solution you'd like
Poppler Utils support the ability to read PDF files from stdin for most of the binaries (besides pdfattach
, pdfdetach
, pdfseparate
, pdfunite
).
The functionality for this would ideally be reflected in this wrapper module as well.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.