dbashford / textract Goto Github PK
View Code? Open in Web Editor NEWnode.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!
License: MIT License
node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!
License: MIT License
With larger docx files an buffer exceeded error is generated.
I got around this by modifying:
lib/extractors/docx.js
adding the following to the exec statement near the top of the file:
{maxBuffer: 50000*1024},
Ideally this could be a configurable parameter.
Cheers!
i want use language chi_sim
where can i set options
I've just made a deployment with the latest version of the lib (0.17) and get the following error in the log:
/graspeo/current/node_modules/mongoose/node_modules/mongodb/lib/mongodb/db.js:297
throw err;
^
Error: Cannot find module 'ppt'
at Function.Module._resolveFilename (module.js:338:15)
at Function.Module._load (module.js:280:25)
at Function.cls_wrapMethod [as _load] (/graspeo/current/node_modules/newrelic/lib/shimmer.js:208:38)
at Module.require (module.js:364:17)
at require (module.js:380:17)
at Object.<anonymous> (/graspeo/current/node_modules/textract/lib/extractors/ppt.js:2:11)
at Module._compile (module.js:456:26)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
Earlier today everything was fine so I assume this is because of the new release. Changing version back to 0.16 made things work again.
By the way, thanks for the great lib!
It looks like the options passed to other extractors is not utilized for the .docx extraction process. textract API's are passing an empty string back to the callback for large .docx files (testing with a .docx around 400 pages).
Get a lot of extractions that'll look something like this
some text more text some other text
No need for all the white space.
analyze the output of catdoc __filename
to see if catdoc is there but just can't find the file.
Hi David!
Trying to create an endpoint in an Express server like this:
app.get('/textract', function(req, res, next) { textract("https://s3.amazonaws.com/testbucket1a2b3c/test.pdf", function(error, text) { console.log(error); res.end(); }); });
Console returns [Error: File at path [[ https://s3.amazonaws.com/testbucket1a2b3c/test.pdf ]] does not exist.]
What does this mean exactly? Textract only works with local files? (in this case my file is uploaded to S3). Thanks!
Hi David,
Do you have plans to update textract to support .ppt, .xlsx, .xltx, .potx, .key, .pages, .xml? I'd also love to see support for OpenOffice file formats, like .odt, .ott, .ods, .ots, .odg, .otg, .odp, .otp.
Thanks!
For me, the preserve newline behaviour isn't quite working as I expected (tested with the docx extractor).
I have text like this in a docx file:
2 downlighters; door to hall.
Hall
Double glazed window to front;
With preserveLineBreaks I get this output:
2 downlighters; door to hall. Hall
Double glazed window to front;
After outputting some stuff to the console I can see the newlines are there as expected but then they get parsed out.
Taking a look at how preserveLineBreaks
is implemented I see it's a big, hairy regex, so not sure what it is doing at first glance. From my naive point of view it would be nicer to get the raw text output, if I need to filter further I can make my own mind. Or if there is a 'clean' function as a configuration option I could use it to override the default behaviour.
$ textract some-file.docx
textract not ready, retrying in .5 seconds
textract: 'drawingtotext' does not appear to be installed, so it will be unable to extract DXFs.
[Error: extractNewWordDocument exec error: Error: stdout maxBuffer exceeded.]
Any way to avoid this error? Or is it just something im doing wrong?
I dont need drawingtotext, just for doc and docx i guess?
Use case: An Express API for taking DOC/PDF/DOCX and returning text.
Rather than uploading a file to the server and then having textract read that off the disk it would be preferable to take a base64 encoded DOC/PDF/DOCX file sent as a string in a POST request, put it in a buffer, and then have textract read that buffer.
I am finding that textract is removing all of the line breaks within a document. Commenting out cleanseText seemed to fix it but perhaps a better way would be to specify whether text is 'cleansed' with params?
I don't have the DFX conversion software installed so every time I do a text extraction, I get info warning text saying INFO: 'drawingtotext' does not appear to be installed, so textract will be unable to extract DXFs.
and then the text of my document after it. Is there any way to disable this?
Will end up with double digit pages showing up first.
I took the test file and used powerpoint to save as an RTF file. Using textutil on OSX, I generated a baseline. Ideally, textract should produce the exact text:
$ textutil -convert txt layout_types_2011.rtf # creates layout_types_2011.txt
$ textract layout_types_2011.pptx 2>/dev/null >layout_types_2011.textract
$ diff layout_types_2011.txt layout_types_2011.textract
While the differences might be conscious decisions, it's worth clarifying:
A) the line "textract not ready, retrying in .5 seconds" is printed to stdout. This probably should be printed to stderr: https://github.com/dbashford/textract/blob/master/lib/extract.js#L72 should use console.error
rather than console.log
B) Newlines are completely lost. For example, slide 10 reads
Who thought this would be a good idea?
Unfortunately the arrow keys act relative to the screen rather than the text
The entire input situation is confusing
but textract is writing
Who thought this would be a good idea? Unfortunately the arrow keys act relative to the screen rather than the text The entire input situation is confusing
C) The โฆ
character U+2026 is missing (is that intentional?)
We've recently began to shard out our text extraction processes and I noticed a significant spike in memory usage. Looks like it's coming from this module. Running the following:
var textract = require('textract');
setInterval(function () {
console.error(process.memoryUsage());
}, 1000);
Results in around 135 MB of memory being used. Comment out the first line and that shoots down to around 10 MB.
Any ideas what's causing this?
2126150Microsoft Macintosh Word011falseW
Only seen this with docx, usually with things like complex footers.
Verify and fix
Pre-2007 powerpoint
We have a requirement for CSV support in a project. Would this be useful to use a popular npm library with the same interface as textract?
I will be able to PR my work early next week.
I've quickly implemented from a project using this currently on my fork where you can find a contribution guide. Its the smallest image out there doing the same at 86MB and you should be able to build the container locally with different versions of node after pulling from the image repository.
In Node v4.2.1 I'm getting child depreciation warnings which is failing command line tests and we would have to work out how to compile the drawingtotext binary as I can't find much documentation other than making. This might be a separate container which generates the package and hosts it on github.
Let me know your thoughts!
Fork: https://github.com/sidhuko/textract
Github: https://github.com/sidhuko/docker-textract
Docker hub: https://hub.docker.com/r/sidhuko/textract/
Does it work? Because for me it does not.
$ textract 'test.pptx'
textract not ready, retrying in .5 seconds
textract: 'drawingtotext' does not appear to be installed, so it will be unable
to extract DXFs.
textract: 'catdoc' does not appear to be installed, so it will be unable to extr
act DOCs.
[Error: extract powerpoint, pptx, exec error: Error: stdout maxBuffer exceeded.]
I added a simple call to "textract(filePath, callback)" in my "app.js", like this:
var textract = require('textract');
var filePath = "examples/Cosmos.pdf";
textract(filePath, function( error, text )
{
if (error)
{
console.log("%s", error);
}
else if (!text)
{
console.log("Error: no text received");
}
else
{
// Ignore punctuation for now...
var terms = text.split(" ");
console.log("terms found: #%d", terms.length);
}
});
When running it via "node app" it reports that "Error: textract does not currently extract files of type [[ application/pdf ]]".
Reading the source I found that the extractor for PDFs was indeed there (under "lib/extractors/") so I added a "console.log()" to "registerExtractor(extractor)" in "lib/extract.js" and I found that the PDF extractor was loaded AFTER my call to "textract()" was "completed".
I rearranged my code as follows and it works (because I'm now waiting 5 seconds for the extractors to be loaded):
var delayedExtraction = function()
{
textract(filePath, function( error, text )
{
if (error)
{
console.log("%s", error);
}
else if (!text)
{
console.log("Error: no text received");
}
else
{
// Ignore punctuation for now...
var terms = text.split(" ");
console.log("terms found: #%d", terms.length);
}
});
};
setTimeout(delayedExtraction, 5000);
I know this way it works, but I'd like textract to take care of this concurrency issue in a deterministic way ;-)
Thanks!
Whenever i run the project i keep getting the following warnings:
textract: 'unzip' does not appear to be installed, so textract will be unable to
extract DOCXs.
textract: 'catdoc' does not appear to be installed, so it will be unable to extr
act DOCs.
I have properly installed catdoc command and its working in command prompt using path envionment variables.
Also i am unable to install the unzip module as there is no link provided for this and i would like to know how to install it.
If anybody would provide some information on this i will be very thankful to him.
From here: #5 (comment)
This causes change has caused random spaces in the middle of words in the .docx files I've been using. It seems to be an issue when either the w:t tag has an attribute of xml-spacing="preserve" or the sibling to the w:t tag w:rPr has a child node of
Here you go:
https://docs.google.com/file/d/0Bxcbem1SSxNoaXRRazcwWG82Y1k/edit
the extracted text will be this:
this is a test docu ment that won t be extracted properly.
should be:
this is a test document that won't be extracted properly.
(the quote thing might be a little harder to fix than the space).
cleanseText
removes cyrillic letters.
The cause is that WHITELIST_PRESERVE_LINEBREAKS
and WHITELIST_STRIP_LINEBREAKS
will remove all unknown characters.
See RegEx with extended alphabet to match all unicode letters.
Is ODT support in the pipeline?
Also with docx files "preserveLineBreaks" does not seem to work.
How do I pass the language that should be used for ocr?
if ( extractor.test ) {
extractor.test();
}
return extractor;
I tried installing catdoc on osx 10.9.3 (for RTF support) using brew as well as from source, and for whatever reason it just does not want to play nice. What formats currently use catdoc? Are there pure-JS text extractors for those formats?
Hi David,
I'm getting this error when trying to textract a big .docx file (1.8MB). I tried increasing the maxBuffer setting by doing $ textract big-file.docx --exec.maxBuffer 512000
but it's not working (tried many values, but none seem to work).
Do you know a possible fix?
Thanks!
If your filename is named with brackets, for instance "new doc(1).docx" the extraction fails(at least for docx files). Escaping the brackets won't work because then fs.exists on line 7 of index.js fails.
Many (most) extractors now do not need to be on disk to be extracted. Would be nice to avoid that step.
Any plans on using node streams?
Ref #42
Is it possible to build an extractor for Excel (*.xlsx) files?
Any change of getting an update on NPM so we can have pptx extractor? I actually wrote the pptx extractor, then noticed you had done it already!
We've noticed that the pdf-text-extract npm module has been updated (now at 1.1.2).
This new version fixes some problems we have been having where warnings in the extraction process come back as errors and thus we do not get the extracted text.
Any chance we can get the package.json file updated to use 1.1.2 for pdf-text-extract?
Why does the pdf extractor ignore the options?
Error:
< 29 Mar 22:40:41 - error: [App] Error extracting [[ /XXX/Screen Shot 2014-03-06 at 14.43.23.png ]], exec error: Error: Command failed: read_params_file: Can't open Shot
< read_params_file: Can't open 2014-03-06
< read_params_file: Can't open at
< read_params_file: Can't open 14.43.23.png
< read_params_file: Can't open /YYY/node_modules/textract/lib/extractors/temp/Screen
< read_params_file: Can't open Shot
< read_params_file: Can't open 2014-03-06
< read_params_file: Can't open at
< read_params_file: Can't open 14.43.23
< Cannot open input file: /XXX
The problem is that the paths are not escaped before calling the tesseract command:
exec( "tesseract " + filePath + " " + fileTempOutPath + " quiet",
Will submit a pull request fixing the issue.
See #58
Because nothing is wrong other than textract won't be able to extract that type.
When I execute js file with node.js with following content(for example with .doc file):
var textract = require('textract');
textract.fromFileWithPath('test.doc', function( error, text ) {
if (error) throw error;
console.log(text);
})
with .doc file, all cyrillic symbols ureadable (but when I execute Catdoc, then I can read it)
and with .docx file all cyrillic symbols removes.
I installed textract:
$ sudo npm install -g textract
Every invocation of textract seems to fail:
$ textract -h
fs.js:647
return binding.mkdir(pathModule._makeLong(path),
^
Error: EACCES, permission denied '/usr/local/lib/node_modules/textract/lib/extractors/temp'
at Object.fs.mkdirSync (fs.js:647:18)
at Object.<anonymous> (/usr/local/lib/node_modules/textract/lib/extractors/images.js:83:8)
at Module._compile (module.js:456:26)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Module.require (module.js:364:17)
at require (module.js:380:17)
at module.exports (/usr/local/lib/node_modules/textract/lib/extract.js:85:10)
at Array.map (native)
This happens on OSX because the module was installed as root but invoked as a normal user. On linux and osx the temp folder should probably be a proper temporary directory in a location like /tmp
Receiving the following error when trying get text from simple docx. http://www.filedropper.com/testres
[Error: extractNewWordDocument exec error: Error: Command failed: [tests/testres.docx] End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. note: tests/testres.docx may be a plain executable, not an archive ]
Do you support rtf? Should I be forcing a file type?
{ [Error: textract does not currently extract files of type [[ application/rtf ]]] typeNotFound: true }
Parsing a plain .txt http://www.filedropper.com/testres_1 I receive
*********************** C o u r i e r N e w
Here's the server thats trying to parse these files. Using express and node.js
exports.indexFile = function (req, res) { console.log(JSON.stringify(req.body)); var path = req.body.path, ext = req.body.extension, ext = ext.toString().toLowerCase(); if(ext == "pdf" || ext == "doc" || ext == "docx" || ext == "rtf" || ext == "txt") { textract(path, function(err, text) { console.log(err); console.log(text); res.send(text); }); } else { res.send("File type not supported."); } }
Please let me know asap.
EDIT: I forgot to close the document creator before uploading the files, resulting in a corrupted document. But the RTF question is still open.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.