dbashford / textract Goto Github PK

View Code? Open in Web Editor NEW

1.6K 44.0 184.0 5.21 MB

node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!

License: MIT License

JavaScript 35.29% CSS 0.01% HTML 54.84% Rich Text Format 9.87%

extract-text extraction nodejs

textract's People

Stargazers

Watchers

Forkers

davidworkman9 nvdnkpr james1x0 voz rosslynp nisaacson rakesh-mohanta gitter-badger enelesmai tpreusse tommygnr aqum luzc08 sidhuko ahgentil danthemaen dlakata oatkiller jxcjxcjzx desperado1992 chagge sjtu2008 bline giserh prakhyatata kamilziajka saibabanadh veljkomatic andre0799 arcanebear rebiyon sebastiansingle99 tyolab yelabbassi sadanoah agrimm moz maxkurama parallelsoftware wandec xiaohuanit harendranathvegi9 zhhb olivierb-ob tetsuyas1 empia mr2fish daminhtung maxism redanium deplay semtle spneto catataw gragtah hongtaicao sahwar yankee-by wtianyu wangxiaoshuo gitgrimbo ge-lx oliveira jtn-ms magicianlee007 xiaodin1 njlr aglaianwoman geeph asb14690 tansaku kukkadapusushma darrencook huydeerpets perminder-klair menikmathi neurogrid opencii polygox bharatrsharma outwrite abibazhi jogli5er mapboss dupenf droplr zengjing19890310 bradparks byoung2 raulromanp ripkens konijnendijk halfz carloslema kiitehq derekzhang79 miguelramosfdz jrsglobalpriv apporoad nayoung0

textract's Issues

Exceeding buffer error

With larger docx files an buffer exceeded error is generated.

I got around this by modifying:
lib/extractors/docx.js

adding the following to the exec statement near the top of the file:
{maxBuffer: 50000*1024},

Ideally this could be a configurable parameter.

Cheers!

how can i set options with language

i want use language chi_sim

where can i set options

Error: Cannot find module `ppt`

I've just made a deployment with the latest version of the lib (0.17) and get the following error in the log:

/graspeo/current/node_modules/mongoose/node_modules/mongodb/lib/mongodb/db.js:297
          throw err;
                ^
Error: Cannot find module 'ppt'
    at Function.Module._resolveFilename (module.js:338:15)
    at Function.Module._load (module.js:280:25)
    at Function.cls_wrapMethod [as _load] (/graspeo/current/node_modules/newrelic/lib/shimmer.js:208:38)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at Object.<anonymous> (/graspeo/current/node_modules/textract/lib/extractors/ppt.js:2:11)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)

Earlier today everything was fine so I assume this is because of the new release. Changing version back to 0.16 made things work again.

By the way, thanks for the great lib!

It looks like the options passed to other extractors is not utilized for the .docx extraction process. textract API's are passing an empty string back to the callback for large .docx files (testing with a .docx around 400 pages).

Remove extraneous white space

Get a lot of extractions that'll look something like this

some text                more text             some other text

No need for all the white space.

Capture catdoc existing differently

analyze the output of catdoc __filename to see if catdoc is there but just can't find the file.

Reading files from S3

Hi David!

Trying to create an endpoint in an Express server like this:

app.get('/textract', function(req, res, next) { textract("https://s3.amazonaws.com/testbucket1a2b3c/test.pdf", function(error, text) { console.log(error); res.end(); }); });

Console returns [Error: File at path [[ https://s3.amazonaws.com/testbucket1a2b3c/test.pdf ]] does not exist.]

What does this mean exactly? Textract only works with local files? (in this case my file is uploaded to S3). Thanks!

Support for more file formats

Hi David,

Do you have plans to update textract to support .ppt, .~~xlsx~~, .~~xltx~~, .~~potx~~, .key, .pages, ~~.xml~~? I'd also love to see support for OpenOffice file formats, like ~~.odt~~, ~~.ott~~, ~~.ods~~, ~~.ots~~, ~~.odg~~, ~~.otg~~, ~~.odp~~, ~~.otp~~.

Thanks!

Preserve newline behaviour

For me, the preserve newline behaviour isn't quite working as I expected (tested with the docx extractor).

I have text like this in a docx file:

2 downlighters; door to hall.

Hall
Double glazed window to front;

With preserveLineBreaks I get this output:

2 downlighters; door to hall. Hall
Double glazed window to front;

After outputting some stuff to the console I can see the newlines are there as expected but then they get parsed out.

Taking a look at how preserveLineBreaks is implemented I see it's a big, hairy regex, so not sure what it is doing at first glance. From my naive point of view it would be nicer to get the raw text output, if I need to filter further I can make my own mind. Or if there is a 'clean' function as a configuration option I could use it to override the default behaviour.

Error: extractNewWordDocument exec error: Error: stdout maxBuffer exceeded

$ textract some-file.docx 
textract not ready, retrying in .5 seconds
textract: 'drawingtotext' does not appear to be installed, so it will be unable to extract DXFs.
[Error: extractNewWordDocument exec error: Error: stdout maxBuffer exceeded.]

Any way to avoid this error? Or is it just something im doing wrong?
I dont need drawingtotext, just for doc and docx i guess?

Support for Buffer objects containing a base64 encoded string?

Use case: An Express API for taking DOC/PDF/DOCX and returning text.

Rather than uploading a file to the server and then having textract read that off the disk it would be preferable to take a base64 encoded DOC/PDF/DOCX file sent as a string in a POST request, put it in a buffer, and then have textract read that buffer.

Removes too much whitespace

I am finding that textract is removing all of the line breaks within a document. Commenting out cleanseText seemed to fix it but perhaps a better way would be to specify whether text is 'cleansed' with params?

Disable info text?

I don't have the DFX conversion software installed so every time I do a text extraction, I get info warning text saying INFO: 'drawingtotext' does not appear to be installed, so textract will be unable to extract DXFs. and then the text of my document after it. Is there any way to disable this?

PPTX beyond 9 pages will end up out of order

Will end up with double digit pages showing up first.

PPTX missing newlines, writes error messages to stdout

I took the test file and used powerpoint to save as an RTF file. Using textutil on OSX, I generated a baseline. Ideally, textract should produce the exact text:

$ textutil -convert txt layout_types_2011.rtf # creates layout_types_2011.txt
$ textract layout_types_2011.pptx 2>/dev/null >layout_types_2011.textract
$ diff layout_types_2011.txt layout_types_2011.textract

While the differences might be conscious decisions, it's worth clarifying:

A) the line "textract not ready, retrying in .5 seconds" is printed to stdout. This probably should be printed to stderr: https://github.com/dbashford/textract/blob/master/lib/extract.js#L72 should use console.error rather than console.log

B) Newlines are completely lost. For example, slide 10 reads

Who thought this would be a good idea?

Unfortunately the arrow keys act relative to the screen rather than the text

The entire input situation is confusing

but textract is writing

Who thought this would be a good idea? Unfortunately the arrow keys act relative to the screen rather than the text The entire input situation is confusing

C) The … character U+2026 is missing (is that intentional?)

Excessive memory usage?

We've recently began to shard out our text extraction processes and I noticed a significant spike in memory usage. Looks like it's coming from this module. Running the following:

var textract = require('textract');
setInterval(function () {
  console.error(process.memoryUsage());
}, 1000);

Results in around 135 MB of memory being used. Comment out the first line and that shoots down to around 10 MB.

Any ideas what's causing this?

Look into yauzl for replacing requirement for unzip

Put checks in to warn if binaries are not in place

With at least docx words can end up smashed together.

2126150Microsoft Macintosh Word011falseW

Only seen this with docx, usually with things like complex footers.

docx files "preserveLineBreaks" does not seem to work.

Verify and fix

PPT Support?

Pre-2007 powerpoint

use node modules instead of external programs

For example:

pdf-text for PDF files
xlsjs for XLS files
xlsx for XLSX/XLSM/XLSB files

I'm sure more pure-JS parsers exist

Add CSV support

We have a requirement for CSV support in a project. Would this be useful to use a popular npm library with the same interface as textract?

I will be able to PR my work early next week.

Using a docker container for dependencies

I've quickly implemented from a project using this currently on my fork where you can find a contribution guide. Its the smallest image out there doing the same at 86MB and you should be able to build the container locally with different versions of node after pulling from the image repository.

In Node v4.2.1 I'm getting child depreciation warnings which is failing command line tests and we would have to work out how to compile the drawingtotext binary as I can't find much documentation other than making. This might be a separate container which generates the package and hosts it on github.

Let me know your thoughts!

Fork: https://github.com/sidhuko/textract
Github: https://github.com/sidhuko/docker-textract
Docker hub: https://hub.docker.com/r/sidhuko/textract/

PPTX support?

Does it work? Because for me it does not.

$ textract 'test.pptx'
textract not ready, retrying in .5 seconds
textract: 'drawingtotext' does not appear to be installed, so it will be unable
to extract DXFs.
textract: 'catdoc' does not appear to be installed, so it will be unable to extr
act DOCs.
[Error: extract powerpoint, pptx, exec error: Error: stdout maxBuffer exceeded.]

textract function can be invoked before all extractors are loaded

I added a simple call to "textract(filePath, callback)" in my "app.js", like this:

    var textract = require('textract');
    var filePath = "examples/Cosmos.pdf";
    textract(filePath, function( error, text )
    {
        if (error)
        {
            console.log("%s", error);
        }
        else if (!text)
        {
            console.log("Error: no text received");
        }
        else
        {
            // Ignore punctuation for now...
            var terms = text.split(" ");
            console.log("terms found: #%d", terms.length);
        }
    });

When running it via "node app" it reports that "Error: textract does not currently extract files of type [[ application/pdf ]]".

Reading the source I found that the extractor for PDFs was indeed there (under "lib/extractors/") so I added a "console.log()" to "registerExtractor(extractor)" in "lib/extract.js" and I found that the PDF extractor was loaded AFTER my call to "textract()" was "completed".

I rearranged my code as follows and it works (because I'm now waiting 5 seconds for the extractors to be loaded):

var delayedExtraction = function()
{
    textract(filePath, function( error, text )
    {
        if (error)
        {
            console.log("%s", error);
        }
        else if (!text)
        {
            console.log("Error: no text received");
        }
        else
        {
            // Ignore punctuation for now...
            var terms = text.split(" ");
            console.log("terms found: #%d", terms.length);
        }
    });
};
setTimeout(delayedExtraction, 5000);

I know this way it works, but I'd like textract to take care of this concurrency issue in a deterministic way ;-)

Thanks!

Unable to extract text from doc and docx files

Whenever i run the project i keep getting the following warnings:

textract: 'unzip' does not appear to be installed, so textract will be unable to
extract DOCXs.
textract: 'catdoc' does not appear to be installed, so it will be unable to extr
act DOCs.

I have properly installed catdoc command and its working in command prompt using path envionment variables.

Also i am unable to install the unzip module as there is no link provided for this and i would like to know how to install it.

If anybody would provide some information on this i will be very thankful to him.

Some spaces showing up in the middle of words

From here: #5 (comment)

This causes change has caused random spaces in the middle of words in the .docx files I've been using. It seems to be an issue when either the w:t tag has an attribute of xml-spacing="preserve" or the sibling to the w:t tag w:rPr has a child node of

Here you go:
https://docs.google.com/file/d/0Bxcbem1SSxNoaXRRazcwWG82Y1k/edit
the extracted text will be this:
this is a test docu ment that won t be extracted properly.
should be:
this is a test document that won't be extracted properly.
(the quote thing might be a little harder to fix than the space).

support cyrillic

cleanseText removes cyrillic letters.

The cause is that WHITELIST_PRESERVE_LINEBREAKS and WHITELIST_STRIP_LINEBREAKS will remove all unknown characters.

See RegEx with extended alphabet to match all unicode letters.

ODT Support

Is ODT support in the pipeline?

Also with docx files "preserveLineBreaks" does not seem to work.

lang parameter

How do I pass the language that should be used for ocr?

Extractor that fails test still registers

if ( extractor.test ) {
  extractor.test();
}
return extractor;

Consider replacing catdoc

I tried installing catdoc on osx 10.9.3 (for RTF support) using brew as well as from source, and for whatever reason it just does not want to play nice. What formats currently use catdoc? Are there pure-JS text extractors for those formats?

[Error: extract docx unzip exec error: Error: stdout maxBuffer exceeded.]

Hi David,

I'm getting this error when trying to textract a big .docx file (1.8MB). I tried increasing the maxBuffer setting by doing $ textract big-file.docx --exec.maxBuffer 512000 but it's not working (tried many values, but none seem to work).

Do you know a possible fix?

Thanks!

Filenames with round brackets "(" or ")" break the extraction process

If your filename is named with brackets, for instance "new doc(1).docx" the extraction fails(at least for docx files). Escaping the brackets won't work because then fs.exists on line 7 of index.js fails.

Add ability to optionally write file to disk

Many (most) extractors now do not need to be on disk to be extracted. Would be nice to avoid that step.

Streams?

Any plans on using node streams?

Add support for .key, .pages

Ref #42

xlsx extractor?

Is it possible to build an extractor for Excel (*.xlsx) files?

Update NPM?

Any change of getting an update on NPM so we can have pptx extractor? I actually wrote the pptx extractor, then noticed you had done it already!

pdf-to-text version upgrade

We've noticed that the pdf-text-extract npm module has been updated (now at 1.1.2).

This new version fixes some problems we have been having where warnings in the extraction process come back as errors and thus we do not get the extracted text.

Any chance we can get the package.json file updated to use 1.1.2 for pdf-text-extract?

PDF extractor options are ignored

Why does the pdf extractor ignore the options?

A bug when when extracting from an image with tesseract

Error:

< 29 Mar 22:40:41 - error: [App] Error extracting [[ /XXX/Screen Shot 2014-03-06 at 14.43.23.png ]], exec error: Error: Command failed: read_params_file: Can't open Shot
< read_params_file: Can't open 2014-03-06
< read_params_file: Can't open at
< read_params_file: Can't open 14.43.23.png
< read_params_file: Can't open /YYY/node_modules/textract/lib/extractors/temp/Screen
< read_params_file: Can't open Shot
< read_params_file: Can't open 2014-03-06
< read_params_file: Can't open at
< read_params_file: Can't open 14.43.23
< Cannot open input file: /XXX

The problem is that the paths are not escaped before calling the tesseract command:

  exec( "tesseract " + filePath + " " + fileTempOutPath + " quiet",

Will submit a pull request fixing the issue.

Determine if replaceTextChars is still necessary and remove if not

See #58

Add support for speech to text via google API

Make messages about failed extractors clear "Info" messages.

Because nothing is wrong other than textract won't be able to extract that type.

Problems with cyrillic symbols

When I execute js file with node.js with following content(for example with .doc file):
var textract = require('textract');

textract.fromFileWithPath('test.doc', function( error, text ) {
if (error) throw error;
console.log(text);
})

with .doc file, all cyrillic symbols ureadable (but when I execute Catdoc, then I can read it)
and with .docx file all cyrillic symbols removes.

make the temp folder in an actual temp location

I installed textract:

$ sudo npm install -g textract

Every invocation of textract seems to fail:

$ textract -h

fs.js:647
  return binding.mkdir(pathModule._makeLong(path),
                 ^
Error: EACCES, permission denied '/usr/local/lib/node_modules/textract/lib/extractors/temp'
    at Object.fs.mkdirSync (fs.js:647:18)
    at Object.<anonymous> (/usr/local/lib/node_modules/textract/lib/extractors/images.js:83:8)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at module.exports (/usr/local/lib/node_modules/textract/lib/extract.js:85:10)
    at Array.map (native)

This happens on OSX because the module was installed as root but invoked as a normal user. On linux and osx the temp folder should probably be a proper temporary directory in a location like /tmp

Parsing issues.

Receiving the following error when trying get text from simple docx. http://www.filedropper.com/testres
[Error: extractNewWordDocument exec error: Error: Command failed: [tests/testres.docx] End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. note: tests/testres.docx may be a plain executable, not an archive ]

Do you support rtf? Should I be forcing a file type?
{ [Error: textract does not currently extract files of type [[ application/rtf ]]] typeNotFound: true }

~~Parsing a plain .txt http://www.filedropper.com/testres_1 I receive~~

~~*********************** C o u r i e r N e w~~

Here's the server thats trying to parse these files. Using express and node.js

exports.indexFile = function (req, res) { console.log(JSON.stringify(req.body)); var path = req.body.path, ext = req.body.extension, ext = ext.toString().toLowerCase(); if(ext == "pdf" || ext == "doc" || ext == "docx" || ext == "rtf" || ext == "txt") { textract(path, function(err, text) { console.log(err); console.log(text); res.send(text); }); } else { res.send("File type not supported."); } }

Please let me know asap.

EDIT: I forgot to close the document creator before uploading the files, resulting in a corrupted document. But the RTF question is still open.

dbashford / textract Goto Github PK

textract's People

Stargazers

Watchers

Forkers

textract's Issues

Recommend Projects

Recommend Topics

Recommend Org