Coder Social home page Coder Social logo

dbashford / textract Goto Github PK

View Code? Open in Web Editor NEW
1.6K 44.0 184.0 5.21 MB

node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!

License: MIT License

JavaScript 35.29% CSS 0.01% HTML 54.84% Rich Text Format 9.87%
extract-text extraction nodejs

textract's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

textract's Issues

Exceeding buffer error

With larger docx files an buffer exceeded error is generated.

I got around this by modifying:
lib/extractors/docx.js

adding the following to the exec statement near the top of the file:
{maxBuffer: 50000*1024},

Ideally this could be a configurable parameter.

Cheers!

Error: Cannot find module `ppt`

I've just made a deployment with the latest version of the lib (0.17) and get the following error in the log:

/graspeo/current/node_modules/mongoose/node_modules/mongodb/lib/mongodb/db.js:297
          throw err;
                ^
Error: Cannot find module 'ppt'
    at Function.Module._resolveFilename (module.js:338:15)
    at Function.Module._load (module.js:280:25)
    at Function.cls_wrapMethod [as _load] (/graspeo/current/node_modules/newrelic/lib/shimmer.js:208:38)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at Object.<anonymous> (/graspeo/current/node_modules/textract/lib/extractors/ppt.js:2:11)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)

Earlier today everything was fine so I assume this is because of the new release. Changing version back to 0.16 made things work again.

By the way, thanks for the great lib!

.docx extractor options

It looks like the options passed to other extractors is not utilized for the .docx extraction process. textract API's are passing an empty string back to the callback for large .docx files (testing with a .docx around 400 pages).

Remove extraneous white space

Get a lot of extractions that'll look something like this

some text                more text             some other text

No need for all the white space.

Reading files from S3

Hi David!

Trying to create an endpoint in an Express server like this:

app.get('/textract', function(req, res, next) { textract("https://s3.amazonaws.com/testbucket1a2b3c/test.pdf", function(error, text) { console.log(error); res.end(); }); });

Console returns [Error: File at path [[ https://s3.amazonaws.com/testbucket1a2b3c/test.pdf ]] does not exist.]

What does this mean exactly? Textract only works with local files? (in this case my file is uploaded to S3). Thanks!

Support for more file formats

Hi David,

Do you have plans to update textract to support .ppt, .xlsx, .xltx, .potx, .key, .pages, .xml? I'd also love to see support for OpenOffice file formats, like .odt, .ott, .ods, .ots, .odg, .otg, .odp, .otp.

Thanks!

Preserve newline behaviour

For me, the preserve newline behaviour isn't quite working as I expected (tested with the docx extractor).

I have text like this in a docx file:

2 downlighters; door to hall.

Hall
Double glazed window to front;

With preserveLineBreaks I get this output:

2 downlighters; door to hall. Hall
Double glazed window to front;

After outputting some stuff to the console I can see the newlines are there as expected but then they get parsed out.

Taking a look at how preserveLineBreaks is implemented I see it's a big, hairy regex, so not sure what it is doing at first glance. From my naive point of view it would be nicer to get the raw text output, if I need to filter further I can make my own mind. Or if there is a 'clean' function as a configuration option I could use it to override the default behaviour.

Error: extractNewWordDocument exec error: Error: stdout maxBuffer exceeded

$ textract some-file.docx 
textract not ready, retrying in .5 seconds
textract: 'drawingtotext' does not appear to be installed, so it will be unable to extract DXFs.
[Error: extractNewWordDocument exec error: Error: stdout maxBuffer exceeded.]

Any way to avoid this error? Or is it just something im doing wrong?
I dont need drawingtotext, just for doc and docx i guess?

Support for Buffer objects containing a base64 encoded string?

Use case: An Express API for taking DOC/PDF/DOCX and returning text.

Rather than uploading a file to the server and then having textract read that off the disk it would be preferable to take a base64 encoded DOC/PDF/DOCX file sent as a string in a POST request, put it in a buffer, and then have textract read that buffer.

Removes too much whitespace

I am finding that textract is removing all of the line breaks within a document. Commenting out cleanseText seemed to fix it but perhaps a better way would be to specify whether text is 'cleansed' with params?

Disable info text?

I don't have the DFX conversion software installed so every time I do a text extraction, I get info warning text saying INFO: 'drawingtotext' does not appear to be installed, so textract will be unable to extract DXFs. and then the text of my document after it. Is there any way to disable this?

PPTX missing newlines, writes error messages to stdout

I took the test file and used powerpoint to save as an RTF file. Using textutil on OSX, I generated a baseline. Ideally, textract should produce the exact text:

$ textutil -convert txt layout_types_2011.rtf # creates layout_types_2011.txt
$ textract layout_types_2011.pptx 2>/dev/null >layout_types_2011.textract
$ diff layout_types_2011.txt layout_types_2011.textract

While the differences might be conscious decisions, it's worth clarifying:

A) the line "textract not ready, retrying in .5 seconds" is printed to stdout. This probably should be printed to stderr: https://github.com/dbashford/textract/blob/master/lib/extract.js#L72 should use console.error rather than console.log

B) Newlines are completely lost. For example, slide 10 reads

Who thought this would be a good idea?

Unfortunately the arrow keys act relative to the screen rather than the text

The entire input situation is confusing

but textract is writing

Who thought this would be a good idea? Unfortunately the arrow keys act relative to the screen rather than the text The entire input situation is confusing

C) The โ€ฆ character U+2026 is missing (is that intentional?)

Excessive memory usage?

We've recently began to shard out our text extraction processes and I noticed a significant spike in memory usage. Looks like it's coming from this module. Running the following:

var textract = require('textract');
setInterval(function () {
  console.error(process.memoryUsage());
}, 1000);

Results in around 135 MB of memory being used. Comment out the first line and that shoots down to around 10 MB.

Any ideas what's causing this?

Add CSV support

We have a requirement for CSV support in a project. Would this be useful to use a popular npm library with the same interface as textract?

I will be able to PR my work early next week.

Using a docker container for dependencies

I've quickly implemented from a project using this currently on my fork where you can find a contribution guide. Its the smallest image out there doing the same at 86MB and you should be able to build the container locally with different versions of node after pulling from the image repository.

In Node v4.2.1 I'm getting child depreciation warnings which is failing command line tests and we would have to work out how to compile the drawingtotext binary as I can't find much documentation other than making. This might be a separate container which generates the package and hosts it on github.

Let me know your thoughts!

Fork: https://github.com/sidhuko/textract
Github: https://github.com/sidhuko/docker-textract
Docker hub: https://hub.docker.com/r/sidhuko/textract/

PPTX support?

Does it work? Because for me it does not.

$ textract 'test.pptx'
textract not ready, retrying in .5 seconds
textract: 'drawingtotext' does not appear to be installed, so it will be unable
to extract DXFs.
textract: 'catdoc' does not appear to be installed, so it will be unable to extr
act DOCs.
[Error: extract powerpoint, pptx, exec error: Error: stdout maxBuffer exceeded.]

textract function can be invoked before all extractors are loaded

I added a simple call to "textract(filePath, callback)" in my "app.js", like this:

    var textract = require('textract');
    var filePath = "examples/Cosmos.pdf";
    textract(filePath, function( error, text )
    {
        if (error)
        {
            console.log("%s", error);
        }
        else if (!text)
        {
            console.log("Error: no text received");
        }
        else
        {
            // Ignore punctuation for now...
            var terms = text.split(" ");
            console.log("terms found: #%d", terms.length);
        }
    });

When running it via "node app" it reports that "Error: textract does not currently extract files of type [[ application/pdf ]]".

Reading the source I found that the extractor for PDFs was indeed there (under "lib/extractors/") so I added a "console.log()" to "registerExtractor(extractor)" in "lib/extract.js" and I found that the PDF extractor was loaded AFTER my call to "textract()" was "completed".

I rearranged my code as follows and it works (because I'm now waiting 5 seconds for the extractors to be loaded):

var delayedExtraction = function()
{
    textract(filePath, function( error, text )
    {
        if (error)
        {
            console.log("%s", error);
        }
        else if (!text)
        {
            console.log("Error: no text received");
        }
        else
        {
            // Ignore punctuation for now...
            var terms = text.split(" ");
            console.log("terms found: #%d", terms.length);
        }
    });
};
setTimeout(delayedExtraction, 5000);

I know this way it works, but I'd like textract to take care of this concurrency issue in a deterministic way ;-)

Thanks!

Unable to extract text from doc and docx files

Whenever i run the project i keep getting the following warnings:

textract: 'unzip' does not appear to be installed, so textract will be unable to
extract DOCXs.
textract: 'catdoc' does not appear to be installed, so it will be unable to extr
act DOCs.

I have properly installed catdoc command and its working in command prompt using path envionment variables.

Also i am unable to install the unzip module as there is no link provided for this and i would like to know how to install it.

If anybody would provide some information on this i will be very thankful to him.

Some spaces showing up in the middle of words

From here: #5 (comment)

This causes change has caused random spaces in the middle of words in the .docx files I've been using. It seems to be an issue when either the w:t tag has an attribute of xml-spacing="preserve" or the sibling to the w:t tag w:rPr has a child node of

Here you go:
https://docs.google.com/file/d/0Bxcbem1SSxNoaXRRazcwWG82Y1k/edit
the extracted text will be this:
this is a test docu ment that won t be extracted properly.
should be:
this is a test document that won't be extracted properly.
(the quote thing might be a little harder to fix than the space).

support cyrillic

cleanseText removes cyrillic letters.

The cause is that WHITELIST_PRESERVE_LINEBREAKS and WHITELIST_STRIP_LINEBREAKS will remove all unknown characters.

See RegEx with extended alphabet to match all unicode letters.

ODT Support

Is ODT support in the pipeline?

Also with docx files "preserveLineBreaks" does not seem to work.

lang parameter

How do I pass the language that should be used for ocr?

Consider replacing catdoc

I tried installing catdoc on osx 10.9.3 (for RTF support) using brew as well as from source, and for whatever reason it just does not want to play nice. What formats currently use catdoc? Are there pure-JS text extractors for those formats?

Streams?

Any plans on using node streams?

xlsx extractor?

Is it possible to build an extractor for Excel (*.xlsx) files?

Update NPM?

Any change of getting an update on NPM so we can have pptx extractor? I actually wrote the pptx extractor, then noticed you had done it already!

pdf-to-text version upgrade

We've noticed that the pdf-text-extract npm module has been updated (now at 1.1.2).

This new version fixes some problems we have been having where warnings in the extraction process come back as errors and thus we do not get the extracted text.

Any chance we can get the package.json file updated to use 1.1.2 for pdf-text-extract?

A bug when when extracting from an image with tesseract

Error:

< 29 Mar 22:40:41 - error: [App] Error extracting [[ /XXX/Screen Shot 2014-03-06 at 14.43.23.png ]], exec error: Error: Command failed: read_params_file: Can't open Shot
< read_params_file: Can't open 2014-03-06
< read_params_file: Can't open at
< read_params_file: Can't open 14.43.23.png
< read_params_file: Can't open /YYY/node_modules/textract/lib/extractors/temp/Screen
< read_params_file: Can't open Shot
< read_params_file: Can't open 2014-03-06
< read_params_file: Can't open at
< read_params_file: Can't open 14.43.23
< Cannot open input file: /XXX

The problem is that the paths are not escaped before calling the tesseract command:

  exec( "tesseract " + filePath + " " + fileTempOutPath + " quiet",

Will submit a pull request fixing the issue.

Problems with cyrillic symbols

When I execute js file with node.js with following content(for example with .doc file):
var textract = require('textract');

textract.fromFileWithPath('test.doc', function( error, text ) {
if (error) throw error;
console.log(text);
})

with .doc file, all cyrillic symbols ureadable (but when I execute Catdoc, then I can read it)
and with .docx file all cyrillic symbols removes.

make the temp folder in an actual temp location

I installed textract:

$ sudo npm install -g textract

Every invocation of textract seems to fail:

$ textract -h

fs.js:647
  return binding.mkdir(pathModule._makeLong(path),
                 ^
Error: EACCES, permission denied '/usr/local/lib/node_modules/textract/lib/extractors/temp'
    at Object.fs.mkdirSync (fs.js:647:18)
    at Object.<anonymous> (/usr/local/lib/node_modules/textract/lib/extractors/images.js:83:8)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at module.exports (/usr/local/lib/node_modules/textract/lib/extract.js:85:10)
    at Array.map (native)

This happens on OSX because the module was installed as root but invoked as a normal user. On linux and osx the temp folder should probably be a proper temporary directory in a location like /tmp

Parsing issues.

Receiving the following error when trying get text from simple docx. http://www.filedropper.com/testres
[Error: extractNewWordDocument exec error: Error: Command failed: [tests/testres.docx] End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. note: tests/testres.docx may be a plain executable, not an archive ]

Do you support rtf? Should I be forcing a file type?
{ [Error: textract does not currently extract files of type [[ application/rtf ]]] typeNotFound: true }

Parsing a plain .txt http://www.filedropper.com/testres_1 I receive

*********************** C o u r i e r N e w

Here's the server thats trying to parse these files. Using express and node.js

exports.indexFile = function (req, res) { console.log(JSON.stringify(req.body)); var path = req.body.path, ext = req.body.extension, ext = ext.toString().toLowerCase(); if(ext == "pdf" || ext == "doc" || ext == "docx" || ext == "rtf" || ext == "txt") { textract(path, function(err, text) { console.log(err); console.log(text); res.send(text); }); } else { res.send("File type not supported."); } }

Please let me know asap.

EDIT: I forgot to close the document creator before uploading the files, resulting in a corrupted document. But the RTF question is still open.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.