Coder Social home page Coder Social logo

Comments (5)

deanmalmgren avatar deanmalmgren commented on May 28, 2024

Hmmm... Interesting idea. I really appreciate the suggestion.

At the moment I'm leaning toward not incorporating this into textract. Considering how easy it is to do something like this from the command line with something like:

#!/bin/bash
for filename in $(find /path/to/some/directory -name '*.html'); do
    textract $filename >> output.txt
done

or to do this natively in python with something like glob2, it seems a bit unnecessary to bake this into textract. The goal of this package is to streamline the interface for extracting the raw text from any document type and I'd like to keep this as simple as possible while achieving this goal.

I'll keep this issue open for a while in case others would like to comment on this concept, share other use cases where this would be helpful, or have other ideas for implementation.

from textract.

ShawnMilo avatar ShawnMilo commented on May 28, 2024

In the spirit of the Unix philosophy, I agree with @deanmalmgren on this one. A program should do only one thing, and it is preferable to chain commands together than to add non-essential features to commands.

from textract.

MalikRumi avatar MalikRumi commented on May 28, 2024

Ok, I am going to ask a naive question here, and I hope you don't mind enlightening me. I tried your script in a python for loop, and was surprised to find I couldn't make it work. That is what led me here. It is one thing to say some sort of internal for loop is 'extra', I get that, but why doesn't it work in a regular Python for loop? That I don't get. Of course, it is entirely possible I just did it wrong. Nah, that can't be it. But your bash loop does work. Same for all the output going to a single file, instead of one output file for each input file, but that part I was able to figure out. Thanks for sharing your insight, wisdom and experience with me!

from textract.

deanmalmgren avatar deanmalmgren commented on May 28, 2024

@MalikRumi can you provide an example. A python for loop should work just fine...

from textract.

MalikRumi avatar MalikRumi commented on May 28, 2024

`
from os import listdir, environ
import textract
import django
environ['DJANGO_SETTINGS_MODULE'] = 'chronicle.settings'
django.setup()
from ktab.models import Entry

path = '/home/malikarumi/010417_odt_tests/'
filenames = listdir(path)

for filename in filenames:
text = textract.process(filename, encoding='utf_8')
text.write(Entry.objects.create(
title=filename, content=text, chron_date='2018-01-05',
clock='23:59:59', tag__tag='tagg'))
text.save()
`
The code backticks seem not to be working for me.

(lifeandtimes) malikarumi@Tetuoan2:~/Projects/lifeandtimes/chronicle$ python django_textract_2.py
Traceback (most recent call last):
File "django_textract_2.py", line 15, in
text = textract.process(filename, encoding='utf_8')
File "/home/malikarumi/Projects/lifeandtimes/lib/python3.6/site-packages/textract/parsers/init.py", line 39, in process
raise exceptions.MissingFileError(filename)
textract.exceptions.MissingFileError: The file "2018-01-01_psycopg2-error-at-or-near.odt" can not be found.
Is this the right path/to/file/you/want/to/extract.odt?

Now, if the file can't be found, how does python know the name of it? This script uses variables for file names in the expectation that it will iterate over all of them. I don't know what additional change I am supposed to make so that textract / Python can 'see' the file.

Note the tag insert should be changed to be directly into the Tag model, not into Entry.

Thanks.

from textract.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.