As far as I can see, a directory is not a option while choosing a file. <p dir="au

In the spirit of the Unix philosophy, I agree with <a class="user-mention notranslate"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Textract should allow directories as a supported file type about textract HOT 5 OPEN

deanmalmgren commented on May 28, 2024

Textract should allow directories as a supported file type

from textract.

Comments (5)

deanmalmgren commented on May 28, 2024

Hmmm... Interesting idea. I really appreciate the suggestion.

At the moment I'm leaning toward not incorporating this into textract. Considering how easy it is to do something like this from the command line with something like:

#!/bin/bash
for filename in $(find /path/to/some/directory -name '*.html'); do
    textract $filename >> output.txt
done

or to do this natively in python with something like glob2, it seems a bit unnecessary to bake this into textract. The goal of this package is to streamline the interface for extracting the raw text from any document type and I'd like to keep this as simple as possible while achieving this goal.

I'll keep this issue open for a while in case others would like to comment on this concept, share other use cases where this would be helpful, or have other ideas for implementation.

from textract.

ShawnMilo commented on May 28, 2024

In the spirit of the Unix philosophy, I agree with @deanmalmgren on this one. A program should do only one thing, and it is preferable to chain commands together than to add non-essential features to commands.

from textract.

MalikRumi commented on May 28, 2024

Ok, I am going to ask a naive question here, and I hope you don't mind enlightening me. I tried your script in a python for loop, and was surprised to find I couldn't make it work. That is what led me here. It is one thing to say some sort of internal for loop is 'extra', I get that, but why doesn't it work in a regular Python for loop? That I don't get. Of course, it is entirely possible I just did it wrong. Nah, that can't be it. But your bash loop does work. Same for all the output going to a single file, instead of one output file for each input file, but that part I was able to figure out. Thanks for sharing your insight, wisdom and experience with me!

from textract.

deanmalmgren commented on May 28, 2024

@MalikRumi can you provide an example. A python for loop should work just fine...

from textract.

MalikRumi commented on May 28, 2024

`
from os import listdir, environ
import textract
import django
environ['DJANGO_SETTINGS_MODULE'] = 'chronicle.settings'
django.setup()
from ktab.models import Entry

path = '/home/malikarumi/010417_odt_tests/'
filenames = listdir(path)

for filename in filenames:
text = textract.process(filename, encoding='utf_8')
text.write(Entry.objects.create(
title=filename, content=text, chron_date='2018-01-05',
clock='23:59:59', tag__tag='tagg'))
text.save()
`
The code backticks seem not to be working for me.

(lifeandtimes) malikarumi@Tetuoan2:~/Projects/lifeandtimes/chronicle$ python django_textract_2.py
Traceback (most recent call last):
File "django_textract_2.py", line 15, in
text = textract.process(filename, encoding='utf_8')
File "/home/malikarumi/Projects/lifeandtimes/lib/python3.6/site-packages/textract/parsers/init.py", line 39, in process
raise exceptions.MissingFileError(filename)
textract.exceptions.MissingFileError: The file "2018-01-01_psycopg2-error-at-or-near.odt" can not be found.
Is this the right path/to/file/you/want/to/extract.odt?

Now, if the file can't be found, how does python know the name of it? This script uses variables for file names in the expectation that it will iterate over all of them. I don't know what additional change I am supposed to make so that textract / Python can 'see' the file.

Note the tag insert should be changed to be directly into the Tag model, not into Entry.

Thanks.

from textract.

Textract should allow directories as a supported file type about textract HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent