Coder Social home page Coder Social logo

Comments (6)

eikek avatar eikek commented on May 19, 2024 1

Regarding document storage: We might have different opinions here, but we seem to agree (I think) that a database is necessary :). In my experience todays database systems have quite good support for binary files. At work we once used postgres for binary data for much larger amounts of data than I would expect for docspell. Of course, there are pros and cons for both ways! And I'm not convinced that changing the status quo yields in big improvements. So I'm not going to work on changing this in the foreseeable future. It is certainly ok to create an issue for discussion – totally possible that I change my mind :) But even then there is so much in the queue that I'd like to have first….

Since docspell is not meant to deal with really large files and targets individuals to small organizations, I found it ok to store it in the file system database. The pros are: easier backup, simpler interface, no need to deal with filesystem locks, buffering etc in the app – simpler development and consistent data (acid). I might go a different way when it should run for millions of users etc….

Currently it is designed to run in a distributed way, where the database is the central synchronization point. Now when multiple joex and restserver nodes are running, the files must be transferred to them somehow. This is a no-brainer when all nodes are given access to the db. Of course, this way doesn't allow to efficiently scale horizontally to thousands of nodes – but this is not a goal. If it ever would be one, it should be easy to use a different architecture as the components are completely separated.

I think (imho) backups are a lot easier with only the db, because you can easily get out-of-sync backups when first backuping all files and then the database or vice versa. You also need to backup two things now instead of just one. With postgres at least, backup is just one command – it compares to rsync or other file backup tools to me.

Multi-User: I meant in comparison with tools like paperless that run on a single directory (iirc) and that a database is necessary for multi-user to store things like accounts, passwords etc. It usually evolves in some relational data model. The single directory model is really nice and simple, but to have a "true" multi-user application, imho this model doesn't work well. I would have to deal with all the filesytem stuff, like locking, buffering etc. Data can get easily out of sync, as other processes can delete/move/change files. Currently docspell can't edit files, but if this feature is wanted, it's much harder to implement if you need to deal with the filesystem.

Then when thinking about implementing it in the local file system, I would probably also need to store files using their hash, maybe similar to how git stores blobs. This avoids errors in filesystems that don't support unicode, or have some length limits etc. Then imho, there is no practical benefit anymore. You won't find any file by looking at the file system, you would really need the datdabase, too, in order to get any meaningful information.

So, I'm not convinced yet that going in this direction would have any real benefit, especially compared against the effort that is required.

It would be of course nice to have some abstraction that can be used to implement db and file-system storage. But again, it's is quite some effort (it also has to be maintained…).

Export: this is something I want to have! It should also be possible to import the dump again. My current vision/rough-plan is to create a CLI tool and add to that some admin functions like import/export later. For a quick and dirty workaround, it shouldn't be much effort to provide a script in the tools/ directory that downloads all files/items to disk. wdyt about creating this?

The consumption dir workaround is really just a workaround. I find it not too far away, though? If concerned with deleted files in the software, you could e.g. run periodic cleanups against the directory. The other way, it is possible to setup an inotify event for deleting from the filesystem that then goes and deletes in in the software. Of course, this is not provided out of the box and is also not a goal. But it is possible using the api and some scripting effort….

Edit: Fixed some mistakes after reading again :-)

from docspell.

eikek avatar eikek commented on May 19, 2024

Hi @totti4ever and thank you for your kind words! To your questions:

  1. The files and everything else is stored in the database. I also find the simplicity of the filesystem very appealing! But I eventually decided to put everything in the db. A db is required anyways for all the user supplied data. Features like multi-user and distributed setup are too hard if files are on the ("normal") filesystem (at least for me…). Another concern is backups: Since there is a database anyways it should be backed up, too. But the files in the filesystem must also be backed up. So in the end there must be two backups instead of one. If the database is not important,then you could still use docspell and only import from a folder. The consumedir script allows for renaming/moving around files – it doesn't upload files twice. But they will be imported into docspell and use extra space. You could start with a H2 database, that works solely on the filesystem (it's like sqlite), then there is no need of a separate db server. This were my thoughts back then :) At some point inthe future, I want to have a decent CLI that let's you export all files back into the filesystem.
  1. Thumbnails are currently not implemented. I experimented with this early and found that I don't need them :-) My documents were too similar and I found myself looking always at correspondent and tags. But it is planned to add thumbnails! I just don't have an ETA.
  2. I'm not sure if I understand. Docspell itself doesn't watch a folder. It must be done with the consumedir script (or something like this). If you use the docker container for consumedir, then it should be all setup. The problem might be when mounting network filesystems like nfs or cifs. There, the kernel doesn't support watching a directory and therefore the consumedir script won't work. I have no solution to this, only polling periodically via cron or systemd. A simple workaround is to put the consumedir script on the other machine and let the script upload the docs from there.
  3. Documents are not ocr-ed twice normally. Doscpell first extracts the text from a pdf. If this is below some minimum length it will still run OCR just to see if that gives more. Then the longer of the texts is taken. By default it will hand all pdfs to ocrmypdf, but (I think) this will skip already ocred files. If not, I missed this and I would try to fix it in docspell. The whole ocrmypdf process can be switched off in the config file. So if you only have these pdfs, this would be an option, I guess. Alternatively, it is possible to change the ocrmypdf options in docspell's config file to fit your requirements.
  4. Yes, nice idea - thank you!

I hope this helps… if not, please come back with any questions/ideas
etc. Thanks a lot for your input! Docspell is still a young project, so there is probably a lot missing that is known from other tools.

from docspell.

totti4ever avatar totti4ever commented on May 19, 2024

Okay, after fiddling around a lot with the docker stuff and thinking about the points, I have some further remarks:

  1. Document storage
    Okay, I think we have different opinions on whether binary files belong in a database ;-) From my understanding DBs are not made for storing files and looking at for example nextcloud (with also heavy file usage plus multi-user), there the files are in the filesystem as well. For serving multi-user purposes the best way, I guess is to have some middleware, which checks for permission and then grabs the file from disk (or db or whatsoever). So the program itself wouldn't have to care. I don't quite get, where multi-user is a reason for DB :-) - of course one shouldn't use OS users and permission system!
    Regarding backups, it is easier in the first place, to just backup one thing (DB). But backing up files shouldn't be something new and/or big for people who setup docspell, I think. Using docker volumes, it'd be even easier!

Having the files reside in the consumption directory doesn't really help as long as they are not deleted, when deleted in the software. It would end up in a mess.

I don't want the DB to vanish completely from docspell and have linked it to my running mariadb instance as all the relations and so on belong there, of course.

So, I think you should think of at least the export back from DB or even better move files from db to internal file storage (which ideally represents the folder structure)

Would it be okay to open up a FR issue for discussion on this (like also in #289 (comment)) or are you saying you won't do it anyways?

Conversation

Where are the documents stored? I'm afraid it is stored in the database or so? What I really, really would need is a storage in the filesystem :-)

The files and everything else is stored in the database. I also find the simplicity of the filesystem very appealing! But I eventually decided to put everything in the db. A db is required anyways for all the user supplied data. Features like multi-user and distributed setup are too hard if files are on the ("normal") filesystem (at least for me…). Another concern is backups: Since there is a database anyways it should be backed up, too. But the files in the filesystem must also be backed up. So in the end there must be two backups instead of one. If the database is not important,then you could still use docspell and only import from a folder. The consumedir script allows for renaming/moving around files – it doesn't upload files twice. But they will be imported into docspell and use extra space. You could start with a H2 database, that works solely on the filesystem (it's like sqlite), then there is no need of a separate db server. This were my thoughts back then :) At some point inthe future, I want to have a decent CLI that let's you export all files back into the filesystem.

  1. Previews in overview
    I get your point, but think they are still very helpful for new users to get warm with docspell. I added a feature request, so we don't forget ;-) #327
Conversation

There is no chance to see thumbnail in the overview, is it? That really helps a lot in identifying documents

Thumbnails are currently not implemented. I experimented with this early and found that I don't need them :-) My documents were too similar and I found myself looking always at correspondent and tags. But it is planned to add thumbnails! I just don't have an ETA.

  1. Usage of consumedir
    I understand the process now: One does not anything more but the setup consumedir container in the docker setup. back then, nothing happened here because of the whitespaces in the file/path names. But this has been fixed and also released by now and everything is working as expected :-)
Conversation

If I mount /opt/docs of the consumer to a directory, should I still setup a watch folder using consumedir.sh?

I'm not sure if I understand. Docspell itself doesn't watch a folder. It must be done with the consumedir script (or something like this). If you use the docker container for consumedir, then it should be all setup. The problem might be when mounting network filesystems like nfs or cifs. There, the kernel doesn't support watching a directory and therefore the consumedir script won't work. I have no solution to this, only polling periodically via cron or systemd. A simple workaround is to put the consumedir script on the other machine and let the script upload the docs from there.

  1. OCRmyPDF settings / duplicate scanning
    I think, you were right - but I am still not 100% sure.
    I will make a PR for having the ocrmypdf params editable more easily in a docker setup and play around with ocrmypdf during this and come back regarding this point.
Conversation

Documents are not ocr-ed twice normally. Doscpell first extracts the text from a pdf. If this is below some minimum length it will still run OCR just to see if that gives more. Then the longer of the texts is taken. By default it will hand all pdfs to ocrmypdf, but (I think) this will skip already ocred files. If not, I missed this and I would try to fix it in docspell. The whole ocrmypdf process can be switched off in the config file. So if you only have these pdfs, this would be an option, I guess. Alternatively, it is possible to change the ocrmypdf options in docspell's config file to fit your requirements.

Documents are not ocr-ed twice normally. Doscpell first extracts the text from a pdf. If this is below some minimum length it will still run OCR just to see if that gives more. Then the longer of the texts is taken. By default it will hand all pdfs to ocrmypdf, but (I think) this will skip already ocred files. If not, I missed this and I would try to fix it in docspell. The whole ocrmypdf process can be switched off in the config file. So if you only have these pdfs, this would be an option, I guess. Alternatively, it is possible to change the ocrmypdf options in docspell's config file to fit your requirements.

  1. Improving management/information on pages in documents
    I added an issue for this: #325
Conversation

Could you add the number of pages to the tiles in the overview? Would also help to identify documents quicker.

Yes, nice idea - thank you!

from docspell.

totti4ever avatar totti4ever commented on May 19, 2024

Regarding document storage: We might have different opinions here, but we seem to agree (I think) that a database is necessary :)

a 100% confirmation!!

In my experience todays database systems have quite good support for binary files. At work we once used postgres for binary data for much larger amounts of data than I would expect for docspell. Of course, there are pros and cons for both ways! And I'm not convinced that changing the status quo yields in big improvements. So I'm not going to work on changing this in the foreseeable future. It is certainly ok to create an issue for discussion – totally possible that I change my mind :) But even then there is so much in the queue that I'd like to have first….

Since docspell is not meant to deal with really large files and targets individuals to small organizations, I found it ok to store it in the file system database. The pros are: easier backup, simpler interface, no need to deal with filesystem locks, buffering etc in the app – simpler development and consistent data (acid). I might go a different way when it should run for millions of users etc….

  • easier backup --> depends, as one (at least I) might have other files to backup anyway, so backing up files from filesystem shouldn't be something new
  • simpler interface --> is it?
  • filesystem locks and permissions --> okay, fair point
  • buffering --> are dbms so good in buffering binary files?
  • simpler development --> could be true
  • acid --> this is a fair point, especially in distributed environments, see next paragraph

Currently it is designed to run in a distributed way, where the database is the central synchronization point. Now when multiple joex and restserver nodes are running, the files must be transferred to them somehow. This is a no-brainer when all nodes are given access to the db. Of course, this way doesn't allow to efficiently scale horizontally to thousands of nodes – but this is not a goal. If it ever would be one, it should be easy to use a different architecture as the components are completely separated.

Well, if the target audience are individuals to small organizations, the several nodes on different servers is not really an realistic thing to come, is it?

So, I'm not convinced yet that going in this direction would have any real benefit, especially compared against the effort that is required.

I see two main benefits (today)

  1. Access to the files as exit strategy (don't underestimate this as many people will check for this before moving)

  2. Access to the files from other systems, especially as Samba share

  3. can be addressed with an export (I might start with one script for this to have it as a start until something proper comes from you ;-) )

  4. cannot be addressed, but doesn't matter if the software is that good that it fulfills all requirements, which might lead to the wish to access the files via Samba for instance. I'd say if the multi-user thing works and there might be improvements regarding user's accessing several collections, it is fully addressed

It would be of course nice to have some abstraction that can be used to implement db and file-system storage. But again, it's is quite some effort (it also has to be maintained…).

Okay, I am fine with your arguments. Thanks a lot for discussing them in detail!

Export: this is something I want to have! It should also be possible to import the dump again. My current vision/rough-plan is to create a CLI tool and add to that some admin functions like import/export later. For a quick and dirty workaround, it shouldn't be much effort to provide a script in the tools/ directory that downloads all files/items to disk. wdyt about creating this?

I will give it a try!

The consumption dir workaround is really just a workaround. I find it not too far away, though? If concerned with deleted files in the software, you could e.g. run periodic cleanups against the directory. The other way, it is possible to setup an inotify event for deleting from the filesystem that then goes and deletes in in the software. Of course, this is not provided out of the box and is also not a goal. But it is possible using the api and some scripting effort….

I'm not sure, yet, whether I think it's still necessary to have the files lying around in DB plus file system. I think it is not a good idea and rather make sure that files, once uploaded successfully vanish from filesystem. So the only script one might need is to double-check consumed files are in the DB and removed from disk

from docspell.

totti4ever avatar totti4ever commented on May 19, 2024

I'll close this as I, personally, am fine and some follow-up tickets have been created

from docspell.

eikek avatar eikek commented on May 19, 2024

Thank you!

  • Yeah, I see your backup point, if it is already there, then it's of course nice to simply include the new files. I also completely agree that this is a very convenient way to store files from a user perspective. I would probably also start with this option myself….
  • For the interface: to me it is simpler to use jdbc than a somewhat abstracted interface to the filesystem. From my experience, it is simpler. It may just pure habit, not sure. For the app, there is now just one storage system, which makes it a bit "simpler".
  • The buffering is not really a fair point, it is just the ACID properties and the interface. While I don't need tothink about this, I chose to have other problems when using the network :)
  • Running distributed is a goal for me, although I know that it will not be used much. But I think even when only for a few nodes, it makes the whole app much more flexible to use. For example, when running "in production" it is easy to have redundant setup and load balancing etc. Or if you use it occasionally, you can put the restserver and db on a limited RPi and the joex on your laptop…. It also reminds me to design things more in a decoupled way :-)
  • I also think, that there should be a good exit strategy! For me it is important, too. I was thinking to provide an export that stores items in a directory structure and the metadata in a json file next to it (there are some difficulties, but that's a different ticket). I would think that it is ok to run some scripts/programs etc, since it is (hopefully :-)) not used often (and then just once ;-)).
  • For the different access you mentioned: I think it wouldn't be too hard to create something that exposes the files via a protocol, like FTP or SSH (there are java libraries that do the difficult part of that). An item could be represented as a folder and files in there as files :-) …. I would probably create a ftp-to-http bridge like program. That could be started separately and talk to the docspell server. It's of course not at all like having the files in the filesystem, but many unix tools would work with that (like rsync).
  • What do you mean with "user's accessing several collections"?

from docspell.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.