Coder Social home page Coder Social logo

Comments (7)

Hisma avatar Hisma commented on August 23, 2024

looks like you don't have to use their hosted services. The project is open source and you can host a local api.
https://github.com/Unstructured-IO/unstructured-api

the pre-processing libraries are all open source as well.
https://github.com/Unstructured-IO/unstructured?tab=readme-ov-file

imo this makes this solution even more appealing. simply make it dependent on the end user to have the libraries and api implemented and consider using this for doc pre-processing.

from llm-search.

snexus avatar snexus commented on August 23, 2024

Hi @Hisma , thanks for the suggestion.

Unstructured I/O is already installed as a requirement and supported in the package as an alternative back-end in case native parsing isn't available (for anything that isn't .pdf, .md, and .docx). The project is using the core library though, not the API version.

The last time I checked (approximately six months ago), the PDF parsing wasn't better than what could be achieved with other, much faster parsers. However, things are moving fast, and perhaps it is a good alternative at the moment. I haven't explored Unstructured I/O's OCR capabilities for parsing PDFs and images.

I will watch the course and try it out. It shouldn't be a problem to offer the user a choice between Unstructured I/O or the native parser.

from llm-search.

Hisma avatar Hisma commented on August 23, 2024

Thank you! I wasn't aware it was already actually an option. I just recently stumbled on the course (which was created very recently) and was impressed with the unstructured libraries capabilities, particularly around handling images and pdfs that contain nested structured data like tables. I do not recall if llmsearch handled that data well in its current form, but its a feature that would be useful for various industries.

I appreciate you're willing to watch the course and see if its something that can enhance your application. Let us know what you think!

from llm-search.

snexus avatar snexus commented on August 23, 2024

Sorry for delay. I watched the course and tried some of the approaches mentioned there. Advanced, model based methods for PDF parsing are definitely an improvement, especially for documents with tables.

However, on the consumer GPU the speed is of few orders magnitudes slower (it took me 4 minutes to parse 10 page PDF), which makes it impractical for large document bases.

These methods might be useful, however, if there will be a feature to do in-memory (online) processing of 1-2 documents which one would fetch directly from the internet.

from llm-search.

Hisma avatar Hisma commented on August 23, 2024

No problem! This was obviously a "nice to have" type of enhancement. Especially as I wanted to see what could be done to address specifically what you mentioned, which is working with documents with tables, which would be commonly encountered in financial/scientific documentation.
What I would like to do is see what sort of other model-based PDF parsing options are out there to find a good balance of performance vs quality for this area. Thanks for looking into this!

from llm-search.

Hisma avatar Hisma commented on August 23, 2024

I'll leave this open as I research different approaches.

from llm-search.

snexus avatar snexus commented on August 23, 2024

Great! Happy to look into other approaches, agree, let's leave it open. Quality parsing of complex PDF remains a holy grail of RAG.

from llm-search.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.