As we know, data used in RAG applications comes from all kinds of sources, and while i

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

add support for using unstructured SaaS API for handling unstructured data pre-processing about llm-search HOT 7 OPEN

snexus commented on August 23, 2024

add support for using unstructured SaaS API for handling unstructured data pre-processing

from llm-search.

Comments (7)

Hisma commented on August 23, 2024

looks like you don't have to use their hosted services. The project is open source and you can host a local api.
https://github.com/Unstructured-IO/unstructured-api

the pre-processing libraries are all open source as well.
https://github.com/Unstructured-IO/unstructured?tab=readme-ov-file

imo this makes this solution even more appealing. simply make it dependent on the end user to have the libraries and api implemented and consider using this for doc pre-processing.

from llm-search.

snexus commented on August 23, 2024

Hi @Hisma , thanks for the suggestion.

Unstructured I/O is already installed as a requirement and supported in the package as an alternative back-end in case native parsing isn't available (for anything that isn't .pdf, .md, and .docx). The project is using the core library though, not the API version.

The last time I checked (approximately six months ago), the PDF parsing wasn't better than what could be achieved with other, much faster parsers. However, things are moving fast, and perhaps it is a good alternative at the moment. I haven't explored Unstructured I/O's OCR capabilities for parsing PDFs and images.

I will watch the course and try it out. It shouldn't be a problem to offer the user a choice between Unstructured I/O or the native parser.

from llm-search.

Hisma commented on August 23, 2024

Thank you! I wasn't aware it was already actually an option. I just recently stumbled on the course (which was created very recently) and was impressed with the unstructured libraries capabilities, particularly around handling images and pdfs that contain nested structured data like tables. I do not recall if llmsearch handled that data well in its current form, but its a feature that would be useful for various industries.

I appreciate you're willing to watch the course and see if its something that can enhance your application. Let us know what you think!

from llm-search.

snexus commented on August 23, 2024

Sorry for delay. I watched the course and tried some of the approaches mentioned there. Advanced, model based methods for PDF parsing are definitely an improvement, especially for documents with tables.

However, on the consumer GPU the speed is of few orders magnitudes slower (it took me 4 minutes to parse 10 page PDF), which makes it impractical for large document bases.

These methods might be useful, however, if there will be a feature to do in-memory (online) processing of 1-2 documents which one would fetch directly from the internet.

from llm-search.

Hisma commented on August 23, 2024

No problem! This was obviously a "nice to have" type of enhancement. Especially as I wanted to see what could be done to address specifically what you mentioned, which is working with documents with tables, which would be commonly encountered in financial/scientific documentation.
What I would like to do is see what sort of other model-based PDF parsing options are out there to find a good balance of performance vs quality for this area. Thanks for looking into this!

from llm-search.

Hisma commented on August 23, 2024

I'll leave this open as I research different approaches.

from llm-search.

snexus commented on August 23, 2024

Great! Happy to look into other approaches, agree, let's leave it open. Quality parsing of complex PDF remains a holy grail of RAG.

from llm-search.

add support for using unstructured SaaS API for handling unstructured data pre-processing about llm-search HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent