Enhance Tesseract OCR output for scanned PDFs by applying Large Language Model (LLM) corrections, complete with options for text validation and hallucination filtering.
for these doc format convertion, text summarization tasks, I think one of key feature is to include all or some of the images/charts/tables from original doc, as those elements often informative for readers.
Hi,
Your code used llma2 chat offline LLM. But, I wanted to use alternative offline LLMs such as huggingface's distilbert or roberta or albert. Do you have any suggestion for those LLMs to apply on python base?
Your provided [tesseract_with_llama2_corrections.py] code snippet is equipped with the llma2 chat ggml q3 k_s.bin LLM model but the huggingface.co is referring to use GGUF instead saying the GGML is deprecated. Now, I need to know whether I can write the GGUF in the model_file_path in the code snippet.
I need your help because I have to be confirmed before downloading 108GB of data.
Hi,
As per your code instruction, llma2 chat ggml files will be downloaded but currently 'TheBloke' recommends downloading gguf models instead of ggml files. So, can you provide new code stating to download gguf files from repository?
Thanks.