dicklesworthstone / llama2_aided_tesseract Goto Github PK

Enhance Tesseract OCR output for scanned PDFs by applying Large Language Model (LLM) corrections, complete with options for text validation and hallucination filtering.

Python 100.00%

ai-assist hallucinations llama2 llm ocr tesseract

llama2_aided_tesseract's People

Contributors

Stargazers

Watchers

Forkers

francyjglisboa johndpope badjeff dkzdev enth77 hbcbh1999 jianlirong vladg-cloudml manuthvann216 ifitsmanu per25 advancedaiandml

llama2_aided_tesseract's Issues

include original images/charts/tables to output doc

for these doc format convertion, text summarization tasks, I think one of key feature is to include all or some of the images/charts/tables from original doc, as those elements often informative for readers.

Alternative offline LLMs

Hi,
Your code used llma2 chat offline LLM. But, I wanted to use alternative offline LLMs such as huggingface's distilbert or roberta or albert. Do you have any suggestion for those LLMs to apply on python base?

GGUF file inclusion in the code snippet

Your provided [tesseract_with_llama2_corrections.py] code snippet is equipped with the llma2 chat ggml q3 k_s.bin LLM model but the huggingface.co is referring to use GGUF instead saying the GGML is deprecated. Now, I need to know whether I can write the GGUF in the model_file_path in the code snippet.
I need your help because I have to be confirmed before downloading 108GB of data.

Requesting new code to download .gguf files

Hi,
As per your code instruction, llma2 chat ggml files will be downloaded but currently 'TheBloke' recommends downloading gguf models instead of ggml files. So, can you provide new code stating to download gguf files from repository?
Thanks.

Support APIs

Is there any plan to restructure the code to be uniform to use it with Llama2/API like (gpt-3.5-turbo, gpt-4) to use this PDF-to-text in any hardware.

llama2_aided_tesseract/tesseract_with_llama2_corrections.py

Line 180 in 5719a9a

llm = Llama(model_path=model_file_path, n_ctx=2048)

llama2_aided_tesseract/tesseract_with_llama2_corrections.py

Line 122 in 5719a9a

llama = LlamaCppEmbeddings(model_path=model_file_path)

llama2_aided_tesseract/tesseract_with_llama2_corrections.py

Line 173 in 5719a9a

model_file_path = "./Llama-2-13B-chat-GGML/llama-2-13b-chat.ggmlv3.q5_K_S.bin"

I wonder if this could be applicable for music sheets?

AIGText/GlyphControl-release#3

dicklesworthstone / llama2_aided_tesseract Goto Github PK

llama2_aided_tesseract's People

Contributors

Stargazers

Watchers

Forkers

llama2_aided_tesseract's Issues

include original images/charts/tables to output doc

Alternative offline LLMs

GGUF file inclusion in the code snippet

Requesting new code to download .gguf files

Support APIs

I wonder if this could be applicable for music sheets?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent