Coder Social home page Coder Social logo

gpt-4v_ocr's Introduction

GPT-4V_OCR


[arXiv 2310.16809]Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation

This paper presents a comprehensive evaluation of the Optical Character Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large Multimodal Model (LMM). We assess the model's performance across a range of OCR tasks, including scene text recognition, handwritten text recognition, handwritten mathematical expression recognition, table structure recognition, and information extraction from visually-rich document. The evaluation reveals that GPT-4V performs well in recognizing and understanding Latin contents, but struggles with multilingual scenarios and complex tasks. Based on these observations, we delve deeper into the necessity of specialized OCR models and deliberate on the strategies to fully harness the pretrained general LMMs like GPT-4V for OCR downstream tasks. The study offers a critical reference for future research in OCR with LMMs.

Scene Text Recognition

results_str
Scene Text Recognition (STR) aims to recognize textual information in natural scene pictures.

Handwritten Text Recognition

results_htr
Handwritten Text Recognition (HTR) aims to recognize handwritten text.

Handwritten Mathematical Expression Recognition

results_hmer
Handwritten Mathematical Expression Recognition (HMER) aims to recognize handwritten mathematical formulas.

Visual Information Extraction

results_vie results_vie recommendations_vie

To learn more about Visual Information Extraction, please refer to Document-AI-Recommendations.

Visual Information Extraction(VIE) aims at mining, analyzing, and extracting key fields entities contained in visually rich documents. For example, given an image of a receipt, the VIE algorithms will tell information such as store name, product details, price, etc. For documents like forms, VIE algorithms will tell the key-value pairs contained.

Table Structure Recognition

results_tsr recommendations_tsr

To learn more about Table Structure Recognition, please refer to Document-AI-Recommendations.

Table Structure Recognition(TSR) aims to recognize the cellular structures of tables from table images by extracting the coordinates of cell boxes and row/column spanning information. This task is very challenging since tables may have complex structures, diverse styles and contents, and become geometrically distorted or even curved during an image capturing process.

Data Download

Citation

@misc{shi2023exploring,
      title={Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation}, 
      author={Yongxin Shi and Dezhi Peng and Wenhui Liao and Zening Lin and Xinhong Chen and Chongyu Liu and Yuyi Zhang and Lianwen Jin},
      year={2023},
      eprint={2310.16809},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

gpt-4v_ocr's People

Contributors

zzxf11 avatar zeninglin avatar shannanyinxiang avatar shi-yx avatar t-li-1 avatar tenmileslotus avatar

Stargazers

Nguyen Khac Tuan Anh avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.