Coder Social home page Coder Social logo

amenezes / aiopytesseract Goto Github PK

View Code? Open in Web Editor NEW
16.0 4.0 5.0 2.19 MB

A Python asyncio wrapper for Tesseract-OCR.

License: Apache License 2.0

Makefile 2.52% Python 96.99% Dockerfile 0.48%
ocr tesseract asyncio tesseract-ocr optical-character-recognition text-extraction pdftotext pytesseract pytesseract-ocr

aiopytesseract's Introduction

Hi, I'm Alexandre Menezes

Static Badge Static Badge Static Badge

Currently I am a programmer at Supremo Tribunal Federal.

Status

  • ๐Ÿ‘ท Working on STF

Interests

  • ๐Ÿ Programming: ๐Ÿ Python/Cython / ๐Ÿ’ป C / ๐Ÿฆ€ Rust / โšก Zig
  • ๐Ÿค– Machine Learning
  • ๐Ÿฅ Debian OS

aiopytesseract's People

Contributors

amenezes avatar shizacat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

aiopytesseract's Issues

got nothing string output

Platform

Linux

Steps to Reproduce

1 install
2 add to code
3 run
4 get nothing

Expected Result

get some text output

Actual Result

image
image
when using tesseract
image

Char Whitelist support

Need/Ideas Statement

It would be nice to have access to tessedit_char_whitelist so that I can whitelist only specific characters.

Configurable TESSERACT_CMD

Need/Ideas Statement

Hello,

It would be really nice if TESSERACT_CMD could be not constant but configurable to be able to use it in Windows too. Thanks.

Exception from one parent

Need/Ideas Statement

Now check exception very hard. Need always take all exception.
May create from one, or two. But it is better from one.

Example:

class TesseractRuntimeError(RuntimeError):
    pass


class TesseractError(Exception):
    """Base exception for tesseract"""

    def __init__(self, message=""):
        self.message = message
    
    def __str__(self):
        return self.message


class PSMInvalidException(TesseractError):    
    def __init__(self, message="PSM Invalid"):
        super().__init__(message)


class OEMInvalidException(TesseractError):
    def __init__(self, message="OEM Invalid"):
        super().__init__(message)


class NoSuchFileException(TesseractError):
    def __init__(self, message="No such file"):
        super().__init__(message)


class LanguageInvalidException(TesseractError):
    def __init__(self, message="Language invalid"):
        super().__init__(message)

It would be great if and 'TesseractRuntimeError' was also inherited from 'TesseractError'. On you decision...

Default dpi

Need/Ideas Statement

Next )
Exists yet problem.

Tesseract by default not need dpi, then it will be tried to select appropriate dpi himself.

# man tesseract
--dpi N
           Specify the resolution N in DPI for the input image(s). A typical value for N is 300. Without this option, the resolution is read from the metadata
           included in the image. If an image does not include that information, Tesseract tries to guess it.

This is not possible in your implementation. The value for 'dpi' always has something value.
May be change it? If need I may create PR )

Absence of parameters

Platform

Linux

Steps to Reproduce

import aiopytesseract

params = await aiopytesseract.tesseract_parameters()
[p for p in params if p.name == 'tessedit_char_blacklist']

Expected Result

[Parameter(name='tessedit_char_blacklist', description='Blacklist of chars not to recogniz', value='-')]

Actual Result

[]

image_to_data psm and output_type

Need/Ideas Statement

Hi,
Thanks for sharing this code!

There is any option to add this two parameters to image_to_data function?

Best regards!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.