Coder Social home page Coder Social logo

bweigel / aws-lambda-tesseract-layer Goto Github PK

View Code? Open in Web Editor NEW
104.0 3.0 32.0 38.17 MB

A layer for AWS Lambda containing the tesseract C libraries and tesseract executable.

License: Apache License 2.0

Dockerfile 0.29% Shell 1.37% Python 6.66% TypeScript 90.07% JavaScript 1.61%
lambda aws-lambda lambda-layer serverless serverless-framework tesseract amazon-linux

aws-lambda-tesseract-layer's Introduction

πŸš€ Hey there! I'm Ben (πŸ³οΈβ€πŸŒˆHe/HimπŸ³οΈβ€πŸŒˆ)

πŸ“š Background:

  • PhD in Chemistry turned Cloud Coordinator & Tech Lead.
  • Years of experience in fintech, data engineering, cloud architectires and leading cloud migrations.

πŸ›  Tech Stack:

  • πŸ’» AWS (EKS, ECS, Lambda, S3, Sagemaker, Athena, etc.)
  • 🌐 AWS CDK, Terraform, Kubernetes, CICD, DevOps
  • πŸ“Š Data Engineering
  • ☁️ Serverless Architectures

🌱 Professional Goals:

  • πŸ”„ Continual learning & innovation.
  • 🌍 Foster diverse, inclusive tech environments.
  • πŸ›€ Finding roles that meld technical depth with strategic breadth.

🎸 Outside Work:

  • πŸ“· Avid photographer capturing fleeting moments.
  • 🎡 Guitar enthusiast adding rhythm to free time.
  • 🌊 Passionate skipper, sailing both local lakes and distant seas.

πŸ”— Connect with me

weigelb

Languages and Tools:

aws bash docker flask git grafana html5 jest kubernetes linux mongodb nginx pandas postgresql postman python scala scikit_learn typescript

Β bweigel

aws-lambda-tesseract-layer's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

aws-lambda-tesseract-layer's Issues

Cannot import ready-to-use-Lambda layer

I'm unable to use the ready-to-use Lambda Layer.

Steps I followed:

Step 1:
Download repo

Step 2:
cd ready-to-use

Step 3:
Zip the "amazonlinux-2" folder and attach it to my Python3.8 Lambda function

Step 4:
Inside my function, I'm running the below code which simply gets a file from S3 and attempts to use pytesseract.image_to_string:

import json
import os
import boto3
import botocore
import pytesseract
from PIL import Image

def lambda_handler(event, context):

    s3 = boto3.resource('s3')
    #downloads file from S3 to /tmp directory
    BUCKET_NAME = 'xxxxxxx' # replace with your bucket name
    KEY = 'test.jpg'
    try:
        s3.Bucket(BUCKET_NAME).download_file(KEY, '/tmp/my_local_image.jpg')
    except botocore.exceptions.ClientError as e:
        if e.response['Error']['Code'] == "404":
            print("The object does not exist.")
        else:
            raise
    
    image = '/tmp/my_local_image.jpg'

    text = pytesseract.image_to_string(Image.open(image))
    print(text)
    
    return "Hellow world"

Result:
I run into error -> Unable to import module 'lambda_function': No module named 'pytesseract'.

Kindly suggest how to make use of this layer, or could you provide the .zip file to upload as a layer directly.

Question about Dockerfile in al2-serverless example

I got https://github.com/bweigel/aws-lambda-tesseract-layer/tree/master/example/al2-serverless example to work.

What's the use of the Dockerfile?: https://github.com/bweigel/aws-lambda-tesseract-layer/blob/master/example/al2-serverless/Dockerfile

Do I need it? How and why? What can I practically do with it? Or I shouldn't touch it?

I used to use Docker to handle my lambdas but I was having lots of issues to create a Docker image with tesseract and your layer is a bless!

In principle I could simply change your example handler.py to my application needs (very similar actually to the example, except that I need to fetch images from a S3 bucket and save the output there as well), so I don't see I could need this Dockerfile.

So, in the end I just want to now about that Dockerfile. I removed and it apparently worked.

Update for Python 3.8 and lambda-base-2:build

I'm happy to put up a PR for this.

As suggested in the known issues in the README, updating the base image to FROM: lambci/lambda-base-2:build for Python 3.8 requires some changes.

I found this issue with Klayers for tesseract.

"errorMessage": "(127, 'tesseract: error while loading shared libraries: libjpeg.so.62: cannot open shared object file: No such file or directory')", "errorType": "TesseractError"

Working through the errors still, but I was able to get each lib working by copying them into lib/

WORKDIR /opt
RUN mkdir -p ${DIST}/lib && mkdir -p ${DIST}/bin && \
    cp ${TESSERACT}/bin/tesseract ${DIST}/bin/ && \
    cp ${TESSERACT}/lib/libtesseract.so.4  ${DIST}/lib/ && \
    cp ${LEPTONICA}/lib/liblept.so.5 ${DIST}/lib/liblept.so.5 && \
    cp /usr/lib64/libwebp.so.4 ${DIST}/lib/ && \
    cp /usr/lib64/libpng15.so.15 ${DIST}/lib/ && \
    cp /usr/lib64/libjpeg.so.62 ${DIST}/lib/ && \
    cp /usr/lib64/libtiff.so.5 ${DIST}/lib/ && \
    echo -e "LEPTONICA_VERSION=${LEPTONICA_VERSION}\nTESSERACT_VERSION=${TESSERACT_VERSION}\nTESSERACT_DATA_FILES=tessdata${TESSERACT_DATA_SUFFIX}/${TESSERACT_DATA_VERSION}" > ${DIST}/TESSERACT-README.md && \
    find ${DIST}/lib -name '*.so*' | xargs strip -s

Compiling for Python 3.9

How would I do this?

For context, I have a lambda function that runs on Python3.9 and needs to use Tesseract so I was going to build it following your instructions and then link the layer. Would I even need to build this in Python 3.9, or would Python3.8 be fine since it is a separate layer?

My bad, not really an issue, more of just a general question.

Error opening data file

Hi Benjamin,

Thanks very much for detailing out how to set up tesseract as a lambda layer. It has been super helpful to me.

Although I recognize this directory is probably not actively maintained anymore, I thought I'd report a bug to help others avoid running into it as well.

The locations from which the DockerFile (https://github.com/bweigel/aws-lambda-tesseract-layer/blob/master/Dockerfile) downloads the tessdata files seem to be outdated. This causes a slightly misleading error (in pytesseract):

'Error opening data file Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.

The problem is not the location of the tessdata directory, but its contents which are downloaded from stale links. I managed to get things working by updating:
curl -L https://github.com/tesseract-ocr/tessdata${TESSERACT_DATA_SUFFIX}/raw/${TESSERACT_DATA_VERSION}/osd.traineddata > osd.traineddata && \

to:
curl -L https://github.com/tesseract-ocr/tessdata${TESSERACT_DATA_SUFFIX}/raw/master/osd.traineddata > osd.traineddata && \

Pytesseract image_to_pdf_or_hocr function throws an error.

First of all,
Thank you for making our lives easier by developing this. It helped a lot.

Other functions that pytesseract offers like: image_to_string, image_to_data works well without any hiccups.

But, when I try to use image_to_pdf_or_hocr like this:

pdf = pytesseract.image_to_pdf_or_hocr(f'/tmp/{file_name}/{page.number}.png', extension='pdf')

it does not work and throws error like:

Traceback (most recent call last):
File "/var/task/helpers/ocr_helper.py", line 36, in save_searchable_pdf
f'/tmp/{file_name}/{page.number}.png', extension='pdf')
File "/var/task/pytesseract/pytesseract.py", line 432, in image_to_pdf_or_hocr
return run_and_get_output(*args)
File "/var/task/pytesseract/pytesseract.py", line 289, in run_and_get_output
with open(filename, 'rb') as output_file:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tess_6_hu78b0.pdf'
  1. It says that the file tess_6_hu78b0.pdf does not exist. What does this mean? I have no file with tess_6_hu78b0 name to begin with.
  2. The path that I am passing to image_to_pdf_or_hocr function is 100% correct and an image is present there. I have confirmed and the same thing works on my local.

I found somewhere that I needed to install libtesseract-dev too. Hence, I modified my dockerfile as:

FROM lambci/lambda:build-python3.6
RUN sudo apt install tesseract-ocr
RUN sudo apt install libtesseract-dev

but unfortunately this too did not work.

Can someone please help me out on this? Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.