Coder Social home page Coder Social logo

aws-samples / amazon-textract-response-parser Goto Github PK

View Code? Open in Web Editor NEW
209.0 15.0 94.0 16.42 MB

Parse JSON response of Amazon Textract

License: Apache License 2.0

Python 28.10% C# 2.66% JavaScript 2.03% TypeScript 66.93% HTML 0.28%
amazon-textract

amazon-textract-response-parser's Introduction

Textract Response Parser

You can use Textract response parser library to easily parse JSON returned by Amazon Textract. The library parses JSON and provides programming language specific constructs to work with different parts of the document. textractor is an example of a PoC batch processing tool that takes advantage of the Textract response parser library and generates output in multiple formats.

Python Usage

For documentation on usage see: src-python/README.md

JavaScript/TypeScript Usage

For documentation on usage see: src-js/README.md

C# Usage

Forms

document.Pages.ForEach(page => {
    Console.WriteLine("Print Lines and Words:");
    page.Lines.ForEach(line => {
        Console.WriteLine("{0}--{1}", line.Text, line.Confidence);
        line.Words.ForEach(word => {
            Console.WriteLine("{0}--{1}", word.Text, word.Confidence);
        });
    });
    Console.WriteLine("Print Fields:");
    page.Form.Fields.ForEach(f => {
        Console.WriteLine("Field: Key: {0}, Value {1}", f.Key, f.Value);
    });
    Console.WriteLine("Get Field by Key:");
    var key = "Phone Number:";
    var field = page.Form.GetFieldByKey(key);
    if(field != null) {
        Console.WriteLine("Field: Key: {0}, Value: {1}", field.Key, field.Value);
    }
});

Tables

document.Pages.ForEach(page => {
    page.Tables.ForEach(table => {
        var r = 0;
        table.Rows.ForEach(row => {
            r++;
            var c = 0;
            row.Cells.ForEach(cell => {
                c++;
                Console.WriteLine("Table [{0}][{1}] = {2}--{3}", r, c, cell.Text, cell.Confidence);
            });
        });
    });
});

Check out the src-csharp folder for instructions on how to run .NET Core C# samples

Other Resources

License Summary

This sample code is made available under the Apache License V2.0 license. See the LICENSE file.

amazon-textract-response-parser's People

Contributors

athewsey avatar avisaws avatar belval avatar cgarces avatar darwaishx avatar dependabot[bot] avatar dhawalkp avatar douglasqian avatar jmalha avatar kmascar avatar mehran22000 avatar michaelwalker-git avatar mvonlanthen avatar richardscottoz avatar sahays avatar schadem avatar seanstrom avatar stevenmapes avatar tb102122 avatar victormartingarcia avatar yuajia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

amazon-textract-response-parser's Issues

trp2 fails for new entity types

trp2 fails for a new entity type in the creation of the hash map with a KeyError. It should be more robust and add the new key to
the hash map list.

KeyError exception in Python trp package when parsing a page that doesn't have a Polygon element

A 12 page PDF document was processed by Textract, and I'm trying to use this package to parse the resulting response.json. The very first is a PAGE block that has the following Geometry element:

{
    "DocumentMetadata": { "Pages": 12 },
    "JobStatus": "SUCCEEDED",
    "NextToken": "RYAd635ujGFqn4t5XLy4H+7BT1mguxFfHvBA8pGfJ3C9FnC8Pv7Cz/+qj+v/MisnIcNR7fwh+/CfJVGIdHn/sSplCQcE2ra4ZXjtDJ9SIp6Z9v5ICHmkzGNrVtS4m4GG",
    "Blocks": [
      {
        "BlockType": "PAGE",
        "Geometry": {
          "BoundingBox": {
            "Width": 1.0,
            "Height": 1.0,
            "Left": 0.0,
            "Top": 0.0
          }
        },
        "Id": "e5413485-55aa-405c-b547-25d6f3db1251",
       "...","...."
  }]}

I've loaded the response into a dictionary and then tried to instantiate the Document class, passing the document dictionary to the constructor; when I do so, I get the following error:

./tests/TextractOutputProcessor_test.py::test_processResponseJson Failed: [undefined]KeyError: 'Polygon'
responseJsonFile = './tests/textract/response.json'

    def test_processResponseJson(responseJsonFile):
        """Test the processResponseJson method"""
    
        assert isinstance(responseJsonFile, str)
        processor = TextractOutputProcessor()
    
        try:
>           processor.loadResponseJson(responseJsonFile)

tests/TextractOutputProcessor_test.py:17: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
TextractOutputProcessor.py:24: in loadResponseJson
    self.document = Document(self.metadata)
venv/lib/python3.8/site-packages/trp/__init__.py:638: in __init__
    self._parse()
venv/lib/python3.8/site-packages/trp/__init__.py:675: in _parse
    page = Page(documentPage["Blocks"], self._blockMap)
venv/lib/python3.8/site-packages/trp/__init__.py:522: in __init__
    self._parse(blockMap)
venv/lib/python3.8/site-packages/trp/__init__.py:533: in _parse
    self._geometry = Geometry(item['Geometry'])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <trp.Geometry object at 0x7fe2a06e1910>
geometry = {'BoundingBox': {'Height': 1.0, 'Left': 0.0, 'Top': 0.0, 'Width': 1.0}}

    def __init__(self, geometry):
        boundingBox = geometry["BoundingBox"]
>       polygon = geometry["Polygon"]
E       KeyError: 'Polygon'

venv/lib/python3.8/site-packages/trp/__init__.py:111: KeyError

It seems that the Geometry class expects there to be a Polygon element within every Geometry element in the response JSON, even though Textract did not create such an element when it processed my PDF document.

TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

I'm trying to load into trp.Document a .json file that is the result of an texttract execution done with start_document_analysis

The code is done on Python 3.9

I'm loading the json_file in a variable and trying to parse it with trp

    # Load json
    with open(f"/tmp/{file_name}", "rb") as json_file:
        textract_json = json.load(json_file)

   doc = Document(textract_json)

It works fine with synchronous calls but not with asynchronous calls

Traceback (most recent call last):
  File "/var/task/app.py", line 115, in lambda_handler
    process_file(file_name)
  File "/var/task/app.py", line 53, in process_file
    doc = Document(textract_json)
  File "/var/lang/lib/python3.9/site-packages/trp/__init__.py", line 638, in __init__
    self._parse()
  File "/var/lang/lib/python3.9/site-packages/trp/__init__.py", line 675, in _parse
    page = Page(documentPage["Blocks"], self._blockMap)
  File "/var/lang/lib/python3.9/site-packages/trp/__init__.py", line 522, in __init__
    self._parse(blockMap)
  File "/var/lang/lib/python3.9/site-packages/trp/__init__.py", line 543, in _parse
    t = Table(item, blockMap)
  File "/var/lang/lib/python3.9/site-packages/trp/__init__.py", line 438, in __init__
    cell = Cell(blockMap[cid], blockMap)
  File "/var/lang/lib/python3.9/site-packages/trp/__init__.py", line 361, in __init__
    self._text = self._text + w.text + ' '

Using doc = TDocumentSchema().load(textract_json) also give me validation exceptions.

Any clue about what I'm doing wrong?

[Feature Request] JavaScript version

Although JS has cleaner native JSON handling than, say, the notation in Python; I think there is still value for a TRP library in JavaScript because of the know-how it encodes for e.g. mapping all the block relationships.

Would like to see a JS version of the library to support both NodeJS users on the back-end, and potential front-end applications for in-browser JS.

I just went through creating a very basic/direct translation myself for a front-end project, so can raise a draft PR shortly.

NodeJS import/require errors

I ran into an error while requiring the NodeJS version following an npm install.

NodeJS: v14.17.6

To reproduce:

npm init -y
npm install amazon-textract-response-parser # as per instructions in this package's top-level readme
touch index.js
echo 'const { TextractDocument, TextractExpense } = require("textract-response-parser");' >> index.js
node index.js

Output:

Error: Cannot find module 'textract-response-parser'
Require stack:
- /{...}/index.js
    at Function.Module._resolveFilename (internal/modules/cjs/loader.js:889:15)
    at Function.Module._load (internal/modules/cjs/loader.js:745:27)
    at Module.require (internal/modules/cjs/loader.js:961:19)
    at require (internal/modules/cjs/helpers.js:92:18)
    at Object.<anonymous> ({...}/index.js:1:47)
    at Module._compile (internal/modules/cjs/loader.js:1072:14)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1101:10)
    at Module.load (internal/modules/cjs/loader.js:937:32)
    at Function.Module._load (internal/modules/cjs/loader.js:778:12)
    at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:76:12) {
  code: 'MODULE_NOT_FOUND',
  requireStack: [ '/{...}/index.js' ]
}

Lambda layer

Is there a Lambda layer for amazon-textract-response-parser?

various improvements

Hey there,

Thanks for this great project. I found it quite useful. I implemented a few functions in the python version that I would like to submit to you for review. These functions were useful in my case, and I hope they will be useful for others:

  1. Field:
  • repre

  • eq

  1. Form: getFieldByKey2: each field key and the search key will be lower() and trim() before comparison.

  2. Table:

  • toDictionary: useful for various things, including importing tables into pandas.
  • to2DList: useful for various things, including pandas.
  1. Page: added _page_number field and property

If these changes sound interesting, please let me know how to proceed.

Form type in Textract not getting data in sequential order

Hello,
Currently I am performing OCR on 1 page document over there I am having multiple same name entity and in front of it there is a checkbox. I am able to detect all values and the checkbox is selected or not using form in AWS textract but I am not getting any data in sequence.
Below I have attached 2 files with same data but in both file it is detecting all entities but in random order.
Here is the code I am using:

import boto3
import sys
import re
import json
from collections import defaultdict


def get_kv_map(file_name):
    with open(file_name, 'rb') as file:
        img_test = file.read()
        bytes_test = bytearray(img_test)
        print('Image loaded', file_name)

    # process using image bytes
    client = boto3.client('textract')
    response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['FORMS'])

    # Get the text blocks
    blocks = response['Blocks']

    # get key and value maps
    key_map = {}
    value_map = {}
    block_map = {}
    for block in blocks:
        block_id = block['Id']
        block_map[block_id] = block
        if block['BlockType'] == "KEY_VALUE_SET":
            if 'KEY' in block['EntityTypes']:
                key_map[block_id] = block
            else:
                value_map[block_id] = block
    return key_map, value_map, block_map


def get_kv_relationship(key_map, value_map, block_map):
    kvs = defaultdict(list)
    for block_id, key_block in key_map.items():
        value_block = find_value_block(key_block, value_map)
        key = get_text(key_block, block_map)
        val = get_text(value_block, block_map)

        kvs[key].append(val)
    return kvs


def find_value_block(key_block, value_map):
    for relationship in key_block['Relationships']:
        if relationship['Type'] == 'VALUE':
            for value_id in relationship['Ids']:
                value_block = value_map[value_id]
    return value_block


def get_text(result, blocks_map):
    text = ''
    if 'Relationships' in result:
        for relationship in result['Relationships']:
            if relationship['Type'] == 'CHILD':
                for child_id in relationship['Ids']:
                    word = blocks_map[child_id]
                    if word['BlockType'] == 'WORD':
                        text += word['Text'] + ' '
                    if word['BlockType'] == 'SELECTION_ELEMENT':
                        if word['SelectionStatus'] == 'SELECTED':
                            text += 'X '
    return text


def print_kvs(kvs):
    for key, value in kvs.items():
        print(key, ":", value)

        

def search_value(kvs, search_key):
    for key, value in kvs.items():
        if re.search(search_key, key, re.IGNORECASE):
            return value


def main(file_name):
    key_map, value_map, block_map = get_kv_map(file_name)

    # Get Key Value relationship
    kvs = get_kv_relationship(key_map, value_map, block_map)
    print("\n\n== FOUND KEY : VALUE pairs ===\n")
    print_kvs(kvs)
    return kvs


if __name__ == "__main__":
    file_name = sys.argv[1]
    d = main("./data.png")

file1.pdf
file2.pdf

So how can I get the details in sequence rather than in random order:

For buyer entity this is data from 1 file:['', '', '', '', '', '', '', '', '', '', '', '', 'X ', '', 'X ', '', '']
For the same data this is response of buyer for other file: ['', '', '', '', '', '', '', '', '', 'X ', '', '', '', 'X ', '', '', '']

TypeError Boundingbox Undefined

On some expense bills when I call const expense = new TextractExpense(textractResponse);
I receive this error: TypeError: Cannot read properties of undefined (reading 'BoundingBox')

The error is thrown because in the result of Textract an empty Text is detected and there is no Geometry with Boundingbox and Polygon in the JSON object.
The snippet in the JSON, where the error occurs, looks as followed:
{ "Type": { "Text": "CITY", "Confidence": 60.71051788330078 }, "ValueDetection": { "Text": "", "Confidence": 60.71051788330078 }, "PageNumber": 1, "GroupProperties": [ { "Types": ["RECEIVER"], "Id": "a62ec926-a04c-40dd-b59c-f3fbf7249d77" } ] },

I solved this by adding an if statement to the constructor of Geometry.js
image

Option to avoid warning/error logs

Using NodeJS.
In deployment, the logs containing the following messages, overwhelm the log watch services. Is there an option to make them stop ? If not, there should be an option to avoid them.

Provided Textract JSON contains a NextToken: Content may be truncated!
Document missing word block 1f35c661-417c-4673-868c-be50bd7c760a referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block becc58b1-e7bc-4e38-a8be-70603799f928 referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block 07e0f420-5cf6-434e-82e5-df22b7f8bc09 referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block 3cad749c-52e3-4519-a831-107775ff9c32 referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block ec0283a9-bccb-487f-9ffc-937a03a15333 referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block 9445e997-e4ae-47a7-b693-6097f9949792 referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block 28bc973b-72c9-4dcf-92fa-e52968616a72 referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block 35eff2e4-35aa-45c3-86c0-2aae0633b0ff referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block 52e43a8c-344a-4ede-b6d0-624aeea36f99 referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block af741903-4086-4a0e-b15f-16523b1c17cb referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block 432fa0ec-b059-42b1-bff6-91f9cfcb2057 referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block a139c666-4db3-48f0-baa2-69c43b70aab7 referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block 24755216-f559-4dd9-a82a-8e2fb425829f referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block ed384488-94c5-45de-be23-946e1d8cd710 referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block 68b7678f-68ef-444e-b0d4-589665fee28d referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block d6720810-b6d7-4dcc-9453-f8d4bcd0c850 referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block 51092d43-c46a-41ab-9cd5-97b28397202a referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block ac5a7051-319c-417b-96d2-fd528f7afc19 referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block d70aa11e-383d-41e0-a289-b35723cb1afd referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block d4d27afa-533f-4a35-b273-ffb2b448ffae referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block 611007dd-5df7-4ac6-9198-a647a0814670 referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block 723157b0-b94f-47fe-9479-f4a910520b40 referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block fab46976-da2d-4f7c-94ad-425f130b3867 referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block 5ed59f30-1430-4ab1-ac83-68bb9eb898d1 referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block 5b2c2218-05f5-4c92-bf42-7c32606ad5d8 referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block d15da117-069b-4d14-ae32-16085a78c584 referenced by line cbd699e2-34c0-409d-bbcc-fd19d3500960
Document missing word block c291dbdc-5427-45a3-b2b0-b76df9f32bcc referenced by line 6cf6f6a5-1f12-47c6-8966-74541d71cd13
Document missing word block 4209498e-826f-4b07-89bd-cf3ee92294b2 referenced by line 8e99a3c6-8418-4455-94fb-3a2c90b1c493
Document missing word block 8749b2ee-bdc8-4cbe-8ced-e8c847f62d35 referenced by line 8e99a3c6-8418-4455-94fb-3a2c90b1c493
Document missing word block 52602fb2-303b-4c7d-b0ee-d29470e13ba7 referenced by line 8e99a3c6-8418-4455-94fb-3a2c90b1c493
Document missing word block 620e8fa2-f189-457d-988c-1929f4e26726 referenced by line 8e99a3c6-8418-4455-94fb-3a2c90b1c493

Null ref on "block" parameter in Word constructor for some results

Expected result:
API result is parsed correctly.

Actual result:
For some document results, the Word constructor fails due to a null ref on the block parameter.
src-csharp/TextractExtensions.cs line 9.
The same document works fine via the web test UI on AWS.

The calling code is : https://github.com/aws-samples/amazon-textract-response-parser/blob/master/src-csharp/TextractExtensions.cs#L318

Looks to be an issue in the API result relating to Relationships, CHILD, WORD and Id?
This happens on API results from completely different source PDFs (all invoices but from separate suppliers).

Validation Error For Proxy HTTP headers

With additional HTTPResponseHeader information the marshalling fails.
example:
ValidationError: {'ResponseMetadata': {'HTTPHeaders': {'age': ['Unknown field.'], 'via': ['Unknown field.']}}}

How to print all multi-column variable text in reading order

Hi
I've been trying to extract only the text in reading order for multi-column cases with the code below.
His problem is that the number of columns is manual.
I've been trying to deploy the response-parser in this code but I couldn't, could you give an example of how to do it?
What I've achieved so far with amazon-textracr-response-parser keeps mixing up the reading order.

# Document
s3BucketName = "your-bucket-name"
documentName = "your-image.png"


# Call Amazon Textract
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

print(response)

# Detect columns and print lines
columns = []
lines = []
for item in response["Blocks"]:
      if item["BlockType"] == "LINE":
        column_found=False
        for index, column in enumerate(columns):
            bbox_left = item["Geometry"]["BoundingBox"]["Left"]
            bbox_right = item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]
            # Divide by the number of existing columns
            # manual input that I need to resolve to sort either with 1, 2, 3, 4 or 5 columns
            bbox_centre = item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]/2
            column_centre = column['left'] + column['right']/2

            if (bbox_centre > column['left'] and bbox_centre < column['right']) or (column_centre > bbox_left and column_centre < bbox_right):
                #Bbox appears inside the column
                lines.append([index, item["Text"]])
                column_found=True
                break
        if not column_found:
            columns.append({'left':item["Geometry"]["BoundingBox"]["Left"], 'right':item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]})
            lines.append([len(columns)-1, item["Text"]])

lines.sort(key=lambda x: x[0])
for line in lines:
    print (line[1])

Tables spanning pages from PDFs with more than one table spanning pages are not merged correctly

The code sample below works well for merging a single table that spans multiple pages, but we cannot get it to fully work when there are many tables in a document that span multiple pages. If the first table spans multiple pages it is merged correctly, but subsequent tables are not merged together when they span multiple pages. From https://github.com/aws-samples/amazon-textract-multipage-tables-processing, here is the code we are using for the test:

textract_json = call_textract(input_document=s3_uri_of_documents, features=[Textract_Features.TABLES], boto3_textract_client = textract_client)
t_document: t2.TDocument = t2.TDocumentSchema().load(textract_json)
t_document = pipeline_merge_tables(t_document, MergeOptions.MERGE, None, HeaderFooterType.NONE)
json_data = t2.TDocumentSchema().dump(t_document)
PrettyPrintTables(json_data)

To test this, just produce a PDF with two tables (I have attached a test PDF document to this issue), the first table spanning pages one and two, and the second table spanning pages two and three. In our test, the merge of the first table works fine, but the merge of the second table does not work, and our final result is three tables rather than 2.

test_textract_tables.pdf

No confindence levels in query output (with solution)

I would like to know the confidence level of query results, however this is not made available.
I suggest to make a small change to get_query_answers in trp2.py:

  def get_query_answers(self, page: TBlock) -> List[List[str]]:
        result_list: List[List[str]] = list()
        for query in self.queries(page=page):
            answers = [x for x in self.get_answers_for_query(block=query)]
            if answers:
                for answer in answers:
                    result_list.append([query.query.text, query.query.alias, answer.text, answer.confidence])
            else:
                result_list.append([query.query.text, query.query.alias, "", 0])

This will make confidence available as fourth item in the return list and maintains compatibility with previous output.

Better *InReadingOrder APIs

Hi folks & thanks for your work maintaining TRP.

Using the tool to post-process Textract results, I find that the idea of the getLinesInReadingOrder function really useful... but the returned data model today is frustratingly unhelpful!

What I'd really like is methods that return the actual Line or Word objects (rather than just text), so I can still access things like the block IDs and geometries.

Today, the getTextInReadingOrder() method just returns a text string and the getLinesInReadingOrder() method returns a (particularly un-intuitive) list of [ColumnId, LineText] pairs.

  1. It doesn't make sense to me that just text instead of the full objects are returned, given the method name is getLines... and not e.g. getLineText...
  2. The concept of columns is an implementation detail of getLinesInReadingOrder() and should either be:
    a. Explicitly committed to by docstring and method renaming e.g. getLineTextsByColumn(), or
    b. Recognised as an internal heuristic and hidden from the output.

I also see that the column detection seems pretty simple as it's implemented so far and likely to do some weird things on documents like forms or posters that might have less vertically-static column layouts down the page.

So would ask:

  • How open/resistant would we be to making breaking changes to the existing getLinesInReadingOrder API? to try and bring the naming and functionality closer together?
  • What's the perspective on documents with more advanced not-quite-columns structure: Is the raw order of tokens output from Textract likely to be a better approximation of the reading order? Is there appetite to develop more sophisticated rules in TRP or not really as the complexity makes it a bit of a losing battle?

Amazon Textract amazon-textract-response-parser library with python throws "ValidationError: {'_schema': ['Invalid input type.']}" on expense analysis textract response

Hi,

I tried following the instructions at https://aws.amazon.com/fr/blogs/machine-learning/announcing-expanded-support-for-extracting-data-from-invoices-and-receipts-using-amazon-textract/ to parse the Textrcat response from a call to get_expense_analysis API (aynchronous API call on a pdf file).

I get a response from the API which seems to be valid json and compliant with the Textract Expense API documentation.

But, when I execute the following code that I copied from the article

from trp.trp2_expense import TAnalyzeExpenseDocument, TAnalyzeExpenseDocumentSchema
t_doc = TAnalyzeExpenseDocumentSchema().load(out)
# out is the json output from the analyse expense API call (tried both text and json dictionnary form to be sure)

I get following error

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
/tmp/ipykernel_7538/447086103.py in <cell line: 2>()
      1 from trp.trp2_expense import TAnalyzeExpenseDocument, TAnalyzeExpenseDocumentSchema
----> 2 t_doc = TAnalyzeExpenseDocumentSchema().load(out)

~/anaconda3/envs/python3/lib/python3.8/site-packages/marshmallow/schema.py in load(self, data, many, partial, unknown)
    717             if invalid data are passed.
    718         """
--> 719         return self._do_load(
    720             data, many=many, partial=partial, unknown=unknown, postprocess=True
    721         )

~/anaconda3/envs/python3/lib/python3.8/site-packages/marshmallow/schema.py in _do_load(self, data, many, partial, unknown, postprocess)
    902             exc = ValidationError(errors, data=data, valid_data=result)
    903             self.handle_error(exc, data, many=many, partial=partial)
--> 904             raise exc
    905 
    906         return result

ValidationError: {'_schema': ['Invalid input type.']}

It is supposed to work out of the box as documented in the article.

I am using Python 3.8.12 and successfully installed amazon-textract-response-parser-0.1.30 botocore-1.24.46 marshmallow-3.14.1

What am I missing (the output of the API call is attached) ?
json.txt

Any help welcome.

Christian.

Upgrade marshmallow to >3.18.0

Because [email protected] is a hard requirement, this parser library doesn't work with marshmallow > 3.14.1 managed by poetry. If there is no blocker, can you please upgrade marshmallow dependency to 3.18.0, which introduces Enum support, at least?

JavaScript vs TypeScript issues.

This code works with JavaScript but does not work with TypeScript:

image

When inspecting my combined.json file I see that it has the expected contents:

image

TypeScript (run from an AWS Lambda) is throwing this error:

image

ExecuteTableValidations on documents with no tables in some pages

Is it expected behaviour that the function ExecuteTableValidations stops looping through pages if it finds a page that does not contain any table?

By modifying the break statement to a continue statement here:

if len(current_page.tables) == 0:
            page_compare_proc += 1
            break

we can make sure to loop through other pages to check for mergeable tables.

Can't merge pipeline_merge_tables if 1st page is missing a table #1

Issue

Was trying to get pipeline_merge_tables working and ended up finding a small issue. The default validation function breaks when there are no tables in the current or next page, which means that the pipeline can't scan any pages after the fact for merging.

After poking around a bit I noticed that it's because of the break's here:


Workaround

Opening a PR to fix this, but for now if you need a workaround just change these to continue locally

README License mismatch

Hi folks - I just noticed the README.md footer cites an MIT-0 license, which is different from the LICENSE file, Python setup.py, etc.

Please update to avoid any possible confusion!

[DOCUMENTATION] Wrong package for pip install

The documentation says, in the src-python/README.md file, to install textract-response-parser via pip install textract-response-parser but the install does not work - says it's not available in pip. Moreover, the source file is not available anymore.

GoLang version?

is there a golang version in the works? If not I'd love to contribute!

`merge_table` does not handle merge child table header columns properly

Issue

The expectation from using merge_tables is that it will convert 2 tables like this:

Headers: 
C1 C2 C3
Rows:
R1
R2

Headers: 
R3
Rows:
R4

into:

Headers: 
C1 C2 C3
Rows:
R1
R2
R3
R4

But instead the output is

Headers: 
C1 C2 C3
R3
Rows:
R1
R2
R4

This is because the headers are identified based on the existence of COLUMN_HEADER in a block's EntityTypes field and the child table's top block keep this entity type even after merging. The solution is to simply drop the this value from EntityTypes if it's there while merging.

EntityTypes in APICellBlock in src-js

Would it be a good idea to add EntityTypes to ApiCellBlock, in src-js?. EntityTypes for BlockType : "CELL" can be useful to find out COLUMN_HEADERS instead of assuming that cellsAt(1, null) will always be the header since there are misses in detecting headers in certain cases. This will give developers the flexibility to use cellsAt(1, null) as headers if cell block with "EntityTypes": [ "COLUMN_HEADER" ] is missing.

{
    "BlockType": "CELL",
    "Confidence": 93.32925415039062,
    "RowIndex": 1,
    "ColumnIndex": 1,
    "RowSpan": 1,
    "ColumnSpan": 1,
    "Geometry": {...},
    "Id": "cda64a58-28d2-47d9-857e-bd7fd9f99d57",
    "Relationships": [
                {
                    "Type": "CHILD",
                    "Ids": [  "b49e883c-bd8b-43e2-aed6-0aa93a52b2b1" ]
                }
    ],
    "EntityTypes": [  "COLUMN_HEADER" ]
}

Table.rows_without_header function adds duplicate non_header_rows - the call is 1-level too deep

The function checks if a row is not a header and appends the row within the cell for loop (adds a row for each cell). It should be moved one level out into the row for loop instead:
https://github.com/aws-samples/amazon-textract-response-parser/blob/master/src-python/trp/__init__.py#L431

Original:

    @property
    def rows_without_header(self) -> List[Row]:
        non_header_rows: List[Row] = list()
        for row in self.rows:
            header = False
            for cell in row.cells:
                for entity_type in cell.entityTypes:
                    if entity_type == ENTITY_TYPE_COLUMN_HEADER:
                        header = True
                if not header:
                    non_header_rows.append(row)
        return non_header_rows

New:

    @property
    def rows_without_header(self) -> List[Row]:
        non_header_rows: List[Row] = list()
        for row in self.rows:
            header = False
            for cell in row.cells:
                for entity_type in cell.entityTypes:
                    if entity_type == ENTITY_TYPE_COLUMN_HEADER:
                        header = True
            if not header: # moved this left one tab
                non_header_rows.append(row) # moved this left one tab
        return non_header_rows

Bug in parsing for multi-page Documents

When parsing some multi-page outputs, there's a bug in the trp.py file.
Due to which the keys in the blockmap at line 119 is not found & KeyError Exception is thrown.

PS: Also, put an issue in the AWS Samples file there & another source for trp.

Full Traceback if needed is here:

Traceback (most recent call last):
  File "textract_kv.py", line 72, in <module>
    lines_to_file(resp_to_keyValues(response), fp)
  File "/path/to/proj/resp_parser.py", line 28, in resp_to_keyValues
    doc = Document(response)
  File "/path/to/user/.pyenv/versions/anaconda3-2019.03/envs/textract/lib/python3.7/site-packages/trp/__init__.py", line 633, in __init__
    self._parse()
  File "/path/to/user/.pyenv/versions/anaconda3-2019.03/envs/textract/lib/python3.7/site-packages/trp/__init__.py", line 667, in _parse
    page = Page(documentPage["Blocks"], self._blockMap)
  File "/path/to/user/.pyenv/versions/anaconda3-2019.03/envs/textract/lib/python3.7/site-packages/trp/__init__.py", line 516, in __init__
    self._parse(blockMap)
  File "/path/to/user/.pyenv/versions/anaconda3-2019.03/envs/textract/lib/python3.7/site-packages/trp/__init__.py", line 530, in _parse
    l = Line(item, blockMap)
  File "/path/to/user/.pyenv/versions/anaconda3-2019.03/envs/textract/lib/python3.7/site-packages/trp/__init__.py", line 142, in __init__
    if(blockMap[cid]["BlockType"] == "WORD"):
KeyError: '2a50d992-0aa4-4cda-8c87-0b88e8dccab9'

Align classes for Lending extraction objects

The TExtraction classes for expense_document and identity_document are not aligned with the actual deserializer
These have to be changed
TAnalyzeExpenseDocument -> TExpense
TAnalyzeIdDocument -> TIdentityDocument

To add fields to ID Document schema

The ID schema cannot be consistent for all the documents(passport, license, bank cheques etc), so is there any way possible to add new fields to the schema? This way more use cases can be leveraged as it can be used for any document template.

Not able to extract Textract merge cell text properly

Not able to extract the merge cell text properly. There is some issue with combine headers function. Textract not able to extract the top header text properly.

Reference:
t_doc = TDocumentSchema().load(textract_json)
ordered_doc = order_blocks_by_geo(t_doc)
trp_doc = Document(TDocumentSchema().dump(ordered_doc))
Now let’s iterate through the tables’ content, and extract the data into a DataFrame:

table_index = 1
dataframes = []
def combine_headers(top_h, bottom_h):
bottom_h[3] = top_h[2] + " " + bottom_h[3]
bottom_h[4] = top_h[2] + " " + bottom_h[4]
for page in trp_doc.pages:
for table in page.tables:
table_data = []
headers = table.get_header_field_names() #New Table method to retrieve header column names
if(len(headers)>0): #Let's retain the only table with headers
print("Statememt headers: "+ repr(headers))
top_header= headers[0]
bottom_header = headers[1]
combine_headers(top_header, bottom_header) #The statement has two headers. let's combine them
for r, row in enumerate(table.rows_without_header): #New Table attribute returning rows without headers
table_data.append([])
for c, cell in enumerate(row.cells):
table_data[r].append(cell.mergedText) #New Cell attribute returning merged cells common values
if len(table_data)>0:
df = pd.DataFrame(table_data, columns=bottom_header)

Document table format:
image

with above logic:
image

With small changes in the combine header, my issue got solved to some extent:

def combine_headers(top_h, bottom_h):
    for i in range(len(top_h)):
        if bottom_h[i] != top_h[i]:
            bottom_h[i] = top_h[i] + ' ' + bottom_h[i] 
        else :
            bottom_h[i] = bottom_h[i]

But there is some issue with textract top header detection,
image

Block type: PARAGRAPH?

Hi.

Thank you for the sample!

I saw a few intro videos about textract, where they mentioned that there are several block types: page, paragraph, line, word. But actually there is no paragraph type, is it?

I need to extract paragraphs from a pdf document, so if there is a way to do that too, please let me know!

Thank you!

__compare_table_headers allways returns False

in the __compare_table_headers function, the cells are being compared, and not their values, and as there is no __eq__ method implemented for Cell class, it compares if the objects are the same.

Either there needs to be __eq__ in Cell class, or in __compare_table_headers there should be comparison of cell text, and not cell objects

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.