Coder Social home page Coder Social logo

aws-solutions / document-understanding-solution Goto Github PK

View Code? Open in Web Editor NEW
232.0 20.0 90.0 135.31 MB

Example of integrating & using Amazon Textract, Amazon Comprehend, Amazon Comprehend Medical, Amazon Kendra to automate the processing of documents for use cases such as enterprise search and discovery, control and compliance, and general business process workflow.

Home Page: https://aws.amazon.com/solutions/implementations/document-understanding-solution/

License: Apache License 2.0

JavaScript 53.70% Shell 1.99% TypeScript 10.68% Python 23.43% SCSS 10.19%
amazon-comprehend amazon-textract aws-machine-learning machine-learning aws amazon-kendra amazon-elasticsearch cdk aws-cdk

document-understanding-solution's Introduction

Deprecation Notice

As of 09/14/2023, Document Understanding Solution has been deprecated and will not be receiving any additional features or updates. We encourage customers to explore the new solution: https://aws.amazon.com/solutions/implementations/enhanced-document-understanding-on-aws/.

Document Understanding Solution

DUS leverages the power of Amazon Textract, Amazon Comprehend , Amazon Comprehend Medical Amazon OpenSearch Service and Amazon Kendra to provide digitization , domain-specific data discovery, redaction controls , structural component extraction and other document processing & understanding capabilities.

img

Architecture Diagram

img

Note

Current document formats supported : PDF,JPG,PNG

Current maximum document file size supported : 150MB

Current concurrent document uploads (via UI ) supported : 100

1. CICD Deploy

Requirements

  • aws cli

    sudo yum -y install aws-cli

  • pip3 (Required to install packages)

    curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py

Getting Started with CICD Deploy

  • Create a bucket to act as the target Amazon S3 distribution bucket

Note: You will have to create an S3 bucket with the template 'my-bucket-name-<aws_region>'; aws_region is where you are testing the customized solution.

For example, you create a bucket called my-solutions-bucket-us-east-1,

  • Now build the distributable:
chmod +x ./deployment/build-s3-dist.sh
./deployment/build-s3-dist.sh <bucket-name-minus-region> <solution-name> <version>

For example,

./deployment/build-s3-dist.sh my-solutions-bucket document-understanding-solution v1.0.0
  • Deploy the distributable to an Amazon S3 bucket in your account. Note: you must have the AWS Command Line Interface installed.
aws s3 cp ./deployment/global-s3-assets/ s3://my-bucket-name-<aws_region>/<solution_name>/<my-version>/ --recursive --acl bucket-owner-full-control --profile aws-cred-profile-name
aws s3 cp ./deployment/regional-s3-assets/ s3://my-bucket-name-<aws_region>/<solution_name>/<my-version>/ --recursive --acl bucket-owner-full-control --profile aws-cred-profile-name
  • Get the link of the document-understanding-solution.template uploaded to your Amazon S3 bucket.
  • Deploy the Document Understanding solution to your account by launching a new AWS CloudFormation stack using the link of the document-understanding-solution.template.
  • If you wish to manually choose whether to enable Kendra or Read-only mode (default 'true' and 'false', respectively), you need to add ParameterKey=KendraEnabled,ParameterValue=<true_or_false> and ParameterKey=ReadOnlyMode,ParameterValue=<true_or_false> after the email parameter when calling create-stack.
aws cloudformation create-stack --stack-name DocumentUnderstandingSolutionCICD --template-url https://my-bucket-name-<aws_region>.s3.amazonaws.com/<solution_name>/<my_version>/document-understanding-solution.template --parameters ParameterKey=Email,ParameterValue=<my_email> --capabilities CAPABILITY_NAMED_IAM --disable-rollback

This solutions will create 7 S3 buckets that need to be manually deleted when the stack is destroyed (Cloudformation will only delete the solution specific CDK toolkit bucket. The rest are preserved to prevent accidental data loss).

  • 2 for CICD
  • 1 for solution specific CDK Toolkit
  • 2/3 for documents ((sample and general documents and optionally 1 for Medical sample documents if opting for Amazon Kendra Integration)
  • 1 for the client bucket
  • 1 for access logs
  • 1 for CDK toolkit (if this is the customer's first try with CDK)
  • 1 for document bulk processing pipeline

The solution is set up to reserve lambda concurrency quota. This is both to limit the scale of concurrent Lambda invocations as well to ensure sufficient capacity is available for the smooth functioning of the demo. You can tweak the "API_CONCURRENT_REQUESTS" value in source/lib/cdk-textract-stack.ts for changing the concurrency Lambda limits

Notes

  • Do NOT change the cicd in package.json. This field is for the deployment system to use in CodePipeline
  • Due to limitations of CodeCommit, you cannot use this deploy approach if you add a file to the solution that is above 6MB (for good measure, stay below 5MB)

Development Deploy

The instructions below cover installation on Unix-based Operating systems like macOS and Linux. You can use a AWS Cloud9 environment or EC2 instance (recommended: t3.large or higher on Amazon Linux platform) to deploy the solution

Requirements

Please ensure you install all requirements before beginning the deployment

  • aws cli

    sudo yum -y install aws-cli

  • node 10+

    sudo yum -y install nodejs

  • yarn

    curl --silent --location https://dl.yarnpkg.com/rpm/yarn.repo | sudo tee /etc/yum.repos.d/yarn.repo

    sudo yum -y install yarn

  • tsc

    npm install -g typescript

  • jq

    sudo yum -y install jq

  • moto (Required for running the tests) pip install moto==2.3.2

  • pip3 (Required to install packages) curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py

To deploy using this approach, you must first set few values inside the package.json file in the source folder.

  • Set your deployment region in the stack->region property, replacing "%%REGION%%". This deployment will not pull the AWS region from your current AWS profile.

    Note : The AWS services used in this solution are not all available in all AWS Regions. Supported regions include us-east-1,us-west-2,eu-west-1. Please refer the AWS Regions Table for the most up to date information on which regions support the all services in DUS are available.

  • Enter your email into the email property, replacing "%%USER_EMAIL%%"

  • If you want to use the Classic mode, set the enableKendra flag to false. For Kendra-enabled mode, set the flag as true

  • If you want to use the Read-only (RO) mode, set the is isROMode flag to true.

Now switch to the source directory, and use yarn to deploy the solution:

cd ./source
yarn && yarn deploy

The cli will prompt for approval on IAM Roles and Permissions twice in the full deploy. Once for the backend stack and then again for the client stack. The cli will prompt for an email. After the deploy is complete, an email will be sent to address provided with credentials for logging in.

Note:

This will create 5 or 6 S3 buckets that will have to be manually deleted when the stack is destroyed (Cloudformation does not delete them, in order to avoid data loss).

  • 2/3 for documents (sample and general documents and optionally 1 for Medical sample documents if opting for Amazon Kendra Integration)
  • 1 for the client stack
  • 1 for document bulk processing pipeline
  • 1 for CDK toolkit (if this is your first time using CDK)

The solution is set up to reserve lambda concurrency quota. This is both to limit the scale of concurrent Lambda invocations as well to ensure sufficient capacity is available for the smooth functioning of the demo. You can tweak the "API_CONCURRENT_REQUESTS" value in source/lib/cdk-textract-stack.ts for changing the concurrency Lambda limits

Development Deploy Commands

  • yarn deploy:backend : deploys or updates the backend stack
  • yarn deploy:client : deploys or updates the client app
  • yarn deploy:setup-samples : push sample docs to s3
  • yarn deploy:setup-user : initiated prompts to set up a user
  • yarn deploy:show : displays the url of the client app
  • yarn destroy : tears down the CloudFormation backend and client stacks

Development Deploy Workflow and stack naming

The package.json script node stackname sets the stackname for the deploy commands. Throughout development it has been imperative to maintain multiple stacks in order to allow client app development and stack architecture development to work without creating breaking changes. When a new stackname is merged into develop it should have the most up to date deployments.

Developing Locally

Once deployed into the AWS account, you can also deploy locally for web development This application uses next.js along with next-scss — all documentation for those packages apply here. NOTE: This application uses the static export feature of next.js — be aware of the limited features available when using static export.

Start Dev Server

  • Clone this repository
  • Run yarn to install/update packages
  • Run yarn dev
  • Navigate to http://localhost:3000
  • NOTE: The dev build is noticeably slower than the production build because pages are built/unbuilt on-demand. Also, the code in the dev build is uncompressed and includes extra code for debugging purposes.

Generate Production Build

  • Run yarn export to create a static export of the application.
  • In a terminal go to the app/out directory and run python -m SimpleHTTPServer
  • Navigate to http://localhost:8000

Code Quality Tools

This project uses Prettier to format code. It is recommended to install a Prettier extension for your editor and configure it to format on save. You can also run yarn prettier to auto-format all files in the project (make sure you do this on a clean working copy so you only commit formatting changes).

This project also uses ESLint and sass-lint to help find bugs and enforce code quality/consistency. Run yarn lint:js to run ESLint. Run yarn lint:css to run sass-lint. Run yarn lint to run them both.

Generating License Report

Run yarn license-report to generate a license report for all npm packages. See output in license-report.txt.

DUS Modes:

Classic Mode

This is first release of the DUS solution. The major services included in this mode include Amazon OpenSearch Service, Amazon Textract, Amazon Comprehend and Amazon Comprehend Medical that allow digitization, information extraction and indexing in DUS.

Kendra-Enabled Mode

In the Classic version, DUS supports searching/indexing of documents using Amazon OpenSearch Service In the kendra enabled mode, Amazon Kendra is added as an additional capability and can be used for exploring features such as Semantic Search, Adding FAQs and Access Control Lists. Simply set the enableKendra: "true" in package.json Note: Amazon Kendra Developer edition is deployed as a part of this deployment.

Read-Only Mode

In this mode, DUS will only be available in Read-Only mode and you will only be able to analyze the pre-loaded documents. You will not be able to upload documents from the web application UI. In order to enable the Read-Only mode, set isROMode: "true" in package.json. By default, this mode is disabled.

Notes

Document Bulk Processing

DUS supports bulk processing of documents. During deploy, an S3 bucket for document bulk processing is created. To use the bulk processing mode, simply upload documents under the documentDrop/ prefix. In Kendra mode, you can also upload the corresponding access control list under policy/ prefix in the same bucket with the following name convention <document-name>.metadata.json Be sure to upload the access control policy first and then the document.

Other

  • To switch between the DUS Classic version and Amazon Kendra enabled version, please follow a fresh deploy (either in a different region/ deleting the stack) and avoid updating the CloudFormation stack for the existing mode. Currently, DUS does not have the feature to seamlessly switch between the 2 modes. More info available in this issue
  • Do NOT change the cicd in package.json. This field is for the deployment system to use in CodePipeline
  • Due to limitations of CodeCommit, you cannot use this deploy approach if you add a file to the solution that is above 6MB (for good measure, stay below 5MB)

Cost

  • As you deploy this sample application, it creates different resources (Amazon S3 bucket, Amazon SQS Queue, Amazon DynamoDB table, OpenSearch Service (and potentially Amazon Kendra) cluster(s) and AWS Lambda functions etc.). When you analyze documents, it calls different APIs (Amazon Textract , Amazon Comprehend and Amazon Comprehend Medical) in your AWS account. You will get charged for all the API calls made as part of the analysis as well as any AWS resources created as part of the deployment. To avoid any recurring charges, delete stack using "yarn destroy".

  • The CDK Toolkit stacks that are created during deploy of this solution are not destroyed when you tear down the solution stacks. If you want to remove these resources, delete the S3 bucket that contains staging-bucket in the name, and then delete the CDKToolkit stack.

  • You are responsible for the cost of the AWS services used while running this reference deployment. The solution consists of some resources that have to be paid by the hour/size such as Amazon OpenSearch Service, Amazon Kendra and Amazon S3 while others are serverless technologies where costs are incurred depending on the number of requests. The approximate cost for the solution for 100 documents/day comes under $20/day for the Classic Mode and under $80/day for Kendra-Enabled Mode. For accurate and most up-to-date pricing information, refer AWS Pricing

Delete demo application

  1. CICD Deploy:

Either run aws cloudformation delete-stack --stack-name {CICD stack}, or go to Cloudformation in the AWS Console and delete the stack that ends with "CICD". You will also have to go to CodeCommit in the console and manually delete the Repository that was created during the deploy.

  1. Development Deploy:

Make sure you are in the source directory, and then run yarn destroy.

License

This project is licensed under the Apache-2.0 License. You may not use this file except in compliance with the License. A copy of the License is located at http://www.apache.org/licenses/

Additional Notes

The intended use is for users to use this application as a reference architecture to build production ready systems for their use cases. Users will deploy this solution in their own AWS accounts and own the deployment, maintenance and updates of their applications based on this solution.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

The searchable PDF functionality is included as a pre-compiled jar binary. See the following README for more information: source/lambda/pdfgenerator/README.md

document-understanding-solution's People

Contributors

alexchirayath avatar amazon-auto avatar bios6 avatar dependabot[bot] avatar fhoueto-amz avatar gwp avatar jamesnixon-aws avatar jbt avatar kazbaig avatar knihit avatar pierreaws avatar shivanimehendarge avatar tabdunabi avatar tbelmega avatar vishaalkapoor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

document-understanding-solution's Issues

Error using `--log-group-name` with `aws logs` command

Describe the bug
The solution will not deploy, due to errors with some AWS CLI commands.

To Reproduce
Try to deploy, either CI/CD or "Development" style.

Expected behavior
It should deploy.

Please complete the following information about the solution:

  • Version: latest
  • Region: us-east-1
  • Was the solution modified from the version published on this repository? no
  • If the answer to the previous question was yes, are the changes available on GitHub?
  • Have you checked your service quotas for the sevices this solution uses? yes
  • Were there any errors in the CloudWatch Logs? yes

Screenshots
Sorry, I didn't get any at the time.

Additional context
In the update-es-logs-and-client-stack-vars.sh file, the following commands fail, due to ambiguity:

INDEX_LOG_ARN=$(aws logs describe-log-groups --region $AWS_REGION --log-group-name $ElasticSearchIndexLogGroup | jq -r '.logGroups[0].arn')
SEARCH_LOG_ARN=$(aws logs describe-log-groups --region $AWS_REGION --log-group-name $ElasticSearchSearchLogGroup | jq -r '.logGroups[0].arn')

The --log-group-name parameter needs to be changed to --log-group-name-prefix.

INDEX_LOG_ARN=$(aws logs describe-log-groups --region $AWS_REGION --log-group-name-prefix $ElasticSearchIndexLogGroup | jq -r '.logGroups[0].arn')
SEARCH_LOG_ARN=$(aws logs describe-log-groups --region $AWS_REGION --log-group-name-prefix $ElasticSearchSearchLogGroup | jq -r '.logGroups[0].arn')

I'm not sure if this is because of changes to the build image that is being used in the CI/CD process, but it also fails locally, when doing the source deploy. I am using the latest version of the AWS CLI: aws-cli/2.10.1 Python/3.11.2 Darwin/22.4.0 source/arm64 prompt/off.

Error: Class not found: DemoLambdaV2

Describe the bug
When ingesting a document in the web UI, job completes. But when I try to see the results in the solution, nothing comes up. Errors show that the searchable.pdf file that should be created is not found. Digging into the Lambdas, I found that "DUSStack-dusstackpdfgenerator" had an error in the CloudWatch logs.

Class not found: DemoLambdaV2: java.lang.ClassNotFoundException
java.lang.ClassNotFoundException: DemoLambdaV2
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)

To Reproduce

  1. Deploy following "Development Deployment" section
  2. After deployment successfully completes, log into solution
  3. "Upload your own documents"
  4. Add Management Report example
  5. Wait for pipeline to complete
  6. Click on "management" in document list

Expected behavior
Expect to see the document to show up and rest of the metadata/interface.

Please complete the following information about the solution:

  • [v1.0.3 ] Version: [e.g. v1.0.0]
  • [us-east-1 ] Region: [e.g. us-east-1]
  • [Y ] Was the solution modified from the version published on this repository?
  • [Y ] If the answer to the previous question was yes, are the changes available on GitHub? **Included updates mentioned in older Issues to enable solution to deploy
  • [Y ] Have you checked your service quotas for the sevices this solution uses?
  • [Y ] Were there any errors in the CloudWatch Logs?

Screenshots
If applicable, add screenshots to help explain your problem (please DO NOT include sensitive information).

Additional context
Add any other context about the problem here.

Slow search experience

The current version of the solution uses the Elasticsearch in the backend to retreive search results. Before returning the results to the browser, the solution uses Lambda code to parse the matching documents to count the number of matches as well to extract snippets of the documents that match the search phrase in results.
This makes search noticably slower especially when several large documents are present with mulitple occurrences of the search phrase.

[CloudFormation Deployment Failure] Nodejs10.x is no longer supported in AWS Lambda

Describe the bug

End of support of node.js 10x on AWS Lambda has been reached on July 30, 2021. So installing the stack referred to here: https://docs.aws.amazon.com/solutions/latest/document-understanding-solution/automated-deployment.html fails with the following error

Resource handler returned message: "The runtime parameter of nodejs10.x is no longer supported for creating or updating AWS Lambda functions. We recommend you use the new runtime (nodejs14.x) while creating or updating functions. 

To Reproduce
Attempt to Install the stack.

Expected behavior
Installation completes correctly.

Please complete the following information about the solution:

  • Version: [v1.0.2]
  • Region: [eu-west-1]

Allow control over the size of the Elasticsearch instance when deploying

Is your feature request related to a problem? Please describe.
As of now, the solution fixes the instance size for ES to m5.large.elasticsearch, which is larger than what we need in our use case. This can drive up the base monthly cost.

Describe the feature you'd like
Set the value of instanceType based on the value of a new CloudFormation template parameter (a new prop in TextractStackProps)
https://github.com/awslabs/document-understanding-solution/blob/b375268de159ea52296a3747506eb154962b4a2f/source/lib/cdk-textract-stack.ts#L304

Bug described in AWS DUS (Case 9497917291)

Describe the bug
Uploaded 3 pdf files (reasonably large). 2 files resulted in status "Failed", 1 file has Status "Ready", but when clicked upon, no data appears.

To Reproduce
Please upload attached files into a DUS (kendra enabled)

Expected behavior

Please complete the following information about the solution:

Screenshots
If applicable, add screenshots to help explain your problem (please DO NOT include sensitive information).

Additional context
Add any other context about the problem here.

ApiGateway not upating via codeBuild

Describe the bug
My client is trying to hit an old APIGateway endpoint after updating the stack via codebuild. I have calls being made to https://oakcalov4g.execute-api..... when the only live gateway is now https://bed2g6zyd1.execute-api....

To Reproduce
Deploy stack
Make some change to the codecommit repo
Kick off a codeBuild for the stack

Expected behavior
Updated stack should point to the correct endpoint.

Please complete the following information about the solution:

  • [v1.0.1] Version:
  • [us-east-1] Region:
  • [Y] Was the solution modified from the version published on this repository?
  • [N] If the answer to the previous question was yes, are the changes available on GitHub?
    No changes have been made to any sort of deployment scripts. Just tinkering with the site & excel file generation.
  • [Y] Have you checked your service quotas for the sevices this solution uses?
  • [N] Were there any errors in the CloudWatch Logs?

I would assume that the APIGateway needs to be updated in waht seems like a .env file, but I have no idea where that file lives.
I also kicked a codebuild off manually, not via the codePipeline. I can't see why this would matter but I recognize that's different from the turnkey solution.

DUSClient is not creating on our cloudformation

Hi Team,

I got Email from DUS Login but the problem here is its not creating DUSClient cloudformation.

I got below email when its DUS Stack is created

You are invited to try the Document Understanding Solution. Your credentials are:

Username: ************
Password:************

Please wait until the deployent has completed for both DUS and DUSClient stacks before accessing the website

Please sign in with the user name and your temporary password provided above at:
https://*************

Its clearly states in email that i need to create both DUS and DUSClient stacks before accessing the below website but here DUSClient is not creating.

Thanks
Chandra

CodeBuild Fails

Describe the bug
I am launching the DUS from the Launch in AWS Console link here https://aws.amazon.com/solutions/implementations/document-understanding-solution/

the Cloudformation deploys but the CodeBuild fails during Build phase with the following error:

[Container] 2021/01/12 17:38:53 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: yarn deploy-all. Reason: exit status 1

522 | [Container] 2021/01/12 17:38:53 Entering phase POST_BUILD

Please complete the following information about the solution:
Version: [e.g. v1.0.0]
Region: [e.g. us-east-1]

CICD Deployment failure

Describe the bug
CICD Deployment failure

To Reproduce
Follow CICD Deployment failure

Expected behavior
Deployment succeeds.

Please complete the following information about the solution:

  • Version: [e.g. v1.0.0] v1.0.3
  • Region: [e.g. us-east-1]us-east-1
  • Was the solution modified from the version published on this repository? No
  • If the answer to the previous question was yes, are the changes available on GitHub?
  • Have you checked your service quotas for the sevices this solution uses? N/A
  • Were there any errors in the CloudWatch Logs? No

Screenshots
If applicable, add screenshots to help explain your problem (please DO NOT include sensitive information).

Additional context
I did all the steps from CICD Deploy section
it fails when reaching
aws cloudformation create-stack --stack-name DocumentUnderstandingSolutionCICD --template-url...

  1. It complains it cannot find an S3 bucket when creating the CICDHelper Lambda
    This seems to be caused by
    https://github.com/awslabs/document-understanding-solution/blob/main/deployment/document-understanding-solution.template#L263
    I don't see the reason to join bucket name with AWS:Region. (bucket name should be enough)

  2. After fixing the bucket problem, it complains about missing the S3 Key:
    https://github.com/awslabs/document-understanding-solution/blob/main/deployment/document-understanding-solution.template#L271
    There is no such file in the repo. I guess it supposed to be an archive (Zipped) with the files from here:
    https://github.com/awslabs/document-understanding-solution/tree/main/deployment/document-understanding-cicd

  3. CloudFormation did not receive a response from your Custom Resource.
    After fixing 1&2, I've reached this error: "CloudFormation did not receive a response from your Custom Resource.Please check your logs for requestId..."
    CW Logs ? I don't have any useful CW logs. Any idea?

cdk always redeploying full stack instead of update

The cdk stack seems to be initiated with a different uuid everytime it gets compiled.
This causes the resources to be completely different and thus cause full deployment instead of incremental.

 this.resourceName = (name: any) =>
      `${id}-${name}-${this.uuid}`.toLowerCase();

    this.uuid = uuid.generate();

Possible Fix:
Provide a custom/rand but fixed suffix for the resource names

Switching between Kendra and Classic DUS mode

Describe the bug
Currently when I switch from the classic DUS mode to Amazon Kendra enabled version (or vice cersa), the changes are not reflected on the UI even though the infrastructure pieces are updated.

To Reproduce
Switch the enableKendra flag in package.json file
Redeploy on existing stack

Expected behavior
Seamless transition between the 2 DUS modes after redeploy

Please complete the following information about the solution:

  • Version: [e.g. v1.0.0] kendraMaster branch : v2.0.0
  • Region: [e.g. us-east-1] us-east-1
  • Was the solution modified from the version published on this repository? No
  • If the answer to the previous question was yes, are the changes available on GitHub?
  • Have you checked your service quotas for the sevices this solution uses?
  • Were there any errors in the CloudWatch Logs?

Screenshots
If applicable, add screenshots to help explain your problem (please DO NOT include sensitive information).
N.A
Additional context
Add any other context about the problem here.

Cloudfront url errors "No Such Key" after successful deployment

Describe the bug
I deployed the DUS solution following the exact steps in CI/CD deployment documentation. It was successful but nothing loads when I access the link received in the email.

To Reproduce
Same steps as in documentation

Expected behavior
Should load the web app

Please complete the following information about the solution:

  • [v1.0.0 ] Version
  • [ us-east-1 ] Region
  • [ No ] Was the solution modified from the version published on this repository?
  • [ Yes ] Have you checked your service quotas for the sevices this solution uses?
  • [ No ] Were there any errors in the CloudWatch Logs?

Screenshots
image
image
image

Additional context
Lambda concurrency quota available - 1000
Client apps s3 bucket is empty

Test and Deployment Failure

Describe the bug
A clear and concise description of what the bug is.
The deployment for DUS is failing due to the test failure

To Reproduce
Deploy the solution

Expected behavior

Successful deployment

Please complete the following information about the solution:

  • Version: [e.g. v1.0.0]
  • Region: [e.g. us-east-1] All
  • Was the solution modified from the version published on this repository? N
  • If the answer to the previous question was yes, are the changes available on GitHub? N
  • Have you checked your service quotas for the sevices this solution uses? Y
  • Were there any errors in the CloudWatch Logs? CodeBuild logs which are a part of the CICD deployment show the tests failing

Screenshots
If applicable, add screenshots to help explain your problem (please DO NOT include sensitive information).

Additional context

The test file associated with DUS is failing.
This test is run once as part of the deployment command yarn:deploy-all. Hence, all deployments that include the test are now failing.

This affects the

  • CICD deployment methods
  • AWS Solutions deployment (which essentially uses the CICD deploy)
  • Development Deploy(by default the command runs the test file but can be excluded if user updates the yarn:deploy-all command )

Initial research shows that this is likely related to the upgrading ini and moto which the development team is now in the process of integrating.

self.default_session_mock.stop() File "/root/.pyenv/versions/3.8.1/lib/python3.8/site-packages/mock/mock.py", line 1563, in stop return self.__exit__(None, None, None) File "/root/.pyenv/versions/3.8.1/lib/python3.8/site-packages/mock/mock.py", line 1529, in __exit__ if self.is_local and self.temp_original is not DEFAULT: AttributeError: '_patch' object has no attribute 'is_local'

Quick Fix

The Development Deploy and AWS Solutions site deployment will not work until the issue is fixed
However, users can choose to deploy via the Development Deploy mode by manually updating the yarn:deploy-all command in source/package.json to exclude running the tests in the method and proceed with that.

Error when importing sample medical document

Describe the bug
When adding 4 sample medical documents, "Medical History Form" fails to be processed

To Reproduce
Click "upload your own document" link
Click "Add" for Medical (4 documents)

Expected behavior
All 4 medical documents processed successfully

Please complete the following information about the solution:

  • Version: v1.0.11, v1.0.12
  • Region: us-east-1, us-east-2
  • Was the solution modified from the version published on this repository? No
  • [n/a] If the answer to the previous question was yes, are the changes available on GitHub?
  • Have you checked your service quotas for the sevices this solution uses?
  • Were there any errors in the CloudWatch Logs? Yes

Screenshots
If applicable, add screenshots to help explain your problem (please DO NOT include sensitive information).

Additional context
Error from CloudWatch:
Log group: /aws/lambda/DUSStack-dusstacksyncprocessorpzdnsmffxjbep3hg1wbq-Iajz2TVeUv28

[ERROR] AttributeError: 'list' object has no attribute 'add'Traceback (most recent call last):  File "/var/task/lambda_function.py", line 187, in lambda_handler    return processRequest(request)  File "/var/task/lambda_function.py", line 160, in processRequest    processImage(documentId, features, bucketName, outputBucketName,  File "/var/task/lambda_function.py", line 131, in processImage    comprehendAndMedicalEntities[key].add(val) | [ERROR] AttributeError: 'list' object has no attribute 'add' Traceback (most recent call last):   File "/var/task/lambda_function.py", line 187, in lambda_handler     return processRequest(request)   File "/var/task/lambda_function.py", line 160, in processRequest     processImage(documentId, features, bucketName, outputBucketName,   File "/var/task/lambda_function.py", line 131, in processImage     comprehendAndMedicalEntities[key].add(val)

DUS fails to deploy due to nodejs12 is not longer supported by Lambda

Describe the bug
DUS deployment fails because Lambda functions created by CDK for custom resources still uses nodejs12 environment, which is no longer supported by Lambda.

To Reproduce
Deploy DUS solution, v1.0.11

Expected behavior
Successful deployment

Please complete the following information about the solution:

  • Version: v1.0.11
  • Region: us-east-1
  • Was the solution modified from the version published on this repository? No
  • [n/a] If the answer to the previous question was yes, are the changes available on GitHub?
  • Have you checked your service quotas for the services this solution uses?
  • [n/a] Were there any errors in the CloudWatch Logs?

Screenshots
Error message from failed DUSStack:

Resource handler returned message: "The runtime parameter of nodejs12.x is no longer supported for creating or updating AWS Lambda functions. We recommend you use the new runtime (nodejs18.x) while creating or updating functions. 

For resources: dusstackkendraindexprovideruanfh4dbcex3irvekny2amframeworkonTimeout88FF4ADB

Additional context
Looks like currently DUS uses CDK v1.58.0, the fix for deprecated nodejs12 environment was introduced in v1.60.0

CodeBuild Failing

Describe the bug
I am launching the DUS from the Launch in AWS Console link here https://aws.amazon.com/solutions/implementations/document-understanding-solution/

the Cloudformation deploys but the CodeBuild fails during Build phase with the following error:

[Container] 2021/01/12 17:38:53 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: yarn deploy-all. Reason: exit status 1

522 | [Container] 2021/01/12 17:38:53 Entering phase POST_BUILD

Please complete the following information about the solution:
Version: [e.g. v1.0.0]
Region: [e.g. us-east-1]

Allow disabling Amazon Comprehend Medical when deploying

Is your feature request related to a problem? Please describe.
The use case we have for this solution doesn't require the use of Amazon Comprehend Medical as the data is not in the medical domain. The implicit use of Amazon Comprehend Medical only adds to the variable cost without adding value.

Describe the feature you'd like

  1. Add a parameter to opt out of using Amazon Comprehend Medical when deploying.
  2. Add a way to opt out/in of using Amazon Comprehend Medical after deployment, maybe as an Environment Variable for the relevant Lambda function(s)

Web App is showing an error instead of document list

Describe the bug
When opening the web app and for example the disovery page, the error "Something went wrong, please refresh the page to try again." is shown instead of the document list (see screenshot). The demo worked without problems and showing the document list three days before. There are no errors shown in the network console of the browser.

To Reproduce

  • Open DUS Demo Start page in Browser
  • click on "Discovery"
  • after a few secons of loading error message is shown

Expected behavior

  • After Opening the DUS Demo Discovery Page, the document list should be shown

Please complete the following information about the solution:

  • Version: v1.0.13
  • Region: eu-west-1
  • Was the solution modified from the version published on this repository? - no
  • Have you checked your service quotas for the sevices this solution uses? yes
  • Were there any errors in the CloudWatch Logs? no errors

Screenshots

DUSDemo-error

Inconsistencies in upload/processing

Describe the bug
~65% of files uploaded fail to upload/be processed

To Reproduce
Deploy solution
Upload 100 pdfs
Note that on average, only ~65 of them process successfully

Expected behavior
I would've expected no more than a couple to fail per 100.

Please complete the following information about the solution:

  • [] Version: [e.g. v1.0.0]
  • [] Region: [e.g. us-east-1]
  • [y] Was the solution modified from the version published on this repository?
    I downsized the elasticsearch container is all.
  • [n] If the answer to the previous question was yes, are the changes available on GitHub?
  • [y] Have you checked your service quotas for the sevices this solution uses?
  • [y] Were there any errors in the CloudWatch Logs?
  • I found errors in the lambda dashboard but couldn't hunt them down in cloudwatch successfully.

Screenshots
If applicable, add screenshots to help explain your problem (please DO NOT include sensitive information).

Additional context
I thought it was odd that it failed to process so much, but that it must be a bug of some sort when many failed to even upload in the first place. Additionally, all documents that will pass processing do so rather quickly (~1-2 mins), then all the failures seem to wait for a max timeout (5-10mins). This puzzled me as it seems like there's a max timeout with no retry logic maybe?

yarn dev fails because of undefined publicRuntimeConfig

Describe the bug
I get an error when trying to run the solution locally (next dev). It seems like there's a missing manual step in the readme about defining the env configs.parsed.APIGateway?

To Reproduce
Follow 'running locally' steps in readme to setup, then run: yarn dev

Expected behavior
App to load on localhost:3000.

Please complete the following information about the solution:

  • Version: e.g. v1.0.1
  • Region: us-east-1
  • [n] Was the solution modified from the version published on this repository?

Additional context
ERROR:
TypeError: Cannot read property 'APIGateway' of undefined
at Object. (/home/shenry/freelance/AWS_Textract/document-understanding-solution/source/next.config.js:20:5)
at Module._compile (node:internal/modules/cjs/loader:1108:14)
at Object.Module._extensions..js (node:internal/modules/cjs/loader:1137:10)
at Module.load (node:internal/modules/cjs/loader:973:32)
at Function.Module._load (node:internal/modules/cjs/loader:813:14)
at Module.require (node:internal/modules/cjs/loader:997:19)
at require (node:internal/modules/cjs/helpers:92:18)
at loadConfig (/home/shenry/freelance/AWS_Textract/document-understanding-solution/source/node_modules/next/dist/next-server/server/config.js:8:100)
at new Server (/home/shenry/freelance/AWS_Textract/document-understanding-solution/source/node_modules/next/dist/next-server/server/next-server.js:1:4383)
at new DevServer (/home/shenry/freelance/AWS_Textract/document-understanding-solution/source/node_modules/next/dist/server/next-dev-server.js:1:2964)
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

Global document search not returning results

No documents are returned when using the search box on the "View Existing Documents" pages.

Search is functioning when opening a specific document and searching for terms within that single document.

File name uploaded with spaces fail with Bulk Processing feature

Describe the bug
When a file with spaces in its name is uploaded to the S3 bucket for bulk upload, does not show up either in the DUS Web App in the list of processed files or in the output S3 bucket.

To Reproduce

  1. Select a pdf file whose file name has spaces.
  2. Upload the file to the S3 bucket for bulk upload
  3. This file will not show up in the DUS Web App.
  4. The file does not show up in the list of processed files either in the output S3 bucket.
  5. Check the lambda for bulk processing, it shows an error in the log file that the file was not found.

Expected behavior
The file should be processed successfully and should also appear in the Web App and the S3 bucket containing processed files.

Please complete the following information about the solution:

  • Version: v1.0.3
  • Region: All
  • Was the solution modified from the version published on this repository? No
  • Have you checked your service quotas for the sevices this solution uses? Yes
  • Were there any errors in the CloudWatch Logs? Yes

Error: Invalid S3 bucket name (value: SOURCE_BUCKET) during Dev deploy

Describe the bug
getting an error during initial deployment. I am trying to do the Dev approach for the install. Seems it is trying to create a bucket with no name or bad name. Ive installed all re reqs. This is in a Cloud9 IDE. It pass all previous tests up to this point. I have tried to dig into /cdk-textract-client-stack.ts but I cannot see where SOURCE_BUCKET actually gets its value. Your expert insight would be greatly appreciated

error:

Running tests for datastore
Test region is us-east-1
/home/ec2-user/.local/lib/python3.7/site-packages/responses/init.py:484: DeprecationWarning: stream argument is deprecated. Use stream parameter in request directly
DeprecationWarning,
.Error : An error occurred (ConditionalCheckFailedException) when calling the UpdateItem operation: A condition specified in the operation could not be evaluated.
A condition specified in the operation could not be evaluated.
....response: {'Items': [{'documentId': 'b1a54fda-1809-49d7-8f19-0d1688eb65b9', 'objectName': 'public/samples/Misc/expense.png', 'bucketName': 'dusstack-sample-s3-bucket', 'documentStatus': 'IN_PROGRESS'}, {'documentId': 'b1a99fda-1809-49d7-8f19-0d1688eb65b9', 'objectName': 'public/samples/Misc/expense.png', 'bucketName': 'dusstack-sample-s3-bucket', 'documentStatus': 'IN_PROGRESS'}], 'Count': 2, 'ScannedCount': 2, 'ConsumedCapacity': {'TableName': 'DocumentsTestTable', 'CapacityUnits': 1}, 'ResponseMetadata': {'RequestId': 'C5M72NYUQUFHHS0ILX3HFC7W4LFLPWXEZGYQZY138HS6KE7ZH8CT', 'HTTPStatusCode': 200, 'HTTPHeaders': {'server': 'amazon.com', 'x-amzn-requestid': 'C5M72NYUQUFHHS0ILX3HFC7W4LFLPWXEZGYQZY138HS6KE7ZH8CT'}, 'RetryAttempts': 0}}
....A condition specified in the operation could not be evaluated.
.

Ran 10 tests in 1.069s

OK
sys:1: ResourceWarning: unclosed <socket.socket fd=3, family=AddressFamily.AF_INET, type=SocketKind.SOCK_DGRAM, proto=0, laddr=('0.0.0.0', 0)>
warning ../../../package.json: No license field
$ yarn compile-ts-backend-stack && yarn compile-ts-client-stack
warning ../../../package.json: No license field
$ tsc lib/cdk-textract-stack.ts --target es2018 --module commonjs --allowjs true
warning ../../../package.json: No license field
$ tsc lib/cdk-textract-client-stack.ts --target es2018 --module commonjs --allowjs true
warning ../../../package.json: No license field
$ AWS_REGION=$npm_package_stack_region USER_EMAIL=$npm_package_email cdk bootstrap --toolkit-stack-name DocumentUnderstandingCDKToolkit
/home/ec2-user/environment/document-understanding-solution/source/node_modules/@aws-cdk/aws-s3/lib/bucket.js:750
throw new Error(Invalid S3 bucket name (value: ${bucketName})${os_1.EOL}${errors.join(os_1.EOL)});
^

Error: Invalid S3 bucket name (value: SOURCE_BUCKET)
Bucket name must only contain lowercase characters and the symbols, period (.) and dash (-) (offset: 0)
Bucket name must start and end with a lowercase character or number (offset: 0)
Bucket name must start and end with a lowercase character or number (offset: 12)
at Function.validateBucketName (/home/ec2-user/environment/document-understanding-solution/source/node_modules/@aws-cdk/aws-s3/lib/bucket.js:750:19)
at Function.fromBucketAttributes (/home/ec2-user/environment/document-understanding-solution/source/node_modules/@aws-cdk/aws-s3/lib/bucket.js:673:16)
at Function.fromBucketName (/home/ec2-user/environment/document-understanding-solution/source/node_modules/@aws-cdk/aws-s3/lib/bucket.js:654:23)
at new CdkTextractStack (/home/ec2-user/environment/document-understanding-solution/source/lib/cdk-textract-stack.js:509:62)
at Object. (/home/ec2-user/environment/document-understanding-solution/source/bin/deploy-backend.js:32:1)
at Module._compile (internal/modules/cjs/loader.js:1085:14)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:1114:10)
at Module.load (internal/modules/cjs/loader.js:950:32)
at Function.Module._load (internal/modules/cjs/loader.js:790:12)
at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:76:12)
Subprocess exited with error 1
error Command failed with exit code 1.

To Reproduce
run yarn deploy

Expected behavior
A clear and concise description of what you expected to happen.

Please complete the following information about the solution:

  • Version: [e.g. v1.0.3] v1.0.3
  • Region: [e.g. us-east-1] us-east-1
  • Was the solution modified from the version published on this repository? negative, just trying to test out initial template
  • If the answer to the previous question was yes, are the changes available on GitHub?
  • Have you checked your service quotas for the sevices this solution uses? no but this basically a vanilla aws org/account with nothing in it, lmk if I need to check something
  • Were there any errors in the CloudWatch Logs? no

Screenshots
If applicable, add screenshots to help explain your problem (please DO NOT include sensitive information).

Additional context
Add any other context about the problem here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.