Coder Social home page Coder Social logo

googlecloudplatform / terraform-genai-doc-summarization Goto Github PK

View Code? Open in Web Editor NEW
85.0 17.0 44.0 1.17 MB

Summarizes document using OCR and Vertex Generative AI LLM

Home Page: https://registry.terraform.io/modules/GoogleCloudPlatform/terraform-genai-doc-summarization/google

License: Apache License 2.0

Python 30.75% HCL 48.81% Makefile 8.63% Go 6.62% Shell 5.19%
cft-terraform

terraform-genai-doc-summarization's Introduction

Generative AI Document Summarization

Description

Tagline

Create summaries of a large corpus of documents using Generative AI.

Detailed

This solution showcases how to summarize a large corpus of documents using Generative AI. It provides an end-to-end demonstration of document summarization going all the way from raw documents, detecting text in the documents and summarizing the documents on-demand using Vertex AI LLM APIs, Document AI Optical Character Recognition (OCR), and BigQuery.

PreDeploy

To deploy this blueprint you must have an active billing account and billing permissions.

Architecture

Document Summarization using Generative AI

  • User uploads a new document triggering the webhook Cloud Function.
  • Document AI extracts the text from the document file.
  • A Vertex AI Large Language Model summarizes the document text.
  • The document summaries are stored in BigQuery.

Documentation

Deployment Duration

Configuration: 1 mins Deployment: 5 mins

Cost

Cost Details

Inputs

Name Description Type Default Required
disable_services_on_destroy Whether project services will be disabled when the resources are destroyed. bool false no
documentai_location Document AI location, see https://cloud.google.com/document-ai/docs/regions string "us" no
labels A set of key/value label pairs to assign to the resources deployed by this blueprint. map(string) {} no
project_id The Google Cloud project ID to deploy to string n/a yes
region The Google Cloud region to deploy to string "us-central1" no
unique_names Whether to use unique names for resources bool false no

Outputs

Name Description
bigquery_dataset_id The name of the BigQuery dataset created
bucket_docs_name The name of the docs bucket created
bucket_main_name The name of the main bucket created
documentai_processor_id The full Document AI processor path ID
neos_walkthrough_url The URL to launch the in-console tutorial for the Generative AI Document Summarization solution
unique_id The unique ID for this deployment

Requirements

These sections describe requirements for using this module.

Software

The following dependencies must be available:

Service Account

A service account with the following roles must be used to provision the resources of this module:

  • Storage Admin: roles/storage.admin

APIs

A project with the following APIs enabled must be used to host the resources of this module:

  • Google Cloud Storage JSON API: storage-api.googleapis.com

Contributing

Refer to the contribution guidelines for information on contributing to this module.

Security Disclosures

Please see our security disclosure process.

terraform-genai-doc-summarization's People

Contributors

asrivas avatar balajismaniam avatar cloud-foundation-bot avatar davidcavazos avatar dependabot[bot] avatar donmccasland avatar kweinmeister avatar nicain avatar nimjay avatar release-please[bot] avatar renovate-bot avatar yil532 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

terraform-genai-doc-summarization's Issues

document_extract doesn't use my own credentials

I am running from Jump Start Soltion center this example which uses gen_ai_jss notebook

Cell 29

bucket = "arxiv-dataset"
pdf_name = "arxiv/cmp-lg/pdf/9404/9404002v1.pdf"
output_bucket = f"{PROJECT_ID}_output"

complete_text = document_extract(bucket=bucket,
                                 name=pdf_name,
                                 output_bucket=output_bucket,
                                 project_id=PROJECT_ID)

# Entire text is long; print just first 1000 characters
print(complete_text[:1000])

fails for me with the below error

_InactiveRpcError                         Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/google/api_core/grpc_helpers.py](https://localhost:8080/#) in error_remapped_callable(*args, **kwargs)
     71         try:
---> 72             return callable_(*args, **kwargs)
     73         except grpc.RpcError as exc:

9 frames
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.PERMISSION_DENIED
	details = "Cloud Vision API has not been used in project 522309567947 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/vision.googleapis.com/overview?project=522309567947 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry."
	debug_error_string = "UNKNOWN:Error received from peer ipv4:172.253.63.95:443 {created_time:"2023-10-22T18:48:21.432087561+00:00", grpc_status:7, grpc_message:"Cloud Vision API has not been used in project 522309567947 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/vision.googleapis.com/overview?project=522309567947 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry."}"
>

The above exception was the direct cause of the following exception:

PermissionDenied                          Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/google/api_core/grpc_helpers.py](https://localhost:8080/#) in error_remapped_callable(*args, **kwargs)
     72             return callable_(*args, **kwargs)
     73         except grpc.RpcError as exc:
---> 74             raise exceptions.from_grpc_error(exc) from exc
     75 
     76     return error_remapped_callable

PermissionDenied: 403 Cloud Vision API has not been used in project 522309567947 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/vision.googleapis.com/overview?project=522309567947 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry. [links {
  description: "Google developers console API activation"
  url: "https://console.developers.google.com/apis/api/vision.googleapis.com/overview?project=522309567947"
}
, reason: "SERVICE_DISABLED"
domain: "googleapis.com"
metadata {
  key: "consumer"
  value: "projects/522309567947"
}
metadata {
  key: "service"
  value: "vision.googleapis.com"
}
]

although you might wonder if I forget to enable something, its not the case.
Actually I don't recognize the project number, so it's not mine.

Somehow the script even if the notebook is authentication for my user, it tries to use a different project to run the Vision API call, perhaps it's from the "bucket" owner. This should be fixed.

Evaluate whether to replace or refactor text truncation function

The utils.truncate_complete_text function uses a heuristic to extract the abstract and conclusion from a OCR result. This approach has multiple issues (cannot handle corner cases; doesn't capture all or only abstracts & conclusions).

I recommend replacing or refactoring this module in one of the following ways:

  1. Replace the heuristic string manipulation code with a call to an in-memory NLP or LLM model.
  2. Investigate improvements to Document AI templating to get better results from OCR.
  3. Use regex (ugh) to better isolate the abstract and conclusion
  4. Other?

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Pending Status Checks

These updates await pending status checks. To force their creation now, click the checkbox below.

  • chore(deps): update dependency functions-framework to v3.7.0

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Detected dependencies

github-actions
.github/workflows/periodic-reporter.yaml
  • actions/github-script v7
.github/workflows/webhook.yml
  • actions/checkout v4
  • actions/setup-python v5
  • actions/checkout v4
  • hashicorp/setup-terraform v3
  • actions/setup-python v5
  • google-github-actions/auth v2
  • actions/setup-python v5
  • actions/cache v4
gomod
test/integration/go.mod
  • go 1.21
  • go 1.22.1
  • github.com/GoogleCloudPlatform/cloud-foundation-toolkit/infra/blueprint-test v0.14.0
  • github.com/stretchr/testify v1.9.0
pip_requirements
webhook/requirements-test.txt
  • mypy ==1.10.0
  • pytest ==8.2.0
webhook/requirements.txt
  • flask ==3.0.3
  • functions-framework ==3.5.0
  • google-cloud-aiplatform ==1.51.0
  • google-cloud-documentai ==2.27.0
  • google-cloud-bigquery ==3.22.0
  • google-cloud-storage ==2.16.0
regex
Makefile
  • cft/developer-tools 1
build/int.cloudbuild.yaml
  • cft/developer-tools 1
terraform
examples/simple_example/main.tf
main.tf
  • terraform-google-modules/project-factory/google ~> 14.5
test/setup/main.tf
  • terraform-google-modules/project-factory/google ~> 14.5
test/setup/versions.tf
  • google >= 5.24, < 6
  • google-beta >= 5.24, < 6
  • hashicorp/terraform >= 0.13
versions.tf
  • archive ~> 2.4
  • google >= 5.24, < 6
  • google-beta >= 5.24, < 6
  • random ~> 3.5
  • hashicorp/terraform >= 0.13

  • Check this box to trigger a request for Renovate to run again on this repository

pubsub service account missing roles/iam.serviceAccountTokenCreator

I was able to reconstruct an issue when hiting Deploy from this url
https://console.cloud.google.com/products/solutions/deployments/details/us-central1/generative-ai-document-summarization

I used an existing project

all deployments succeed, notebook is working, file is uploaded, EventArc picks up the event, Pub/Sub receives the message, but
Pub/Sub is not calling the Cloud Function, although everything looks setup. the GCF exists

I did this many times by undeploying and deploying.

After that I was able to observ in the Edit subscription page of Pub/Sub that it complaints the roles/iam.serviceAccountTokenCreator is not applied. Once I manually granted this, the subscription started to fire the push job.

image

Please fix the scripts. and apply this role.

Location is defined explicitly to us-central1

def predict_large_language_model(
project_id: str,
model_name: str,
temperature: float,
max_decode_steps: int,
top_p: float,
top_k: int,
content: str,
location: str = "us-central1",

Remove test skip from document_extract_test

The integration test for document_extract sometimes encounters a race condition where it attempts to access a OCR page that doesn't exist in the Storage bucket. We need to adjust the test so that this race condition is ameliorated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.