Coder Social home page Coder Social logo

Comments (3)

HannaHUp avatar HannaHUp commented on May 29, 2024

I have printed all the input text and feed it one by one to to embedding model. I found the the text that cause error. Please take a look:
here's my code,

from urllib.request import urlretrieve

os.makedirs("data", exist_ok=True)
files = [
    "https://www.irs.gov/pub/irs-pdf/p1544.pdf",
    "https://www.irs.gov/pub/irs-pdf/p15.pdf",
    "https://www.irs.gov/pub/irs-pdf/p1212.pdf",
    "https://www.irs.gov/pub/irs-pdf/p3.pdf",
    "https://www.irs.gov/pub/irs-pdf/p17.pdf",
    "https://www.irs.gov/pub/irs-pdf/p51.pdf",
    "https://www.irs.gov/pub/irs-pdf/p54.pdf",
]
for url in files:
    file_path = os.path.join("data", url.rpartition("/")[2])
    urlretrieve(url, file_path)

import numpy as np
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("./data/")

documents = loader.load()
# - in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=1000, chunk_overlap=100
    
)
docs = text_splitter.split_documents(documents)

avg_doc_length = lambda documents: sum([len(doc.page_content) for doc in documents])//len(documents)
avg_char_count_pre = avg_doc_length(documents)
avg_char_count_post = avg_doc_length(docs)
print(f'Average length among {len(documents)} documents loaded is {avg_char_count_pre} characters.')
print(f'After the split we have {len(docs)} documents more than the original {len(documents)}.')
print(f'Average length among {len(docs)} documents (after split) is {avg_char_count_post} characters.')

texts = [d.page_content for d in docs]
metadatas = [d.metadata for d in docs]
print(len(texts), len(metadatas))

def embedding_func(text: str):
# this function is  from langchain/embeddings/bedrock.py
    """Call out to Bedrock embedding endpoint."""
    # replace newlines, which can negatively affect performance.
    text = text.replace(os.linesep, " ")
    print("\n text", text)
    _model_kwargs = {}

    input_body = {**_model_kwargs, "inputText": text}
    **print("input_body", input_body)**
    body = json.dumps(input_body)

    try:
        response = boto3_bedrock.invoke_model(
            body=body,
            modelId="amazon.titan-e1t-medium",
            accept="application/json",
            contentType="application/json",
        )
        response_body = json.loads(response.get("body").read())
    except Exception as e:
            raise ValueError(f"Error raised by inference endpoint: {e}")
            
for text in texts:
    response = embedding_func(text)

and when the input text is "18,100 18,150 1,970 1,813 1,970 1,882 18,150 18,200 1,976 1,818 1,976 1,888 18,200 18,250 1,982 1,823 1,982 1,894 18,250 18,300 1,988 1,828 1,988 1,900 18,300 18,350 1,994 1,833 1,994 1,906 18,350 18,400 2,000 1,838 2,000 1,912 18,400 18,450 2,006 1,843 2,006 1,918 18,450 18,500 2,012 1,848 2,012 1,924 18,500 18,550 2,018 1,853 2,018 1,930 18,550 18,600 2,024 1,858 2,024 1,936 18,600 18,650 2,030 1,863 2,030 1,942 18,650 18,700 2,036 1,868 2,036 1,948 18,700 18,750 2,042 1,873 2,042 1,954 18,750 18,800 2,048 1,878 2,048 1,960 18,800 18,850 2,054 1,883 2,054 1,966 18,850 18,900 2,060 1,888 2,060 1,972 18,900 18,950 2,066 1,893 2,066 1,978 18,950 19,000 2,072 1,898 2,072 1,984 19,000 19,000 19,050 2,078 1,903 2,078 1,990 19,050 19,100 2,084 1,908 2,084 1,996 19,100 19,150 2,090 1,913 2,090 2,002 19,150 19,200 2,096 1,918 2,096 2,008 19,200 19,250 2,102 1,923 2,102 2,014 19,250 19,300 2,108 1,928 2,108 2,020 19,300 19,350 2,114 1,933 2,114 2,026 19,350 19,400 2,120 1,938 2,120 2,032"
embedding model give me ValueError: Error raised by inference endpoint: An error occurred (ValidationException) when calling the InvokeModel operation: The provided inference configurations are invalid
image

I belive you can reproduce it. Thank you

from amazon-bedrock-workshop.

HannaHUp avatar HannaHUp commented on May 29, 2024

You can reproduce the error simply by replace your data preparation code with these
`from urllib.request import urlretrieve

os.makedirs("data_17", exist_ok=True)
files = [

"https://www.irs.gov/pub/irs-pdf/p17.pdf"

]
for url in files:
file_path = os.path.join("data_17", url.rpartition("/")[2])
urlretrieve(url, file_path)`

Then run all the other code in noetebook

from amazon-bedrock-workshop.

lauerarnaud avatar lauerarnaud commented on May 29, 2024

@HannaHUp hello, I run the same list of PDFs, and chunk size = 1000 and chunk overlap = 100.

from urllib.request import urlretrieve

os.makedirs("data", exist_ok=True)
files = [
    "https://www.irs.gov/pub/irs-pdf/p1544.pdf",
    "https://www.irs.gov/pub/irs-pdf/p15.pdf",
    "https://www.irs.gov/pub/irs-pdf/p1212.pdf",
    "https://www.irs.gov/pub/irs-pdf/p3.pdf",
    "https://www.irs.gov/pub/irs-pdf/p17.pdf",
    "https://www.irs.gov/pub/irs-pdf/p51.pdf",
    "https://www.irs.gov/pub/irs-pdf/p54.pdf",
]
for url in files:
    file_path = os.path.join("data", url.rpartition("/")[2])
    urlretrieve(url, file_path)

Got this as the split.

Average length among 314 documents loaded is 6397 characters.
After the split we have 2351 documents more than the original 314.
Average length among 2351 documents (after split) is 920 characters.
We had 3 PDF documents which have been split into smaller ~500 chunks.

It worked fine for me. Would you be able to try with the latest Titan Embeddings model that was released last week at General Availability of the service with latest workshop code: "amazon.titan-embed-text-v1" and see if you get the same problem.

from amazon-bedrock-workshop.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.