Comments (3)
I have printed all the input text and feed it one by one to to embedding model. I found the the text that cause error. Please take a look:
here's my code,
from urllib.request import urlretrieve
os.makedirs("data", exist_ok=True)
files = [
"https://www.irs.gov/pub/irs-pdf/p1544.pdf",
"https://www.irs.gov/pub/irs-pdf/p15.pdf",
"https://www.irs.gov/pub/irs-pdf/p1212.pdf",
"https://www.irs.gov/pub/irs-pdf/p3.pdf",
"https://www.irs.gov/pub/irs-pdf/p17.pdf",
"https://www.irs.gov/pub/irs-pdf/p51.pdf",
"https://www.irs.gov/pub/irs-pdf/p54.pdf",
]
for url in files:
file_path = os.path.join("data", url.rpartition("/")[2])
urlretrieve(url, file_path)
import numpy as np
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader("./data/")
documents = loader.load()
# - in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size=1000, chunk_overlap=100
)
docs = text_splitter.split_documents(documents)
avg_doc_length = lambda documents: sum([len(doc.page_content) for doc in documents])//len(documents)
avg_char_count_pre = avg_doc_length(documents)
avg_char_count_post = avg_doc_length(docs)
print(f'Average length among {len(documents)} documents loaded is {avg_char_count_pre} characters.')
print(f'After the split we have {len(docs)} documents more than the original {len(documents)}.')
print(f'Average length among {len(docs)} documents (after split) is {avg_char_count_post} characters.')
texts = [d.page_content for d in docs]
metadatas = [d.metadata for d in docs]
print(len(texts), len(metadatas))
def embedding_func(text: str):
# this function is from langchain/embeddings/bedrock.py
"""Call out to Bedrock embedding endpoint."""
# replace newlines, which can negatively affect performance.
text = text.replace(os.linesep, " ")
print("\n text", text)
_model_kwargs = {}
input_body = {**_model_kwargs, "inputText": text}
**print("input_body", input_body)**
body = json.dumps(input_body)
try:
response = boto3_bedrock.invoke_model(
body=body,
modelId="amazon.titan-e1t-medium",
accept="application/json",
contentType="application/json",
)
response_body = json.loads(response.get("body").read())
except Exception as e:
raise ValueError(f"Error raised by inference endpoint: {e}")
for text in texts:
response = embedding_func(text)
and when the input text is "18,100 18,150 1,970 1,813 1,970 1,882 18,150 18,200 1,976 1,818 1,976 1,888 18,200 18,250 1,982 1,823 1,982 1,894 18,250 18,300 1,988 1,828 1,988 1,900 18,300 18,350 1,994 1,833 1,994 1,906 18,350 18,400 2,000 1,838 2,000 1,912 18,400 18,450 2,006 1,843 2,006 1,918 18,450 18,500 2,012 1,848 2,012 1,924 18,500 18,550 2,018 1,853 2,018 1,930 18,550 18,600 2,024 1,858 2,024 1,936 18,600 18,650 2,030 1,863 2,030 1,942 18,650 18,700 2,036 1,868 2,036 1,948 18,700 18,750 2,042 1,873 2,042 1,954 18,750 18,800 2,048 1,878 2,048 1,960 18,800 18,850 2,054 1,883 2,054 1,966 18,850 18,900 2,060 1,888 2,060 1,972 18,900 18,950 2,066 1,893 2,066 1,978 18,950 19,000 2,072 1,898 2,072 1,984 19,000 19,000 19,050 2,078 1,903 2,078 1,990 19,050 19,100 2,084 1,908 2,084 1,996 19,100 19,150 2,090 1,913 2,090 2,002 19,150 19,200 2,096 1,918 2,096 2,008 19,200 19,250 2,102 1,923 2,102 2,014 19,250 19,300 2,108 1,928 2,108 2,020 19,300 19,350 2,114 1,933 2,114 2,026 19,350 19,400 2,120 1,938 2,120 2,032"
embedding model give me ValueError: Error raised by inference endpoint: An error occurred (ValidationException) when calling the InvokeModel operation: The provided inference configurations are invalid
I belive you can reproduce it. Thank you
from amazon-bedrock-workshop.
You can reproduce the error simply by replace your data preparation code with these
`from urllib.request import urlretrieve
os.makedirs("data_17", exist_ok=True)
files = [
"https://www.irs.gov/pub/irs-pdf/p17.pdf"
]
for url in files:
file_path = os.path.join("data_17", url.rpartition("/")[2])
urlretrieve(url, file_path)`
Then run all the other code in noetebook
from amazon-bedrock-workshop.
@HannaHUp hello, I run the same list of PDFs, and chunk size = 1000 and chunk overlap = 100.
from urllib.request import urlretrieve
os.makedirs("data", exist_ok=True)
files = [
"https://www.irs.gov/pub/irs-pdf/p1544.pdf",
"https://www.irs.gov/pub/irs-pdf/p15.pdf",
"https://www.irs.gov/pub/irs-pdf/p1212.pdf",
"https://www.irs.gov/pub/irs-pdf/p3.pdf",
"https://www.irs.gov/pub/irs-pdf/p17.pdf",
"https://www.irs.gov/pub/irs-pdf/p51.pdf",
"https://www.irs.gov/pub/irs-pdf/p54.pdf",
]
for url in files:
file_path = os.path.join("data", url.rpartition("/")[2])
urlretrieve(url, file_path)
Got this as the split.
Average length among 314 documents loaded is 6397 characters.
After the split we have 2351 documents more than the original 314.
Average length among 2351 documents (after split) is 920 characters.
We had 3 PDF documents which have been split into smaller ~500 chunks.
It worked fine for me. Would you be able to try with the latest Titan Embeddings model that was released last week at General Availability of the service with latest workshop code: "amazon.titan-embed-text-v1" and see if you get the same problem.
from amazon-bedrock-workshop.
Related Issues (20)
- vectorstore_faiss.similarity_search_with_score
- IAM role should not be bedrock:*
- How to truncate text for Titan-Embeddings
- Add examples using Meta's Llama 2 model HOT 4
- Add example of Image Outpaiting with Stable Diffusion HOT 1
- 'BedrockEmbeddings' object is not callable HOT 3
- missing policies for 02_qa_w_rag_claude_opensearch.ipynb
- notebook missing
- Error: AttributeError: 'Bedrock' object has no attribute 'invoke_model' encountered in chatbot session HOT 2
- 00_Intro bedrock_boto3_setup.ipynb has an error setting up dependencies. HOT 1
- Error missing numexpr dependency when running 07_Agents/00_Function_DIY_Agents.ipynb
- Cannot create pinecone index HOT 1
- "apt-get update && apt-get install g++ -y" in bedrock_boto3_setup.jpynb returned error when using latest JupyterLab app in SageMaker Studio HOT 3
- Could not import anthropic python package. This is needed in order to accurately tokenize the text for anthropic models. Please install it with `pip install anthropic`. HOT 4
- S3 bucket creation for Regions other than us-east-1
- 00_Chatbot_Claude ValueError: Error: Prompt must alternate between ' Human:' and ' Assistant:'. HOT 2
- How do I contribute javascript version of this repo? HOT 1
- pysqlite3 is not installable from osx.
- How to resolve Authorizations errors when running lab 03_QuestionAnswering/02_qa_w_rag_claude_opensearch
- Invalid LocationConstraint error in lab 03_QuestionAnswering/knowledge-bases /0_create_ingest_documents_test_kb.ipynb HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from amazon-bedrock-workshop.