First, thanks for your work. It's amazing! Running on a Mac M1, when

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Index list out of range with too many PDFs in source_documents? about privategpt HOT 6 CLOSED

PierrickLozach commented on August 17, 2024

Index list out of range with too many PDFs in source_documents?

from privategpt.

Comments (6)

andreakiro commented on August 17, 2024 3

@imartinez I just realized with this issue that currently the ingest file only loads a single file regardless of the number of documents in the source_documents directory. The code changes the type of document loader for each file and then loads its content at the end (the very last file only).

# Load document and split in chunks
for root, dirs, files in os.walk("source_documents"):
    for file in files:
        if file.endswith(".txt"):
            loader = TextLoader(os.path.join(root, file), encoding="utf8")
        elif file.endswith(".pdf"):
            loader = PDFMinerLoader(os.path.join(root, file))
        elif file.endswith(".csv"):
            loader = CSVLoader(os.path.join(root, file))
documents = loader.load() # loads only the last file content!

I am working on a fix right now.

from privategpt.

imartinez commented on August 17, 2024

Not sure, maybe you can run a couple tests with fewer docs to check if that's the case. Thanks for your help!

from privategpt.

PierrickLozach commented on August 17, 2024

That's what I did. It works with 7-8 PDFs but then gives this error when I add more

from privategpt.

imartinez commented on August 17, 2024

Ok great, thanks for sharing. Does it fail right away? It is interesting, will need to look into it. Please share your findings!

from privategpt.

PierrickLozach commented on August 17, 2024

It fails right at the beginning, see the output above. Let me know if you need anything else.

from privategpt.

rkrkrediffmail commented on August 17, 2024

i pulled the git today and yet get the same issue. I am not even uploading multiple files. i am just trying to check this using colab with state of the union.txt.

Loading documents from source_documents
Loaded 0 documents from source_documents
Split into 0 chunks of text (max. 500 tokens each)
llama.cpp: loading model from llm/ggml-model-q4_0.bin
llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this
llama_model_load_internal: format = 'ggml' (old version with low tokenizer quality and no mmap support)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 1000
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4113748.20 KB
llama_model_load_internal: mem required = 5809.33 MB (+ 2052.00 MB per state)
...................................................................................................
.
llama_init_from_file: kv self size = 1000.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Using embedded DuckDB with persistence: data will be stored in: db
Traceback (most recent call last):
File "/content/privateGPT/ingest.py", line 96, in
main()
File "/content/privateGPT/ingest.py", line 90, in main
db = Chroma.from_documents(texts, llama, persist_directory=persist_directory, client_settings=CHROMA_SETTINGS)
File "/usr/local/lib/python3.10/dist-packages/langchain/vectorstores/chroma.py", line 413, in from_documents
return cls.from_texts(
File "/usr/local/lib/python3.10/dist-packages/langchain/vectorstores/chroma.py", line 381, in from_texts
chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)
File "/usr/local/lib/python3.10/dist-packages/langchain/vectorstores/chroma.py", line 159, in add_texts
self._collection.add(
File "/usr/local/lib/python3.10/dist-packages/chromadb/api/models/Collection.py", line 97, in add
ids, embeddings, metadatas, documents = self._validate_embedding_set(
File "/usr/local/lib/python3.10/dist-packages/chromadb/api/models/Collection.py", line 340, in _validate_embedding_set
ids = validate_ids(maybe_cast_one_to_many(ids))
File "/usr/local/lib/python3.10/dist-packages/chromadb/api/types.py", line 75, in maybe_cast_one_to_many
if isinstance(target[0], (int, float)):
IndexError: list index out of range

from privategpt.

Index list out of range with too many PDFs in source_documents? about privategpt HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent