Coder Social home page Coder Social logo

Comments (6)

andreakiro avatar andreakiro commented on August 17, 2024 3

@imartinez I just realized with this issue that currently the ingest file only loads a single file regardless of the number of documents in the source_documents directory. The code changes the type of document loader for each file and then loads its content at the end (the very last file only).

# Load document and split in chunks
for root, dirs, files in os.walk("source_documents"):
    for file in files:
        if file.endswith(".txt"):
            loader = TextLoader(os.path.join(root, file), encoding="utf8")
        elif file.endswith(".pdf"):
            loader = PDFMinerLoader(os.path.join(root, file))
        elif file.endswith(".csv"):
            loader = CSVLoader(os.path.join(root, file))
documents = loader.load() # loads only the last file content! 

I am working on a fix right now.

from privategpt.

imartinez avatar imartinez commented on August 17, 2024

Not sure, maybe you can run a couple tests with fewer docs to check if that's the case. Thanks for your help!

from privategpt.

PierrickLozach avatar PierrickLozach commented on August 17, 2024

That's what I did. It works with 7-8 PDFs but then gives this error when I add more

from privategpt.

imartinez avatar imartinez commented on August 17, 2024

Ok great, thanks for sharing. Does it fail right away? It is interesting, will need to look into it. Please share your findings!

from privategpt.

PierrickLozach avatar PierrickLozach commented on August 17, 2024

It fails right at the beginning, see the output above. Let me know if you need anything else.

from privategpt.

rkrkrediffmail avatar rkrkrediffmail commented on August 17, 2024

i pulled the git today and yet get the same issue. I am not even uploading multiple files. i am just trying to check this using colab with state of the union.txt.

Loading documents from source_documents
Loaded 0 documents from source_documents
Split into 0 chunks of text (max. 500 tokens each)
llama.cpp: loading model from llm/ggml-model-q4_0.bin
llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this
llama_model_load_internal: format = 'ggml' (old version with low tokenizer quality and no mmap support)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 1000
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4113748.20 KB
llama_model_load_internal: mem required = 5809.33 MB (+ 2052.00 MB per state)
...................................................................................................
.
llama_init_from_file: kv self size = 1000.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Using embedded DuckDB with persistence: data will be stored in: db
Traceback (most recent call last):
File "/content/privateGPT/ingest.py", line 96, in
main()
File "/content/privateGPT/ingest.py", line 90, in main
db = Chroma.from_documents(texts, llama, persist_directory=persist_directory, client_settings=CHROMA_SETTINGS)
File "/usr/local/lib/python3.10/dist-packages/langchain/vectorstores/chroma.py", line 413, in from_documents
return cls.from_texts(
File "/usr/local/lib/python3.10/dist-packages/langchain/vectorstores/chroma.py", line 381, in from_texts
chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)
File "/usr/local/lib/python3.10/dist-packages/langchain/vectorstores/chroma.py", line 159, in add_texts
self._collection.add(
File "/usr/local/lib/python3.10/dist-packages/chromadb/api/models/Collection.py", line 97, in add
ids, embeddings, metadatas, documents = self._validate_embedding_set(
File "/usr/local/lib/python3.10/dist-packages/chromadb/api/models/Collection.py", line 340, in _validate_embedding_set
ids = validate_ids(maybe_cast_one_to_many(ids))
File "/usr/local/lib/python3.10/dist-packages/chromadb/api/types.py", line 75, in maybe_cast_one_to_many
if isinstance(target[0], (int, float)):
IndexError: list index out of range

from privategpt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.