Python function call adapter producer/processor/consumer

Ability to call any arbitrary pythnon function and adapt its input to target format

Import command - detailed docs

Detailed info about import command

External Fns

Ability to load code from remote sources. We can start with Chroma as a source of these functions.

Generic file generator

Use LC mime router and python-magic to route files automatically:

from langchain_community.document_loaders.mime_router import MimeRouter
from langchain_community.document_loaders.text import TextLoader
from langchain_community.document_loaders.pdf import PDFLoader

# Create an instance of the MimeRouter
router = MimeRouter()

# Register loaders for specific MIME types
router.register_loader('text/plain', TextLoader)
router.register_loader('application/pdf', PDFLoader)

# Specify the path to the file
file_path = '/path/to/file.pdf'

# Determine the MIME type of the file
mime_type = 'application/pdf'  # You can use a library like python-magic to detect the MIME type

# Get the appropriate loader for the MIME type
loader_cls = router.get_loader(mime_type)

# Create an instance of the loader and load the file
loader = loader_cls(file_path)
document = loader.load()

# Process the loaded document as needed
content = document.content
# ...

Fix code scanning alert - Overly permissive regular expression range

Tracking issue for:

https://github.com/amikos-tech/chromadb-data-pipes/security/code-scanning/4

Remote API import

A simple primer on how to call an API to fetch data. Keep this simple - remote fetch + rudimentary pagination support + basic JSONPath support (yes we only support JSON based REST APIs)

Reconsider design decision about using command prefixes

Command prefixes like tx and imp may put unnecessary cognitive load on users. Evaluate the option where you can transform things like this:

cdp tx embed -> cdp embed
cdp tx chunk -> cdp chunk

Observability

OTEL Observability pattern to annotate structures.

We want to produce:

Traces
Metrics
Logs

Add smart_open to reduce code duplication

import sys
from contextlib import contextmanager


@contextmanager
def smart_open(filename=None):
    if filename:
        fh = open(filename, 'w')
    else:
        fh = sys.stdout

    try:
        yield fh
    finally:
        if filename:
            fh.close()

What is the purpose of pipelines

Initial thoughts: A reusable flow of cdp commands (like a recipe) with predefined inputs that the user either needs to define or use the defaults of.

Remove URI from import/export commands

URI should not be option but the first and only arg for these commands.

Where query in export

To keep it simple, let's use the same syntax as Chroma's where e.g. --where '{"id":1}'

Where document query in export

Save state

Need an ADR for saving temporary state for each step in case of failure.

ID injector strategy

Support the following strategies:

UUIDv4
ULID
Doc Hash
Jinja Expression

Move URI as argument instead of option for HF commands

FastAPI Wrappers

Ability to select any of the available tools and expose it as FastAPI endpoint. In a generic way.

CDP URI Scheme

We need a URI scheme as a shortcut to targetting a chroma instance with respective tenant/db and collection

http(s)://<basic_user or __auth_token__ or __x_chroma_token__>:<basic_password or token>@<host>:<port>/<database>/<collection>?tenant=<tenant>&batch_size=<batch_size>&limit=<limit>&offset=<offset>

Data versioning with dataset splits

Use Case Support

Add example usages at the beginning of the docs so that users quickly evaluate if this is the right tool for the job.

Text file generator

Read either a single or a dir of text files.

Transcribe command

Provide and input audio file and get it transcribed using whisper on device or OAI api.

Export to Hugging Face Dataset

HF Inference endpoint embedding support

Ability to pass distance function when importing data in Chroma

This is only applicable when creating a collection

Persistent Chroma support

support via URI e.g. file://path/to/persist_dir style

Investigate issue with chunk functionality

Sentence Transformers Embedding support

Direct WAL export

Exports data directly from SQLite DB without needing a Chroma instance available.

Useful for backups

Pinecone Generator

Read data from pinecone and output to stdout/file:

https://community.pinecone.io/t/how-to-retrieve-list-of-ids-in-an-index/380/11

import numpy as np
def get_ids_from_query(index,input_vector):
  print("searching pinecone...")
  results = index.query(vector=input_vector, top_k=10000,include_values=False)
  ids = set()
  print(type(results))
  for result in results['matches']:
    ids.add(result['id'])
  return ids

def get_all_ids_from_index(index, num_dimensions, namespace=""):
  num_vectors = index.describe_index_stats()["namespaces"][namespace]['vector_count']
  all_ids = set()
  while len(all_ids) < num_vectors:
    print("Length of ids list is shorter than the number of total vectors...")
    input_vector = np.random.rand(num_dimensions).tolist()
    print("creating random vector...")
    ids = get_ids_from_query(index,input_vector)
    print("getting ids from a vector query...")
    all_ids.update(ids)
    print("updating ids set...")
    print(f"Collected {len(all_ids)} ids out of {num_vectors}.")

  return all_ids

all_ids = get_all_ids_from_index(index, num_dimensions=1536, namespace="")
print(all_ids)

To read all vectors from a Qdrant index using Python, you would typically use the qdrant_client library, which is the official Python client for interacting with the Qdrant vector search engine. The documentation for Qdrant suggests that you can interact with the service using a convenient API to store, search, and manage vectors with an additional payload[1].

Here is a general outline of steps you would follow to read vectors from a Qdrant index using Python:

Install the Qdrant Python Client: If you haven't already, you need to install the qdrant-client package using pip:
```
pip install qdrant-client
```

Initialize the Client: Import the QdrantClient from the qdrant_client module and initialize it with the host and port where your Qdrant service is running:

from qdrant_client import QdrantClient

client = QdrantClient(host="localhost", port=6333)
 from qdrant_client.http.models import Distance, VectorParams
 client.create_collection(
     collection_name="test_collection",
     vectors_config=VectorParams(size=4, distance=Distance.DOT)
 )
 from qdrant_client.http.models import PointStruct
 operation_info = client.upsert(
     collection_name="test_collection",
     wait=True,
     points=[
         PointStruct(id=1, vector=[0.05, 0.61, 0.76, 0.74], payload={"city": "Berlin"}),
         PointStruct(id=2, vector=[0.19, 0.81, 0.75, 0.11], payload={"city": "London"}),
         # Add more points as needed...
     ]
 )

Retrieve Vectors: To retrieve vectors, you would typically use the scroll method, which allows you to iterate over all points in a collection. However, the specific method to retrieve all vectors is not directly documented in the provided search results. You would need to refer to the full API documentation or the source code for the qdrant_client library to find the exact method[4][8].

Here is a hypothetical example of how the code might look, assuming that there is a method get_all_vectors:
```
vectors = client.get_all_vectors(collection_name="your_collection_name")
for vector in vectors:
    print(vector)
```
Please note that the method get_all_vectors is used here as a placeholder. You will need to check the actual qdrant_client documentation or source code to find the correct method to retrieve all vectors.
Handle Pagination: If the collection is large, the results may be paginated. You would need to handle pagination in your code to iterate through all pages of results.

Since the exact method to retrieve all vectors from a Qdrant index is not provided in the search results, you should consult the Qdrant Python Client Documentation for detailed information on the available methods and how to use them[8]. Additionally, the Qdrant GitHub repository may contain examples and further documentation that can assist you[5].

Docker Setup:

docker run -p 6333:6333 -p 6334:6334 -v $(pwd)/qdrant_storage:/qdrant/storage:z qdrant/qdrant

Citations:
[1] https://qdrant.tech/documentation/
[2] https://qdrant.tech/documentation/concepts/indexing/
[3] https://qdrant.tech/documentation/concepts/points/
[4] https://python-client.qdrant.tech/py-modindex
[5] https://github.com/qdrant/qdrant-client
[6] https://python-client.qdrant.tech/genindex
[7] https://qdrant.tech/documentation/quick-start/
[8] https://python-client.qdrant.tech
[9] https://jina.ai/news/how-to-use-every-vector-database-in-python-with-docarray/
[10] https://gpt-index.readthedocs.io/en/v0.6.11/examples/data_connectors/QdrantDemo.html
[11] https://python.langchain.com/docs/integrations/vectorstores/qdrant
[12] https://qdrant.github.io/qdrant/redoc/index.html
[13] https://cookbook.openai.com/examples/vector_databases/qdrant/using_qdrant_for_embeddings_search

amikos-tech / chromadb-data-pipes Goto Github PK

chromadb-data-pipes's People

Contributors

Stargazers

Watchers

Forkers

chromadb-data-pipes's Issues

What is the purpose of pipelines

Recommend Projects

Recommend Topics

Recommend Org