Coder Social home page Coder Social logo

amikos-tech / chromadb-data-pipes Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 1.0 3.75 MB

ChromaDB Data Pipes ๐Ÿ–‡๏ธ - The easiest way to get data into and out of ChromaDB

Home Page: https://datapipes.chromadb.dev/

License: MIT License

Python 99.26% Dockerfile 0.28% Makefile 0.46%
ai chromadb machine-learning ml mlops pipeline

chromadb-data-pipes's People

Contributors

tazarov avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

chunthebear

chromadb-data-pipes's Issues

External Fns

Ability to load code from remote sources. We can start with Chroma as a source of these functions.

Generic file generator

Use LC mime router and python-magic to route files automatically:

from langchain_community.document_loaders.mime_router import MimeRouter
from langchain_community.document_loaders.text import TextLoader
from langchain_community.document_loaders.pdf import PDFLoader

# Create an instance of the MimeRouter
router = MimeRouter()

# Register loaders for specific MIME types
router.register_loader('text/plain', TextLoader)
router.register_loader('application/pdf', PDFLoader)

# Specify the path to the file
file_path = '/path/to/file.pdf'

# Determine the MIME type of the file
mime_type = 'application/pdf'  # You can use a library like python-magic to detect the MIME type

# Get the appropriate loader for the MIME type
loader_cls = router.get_loader(mime_type)

# Create an instance of the loader and load the file
loader = loader_cls(file_path)
document = loader.load()

# Process the loaded document as needed
content = document.content
# ...

Remote API import

A simple primer on how to call an API to fetch data. Keep this simple - remote fetch + rudimentary pagination support + basic JSONPath support (yes we only support JSON based REST APIs)

Observability

OTEL Observability pattern to annotate structures.

We want to produce:

  • Traces
  • Metrics
  • Logs

Add smart_open to reduce code duplication

import sys
from contextlib import contextmanager


@contextmanager
def smart_open(filename=None):
    if filename:
        fh = open(filename, 'w')
    else:
        fh = sys.stdout

    try:
        yield fh
    finally:
        if filename:
            fh.close()

Pipelines

What is the purpose of pipelines

Initial thoughts: A reusable flow of cdp commands (like a recipe) with predefined inputs that the user either needs to define or use the defaults of.

Where query in export

To keep it simple, let's use the same syntax as Chroma's where e.g. --where '{"id":1}'

Save state

Need an ADR for saving temporary state for each step in case of failure.

FastAPI Wrappers

Ability to select any of the available tools and expose it as FastAPI endpoint. In a generic way.

CDP URI Scheme

We need a URI scheme as a shortcut to targetting a chroma instance with respective tenant/db and collection

http(s)://<basic_user or __auth_token__ or __x_chroma_token__>:<basic_password or token>@<host>:<port>/<database>/<collection>?tenant=<tenant>&batch_size=<batch_size>&limit=<limit>&offset=<offset>

Use Case Support

Add example usages at the beginning of the docs so that users quickly evaluate if this is the right tool for the job.

Transcribe command

Provide and input audio file and get it transcribed using whisper on device or OAI api.

Direct WAL export

Exports data directly from SQLite DB without needing a Chroma instance available.

Useful for backups

Pinecone Generator

Read data from pinecone and output to stdout/file:

https://community.pinecone.io/t/how-to-retrieve-list-of-ids-in-an-index/380/11

import numpy as np
def get_ids_from_query(index,input_vector):
  print("searching pinecone...")
  results = index.query(vector=input_vector, top_k=10000,include_values=False)
  ids = set()
  print(type(results))
  for result in results['matches']:
    ids.add(result['id'])
  return ids

def get_all_ids_from_index(index, num_dimensions, namespace=""):
  num_vectors = index.describe_index_stats()["namespaces"][namespace]['vector_count']
  all_ids = set()
  while len(all_ids) < num_vectors:
    print("Length of ids list is shorter than the number of total vectors...")
    input_vector = np.random.rand(num_dimensions).tolist()
    print("creating random vector...")
    ids = get_ids_from_query(index,input_vector)
    print("getting ids from a vector query...")
    all_ids.update(ids)
    print("updating ids set...")
    print(f"Collected {len(all_ids)} ids out of {num_vectors}.")

  return all_ids

all_ids = get_all_ids_from_index(index, num_dimensions=1536, namespace="")
print(all_ids)

Qdant Generator

To read all vectors from a Qdrant index using Python, you would typically use the qdrant_client library, which is the official Python client for interacting with the Qdrant vector search engine. The documentation for Qdrant suggests that you can interact with the service using a convenient API to store, search, and manage vectors with an additional payload[1].

Here is a general outline of steps you would follow to read vectors from a Qdrant index using Python:

  1. Install the Qdrant Python Client: If you haven't already, you need to install the qdrant-client package using pip:

    pip install qdrant-client
  2. Initialize the Client: Import the QdrantClient from the qdrant_client module and initialize it with the host and port where your Qdrant service is running:

    from qdrant_client import QdrantClient
    
    client = QdrantClient(host="localhost", port=6333)
     from qdrant_client.http.models import Distance, VectorParams
     client.create_collection(
         collection_name="test_collection",
         vectors_config=VectorParams(size=4, distance=Distance.DOT)
     )
     from qdrant_client.http.models import PointStruct
     operation_info = client.upsert(
         collection_name="test_collection",
         wait=True,
         points=[
             PointStruct(id=1, vector=[0.05, 0.61, 0.76, 0.74], payload={"city": "Berlin"}),
             PointStruct(id=2, vector=[0.19, 0.81, 0.75, 0.11], payload={"city": "London"}),
             # Add more points as needed...
         ]
     )
  3. Retrieve Vectors: To retrieve vectors, you would typically use the scroll method, which allows you to iterate over all points in a collection. However, the specific method to retrieve all vectors is not directly documented in the provided search results. You would need to refer to the full API documentation or the source code for the qdrant_client library to find the exact method[4][8].

    Here is a hypothetical example of how the code might look, assuming that there is a method get_all_vectors:

    vectors = client.get_all_vectors(collection_name="your_collection_name")
    for vector in vectors:
        print(vector)

    Please note that the method get_all_vectors is used here as a placeholder. You will need to check the actual qdrant_client documentation or source code to find the correct method to retrieve all vectors.

  4. Handle Pagination: If the collection is large, the results may be paginated. You would need to handle pagination in your code to iterate through all pages of results.

Since the exact method to retrieve all vectors from a Qdrant index is not provided in the search results, you should consult the Qdrant Python Client Documentation for detailed information on the available methods and how to use them[8]. Additionally, the Qdrant GitHub repository may contain examples and further documentation that can assist you[5].

Docker Setup:

docker run -p 6333:6333 -p 6334:6334 -v $(pwd)/qdrant_storage:/qdrant/storage:z qdrant/qdrant

Citations:
[1] https://qdrant.tech/documentation/
[2] https://qdrant.tech/documentation/concepts/indexing/
[3] https://qdrant.tech/documentation/concepts/points/
[4] https://python-client.qdrant.tech/py-modindex
[5] https://github.com/qdrant/qdrant-client
[6] https://python-client.qdrant.tech/genindex
[7] https://qdrant.tech/documentation/quick-start/
[8] https://python-client.qdrant.tech
[9] https://jina.ai/news/how-to-use-every-vector-database-in-python-with-docarray/
[10] https://gpt-index.readthedocs.io/en/v0.6.11/examples/data_connectors/QdrantDemo.html
[11] https://python.langchain.com/docs/integrations/vectorstores/qdrant
[12] https://qdrant.github.io/qdrant/redoc/index.html
[13] https://cookbook.openai.com/examples/vector_databases/qdrant/using_qdrant_for_embeddings_search

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.