amikos-tech / chromadb-data-pipes Goto Github PK
View Code? Open in Web Editor NEWChromaDB Data Pipes ๐๏ธ - The easiest way to get data into and out of ChromaDB
Home Page: https://datapipes.chromadb.dev/
License: MIT License
ChromaDB Data Pipes ๐๏ธ - The easiest way to get data into and out of ChromaDB
Home Page: https://datapipes.chromadb.dev/
License: MIT License
Ability to call any arbitrary pythnon function and adapt its input to target format
Detailed info about import command
Ability to load code from remote sources. We can start with Chroma as a source of these functions.
Use LC mime router and python-magic to route files automatically:
from langchain_community.document_loaders.mime_router import MimeRouter
from langchain_community.document_loaders.text import TextLoader
from langchain_community.document_loaders.pdf import PDFLoader
# Create an instance of the MimeRouter
router = MimeRouter()
# Register loaders for specific MIME types
router.register_loader('text/plain', TextLoader)
router.register_loader('application/pdf', PDFLoader)
# Specify the path to the file
file_path = '/path/to/file.pdf'
# Determine the MIME type of the file
mime_type = 'application/pdf' # You can use a library like python-magic to detect the MIME type
# Get the appropriate loader for the MIME type
loader_cls = router.get_loader(mime_type)
# Create an instance of the loader and load the file
loader = loader_cls(file_path)
document = loader.load()
# Process the loaded document as needed
content = document.content
# ...
Tracking issue for:
A simple primer on how to call an API to fetch data. Keep this simple - remote fetch + rudimentary pagination support + basic JSONPath support (yes we only support JSON based REST APIs)
Command prefixes like tx
and imp
may put unnecessary cognitive load on users. Evaluate the option where you can transform things like this:
cdp tx embed
-> cdp embed
cdp tx chunk
-> cdp chunk
OTEL Observability pattern to annotate structures.
We want to produce:
import sys
from contextlib import contextmanager
@contextmanager
def smart_open(filename=None):
if filename:
fh = open(filename, 'w')
else:
fh = sys.stdout
try:
yield fh
finally:
if filename:
fh.close()
Initial thoughts: A reusable flow of cdp commands (like a recipe) with predefined inputs that the user either needs to define or use the defaults of.
URI should not be option but the first and only arg for these commands.
Clean emojis from text and metadata
To keep it simple, let's use the same syntax as Chroma's where
e.g. --where '{"id":1}'
Need an ADR for saving temporary state for each step in case of failure.
Add chunk separator character
Support the following strategies:
Ability to select any of the available tools and expose it as FastAPI endpoint. In a generic way.
We need a URI scheme as a shortcut to targetting a chroma instance with respective tenant/db and collection
http(s)://<basic_user or __auth_token__ or __x_chroma_token__>:<basic_password or token>@<host>:<port>/<database>/<collection>?tenant=<tenant>&batch_size=<batch_size>&limit=<limit>&offset=<offset>
Push docs with each release
Ability to inject KV pairs of metadata
Add example usages at the beginning of the docs so that users quickly evaluate if this is the right tool for the job.
Read either a single or a dir of text files.
Provide and input audio file and get it transcribed using whisper on device or OAI api.
This is only applicable when creating a collection
support via URI e.g. file://path/to/persist_dir
style
Exports data directly from SQLite DB without needing a Chroma instance available.
Useful for backups
Read data from pinecone and output to stdout/file:
https://community.pinecone.io/t/how-to-retrieve-list-of-ids-in-an-index/380/11
import numpy as np
def get_ids_from_query(index,input_vector):
print("searching pinecone...")
results = index.query(vector=input_vector, top_k=10000,include_values=False)
ids = set()
print(type(results))
for result in results['matches']:
ids.add(result['id'])
return ids
def get_all_ids_from_index(index, num_dimensions, namespace=""):
num_vectors = index.describe_index_stats()["namespaces"][namespace]['vector_count']
all_ids = set()
while len(all_ids) < num_vectors:
print("Length of ids list is shorter than the number of total vectors...")
input_vector = np.random.rand(num_dimensions).tolist()
print("creating random vector...")
ids = get_ids_from_query(index,input_vector)
print("getting ids from a vector query...")
all_ids.update(ids)
print("updating ids set...")
print(f"Collected {len(all_ids)} ids out of {num_vectors}.")
return all_ids
all_ids = get_all_ids_from_index(index, num_dimensions=1536, namespace="")
print(all_ids)
To read all vectors from a Qdrant index using Python, you would typically use the qdrant_client
library, which is the official Python client for interacting with the Qdrant vector search engine. The documentation for Qdrant suggests that you can interact with the service using a convenient API to store, search, and manage vectors with an additional payload[1].
Here is a general outline of steps you would follow to read vectors from a Qdrant index using Python:
Install the Qdrant Python Client: If you haven't already, you need to install the qdrant-client
package using pip:
pip install qdrant-client
Initialize the Client: Import the QdrantClient
from the qdrant_client
module and initialize it with the host and port where your Qdrant service is running:
from qdrant_client import QdrantClient
client = QdrantClient(host="localhost", port=6333)
from qdrant_client.http.models import Distance, VectorParams
client.create_collection(
collection_name="test_collection",
vectors_config=VectorParams(size=4, distance=Distance.DOT)
)
from qdrant_client.http.models import PointStruct
operation_info = client.upsert(
collection_name="test_collection",
wait=True,
points=[
PointStruct(id=1, vector=[0.05, 0.61, 0.76, 0.74], payload={"city": "Berlin"}),
PointStruct(id=2, vector=[0.19, 0.81, 0.75, 0.11], payload={"city": "London"}),
# Add more points as needed...
]
)
Retrieve Vectors: To retrieve vectors, you would typically use the scroll
method, which allows you to iterate over all points in a collection. However, the specific method to retrieve all vectors is not directly documented in the provided search results. You would need to refer to the full API documentation or the source code for the qdrant_client
library to find the exact method[4][8].
Here is a hypothetical example of how the code might look, assuming that there is a method get_all_vectors
:
vectors = client.get_all_vectors(collection_name="your_collection_name")
for vector in vectors:
print(vector)
Please note that the method get_all_vectors
is used here as a placeholder. You will need to check the actual qdrant_client
documentation or source code to find the correct method to retrieve all vectors.
Handle Pagination: If the collection is large, the results may be paginated. You would need to handle pagination in your code to iterate through all pages of results.
Since the exact method to retrieve all vectors from a Qdrant index is not provided in the search results, you should consult the Qdrant Python Client Documentation for detailed information on the available methods and how to use them[8]. Additionally, the Qdrant GitHub repository may contain examples and further documentation that can assist you[5].
Docker Setup:
docker run -p 6333:6333 -p 6334:6334 -v $(pwd)/qdrant_storage:/qdrant/storage:z qdrant/qdrant
Citations:
[1] https://qdrant.tech/documentation/
[2] https://qdrant.tech/documentation/concepts/indexing/
[3] https://qdrant.tech/documentation/concepts/points/
[4] https://python-client.qdrant.tech/py-modindex
[5] https://github.com/qdrant/qdrant-client
[6] https://python-client.qdrant.tech/genindex
[7] https://qdrant.tech/documentation/quick-start/
[8] https://python-client.qdrant.tech
[9] https://jina.ai/news/how-to-use-every-vector-database-in-python-with-docarray/
[10] https://gpt-index.readthedocs.io/en/v0.6.11/examples/data_connectors/QdrantDemo.html
[11] https://python.langchain.com/docs/integrations/vectorstores/qdrant
[12] https://qdrant.github.io/qdrant/redoc/index.html
[13] https://cookbook.openai.com/examples/vector_databases/qdrant/using_qdrant_for_embeddings_search
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.