neumtry / neumai Goto Github PK

View Code? Open in Web Editor NEW

807.0 9.0 44.0 3.92 MB

Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.

Home Page: https://neum.ai

License: Apache License 2.0

Python 100.00%

ai data embeddings etl llm vector-database chatgpt data-engineering database pipeline

neumai's People

Contributors

Stargazers

Watchers

Forkers

kukrev rubenmccarty dmonroym kevinco26 arunkumar-patange evelynmitchell goaaron joezein dantegpt asitha07 muharremokutan prompted365 zhnathaniellee touristshaun rkp64 jithinraj goswamig kranthi419 mz0in jh941213 baceituno hbcbh1999 huangyingting techthiyanes dorucioclea ramzimalhas michaelbsuo syskarma valdo99 xudongliu chan150 qqq-tech mastersatish sky-2002 epsilla-cloud prashantdixit0 sunilkumardash9 ashi-agrawal yona764 brunoscaglione zhaopufeng ego urban-ninja1

neumai's Issues

ModuleNotFoundError: No module named 'neumai_tools'

python----3.10.12
neumai----0.0.33

error:
Traceback (most recent call last):
File "/Users/xxxxxx/xxxxxx/neum_test.py", line 4, in
from neumai.Chunkers.RecursiveChunker import RecursiveChunker
File "/Users/xxxxxx/miniforge3/lib/python3.10/site-packages/neumai/Chunkers/init.py", line 3, in
from .CustomChunker import CustomChunker
File "/Users/xxxxxx/miniforge3/lib/python3.10/site-packages/neumai/Chunkers/CustomChunker.py", line 4, in
from neumai_tools.SemanticHelpers import semantic_chunking
ModuleNotFoundError: No module named 'neumai_tools'

LanceDBSink

TypeError: LanceDBSink.search() got an unexpected keyword argument 'filters'

Adding marqo tensor search as a sink

I have opened this issue to propose to add marqo tensor search as a sink to NeumAI. I will link a PR which adds it.

Unified filter condition mapping

@ddematheu @kevinco26
Currently we are using a dictionary to provide filters on metadata, this approach lacks the following:

Range based queries
"Not equal to" queries
"Less than" , "greater than" queries

A simple solution to start with:

We can expect the user to provide a string of filters, for example - "field1 <= value1, field2 != value2" instead of a dictionay ({"field1": "value1", "field2": "value2"})
Then create FilterCondition objects by parsing this string:

class FilterCondition:
    
   # We can define FilterOperators
    def __init__(self, column: str, op: FilterOperator, value: Any):
            self.column = column
            self.op = op
            self.value = value

Then we can use the filter options of the corresponding databases to map these conditions to.

Let me know your comments on this, I would like to contribute.

Support self-hosted API for embeddings

Support using embedding services through url and api key or similar. Would allow neum to be more open and less vendor locked to the currently supported services.

Structured Search Pipeline

Querying requirements across RAG fall not only onto unstructured data that has been embedded and added to an vector database. It also falls onto structured data sources where semantic search doesn't really make sense.

Goal: Provide a pipeline interface that connects to a structured data source and generates queries in real-time based on queries.

Implementation:

Psuedo Pipeline without an embed or sink connector, just a data source.
Data source connector is configured and an initial pull from the database is done to examine the fields available and their types.
Search generates a query using an LLM based on the fields available in the database.
The Pipeline can be used as part of a PipelineCollection and supported by smart_route in order for model to decide when to use it.

Alternative implementation:

In order to reduce the latency associated with having to do 2-3 back to back LLM calls to generated query and validate it, what if the query generation was done pre-emptively and cached in to a vector database.
Using an LLM, we would try to predict the top sets of queries that one might expect from the database and its permutations. (This might limit the complexity of the queries, but might answer for 80% of use cases)
At search we would run a similarity search of the incoming query against the description of the "cached" queries. We then can run top query against the database.

Chat History Pipeline

As chat histories get longer, passing the entire history on every call is not a good practice. More so, user expects information from several messages ago to be available as context.

Goal: Improve size of the chat history context window to allow users to reference messages that fall outside existing window.

Solution: Leverage semantic search to index the entire chat history of a conversation and pull messages that are related to the latest message from the user.

Implementation:

Create a pseudo-Pipeline object that uses a custom source connector that simply bypasses messages written to it into a vector database.
Pipeline is declared with an Embed Connector and Sink Connector to be used as part of the operation.
At search we would run a normal search against the sink with filters to only pull messages from the given conversation.
Then the user would add the retrieved messages as context into the conversation alongside the last 3-4 messages

Prototyped: https://github.com/NeumTry/Pensieve

Other ideas:

Any chat systems that are worth integrating? (Twilio?)

Self-improving vector db based on feedback

When sink is queried using search API, if the retrieved information is correct (based on feedback or by running results against a different model), we could re-ingest the retrieved query pair (query and resulting vector) back into the vector DB, but using the query as the embedded value. The goal being that in future queries we can improve / make sure that the retrieved information is correct.

Add validation to HuggingFace embed connector

Validation is only set to return true. Change to connect to client to validate params.

Implement Pipeline Collection smart search

Currently support unified (re-rank results into single list) and separate (results for each pipeline returned separately) searches for a collection .

Adding smart search which will do a smart routing to identify what collections are worth searching based on the query. Using the description of the pipeline, match to query.

Filtering argument issues in search method

@ddematheu

Currently, the filter argument in SinkConnector.search method expects the following -
filters:List[FilterCondition]={}, it should rather be filters:List[dict]={} and then we need to convert the dict to a FilterCondition using dict_to_filter_condition. Because user would provide a dictionary not a FilterCondition object.
Also, there needs to be consistency in naming the filtering argument because at some places it is filter and some other places it is filters.
Let me know your opinion on this, and I will open a PR.

FilterCondition generation based on metadata

Given a query to the search interface for a sink, generate the FilterConditions automatically using the metadata fields available for a sink.

Add file_id by default to each vector

file_id is a unique identifier for each file processed by a pipeline.

file_id = pipeline_id + cloudFile_id

Necessary to be able to leverage delete, update and augment capabilities.

Support nested FilerCondition with AND / OR operators

Right now we support a list of FilterConditions which are automatically AND to each other. More complex nesting might be necessary.