Coder Social home page Coder Social logo

neumtry / neumai Goto Github PK

View Code? Open in Web Editor NEW
807.0 9.0 44.0 3.92 MB

Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.

Home Page: https://neum.ai

License: Apache License 2.0

Python 100.00%
ai data embeddings etl llm vector-database chatgpt data-engineering database pipeline

neumai's People

Contributors

ddematheu avatar kevinco26 avatar prashantdixit0 avatar sky-2002 avatar sunilkumardash9 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

neumai's Issues

ModuleNotFoundError: No module named 'neumai_tools'

python----3.10.12
neumai----0.0.33

error:
Traceback (most recent call last):
File "/Users/xxxxxx/xxxxxx/neum_test.py", line 4, in
from neumai.Chunkers.RecursiveChunker import RecursiveChunker
File "/Users/xxxxxx/miniforge3/lib/python3.10/site-packages/neumai/Chunkers/init.py", line 3, in
from .CustomChunker import CustomChunker
File "/Users/xxxxxx/miniforge3/lib/python3.10/site-packages/neumai/Chunkers/CustomChunker.py", line 4, in
from neumai_tools.SemanticHelpers import semantic_chunking
ModuleNotFoundError: No module named 'neumai_tools'

LanceDBSink

TypeError: LanceDBSink.search() got an unexpected keyword argument 'filters'

Unified filter condition mapping

@ddematheu @kevinco26
Currently we are using a dictionary to provide filters on metadata, this approach lacks the following:

  • Range based queries
  • "Not equal to" queries
  • "Less than" , "greater than" queries

A simple solution to start with:

  • We can expect the user to provide a string of filters, for example - "field1 <= value1, field2 != value2" instead of a dictionay ({"field1": "value1", "field2": "value2"})
  • Then create FilterCondition objects by parsing this string:
class FilterCondition:
    
   # We can define FilterOperators
    def __init__(self, column: str, op: FilterOperator, value: Any):
            self.column = column
            self.op = op
            self.value = value
  • Then we can use the filter options of the corresponding databases to map these conditions to.

Let me know your comments on this, I would like to contribute.

Support self-hosted API for embeddings

Support using embedding services through url and api key or similar. Would allow neum to be more open and less vendor locked to the currently supported services.

Structured Search Pipeline

Querying requirements across RAG fall not only onto unstructured data that has been embedded and added to an vector database. It also falls onto structured data sources where semantic search doesn't really make sense.

Goal: Provide a pipeline interface that connects to a structured data source and generates queries in real-time based on queries.

Implementation:

  • Psuedo Pipeline without an embed or sink connector, just a data source.
  • Data source connector is configured and an initial pull from the database is done to examine the fields available and their types.
  • Search generates a query using an LLM based on the fields available in the database.
  • The Pipeline can be used as part of a PipelineCollection and supported by smart_route in order for model to decide when to use it.

Alternative implementation:

  • In order to reduce the latency associated with having to do 2-3 back to back LLM calls to generated query and validate it, what if the query generation was done pre-emptively and cached in to a vector database.
  • Using an LLM, we would try to predict the top sets of queries that one might expect from the database and its permutations. (This might limit the complexity of the queries, but might answer for 80% of use cases)
  • At search we would run a similarity search of the incoming query against the description of the "cached" queries. We then can run top query against the database.

Chat History Pipeline

As chat histories get longer, passing the entire history on every call is not a good practice. More so, user expects information from several messages ago to be available as context.

Goal: Improve size of the chat history context window to allow users to reference messages that fall outside existing window.

Solution: Leverage semantic search to index the entire chat history of a conversation and pull messages that are related to the latest message from the user.

Implementation:

  • Create a pseudo-Pipeline object that uses a custom source connector that simply bypasses messages written to it into a vector database.
  • Pipeline is declared with an Embed Connector and Sink Connector to be used as part of the operation.
  • At search we would run a normal search against the sink with filters to only pull messages from the given conversation.
  • Then the user would add the retrieved messages as context into the conversation alongside the last 3-4 messages

Prototyped: https://github.com/NeumTry/Pensieve

Other ideas:

  • Any chat systems that are worth integrating? (Twilio?)

Self-improving vector db based on feedback

When sink is queried using search API, if the retrieved information is correct (based on feedback or by running results against a different model), we could re-ingest the retrieved query pair (query and resulting vector) back into the vector DB, but using the query as the embedded value. The goal being that in future queries we can improve / make sure that the retrieved information is correct.

Implement Pipeline Collection smart search

Currently support unified (re-rank results into single list) and separate (results for each pipeline returned separately) searches for a collection .

Adding smart search which will do a smart routing to identify what collections are worth searching based on the query. Using the description of the pipeline, match to query.

Filtering argument issues in search method

@ddematheu

  1. Currently, the filter argument in SinkConnector.search method expects the following -
    filters:List[FilterCondition]={}, it should rather be filters:List[dict]={} and then we need to convert the dict to a FilterCondition using dict_to_filter_condition. Because user would provide a dictionary not a FilterCondition object.
  2. Also, there needs to be consistency in naming the filtering argument because at some places it is filter and some other places it is filters.
    Let me know your opinion on this, and I will open a PR.

Add file_id by default to each vector

file_id is a unique identifier for each file processed by a pipeline.

file_id = pipeline_id + cloudFile_id

Necessary to be able to leverage delete, update and augment capabilities.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.