Is your feature request related to a problem? Please describe Open

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks for creating this RFC, few follow up questions: What is

[Triage - attendees 1 <a href="https://gith

Looking at <a class="issue-link js-issue-link" data-error-text="Failed to load title"

Implementation using Custom (which extends <code clas

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

[RFC] Data Access Object Interface for Metadata Store about opensearch HOT 19 OPEN

dbwiddis commented on September 26, 2024 2

[RFC] Data Access Object Interface for Metadata Store

from opensearch.

Comments (19)

andrross commented on September 26, 2024 3

@msfroh Thanks! I believe @Bukhtawar is saying that the current Remote Store implementation uses the Repository interface for storing metadata, and that it is a bad fit for that use case. So Remote Store is yet another feature that would be interested in using the new metadata store feature. But just like there are questions around what search features should the metadata store support, we'd need to figure out what transactional and concurrency control features it should support as well.

from opensearch.

saratvemulapalli commented on September 26, 2024 1

@fddattal is also working on similar[1] lines to abstract out plugins interfacing with core for state.

[1] #13274

from opensearch.

xinlamzn commented on September 26, 2024

While plugins may not need full-text search capability, some search features are nice to have and could simplify existing plugin migration. Could we define the minimal search capability that could be easily implemented across different backend storages?

from opensearch.

arjunkumargiri commented on September 26, 2024

Thanks for creating this RFC, few follow up questions:

What is the proposal if plugin needs support for search functionality? Does each plugin need to extend the base implementation to add support for custom search functionality?
In addition to NodeClient based default implementation, will any other default implementation be supported?

from opensearch.

dbwiddis commented on September 26, 2024

While plugins may not need full-text search capability, some search features are nice to have and could simplify existing plugin migration. Could we define the minimal search capability that could be easily implemented across different backend storages?

We would probably want to include a listing feature ("search all" equivalent).

If we had keyword-based fields we could probably include a limited number of them as filtering of the list.

What is the proposal if plugin needs support for search functionality? Does each plugin need to extend the base implementation to add support for custom search functionality?

This somewhat relates to @xinlamzn 's comment above. We need some sort of basic search; the problem is setting performance expectations. NoSQL DBs are optimized for key-value lookup, not for searching across all the data.

In addition to NodeClient based default implementation, will any other default implementation be supported?

Possibly! I'd think things that are included already in OpenSearch would be prime candidates. A Java Client implementation for a remote cluster would probably have wide applicability. I'm not sure if Remote Storage is the right candidate for it but it could be explored. Beyond those, I'd expect it'd be driven by community requests.

from opensearch.

peternied commented on September 26, 2024

[Triage - attendees 1 2 3 4 5 6 7]
@dbwiddis Thanks for creating this RFC, seems like a great improvement for encapsulation

from opensearch.

dbwiddis commented on September 26, 2024

Looking at #13274, we've got the same goals and would benefit from the same implementation. Currently leaning toward the interface in this comment: #13274 (comment)

Not all plugins will need all interfaces implemented, so will need to consider default implementations or an abstract base class with no-op implementations.

from opensearch.

dbwiddis commented on September 26, 2024

Implementation using Custom (which extends ToXContent) for a cluster-based put (index) request:

public class XContentClient implements Client {

    private final org.opensearch.client.Client client;

    public XContentClient(org.opensearch.client.Client client) {
        this.client = client;
    }

    @Override
    public CompletionStage<PutCustomResponse> putCustom(PutCustomRequest request) {
        CompletableFuture<PutCustomResponse> future = new CompletableFuture<>();
        try (XContentBuilder sourceBuilder = XContentFactory.jsonBuilder()) {
            client.index(
                new IndexRequest(request.index()).setRefreshPolicy(IMMEDIATE)
                    .source(request.custom().toXContent(sourceBuilder, ToXContent.EMPTY_PARAMS)),
                ActionListener.wrap(
                    r -> future.complete(
                        new PutCustomResponse.Builder().id(r.getId()).created(Result.CREATED.equals(r.getResult())).build()
                    ),
                    future::completeExceptionally
                )
            );
        } catch (IOException ioe) {
            // Parsing error
            future.completeExceptionally(ioe);
        }
        return future;
    }
}

from opensearch.

andrross commented on September 26, 2024

@dbwiddis What is the basic idea for plugins to migrate from their existing solution to using this? Will data movement be required (i.e. read all data from cluster state or system index being used today and rewrite it into this interface), or can this be a new facade on top of the existing data?

Regarding existing plugins migrating to this and the questions around search, have you done an inventory of the current plugins that exist within the OpenSearch project and their usage of metadata storage? Basically, I would tend to agree with your statement "plugin metadata tends to be more narrowly defined and include more key-value storage or a limited number of documents primarily referenced by their document ID" but it would be great if we can back that assertion up with some real numbers.

from opensearch.

dbwiddis commented on September 26, 2024

@dbwiddis What is the basic idea for plugins to migrate from their existing solution to using this?

Funny you should ask as I was working on this Draft PR, that I hope shows a possible migration path. Open to feedback!

opensearch-project/ml-commons#2430

Will data movement be required (i.e. read all data from cluster state or system index being used today and rewrite it into this interface), or can this be a new facade on top of the existing data?

The above PR maintains the data in the existing system index, no movement required. However, my next trick will be to use the OpenSearch Java Client to demonstrate the ability to read/write that same data to a remote cluster, and my previous POC would allow that same data to be read/written to a NoSQL store. Options abound.

TLDR: for existing cluster implementation it's a facade.

Regarding existing plugins migrating to this and the questions around search, have you done an inventory of the current plugins that exist within the OpenSearch project and their usage of metadata storage?

Broadly, yes, but I have not done a detailed look, however; for now I'm focusing on flow framework and ML Commons.

Basically, I would tend to agree with your statement "plugin metadata tends to be more narrowly defined and include more key-value storage or a limited number of documents primarily referenced by their document ID" but it would be great if we can back that assertion up with some real numbers.

I'll definitely add that to my long to-do list!

from opensearch.

arjunkumargiri commented on September 26, 2024

Implementation using Custom (which extends ToXContent) for a cluster-based put (index)

Custom is specific to cluster metadata store, can we create a new entity for plugin metadata? Maybe Data or Document

from opensearch.

msfroh commented on September 26, 2024

Plugins should be able to define other persistent storage alternatives than the cluster state or system indices. Options include remote clusters, non-relational databases (MongoDB, NoSQL, DynamoDB, Apache Cassandra, HBase, and many others), Blob Storage (see Remote Cluster State RFC #9143).

Just to clarify, should the plugins manage the persistent storage alternatives? Or should plugins just talk with the DAO interface and not care what the persistent storage is? (I hope the latter.)

Or are you saying that the DAO implementations could be plugins? (I hope so.)

from opensearch.

dbwiddis commented on September 26, 2024

Just to clarify, should the plugins manage the persistent storage alternatives? Or should plugins just talk with the DAO interface and not care what the persistent storage is? (I hope the latter.)

"The plugins" here is overly broad, as plugins will still have a choice in the storage alternative, however that will be more of a configuration-level "management".

The vision here is:

Code meant to create, read, update, delete, and search data should be using the DAO interface, so that code will never have to change. This goal is clear.
The actual runtime class implementing the interface could be changed by the "plugin" but here we're essentially just changing an import, or injected binding, etc. This goal is a bit fuzzier. I can envision potentially choosing different storage implementations for different types of data.

Or are you saying that the DAO implementations could be plugins? (I hope so.)

Maybe. Probably. I'm not sure yet. That's why this RFC. But this is conceptually similar to the Repository interface which eventually finds its way down to an azure-repository plugin.

At this point, the DAO implementations are a single class, so creating a whole plugin around them feels like overkill. They probably do at least need to be in separate maven artifacts.

from opensearch.

andrross commented on September 26, 2024

@msfroh @dbwiddis I think the telemetry effort might be a good parallel here: There is a TelemetryPlugin interface that allows injecting an implementation for transmitting/storing telemetry data to whatever other system you want. The server defines interfaces for emitting metrics (e.g. Tracer, MetricsRegistry) and wires up the implementation provided via TelemetryPlugin. Core pieces of the server now emit metrics through those interfaces. And finally, a TelemetryAwarePlugin interface was introduced to expose Tracer and MetricsRegistry to plugins so that they can emit metrics themselves.

So following that pattern, a new plugin would allow injecting an implementation for storing metadata into the core. The default could be cluster state and/or system indexes, but a plugin could just as easily provide an implementation for an external store. The core would define some new interface for reading/writing this metadata (I think this would look something like your Dao class) and wire up the appropriate implementation. Core features (such as search pipelines) would use this interface for reading and writing its metadata. And finally this new interface would be exposed to plugins for their own metadata storage needs.

One final point, @dbwiddis is coming at this from the perspective of plugin extensibility, but I believe this work also aligns with breaking apart the monolithic cluster state which is an impediment to scaling large clusters.

from opensearch.

Bukhtawar commented on September 26, 2024

There are more use case like Remote store using a metadata store and we really need to think about the store capabilities for instance optimistic concurrency control/conditional updates/transactions which is something missing in the Repository interface. We should list down the common access patterns we envision with varied uses cases and see how that works with pluggable stores across cloud providers.

from opensearch.

msfroh commented on September 26, 2024

@andrross -- Yes! That's exactly the approach that I was thinking of -- pluggable metadata storage, where the producers/consumers don't care how it works or where the metadata is stored, just that it honors some interface.

@Bukhtawar -- I don't think @dbwiddis is suggesting that the Repository interface specifically would be good for metadata (since it's probably not, being eventually consistent). It's just an example of a general "I want to store file-like stuff somewhere" interface. In this case, the hypothetical new MetadataStorage interface would probably need to promise some guarantees around transactionality and concurrency control, and it would be up to implementations to honor those guarantees in order to be valid. (So, e.g. S3 on its own probably isn't what you want -- DynamoDB or DynamoDB + S3 would be better implementations.)

from opensearch.

lukas-vlcek commented on September 26, 2024

Very interesting idea indeed.

But if there will be any (meta)data stored in external systems then I think we also need to think about identity management that needs to be part of the communication with external stores. Should this topic be part of this proposal as well?

The simplest example would be just a basic read/write permissions configuration for an external store (assuming the external system requires user authn/authz). Will that be part of OpenSearch configuration? Or specific configuration of particular metadata store implementation? Or will this be left on external systems to handle? (Ie. for example users will need to setup a proxy server in front of the external store to handle this)

from opensearch.

andrross commented on September 26, 2024

@lukas-vlcek I think identity management would work similarly to how the repository plugins work in OpenSearch. Each implementation for any given remote metadata storage system would be responsible for defining how its specific credentials are provided, and the operator would be responsible for giving credentials with the necessary permissions for OpenSearch to interact with this system.

from opensearch.

dbwiddis commented on September 26, 2024

@andrross @lukas-vlcek @Bukhtawar

Circling back to this after several weeks of trying to migrate a plugin to use this. I'm a bit concerned that we may be trying to over-generalize "metadata" and store arbitrary things. We actually do have arbitrary blob storage in various locations with an interface. But:

The default could be cluster state and/or system indexes, but a plugin could just as easily provide an implementation for an external store.

These two abstractions already store very specific types of data and have very different interfaces.

System indices store documents just like all of OpenSearch, and plugins expect the usual CRUD-S operations to work on them just like they always do. Sure we can put that document (conceptually a JSON string) anywhere.

Cluster state is completely different in how it operates, but it's very consistent with a different interface. It's difficult to combine them.

from opensearch.

[RFC] Data Access Object Interface for Metadata Store about opensearch HOT 19 OPEN

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent