opensearch-project / search-processor Goto Github PK

View Code? Open in Web Editor NEW

22.0 16.0 22.0 608 KB

Search Request Processor: pipeline for transformation of queries and results inline with a search request.

License: Apache License 2.0

Java 90.61% Shell 9.39%

opensearch relevance search-algorithms search-engine

search-processor's Introduction

Search Rerankers: AWS Kendra & AWS Personalize

Welcome!
Project Resources
Code of Conduct
License
Copyright

Welcome!

This repository hosts the code for two self-install re-rankers that integrate into Search Pipelines. User documentation for the Personalize Reranker is here. For Kendra, it is here.

Search Processors: Where Do They Go?

The current guideline for developing processors is that if you are developing a processor that would introduce new dependencies in OpenSearch Core (e.g. new libraries, makes a network connection outside of OpenSearch), it should be in a separate repository. Please consider creating it in a standalone repository since each processor should be thought of like a *NIX command with input and output connected by pipes (i.e. a Search Pipeline). Each processor should do one thing and do it well. Otherwise, it could go into the OpenSearch repository under org.opensearch.search.pipeline.common. If you have doubts, just create an issue in OpenSearch Core and, if you have one, a new PR. Maintainers will help guide you.

History

This repository has also been used for discussion and ideas around search relevance. These discussions still exist here, however due to the relatively new standard of having one repo per plugin in OpenSearch and our implementations beginning to make it into the OpenSearch build, we have two repositories now. This repository will develop into a plugin that will allow OpenSearch users to rewrite search queries, rerank results, and log data about those actions. The other repository, dashboards-search-relevance, is where we will build front-end tooling to help relevance engineers and business users tune results.

Project Resources

Code of Conduct

This project has adopted the Amazon Open Source Code of Conduct. For more information see the Code of Conduct FAQ, or contact [email protected] with any additional questions or comments.

License

This project is licensed under the Apache v2.0 License.

Copyright

search-processor's People

Contributors

Stargazers

Watchers

Forkers

ps48 karthiksaligrama kevinawskendra yang-db msfroh nocharger dblock macohen manuelmhtr mingshl sejli qpc-github kulket dbwiddis tirthptl reta austintlee darshitchanpura

search-processor's Issues

[Testing Confirmation] Confirm current testing requirements

As part of the discussion around implementing an organization-wide testing policy, I am visiting each repo to see what tests they currently perform. I am conducting this work on GitHub so that it is easy to reference.

Looking at the Search Processor repository, it appears there is

Repository	Unit Tests	Integration Tests	Backwards Compatibility Tests	Additional Tests	Link
Search Processor				Certificate of Origin, Link Checker, Create Documentation Issue	#104

If there are any tests I missed or anything you think all repositories in OpenSearch should have for testing please respond to this issue with details.

[Doc] Add release notes

Add release notes in ./release-notes directory

Baseline MAINTAINERS, CODEOWNERS, and external collaborator permissions

Follow opensearch-project/.github#125 to baseline MAINTAINERS, CODEOWNERS, and external collaborator permissions.

Close this issue when:

MAINTAINERS.md has the correct list of project maintainers.
CODEOWNERS exists and has the correct list of aliases.
Repo permissions only contain individual aliases as collaborators with maintain rights, admin, and triage teams.
All other teams are removed from repo permissions.

If this repo's permissions was already baselined, please confirm the above when closing this issue.

[PROPOSAL] Search Semantic Chaining Mechanisms

Relevancy rewriters and rankers mechanism

The purpose of this mechanism is to allow a concise and standard way of defining search relevancy occurring on both
query rewrite side and results ranking

This proposal is the collaboration of the

The capability of chaining multiple search relevancy rewriters and possibly results rerankers would allow the following :

Combine different aspect of relevancy rewriting into a single chain
Create a common standard for search relevancy related plugin components
Easily allow comparing query results under different ranking solutions
Simplify integrating such plugins into the search-relevancy dashboard using dedicated API

Chain Components

Chain operators
Each chain element is an operator which transforms the query content and send it upstream to the next operator - we will
call them Transformers.

The expectation from a transformer is to have no additional side-effects apart from the query transformation.

Chain payload
The chain's payload is the query itself. Each transformer is expected to transform the query in such a way that is
processable by the next transformer.

Chain termination step
The chain is terminated with a terminal step which is no longer emitting the query to upstream components of the chain.
This termination step is likely an actual execution of the query against the underlying search engine.

Chain footsteps
Once a chain is executing, it leaves a trail for each transformer that is operating in the form of specific train info.

Chain execution
The chain order will be defined as part of the query extension, if such definition is not found under the query
extension, the fallback will be the
specific query's index mapping definition of the rewriter (under the mapping's metadata)

Rewriter Transformations

The chain mechanism is actually a composition of query interceptors. These query interceptors purpose will be of
chaining the individual
query rewriter plugin one to the other in a sequential manner.

Rankers Transformations

The chain mechanism is terminated once a termination step is called. Such termination step is the ranker operator.
The ranker operator takes the query input and performs the actual query against the database and ranks the results
according to its own internal reasoning.

We currently don't support paging in the chaining termination step and therefore this step does not allow paging of
the results.

Configuration

Each transformation/operator may use the next levels of configuration:

Pluging level configuration
Index level configuration
Query level configuration

Pluging level configuration

This level of configuration is supported by the Plugin API of opensearch and may be used for static related
configuration of the component.
Implementation of this capability can make use of the BaseRestHandler endpoint extension mechanism.

For example querqy uses such endpoint for it's rewrite rules definition:

PUT /_plugins/_querqy/rewriter/common_rules

{
  "class": "querqy.opensearch.rewriter.SimpleCommonRulesRewriterFactory",
  "config": {
      "rules" : "request =>\nSYNONYM: GET"
  }
}

Index level configuration

This level of configuration is supported by the using the index mapping meta DSL which is an existing part of the
mapping DSL.
Example usage of the index mapping configuration:

New chain mapping DSL
For backwards compatibility we will use the index mapping **_meta **_field to preserve the configuration information
related both to the rewriters and rankers.

The chain parts will reside under the generic concepts:
** - rankers - **ranker list of plugins configuration
** - rewriters - **rewriter list of plugins configuration

Metadata under my_index/_mapping

{
  "_meta": {
    "rankers": [
      {
        "name": "kendra",
        "properties": {
          "title_fields": [
            "title"
          ],
          "body_fields": [
            "published",
            "description"
          ]
        }
      }
    ]
  }
}

The order of the ranker/rewriter is explicit and the chain will dispatch accordingly (unless another directive appears
under the query chain-directive )

Query level configuration

This level of configuration is supported by using the query extension DSL. This section will have a new chain DSL
structure. In a similar manner to the _"meta" section of the mapping DSL, the "ext" will contain the rankers &
rewriters list.

Extension under _search

{
  "query": {
  },
  "ext": {
    "rewriters": [
      {
        "name": "querqy",
        "properties": {
          "querqy": {
            "matching_query": {
              "must_match": {
                "query": "rambo"
              },
              "multi_match": {
                "query": "rambo",
                "fields": [
                  "field1",
                  "field2"
                ]
              }
            },
            "query_fields": [
              "title^3.0",
              "brand^2.1",
              "shortSummary"
            ]
          }
        }
      }
    ],
    "rankers": [
      {
        "name": "kendra",
        "properties": {
          "title_fields": [
            "title"
          ],
          "body_fields": [
            "published",
            "description"
          ]
        }
      }
    ]
  }
}

The order of the ranker/rewriter is explicit and the chain will dispatch accordingly (unless another directive appears

This is a flow chart visualization of the chain steps:

############                 ############             #############           #############
# _Search  #                 #  querqy  #             #  kendra   #           #  Results  #
#   -query #                 #  -rewrite#             #  -execute #           #    -   1  #
#      ... #   --------->    #     query#  ---------> #    search # --------->#    -   2  #   
#          #                 #          #             #  -rank    #           #    -   3  #
############                 #          #             #   results #           #    -   4  #
                             ############             #############           #############
                                                           /\
                                                           ||
                                                           || 
                                                           || 
                                                           || 
                                                           \/ 
                                                      ###############
                                                      # opensearch  #  
                                                      #  -run-query #   
                                                      ###############

Chain Context

Search Relevancy Context Information
In order for the rewriter and ranker chain to be able to track and be informed of all the modifications each step is
performing an execution context is needed.

This context will have the next fields that can be applied to any future plugin that needs to perform rewrites or
ranking

context (information about the current execution parameters)
- params section is an input to each and every ranker and rewriter that it may use it for its own needs*
  - query - the original query that is to be carried forward down the chain
- execution (execution related content that is generated throughout the pipeline)
  - id auto-generated unique id describing the chain instance)
  - rewriters rewriter list of plugin query configuration
  - rankers ranker list of plugins query configuration
  - exclude remove rewriters/rankers that appear in the default index configuration

This execution section may have additional internal fields which are related to the execution flow itself and are
subject to future changes*

This context will be attached to the query DSL under the _ext section.

POST my_index/_search

{
  "query": {
    "match_all": {}
  },
  "ext": {
    "context": {
      "params": {
        "query": {
          "match_all": {}
        }
      }
    },
    "execution": {
      "id": "ABC123",
      "rewriters": [
        {
          "name": "querqy",
          "properties": {
            "querqy": {
              "matching_query": {
                "must_match": {
                  "query": "rambo"
                },
                "multi_match": {
                  "query": "rambo",
                  "fields": [
                    "field1",
                    "field2"
                  ]
                }
              },
              "query_fields": [
                "title^3.0",
                "brand^2.1",
                "shortSummary"
              ]
            }
          }
        }
      ],
      "rankers": [
        {
          "name": "kendra",
          "properties": {
            "title_fields": [
              "title"
            ],
            "body_fields": [
              "published",
              "description"
            ]
          }
        }
      ]
    }
  }
}

Activating Query rewriter / rerankers

During the lifetime of the index, once a query is running against an index - the following steps will occur:

verify the index if search-relevancy activated
1. create a chain flow control component which will drive the chain of rewriters & rerankers
  create the search-relevancy context information (or use existing one if such was created)
for each rewrite step in the rewriters list :
1. dispatch execution to the plugin
2. plugin receives the params section as parameters
3. plugin changes the query
4. plugin may add additional information on its execution step under ext->context->rewriters->$name$->info
5. returns execution to the chain flow control
for each semantic-ranker step in the rankers list:
1. dispatch execution to the plugin
2. plugin receives the params section as parameters
3. plugin performs the ranking logic
4. returns newly ranked results to the caller

In case the rewriter/ranker doesn't appear in the query ext section, but it does appear in the relevant index **
mapping** section -
the configuration details from the index mapping section will be copied into the query relevant ext section.

To disable a rewriter/ranker from being activated on a query in cases where the index mapping indicate it is a part of
the chain,
add their name to exclude list under the execution section.

Example

Configuration Stage

Step 0: Create plugins configuration settings

PUT /_plugins/_querqy/rewriter

{
  "common_rules": [
    {
      "class": "querqy.opensearch.rewriter.SimpleCommonRulesRewriterFactory",
      "config": {
        "rules": "request =>\nSYNONYM: GET"
      }
    }
  ]
}

PUT /_plugins/_kendra

{
  "config": {
    "endpoint": [
      "127.0.0.1",
      "0.0.0.0"
    ]
  }
}

Step 1: Create mapping for index my_index

PUT my_index/_mapping

{
  "_meta": {
    "rankers": [
      {
        "nane":"kendra", "properties": {
          "title_fields": [
            "title"
          ],
          "body_fields": [
            "published",
            "description"
          ]
        }
      }
    ]
  }
}

Query Stage

Step 2: original request from user : “rambo”

Step 2.1: Structured query from application coming to OpenSearch (this is done by the customer’s application)

POST my_index/_search

{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "topic": "hobby"
          }
        }
      ],
      "filter": [
        {
          "range": {
            "dateField": {
              "gte": "now-12d",
              "lte": "now-10d"
            }
          }
        }
      ]
    }
  }
}

The chain flow control intercepts the index search request and will dispatch the request for each the query rewriter

{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "topic": "hobby"
          }
        }
      ],
      "filter": [
        {
          "range": {
            "dateField": {
              "gte": "now-12d",
              "lte": "now-10d"
            }
          }
        }
      ]
    }
  },
  "ext": {
    "context": {
      "params": {
        "query": {
          "bool": {
            "must": [
              {
                "match": {
                  "topic": "hobby"
                }
              }
            ],
            "filter": [
              {
                "range": {
                  "dateField": {
                    "gte": "now-12d",
                    "lte": "now-10d"
                  }
                }
              }
            ]
          }
        }
      },
      // this section is generated for the chain if not given by user 
      "execution": { 
        "id": "A1b2c", 
        "rankers": [
          {
            "name": "kendra",
            "properties": {
              "title_fields": [
                "title"
              ],
              "body_fields": [
                "published",
                "description"
              ]
            }
          }
        ],
        "rewriters": [
          {
            "name": "querqy",
            "properties": {
              "query": {
                "querqy": {
                  "matching_query": {
                    "query": "notebook"
                  },
                  "query_fields": [
                    "title^3.0",
                    "brand^2.1",
                    "shortSummary"
                  ]
                }
              }
            }
          }
        ]
      }
    }
  }
}

Step 3: First rewriter (Querqy) is dispatched and generates the new query (query rewrite)

{
  "query": {
    //todo - put here the query after being re-written by querqy    
  },
  "ext": {
    "context": {
      "params": {
        "query": {
          "bool": {
            "must": [
              {
                "match": {
                  "topic": "hobby"
                }
              }
            ],
            "filter": [
              {
                "range": {
                  "dateField": {
                    "gte": "now-12d",
                    "lte": "now-10d"
                  }
                }
              }
            ]
          }
        }
      },
      "execution": {
        "id": "A1b2c",
        "rankers": [
          {
            "name": "kendra",
            "properties": {
              "title_fields": [
                "title"
              ],
              "body_fields": [
                "published",
                "description"
              ]
            }
          }
        ],
        "rewriters": [
          {
            "name": "querqy",
            "properties": {
              "query": {
                "querqy": {
                  "matching_query": {
                    "query": "notebook"
                  },
                  "query_fields": [
                    "title^3.0",
                    "brand^2.1",
                    "shortSummary"
                  ]
                }
              },
              "info" : { } // additional info that querqy may add after query rewrite
            }
          }
        ]
      }
    }
  }
}

Step 3: chain flow control has no additional rewrites to dispatch - so it will dispatch to the rankers. The first ranker in the chain will review the context params and take the necessary information .

After it will complete its action it will have the results ranked according to its internal reasoning

{
  "query": {
    //todo - put here the query after being re-written by querqy    
  },
  "ext": {
    "context": {
      "params": {
        "query": {
          "bool": {
            "must": [
              {
                "match": {
                  "topic": "hobby"
                }
              }
            ],
            "filter": [
              {
                "range": {
                  "dateField": {
                    "gte": "now-12d",
                    "lte": "now-10d"
                  }
                }
              }
            ]
          }
        }
      },
      "execution": {
        "id": "A1b2c",
        "rankers": [
          {
            "name": "kendra",
            "properties": {
              "title_fields": [
                "title"
              ],
              "body_fields": [
                "published",
                "description"
              ]
            }
          }
        ],
        "rewriters": [
          {
            "name": "querqy",
            "properties": {
              "query": {
                "querqy": {
                  "matching_query": {
                    "query": "notebook"
                  },
                  "query_fields": [
                    "title^3.0",
                    "brand^2.1",
                    "shortSummary"
                  ]
                }
              },
              "info" : { } 
            }
          }
        ]
      }
    }
  }
}

Response Stage

Step 4: Reranking work after the rewrite chain is completed - returning the results to the original calling service

ranker search results json

{
  "took" : 0,
  "timed_out" : false,
   "ext": {  // this ext section is suggested to be added here as part of the results.
     "context": {
       "params": {
         "query": {
           "bool": {
             "must": [
               {
                 "match": {
                   "topic": "hobby"
                 }
               }
             ],
             "filter": [
               {
                 "range": {
                   "dateField": {
                     "gte": "now-12d",
                     "lte": "now-10d"
                   }
                 }
               }
             ]
           }
         }
       },
       "execution": {
         "id": "A1b2c",
         "rankers": [
           {
             "name": "kendra",
             "properties": {
               "title_fields": [
                 "title"
               ],
               "body_fields": [
                 "published",
                 "description"
               ]
             }
           }
         ],
         "rewriters": [
           {
             "name": "querqy",
             "properties": {
               "query": {
                 "querqy": {
                   "matching_query": {
                     "query": "notebook"
                   },
                   "query_fields": [
                     "title^3.0",
                     "brand^2.1",
                     "shortSummary"
                   ]
                 }
               },
               "info" : { }
             }
           }
         ]
       }
     }
   },
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.8773359,
    "hits" : [
      {
        "_index" : "employees",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.8773359,
        "_source" : {
          "id" : 4,
          "name" : "Alan Thomas",
          "email" : "[email protected]",
          "gender" : "male",
          "ip_address" : "200.47.210.95",
          "date_of_birth" : "11/12/1985",
          "company" : "Yamaha",
          "position" : "Resources Manager",
          "experience" : 12,
          "country" : "China",
          "phrase" : "Emulation of roots heuristic coherent systems",
          "salary" : 300000
        }
      }
    ]
  }
}

The response DSL dosn't contain such ext part - this RFC is suggesting to add such a section to the results.

[RFC] Search pipelines

Search pipelines

This RFC is intended to replace #12.

Overview

We are proposing a set of new APIs to manage composable processors to transform search requests and search responses in OpenSearch. Expected transformers include (but are not limited to):

Reranking final search results using an external ranking service (which would be impractical to apply when collecting results per-shard).
Modifying a search request by calling a query understanding service.
Excluding search results based on some externally-configured filtering logic (without needing to modify and deploy changes to the search application).

The new APIs will aim to mirror the ingest APIs, which are responsible for transforming documents before they are indexed, to ensure that all documents going into the index are processed in a consistent way. The ingest API makes use of pipelines of processors. We will do the same, but for search.

Argument over alternatives

Everyone should just implement logic in their calling application

The most obvious counterargument to this proposal is “this logic belongs in the search application that calls OpenSearch”. That is a valid approach and this proposal does not prevent any developer from transforming search requests and responses in their application.

We believe that providing an API within OpenSearch will make it easier for developers to build and share components that perform common transformations, reducing duplicated effort in the calling search applications.

Put this logic in a library that people can use from their calling applications

In theory, we could provide a common “toolbox” of request and response processors as a library that application developers could use. That would mean building libraries for a specific languages/runtimes. By including search processors in OpenSearch itself, any calling application (regardless of implementation) can benefit. In particular, it is possible to modify query processing behavior without modifying the application (by specifying a default search pipeline for the target index(es)).

Write search plugins

Search plugins can significantly impact how search requests are processed, both on the coordinator node and on individual shards. Each processor we can think of could be implemented as a search plugin that runs on the coordinator node. The challenges with that approach are a) writing a whole search plugin complete with parameter parsing is pretty complicated, b) the order in which search plugins run is not immediately obvious to a user, and c) without some overarching framework providing guidelines, every search plugin may have its own style of taking parameters (especially with regards to default behavior).

Similarities with ingest pipelines

A built-in orchestrator can call out to processors defined in plugins

Ingest pipelines have a core orchestrator responsible for calling out to each ingest processor in the pipeline, but the processors themselves may be defined in separate ingest plugins. These plugins can implement specific transformations without needing to consider the broader pipeline execution. Similarly, search pipelines will run from the OpenSearch core, but may call out to named search processors registered via plugins.

Processed on entry (or exit)

Just as ingest pipelines operate before documents get routed to shards, the search pipelines operate “on top of” the index when processing a search request. That is, a SearchRequest gets transformed on the coordinator node before being sent to individual shards, and the SearchResponse gets transformed on the coordinator node after being aggregated from the shard responses.

Processing that happens on each shard is out of scope for this proposal. The SearchPlugin API remains the appropriate extension point for per-shard processing.

Pipelines are named entities stored in the cluster

To use an ingest pipeline, you will generally PUT to create or update the pipeline definition using a REST API. The body of that request defines the pipeline with a description and a list of ingest processors. We will provide a similar API to define named search pipelines built from search processors.

Can be referenced per-request or per-index

When using the index document API or the bulk API, you can include a request parameter like ?pipeline=my-pipeline to indicate that the given request should be processed by a specific pipeline. Similarly, we will add a pipeline parameter to the search API and the multi-search API.

Generally, we want to apply the same pipeline to every document being added to an index. To simplify that, the index API has a setting index.default_pipeline, that designates a pipeline to use if none is specified in an index document or bulk request. Similarly, we will add a setting, index.default_search_pipeline, to apply a pipeline by default for all search or multi-search requests against the given index.

Differences from ingest pipelines

Processing different things in different places

While an ingest processor only ever operates on a document, potentially modifying it, a search processor may operate on a search request, a search response, or both. We also assume that processing a search response requires information from the search request.

To support these different cases, we will provide different interfaces for search request processors, search response processors, and request + response (“bracket”) processors. The search pipeline definition will have separate sections for request and response processors. (Bracket processors must be specified in the request processor list, but may be referenced by ID in the response processor list to explicitly order them relative to response processors.)

The name “bracket processor” is chosen to indicate that they process things on the way in and on the way out, and must be balanced like brackets or parentheses. That is, given two bracket processors B1 and B2, we require that if B1 processes a search request before B2, then B1 processes the search response after B2.

Pipelines can be specified inline “for real”

The ingest API includes “_simulate” endpoint that you can use to preview the behavior of a named pipeline or a pipeline definition included in the request body (before creating a named pipeline). This makes sense, since we wouldn’t want to pollute the index with documents processed with a half-baked, untested pipeline.

Since search requests are read-only, we don’t need a separate API to test an ad hoc search pipeline definition. Instead, we will allow anonymous search pipelines to be defined inline as part of any search or multi-search request. In practice, we don’t expect this approach to be common in production scenarios, but it’s useful for ad hoc testing when creating / modifying a search pipeline.

API definition

Java search processor interfaces

package org.opensearch.search.pipeline;

// Copied from [org.opensearch.ingest.Processor](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/ingest/Processor.java)

interface Processor {
  /**
   * Gets the type of a processor
   */
  String getType();

  /**
   * Gets the tag of a processor.
   */
  String getTag();

  /**
   * Gets the description of a processor.
   */
  String getDescription();
}

/**
 * Processor that (potentially) modifies SearchRequests.
 */
interface RequestProcessor extends Processor {
  SearchRequest execute(SearchRequest request);
}

/**
 * Processor that (potentially) modifies SearchResponses. Behavior may be
 * influenced by parameters from the SearchRequest.
 */
interface ResponseProcessor extends Processor {
  SearchResponse execute(SearchRequest request, SearchResponse response);
}

/**
 * Processor that may modify the request, response, both, or neither.
 */
interface BracketProcessor extends RequestProcessor, ResponseProcessor {

  /**
   * May be specified in the request pipeline and referenced in the response
   * pipeline to determine the order of response processing. 
   */
  String getId();
}

REST APIs

Search pipeline CRUD

// Create/update a search pipeline.
PUT /_search_processing/pipeline/my_pipeline
{
  "description": "A pipeline to apply custom synonyms, result post-filtering, an ML ranking model",
  "request_processors" : [
    {
      "external_synonyms" : {
        "service_url" : "https://my-synonym-service/"
      }
    },
    {
      "ml_ranker_bracket" : {
        "result_oversampling" : 2, // Request 2 * size results.
        "model_id" : "doc-features-20230109",
        "id" : "ml_ranker_identifier"
      }
    }
  ],
  "response_processors" : [
    {
      "result_blocker" : {
        "service_url" : "https://result-blocklist-service/"
      },
      "ml_ranker_bracket" : {
        // Placed here to indicate that it should run after result_blocker.
        // If not part of response_processors, it will run before result_blocker.
        "id" : "ml_ranker_identifier" 
      }
    }
  ]
}

// Return identifiers for all search pipelines.
GET /_search_processing/pipeline

// Return a single search pipeline definition.
GET /_search_processing/pipeline/my_pipeline

// Delete a search pipeline.
DELETE /_search_processing/pipeline/my_pipeline

Search API changes

// Apply a search pipeline to a search request.
POST /my-index/_search?pipeline=my_pipeline
{
  "query" : {
    "match" : {
      "text_field" : "some search text"
    }
  }
}

// Specify an ad hoc search pipeline as part of a search request.
POST /my-index/_search

{
  "query" : {
    "match" : {
      "text_field" : "some search text"
    }
  },
  "pipeline" : {
    "request_processors" : [
      {
        "external_synonyms" : {
          "service_url" : "https://my-synonym-service/"
        }
      },
      {
        "ml_ranker_bracket" : {
          "result_oversampling" : 2, // Request 2 * size results
          "model_id" : "doc-features-20230109",
          "id" : "ml_ranker_identifier"
        }
      }
    ],
    "response_processors" : [
      {
        "result_blocker" : {
          "service_url" : "https://result-blocklist-service/"
        },
        "ml_ranker_bracket" : {
          // Placed here to indicate that it should run after result_blocker.
          // If not part of response_processors, it will run before result_blocker.
          "id" : "ml_ranker_identifier" 
        }
      }
    ]
  }
}

Index settings

// Set default search pipeline for an existing index.
PUT /my-index/_settings
{
  "index" : {
    "default_search_pipeline" : "my_pipeline"
  }
}

// Remove default search pipeline for an index.
PUT /my-index/_settings
{
  "index" : {
    "default_search_pipeline" : "_none"
  }
}

// Create a new index with a default search pipeline.
PUT my-index
{
  "mappings" : {
    // ...index mappings...
  },
  "settings" : {
    "index" : {
      "default_search_pipeline" : "my_pipeline",
      // ... other settings ...
    }
  }
}

Proposed integrations

Kendra ranking

Our first implementation (already in the search-processor repository) provides connectivity to the Amazon Kendra Intelligent Ranking service. This will need to be reworked to match the BracketProcessor interface, because it modifies the SearchRequest as well as the SearchResponse. The processor modifies the SearchRequest to a) request the top 25 search hits (if start is less than 25), and b) request document source (to ensure that the body and title fields for reranking are available). The top 25 results in the SearchResponse are preprocessed (to extract text passages) and sent to the Amazon Kendra Intelligent Ranking service, which returns a (potentially) reordered list of document IDs, which is used to rerank the top 25 results. The originally-requested range of results (by start and size) is returned.

Metarank

To provide search results that learn from user interaction, we could implement a ResponseProcessor that calls out with Metarank.

Note that we would need to make sure that the SearchRequest API has the ability (via the ext property?) to carry additional metadata about the request, like user and session identifiers.

Querqy

Search pipelines could be a convenient interface to integrate with Querqy.

Individual Querqy rewriters could be wrapped in adapters that implement the RequestProcessor interface and added to a search pipeline.

Script processor

Ingest pipelines support processing documents with scripts. We could provide a similar capability to allow users to modify their search request or response with a Painless or Mustache script.

Block expensive query types

About 10 years ago, I worked on a search hosting service (based on Apache Solr) where we added a SearchComponent to our SearchHandler that would reject potentially expensive queries (e.g. leading wildcards, regex) by default. We would lift the restrictions by request and only after discussing the risks (and usually we could explain why another option would be better). A similar RequestProcessor that’s installed as part of a default search pipeline for an index could save an OpenSearch admin from impact from users accidentally sending expensive queries.

Proposed roadmap

Initial release (“soon”)

Based on feedback to this RFC, we intend to refactor the search-processor plugin to be similar to the APIs described above (with the assumption that there will be some changes required when imagination collides with reality). We should (hopefully?) be able to do this in time for the 2.6 release.

At this point, the REST APIs would still be considered “experimental” and we may break backwards compatibility (though we would like to avoid that if possible). The Java APIs may still be subject to change.

We would include additional processors in this repository.

Move to core

After getting some feedback from users of the plugin, we will move the pipeline execution logic into OpenSearch core, with individual processor implementations either in a “common” module (similar to ingest-common) or in separate plugins. Ideally, the OpenSearch SDK for Java will make it possible to implement search processors as extensions.

Search configurations

We’re thinking about using search pipelines as an initial model of “search configurations”, where the pipeline definition captures enough information about how a search request is processed from end-to-end to provide a reproducible configuration.

We can make it easier for application builders to run experiments, both offline and online, by running queries through one pipeline or another. For A/B testing, you could define a default search pipeline that randomly selects a search pipelines to process a request and then link user behavior to the pipeline used.

More complicated search processing

Just as ingest pipelines support conditional execution of processors and nested pipelines, we could add similar capabilities to search processors to effectively turn the pipeline into a directed acyclic graph. If that becomes the norm, we would likely want a visualization tool to view and edit a search pipeline (since nested JSON would be hard for a human to understand).

In the “middle” of the graph, there’s a component to call into OpenSearch to turn a SearchRequest into a SearchResponse. What if we want to use something other than OpenSearch, though? For example, we could precompute result sets for known one-word queries offline and do a lookup to return those results online.

Task list

[BUG] Renaming Repo Broke codecov

What is the bug?
Renaming our repository from search-relevance to search-processor broke the link to codecov.io.

How can one reproduce the bug?
Steps to reproduce the behavior:

Go to README.md
See codecov badge (codecov is unknown)
Click on codecov badge
404

What is the expected behavior?
Code Coverage should show up on the badge

Support configuration via cluster settings

Right now, we only read various settings (e.g. Kendra ranking endpoint) from opensearch.yml.

To update these values, we need to restart every node in the cluster for the settings change to take effect.

It would be more convenient if we copied these to cluster settings so they could be updated on a running cluster.

[RFC] OpenSearch Simple Schema : Search

Is your feature request related to a problem?

Why Use a Schema?

Schema is a means to address the problems of managing and handling unstructured or loosely structured data, in opensearch we need to address the following concerns regarding the stored data:

Integrity - when data is semi/weakly structured - it is hard or even impossible to control the state and validity of the data.

Accessibility and retrieval - without high-level structure comes the lack of possibility to query the data meaningfully:

it may be since our data structure is too low level to express complex queries
handling the complexity of such super detailed and low level queries becomes a problem.
it's hard to express search relevancy in a structured-domain related question without the explicit schema definition

Consequently, we might be forced to ask simple questions only or fall back to client side reasoning functionality.

Maintenance - With very little control over the structure of our data it is hard to alter that structure over time as requirements change.

What solution would you like?

The schema defines a specific, high-level structure of data that is enforced across the datasets, providing:

logical integrity guarantees
consistency guarantees
maintenance simplified and can be automated
build-in tools for defining relevancy inside the (schema) data-domain

A well-constructed schema enables writing intuitive queries, such queries are seamlessly mapped with how we form them as questions in our mind.

Using a schema sets the ground for performing automated reasoning over the represented data.

----- Introducing opensearch simple search draft - opensearch-project/sql#763 -----

What alternatives have you considered?

A clear and concise description of any alternative solutions or features you've considered.

Do you have any additional context?

Add any other context or screenshots about the feature request here.

Refactor metrics and logging

This is the follow up of #38

From @macohen: "We do want to review the full set of metrics and logging events to capture. We'll create another issue for that purpose."

[PROPOSAL] Per index/pattern configuration to enhance search operations

What are you proposing?

In a few sentences, describe the feature and its core capabilities.

How did you come up with this proposal?

Highlight any research, proposals, requests, issues, forum posts, anecdotes that signal this is the right thing to build. Highlight opportunities for additional research.

What is the user experience going to be?

Describe the feature requirements and or user stories. You may include low-fidelity sketches, wireframes, APIs stubs, or other examples of how a user would use the feature. Using a bulleted list or simple diagrams to outline features is okay. e.g. As a < type of user > , I want to < achieve a goal > so that < for some reason >.

Why should it be built? Any reason not to?

Describe the most important user needs, pain points, and the value that this feature will bring to the OpenSearch community, as well as what impact it has if it isn't built, or new risks if it is. What is preventing you from meeting this need today?

What will it take to execute?

Describe what it will take to build this feature. Are there any assumptions you may be making that could limit scope or add limitations? Are there performance, cost, or technical constraints that may impact the user experience? Does this feature depend on other feature work? What additional risks are there?

What are remaining open questions?

List questions that may need to be answered before proceeding with an implementation.

[Feature] Reranking Plugin Release

We're planning to include the reranking plugin in the 2.5.0 release. This tracks our progress to turn the search-relevance repo into a single plugin repo.

opensearch-2.4.0.jar: 1 vulnerabilities (highest severity is: 9.8)

Vulnerable Library - opensearch-2.4.0.jar

Path to dependency file: /build.gradle

Path to vulnerable library: /home/wss-scanner/.gradle/caches/modules-2/files-2.1/org.yaml/snakeyaml/1.32/e80612549feb5c9191c498de628c1aa80693cf0b/snakeyaml-1.32.jar

Vulnerabilities

CVE	Severity	CVSS	Dependency	Type	Fixed in (opensearch version)	Remediation Available
CVE-2022-1471	High	9.8	snakeyaml-1.32.jar	Transitive	N/A*	❌

*For some transitive vulnerabilities, there is no version of direct dependency with a fix. Check the section "Details" below to see if there is a version of transitive dependency where vulnerability is fixed.

Details

CVE-2022-1471

Vulnerable Library - snakeyaml-1.32.jar

YAML 1.1 parser and emitter for Java

Library home page: https://bitbucket.org/snakeyaml/snakeyaml

Path to dependency file: /build.gradle

Path to vulnerable library: /home/wss-scanner/.gradle/caches/modules-2/files-2.1/org.yaml/snakeyaml/1.32/e80612549feb5c9191c498de628c1aa80693cf0b/snakeyaml-1.32.jar

Dependency Hierarchy:

opensearch-2.4.0.jar (Root Library)
- opensearch-x-content-2.4.0.jar
  - ❌ snakeyaml-1.32.jar (Vulnerable Library)

Found in base branch: main

Vulnerability Details

SnakeYaml's Constructor() class does not restrict types which can be instantiated during deserialization. Deserializing yaml content provided by an attacker can lead to remote code execution. We recommend using SnakeYaml's SafeConsturctor when parsing untrusted content to restrict deserialization.

Publish Date: 2022-12-01

URL: CVE-2022-1471

CVSS 3 Score Details (9.8)

Base Score Metrics:

Exploitability Metrics:
- Attack Vector: Network
- Attack Complexity: Low
- Privileges Required: None
- User Interaction: None
- Scope: Unchanged
Impact Metrics:
- Confidentiality Impact: High
- Integrity Impact: High
- Availability Impact: High

For more information on CVSS3 Scores, click here.

Suggested Fix

Type: Upgrade version

Origin: https://nvd.nist.gov/vuln/detail/CVE-2022-1471

Release Date: 2022-12-01

Fix Resolution: org.yaml:snakeyaml - 1.31

opensearch-2.6.0.jar: 1 vulnerabilities (highest severity is: 9.8)

Vulnerable Library - opensearch-2.6.0.jar

Path to dependency file: /build.gradle

Path to vulnerable library: /home/wss-scanner/.gradle/caches/modules-2/files-2.1/org.yaml/snakeyaml/1.33/2cd0a87ff7df953f810c344bdf2fe3340b954c69/snakeyaml-1.33.jar

Found in HEAD commit: 70913e2052570c6070840d669a676009f1ce8bdb

Vulnerabilities

CVE	Severity	CVSS	Dependency	Type	Fixed in (opensearch version)	Remediation Available
CVE-2022-1471	High	9.8	snakeyaml-1.33.jar	Transitive	N/A*	❌

Details

CVE-2022-1471

Vulnerable Library - snakeyaml-1.33.jar

YAML 1.1 parser and emitter for Java

Library home page: https://bitbucket.org/snakeyaml/snakeyaml

Path to dependency file: /build.gradle

Path to vulnerable library: /home/wss-scanner/.gradle/caches/modules-2/files-2.1/org.yaml/snakeyaml/1.33/2cd0a87ff7df953f810c344bdf2fe3340b954c69/snakeyaml-1.33.jar

Dependency Hierarchy:

opensearch-2.6.0.jar (Root Library)
- opensearch-x-content-2.6.0.jar
  - ❌ snakeyaml-1.33.jar (Vulnerable Library)

Found in HEAD commit: 70913e2052570c6070840d669a676009f1ce8bdb

Found in base branch: main

Vulnerability Details

Publish Date: 2022-12-01

URL: CVE-2022-1471

CVSS 3 Score Details (9.8)

Base Score Metrics:

Exploitability Metrics:
- Attack Vector: Network
- Attack Complexity: Low
- Privileges Required: None
- User Interaction: None
- Scope: Unchanged
Impact Metrics:
- Confidentiality Impact: High
- Integrity Impact: High
- Availability Impact: High

For more information on CVSS3 Scores, click here.

Suggested Fix

Type: Upgrade version

Origin: https://bitbucket.org/snakeyaml/snakeyaml/issues/561/cve-2022-1471-vulnerability-in#comment-64634374

Release Date: 2022-12-01

Fix Resolution: org.yaml:snakeyaml:2.0

Release Version 2.5.0

Preparation

Assign this issue to a release owner - @mingshl
Finalize scope and feature set and update the Public Roadmap.
All the tasks in this issue have been reviewed by the release owner.
Create, update, triage and label all features and issues targeted for this release with v2.5.0.
Build and publish a plugin build, then update the quickstart script to point to it.
Cut 2.5 branch

Pre-Release

Increment the version on the parent branch to the next development iteration.
Gather, review and publish release notes following the rules and back port it to the release branch.git-release-notes may be used to generate release notes from your commit history.
Confirm that all changes for 2.5.0 have been merged.

Release Testing

Sanity Testing: Sanity testing and fixing of critical issues found.
File issues for all intermittent test failures.

Release

Verify all issued labeled for this release are closed or labelled for the next release.
Cut release tag after release testing

Post Release

[Optional]Conduct a retrospective, and publish its results.

Create backport labels

Need backport labels for version branches.

Create a stats handler for search-processor

It would be good to get statistics on which search processors have run, how many times, how many failures we have seen, and what the average overhead is for each search transformer.

Looking at some existing plugins, it looks like there's a convention of adding a _stats API endpoint where a GET returns a structured statistics object.

[RFC] OpenSearch Remote Ranker Plugin (semantic)

What is semantic search?

Semantic search is a data searching technique that aims to not only find keywords in documents, but to determine the intent and contextual meaning of the words a person is using for search. Essentially, semantic search is search with meaning and can provide higher quality search results.

What is the OpenSearch Semantic Ranker?

The OpenSearch Semantic Ranker is a plugin that will re-rank search results at search time by calling an external service with semantic search capabilities for improved accuracy and relevance. This plugin will make it easier for OpenSearch users to quickly and easily connect with a service of their choice to improve search results in their applications.

How the plugin will work?

The plugin will modify the OpenSearch query flow and do the following:

Get top N* document results from the OpenSearch index.
Preprocess document results and prepare them to be sent to an external “re-ranking” service.
Call the external service that uses semantic search models to re-rank the results.

*N will be based on requirements of the external service and customizable by the user.

How users will use the plugin?

We are considering two options for using the plugin. The first option is having the plugin be configured at the OpenSearch index level, meaning users will be able to enable/disable semantic re-ranking for each index. After the Semantic Ranker plugin is enabled on a index, all queries to that index will go through the plugin and have their results re-ranked. There will be no change to the query syntax in this option.

The second option is having plugin being configured at the query level, meaning users can enable/disable semantic re-ranking per query. This option will allow for more flexibility as users will be able to selectively choose which queries to apply semantic re-ranking intelligence to, but will require updating the query syntax.

Example usages for both options will be provided below.

What configuration will the plugin have?

Field Configuration

Since data in a user’s OpenSearch index is mostly unstructured, the plugin will need to know which fields in the user’s OpenSearch documents map to specific fields of a “document”. Here is a breakdown of the fields that the plugin will use:

body: the main body of text for the document. This is a required field and the main text the external service will search on and apply the semantic re-ranking intelligence to. In the plugin configuration, the user will provide a list of OpenSearch fields names to map to the body. The list must have at least one field and the fields in the list should be in order of importance. The plugin will concatenate the values for each field into the body text before applying the preprocessing logic.
title: the title content for the document. This is an optional field and can be provided if supported by the external service. Similarly to the body field, the user can provide a list of OpenSearch fields names to map to the title.

The following are not as important and also optional, but may improve relevance of the results if the external service supports them and the user has right inputs for them. These fields may or may be supported on the first version of the plugin.

view_count: numeric field for the document view count
creation_date: date field for the document creation date time
modification_date: date field for the document latest modification date time

In the plugin configuration, the user will provide OpenSearch field names to map to these fields.

Here is an example: let’s say a user has the following document structure in their OpenSearch index:

{
  "country": ...,
  "article_description": ...,
  "article_content": ...,
  "city": ...,
  "author": ...,
  "article_title": ...
  // "article_content2": ...,
}

In this example, the user may want to configure [“article_content”] as the body field and [“article_title”] as the title field in the plugin.

As mentioned above, the configurations for body and title will be lists of OpenSearch field names in order of importance. The reason for this is because there may be use cases in which documents have multiple body/title fields and/or use cases in which documents in the same index have different body/title fields.

Using the same example as above, let’s say there is another field called article_content2. Then, the user may want to configure [“article_content”, “article_content2”] as the body fields.

External Service Configuration

The plugin will require also configuration to connect with the external service.
Non-sensitive inputs such as endpoint and retry count will be provided in opensearch.yml config file. For example:

plugins.semantic_ranker.external_service.client.endpoint: myendpoint.com
plugins.semantic_ranker.external_service.client.max_retries: 3 
# other configs

Credentials to connect to the service will be stored in the OpenSearch keystore. Users will be able to provide the username/password or the access/secret keys for the service.

sudo ./bin/opensearch-keystore add plugins.semantic_ranker.external_service.username
sudo ./bin/opensearch-keystore add plugins.semantic_ranker.external_service.password

sudo ./bin/opensearch-keystore add plugins.semantic_ranker.external_service.access_key
sudo ./bin/opensearch-keystore add plugins.semantic_ranker.external_service.secret_key

How will the plugin modify the query response?

The plugin will re-score and re-rank the query results from OpenSearch, but there should also be a way for users to compare results before/after applying the plugin.

Users can execute queries with/without the plugin enabled themselves and compare the results. If the plugin is configured at the index level, the user can enable/disable the plugin in the index settings and test queries. If the plugin is configured at the query level, the user can choose to enable the plugin by providing the necessary config in the query syntax.

Another option is to provide both original “un-re-ranked” results and re-ranked results in the query response. The advantage of this is that user can compare the results more easily without executing two separate queries, but this will increase the size of the response payload. In this option, re-ranked results will go under “hits” and the original results will go under a new field in the response. The reason for this is to allow for quick and easy usage of the plugin without forcing users to make application code changes to point to a new field in the response.

Example Usage

Note: the following are examples. Actual endpoints/syntax may change on the release of the plugin.

Option 1 (Index level configuration):

// Create a new index
PUT sample-index

// Index some documents
POST sample-index/_doc/1
{
  "my_title": "My first document title",
  "my_body": "My first document body"
}

POST sample-index/_doc/2
{
  "my_title": "My second document title",
  "my_body": "My second document body"
}

// Sample query to search for "document".
GET sample-index/_search
{
  "query": {
    "simple_query_string" : {
        "query": "document"
    }
  }
}

// Enable the Semantic Ranker plugin and provide the body and title fields.
PUT sample-index/_settings
{
  "semantic_ranker" : {
    "enabled": true,
    "title_fields": ["my_title"],
    "body_fields": ["my_body"]
  }
}

// Query again, but this time the query will go through the Semantic Ranker plugin for re-reranking.
// Take note that the query syntax remains the same.
GET sample-index/_search
{
  "query": {
    "simple_query_string" : {
        "query": "document"
    }
  }
}

// Disable the Semantic Ranker plugin.
PUT sample-index/_settings
{
  "semantic_ranker" : {
    "enabled": false
  }
}

Option 2 (Query level configuration):

// Create a new index
PUT sample-index

// Index some documents
POST sample-index/_doc/1
{
  "my_title": "My first document title",
  "my_body": "My first document body"
}

POST sample-index/_doc/2
{
  "my_title": "My second document title",
  "my_body": "My second document body"
}

// Query to search for "document", with Semantic Ranker plugin enabled.
GET sample-index/_search
{
  "query": {
    "simple_query_string" : {
        "query": "document"
    }
  },
  "semantic_ranker" : {
    "title_fields": ["my_title"],
    "body_fields": ["my_body"]
  }
}

Open Questions

Should the plugin be be configured at the index level or query level? Should we support both?
Is there a need to provide original “un-re-ranked” results in the query response?
As mentioned above, the plugin will “pre-process” documents before sending it to the external service. What preprocessing techniques should the plugin support? For example, one technique would be to split each document into passages (ordered list of tokens) and take the top 3 passages using BM25.
As mentioned above, the plugin will concatenate “body” field values together if multiple body fields are provided. Should the plugin support other techniques for combining the “body” values? For example, the plugin could take first N characters for each body field. This could be helpful if the preprocessing technique favors text at the beginning of the document.

Onboard to OpenSearch build distribution for release

Since we want to release in 2.5, there are a few things we should double-check:

https://github.com/opensearch-project/opensearch-build/blob/main/ONBOARDING.md
https://github.com/opensearch-project/opensearch-plugins/blob/main/BUILDING.md

Baseline MAINTAINERS, CODEOWNERS, and external collaborator permissions

Follow opensearch-project/.github#125 to baseline MAINTAINERS, CODEOWNERS, and external collaborator permissions.

Close this issue when:

MAINTAINERS.md has the correct list of project maintainers.
CODEOWNERS exists and has the correct list of aliases.
Repo permissions only contain individual aliases as collaborators with maintain rights, admin, and triage teams.
All other teams are removed from repo permissions.

If this repo's permissions was already baselined, please confirm the above when closing this issue.

Patch file for plugin execution

Need to verify if this patch file is required for the plugin running

diff --git a/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/client/KendraHttpClient.java b/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/client/KendraHttpClient.java
index d30fb27..18880f0 100644
--- a/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/client/KendraHttpClient.java
+++ b/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/client/KendraHttpClient.java
@@ -12,6 +12,7 @@ import com.amazonaws.ClientConfiguration;
 import com.amazonaws.DefaultRequest;
 import com.amazonaws.Request;
 import com.amazonaws.Response;
+import com.amazonaws.SDKGlobalConfiguration;
 import com.amazonaws.auth.AWS4Signer;
 import com.amazonaws.auth.AWSCredentials;
 import com.amazonaws.auth.AWSCredentialsProvider;
@@ -34,6 +35,7 @@ import java.nio.charset.StandardCharsets;
 import java.security.AccessController;
 import java.security.PrivilegedAction;
 
+import org.opensearch.common.SuppressForbidden;
 import org.opensearch.search.relevance.transformer.kendraintelligentranking.model.dto.RescoreRequest;
 import org.opensearch.search.relevance.transformer.kendraintelligentranking.model.dto.RescoreResult;
 
@@ -52,7 +54,10 @@ public class KendraHttpClient {
   private final String executionPlanId;
   private final ObjectMapper objectMapper;
 
+  @SuppressForbidden(reason = "Need to override property")
   public KendraHttpClient(KendraClientSettings clientSettings) {
+    AccessController.doPrivileged((PrivilegedAction<String>) () -> System.setProperty(
+        SDKGlobalConfiguration.DISABLE_CERT_CHECKING_SYSTEM_PROPERTY, "true"));
     amazonHttpClient = AccessController.doPrivileged((PrivilegedAction<AmazonHttpClient>) () -> new AmazonHttpClient(new ClientConfiguration()));
     errorHandler = new SimpleAwsErrorHandler();
     responseHandler = new SimpleResponseHandler();
diff --git a/src/main/plugin-metadata/plugin-security.policy b/src/main/plugin-metadata/plugin-security.policy
index b16dfe0..66461fe 100644
--- a/src/main/plugin-metadata/plugin-security.policy
+++ b/src/main/plugin-metadata/plugin-security.policy
@@ -12,4 +12,5 @@ grant {
   
   permission java.net.SocketPermission "*", "connect,resolve";
   permission java.lang.RuntimePermission "getClassLoader";
+  permission java.util.PropertyPermission "*", "read,write";
 };

[META] Update github workflow - autoclosed

Update github workflow to resolve several issues:

1. Action failed in backport PR due to static DCO check - remove the extra dco.yml
2. PR_stats failed due to some issue from github graphQL, may related to permission setup
3. PR Template includes link to knn plugin

We need to resolve 2. first since 1. and 3. are depend on it

[PROPOSAL] Add a mechanism and chaining to enhance search request / and / responses from OpenSearch

What are you proposing?

In a few sentences, describe the feature and its core capabilities.

How did you come up with this proposal?

Highlight any research, proposals, requests, issues, forum posts, anecdotes that signal this is the right thing to build. Highlight opportunities for additional research.

What is the user experience going to be?

Why should it be built? Any reason not to?

What will it take to execute?

What are remaining open questions?

List questions that may need to be answered before proceeding with an implementation.

[BUG] flowwer-dev/pull-request-stats: Resource not accessible by integration

What is the bug?
I have an issue where I’m trying to integrate a github action (https://github.com/flowwer-dev/pull-request-stats) that should add a comment with PR stats to every PR. I’ve been trying to make it work with PRs opened against main, but it was failing due to permissions (https://github.com/opensearch-project/search-processor/actions/runs/3766987233/jobs/6404075146). I also created two branches in the repo and made a PR from one to the other. That one succeeded: (#62). It’s strange to me that a PR to main would have different permissions than a PR to a different branch and I'm not sure where to go from there.

How can one reproduce the bug?
Steps to reproduce the behavior:

Enable the (Pull Request Stats)[https://github.com/opensearch-project/search-processor/actions/workflows/pr_stats.yml]
Create a PR to merge to main
Watch the check fail like in: https://github.com/opensearch-project/search-processor/actions/runs/3766987233

What is the expected behavior?
A comment should be added to the PR like #62 (comment)

What is your host/environment?
N/A

Do you have any screenshots?
No, but the logfile with error is here: logs_561.zip

Do you have any additional context?
Add any other context about the problem.

opensearch-2.5.0.jar: 1 vulnerabilities (highest severity is: 9.8)

Vulnerable Library - opensearch-2.5.0.jar

Path to dependency file: /build.gradle

Path to vulnerable library: /home/wss-scanner/.gradle/caches/modules-2/files-2.1/org.yaml/snakeyaml/1.32/e80612549feb5c9191c498de628c1aa80693cf0b/snakeyaml-1.32.jar

Found in HEAD commit: a561b0007614cd237ea81d94120f931a415fdfa5

Vulnerabilities

CVE	Severity	CVSS	Dependency	Type	Fixed in (opensearch version)	Remediation Available
CVE-2022-1471	High	9.8	snakeyaml-1.32.jar	Transitive	N/A*	❌

Details

CVE-2022-1471

Vulnerable Library - snakeyaml-1.32.jar

YAML 1.1 parser and emitter for Java

Library home page: https://bitbucket.org/snakeyaml/snakeyaml

Path to dependency file: /build.gradle

Path to vulnerable library: /home/wss-scanner/.gradle/caches/modules-2/files-2.1/org.yaml/snakeyaml/1.32/e80612549feb5c9191c498de628c1aa80693cf0b/snakeyaml-1.32.jar

Dependency Hierarchy:

opensearch-2.5.0.jar (Root Library)
- opensearch-x-content-2.5.0.jar
  - ❌ snakeyaml-1.32.jar (Vulnerable Library)

Found in HEAD commit: a561b0007614cd237ea81d94120f931a415fdfa5

Found in base branch: main

Vulnerability Details

Publish Date: 2022-12-01

URL: CVE-2022-1471

CVSS 3 Score Details (9.8)

Base Score Metrics:

Exploitability Metrics:
- Attack Vector: Network
- Attack Complexity: Low
- Privileges Required: None
- User Interaction: None
- Scope: Unchanged
Impact Metrics:
- Confidentiality Impact: High
- Integrity Impact: High
- Availability Impact: High

For more information on CVSS3 Scores, click here.

[META] Update config rules for auto organizing label

Update config rules for auto organizing the tagged PRs/issues into categories:

update draft-release-notes-config.yml to categorize as below:

categories:

title: 'Breaking Changes'
labels:
- 'Breaking Changes'
title: 'Security'
labels:
- 'Security'
- 'security fix'
- 'security vulnerability'
title: 'Features'
labels:
- 'Features'
title: 'Enhancements'
labels:
- 'Enhancements'
- 'enhancement'
title: 'Bug Fixes'
labels:
- 'Bug Fixes'
- 'bug'
title: 'Infrastructure'
labels:
- 'Infrastructure'
- 'configuration error'
title: 'Documentation'
labels:
- 'Documentation'
- 'documentation'
title: 'Maintenance'
labels:
- 'Maintenance'
- 'meta'
- 'release'
title: 'Refactoring'
labels:
- 'Refactoring'

[Doc] Add API reference

This repo should include documentation that explains all possible parameters for the available result transformers.

As we reimplement result transformers as search pipeline processors (see e.g. opensearch-project/OpenSearch#7158, which probably belongs in this repo, not in OpenSearch core), we will need to update the documentation.

[RFC] Additional Reranker Implementations Requested

We want to be sure we call ourselves out as per our principles for development. Specifically, because we are releasing this plugin with AWS Kendra as the first integrated reranker, we are unfortunately violating the "A level playing field" principle. We do see the Kendra re-ranking integration as simply the first re-ranker integration and want to add more as soon as possible. In calling this out, we are also asking the community for comments here, PRs on the plugin for additional integrations if interested, as well as cooperative work to add an additional re-ranker so we can refactor the code and generalize it.

Improve test coverage

This project currently (November 22, 2022) has around 21% line coverage from tests.

We should increase that significantly to get more confidence on future changes (and maybe uncover some lurking bugs along the way).

I'm going to suggest a target of 80% before we can call this issue "done".

[PROPOSAL] List options to choose an enhanced/modified search API for OpenSearch

What are you proposing?

In a few sentences, describe the feature and its core capabilities.

How did you come up with this proposal?

Highlight any research, proposals, requests, issues, forum posts, anecdotes that signal this is the right thing to build. Highlight opportunities for additional research.

What is the user experience going to be?

Why should it be built? Any reason not to?

What will it take to execute?

What are remaining open questions?

List questions that may need to be answered before proceeding with an implementation.

[PROPOSAL] Is this the semantic-reranker or relevance_workbench repo?

What/Why

What are you proposing?

I think we’re about to hit a conflict here and could use help from the more experienced team members. With the semantic-reranker plugin here https://github.com/kevinawskendra/search-relevance and the relevancy_workbench here https://github.com/ps48/search-relevance/tree/relevancy_workbench, we probably want to set a new origin for one of them and create a new repo according to this: https://github.com/opensearch-project/opensearch-plugin-template-java. Are my assumptions sound?

If so, then I think it makes the most sense to put relevancy_workbench here and create a new repo for the semantic-reranker according to GitHub - opensearch-project/opensearch-plugin-template-java: Template repo for creating OpenSearch plugins/

[FEATURE] [META] Querqy Plugin Release

Release the Querqy Plugin for OpenSearch

Comes in from a community contribution here
More issues to standardize the plugin before release are added here
NOTE: This meta will be moved to a new plugin repository soon.

[REFACTORING] Updating search-processor for XContentType refactor

Description

Following the XContentType refactor, search-processor plugin no longer builds on 3.0.0. XContentType usage should be rearranged in the repository to reflect the refactor changes.

Sample Output

Building using 3.0.0 gradle build tools:

➜ search-processor (main-version) ✗ ./gradlew build

> Configure project :
true
=======================================
OpenSearch Build Hamster says Hello!
  Gradle Version        : 7.4.2
  OS Info               : Mac OS X 12.6.3 (aarch64)
  JDK Version           : 11 (Eclipse Temurin JDK)
  JAVA_HOME             : /Library/Java/JavaVirtualMachines/jdk-11.0.17+8/Contents/Home
  Random Testing Seed   : DF0676945F54ACBA
  In FIPS 140 mode      : false
=======================================

> Task :compileJava FAILED
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/ResultTransformerConfigurationFactory.java:12: error: cannot find symbol
import org.opensearch.common.xcontent.XContentParser;
                                     ^
  symbol:   class XContentParser
  location: package org.opensearch.common.xcontent
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/TransformerConfiguration.java:13: error: cannot find symbol
import org.opensearch.common.ParseField;
                            ^
  symbol:   class ParseField
  location: package org.opensearch.common
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/TransformerConfiguration.java:15: error: cannot find symbol
import org.opensearch.common.xcontent.ToXContentObject;
                                     ^
  symbol:   class ToXContentObject
  location: package org.opensearch.common.xcontent
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/TransformerConfiguration.java:17: error: cannot find symbol
public abstract class TransformerConfiguration implements Writeable, ToXContentObject {
                                                                     ^
  symbol: class ToXContentObject
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/ResultTransformerConfigurationFactory.java:32: error: cannot find symbol
    ResultTransformerConfiguration configure(XContentParser parser) throws IOException;
                                             ^
  symbol:   class XContentParser
  location: interface ResultTransformerConfigurationFactory
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/TransformerConfiguration.java:18: error: cannot find symbol
  protected static final ParseField TRANSFORMER_ORDER = new ParseField(ORDER);
                         ^
  symbol:   class ParseField
  location: class TransformerConfiguration
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/TransformerConfiguration.java:19: error: cannot find symbol
  protected static final ParseField TRANSFORMER_PROPERTIES = new ParseField(PROPERTIES);
                         ^
  symbol:   class ParseField
  location: class TransformerConfiguration
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/SearchConfigurationExtBuilder.java:10: error: cannot find symbol
import org.opensearch.common.ParseField;
                            ^
  symbol:   class ParseField
  location: package org.opensearch.common
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/SearchConfigurationExtBuilder.java:14: error: cannot find symbol
import org.opensearch.common.xcontent.XContentBuilder;
                                     ^
  symbol:   class XContentBuilder
  location: package org.opensearch.common.xcontent
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/SearchConfigurationExtBuilder.java:15: error: cannot find symbol
import org.opensearch.common.xcontent.XContentParser;
                                     ^
  symbol:   class XContentParser
  location: package org.opensearch.common.xcontent
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/SearchConfigurationExtBuilder.java:31: error: cannot find symbol
    private static final ParseField RESULT_TRANSFORMER = new ParseField(TransformerType.RESULT_TRANSFORMER.toString());
                         ^
  symbol:   class ParseField
  location: class SearchConfigurationExtBuilder
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/SearchConfigurationExtBuilder.java:64: error: cannot find symbol
    public static SearchConfigurationExtBuilder parse(XContentParser parser,
                                                      ^
  symbol:   class XContentParser
  location: class SearchConfigurationExtBuilder
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/SearchConfigurationExtBuilder.java:111: error: cannot find symbol
    public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
                                      ^
  symbol:   class XContentBuilder
  location: class SearchConfigurationExtBuilder
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/SearchConfigurationExtBuilder.java:111: error: cannot find symbol
    public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
           ^
  symbol:   class XContentBuilder
  location: class SearchConfigurationExtBuilder
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfigurationFactory.java:12: error: cannot find symbol
import org.opensearch.common.xcontent.XContentParser;
                                     ^
  symbol:   class XContentParser
  location: package org.opensearch.common.xcontent
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfigurationFactory.java:37: error: cannot find symbol
    public ResultTransformerConfiguration configure(XContentParser parser) throws IOException {
                                                    ^
  symbol:   class XContentParser
  location: class KendraIntelligentRankingConfigurationFactory
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:10: error: cannot find symbol
import org.opensearch.common.ParseField;
                            ^
  symbol:   class ParseField
  location: package org.opensearch.common
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:16: error: cannot find symbol
import org.opensearch.common.xcontent.ObjectParser;
                                     ^
  symbol:   class ObjectParser
  location: package org.opensearch.common.xcontent
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:17: error: cannot find symbol
import org.opensearch.common.xcontent.ToXContentObject;
                                     ^
  symbol:   class ToXContentObject
  location: package org.opensearch.common.xcontent
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:18: error: cannot find symbol
import org.opensearch.common.xcontent.XContentBuilder;
                                     ^
  symbol:   class XContentBuilder
  location: package org.opensearch.common.xcontent
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:19: error: cannot find symbol
import org.opensearch.common.xcontent.XContentParser;
                                     ^
  symbol:   class XContentParser
  location: package org.opensearch.common.xcontent
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:36: error: cannot find symbol
  private static final ObjectParser<KendraIntelligentRankingConfiguration, Void> PARSER;
                       ^
  symbol:   class ObjectParser
  location: class KendraIntelligentRankingConfiguration
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:124: error: cannot find symbol
  public static class KendraIntelligentRankingProperties implements Writeable, ToXContentObject {
                                                                               ^
  symbol:   class ToXContentObject
  location: class KendraIntelligentRankingConfiguration
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:79: error: cannot find symbol
  public static KendraIntelligentRankingConfiguration parse(XContentParser parser) throws IOException {
                                                            ^
  symbol:   class XContentParser
  location: class KendraIntelligentRankingConfiguration
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:93: error: cannot find symbol
  public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
                                    ^
  symbol:   class XContentBuilder
  location: class KendraIntelligentRankingConfiguration
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:93: error: cannot find symbol
  public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
                                                             ^
  symbol:   class Params
  location: class KendraIntelligentRankingConfiguration
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:93: error: cannot find symbol
  public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
         ^
  symbol:   class XContentBuilder
  location: class KendraIntelligentRankingConfiguration
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:125: error: cannot find symbol
    protected static final ParseField BODY_FIELD = new ParseField(Constants.BODY_FIELD);
                           ^
  symbol:   class ParseField
  location: class KendraIntelligentRankingProperties
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:126: error: cannot find symbol
    protected static final ParseField TITLE_FIELD = new ParseField(Constants.TITLE_FIELD);
                           ^
  symbol:   class ParseField
  location: class KendraIntelligentRankingProperties
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:127: error: cannot find symbol
    protected static final ParseField DOC_LIMIT = new ParseField(Constants.DOC_LIMIT);
                           ^
  symbol:   class ParseField
  location: class KendraIntelligentRankingProperties
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:129: error: cannot find symbol
    private static final ObjectParser<KendraIntelligentRankingProperties, Void> PARSER;
                         ^
  symbol:   class ObjectParser
  location: class KendraIntelligentRankingProperties
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:168: error: cannot find symbol
    public static KendraIntelligentRankingProperties parse(XContentParser parser, Void context) throws IOException {
                                                           ^
  symbol:   class XContentParser
  location: class KendraIntelligentRankingProperties
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:183: error: cannot find symbol
    public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
                                      ^
  symbol:   class XContentBuilder
  location: class KendraIntelligentRankingProperties
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:183: error: cannot find symbol
    public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
                                                               ^
  symbol:   class Params
  location: class KendraIntelligentRankingProperties
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:183: error: cannot find symbol
    public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
           ^
  symbol:   class XContentBuilder
  location: class KendraIntelligentRankingProperties
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/SearchRelevancePlugin.java:25: error: cannot find symbol
import org.opensearch.common.xcontent.NamedXContentRegistry;
                                     ^
  symbol:   class NamedXContentRegistry
  location: package org.opensearch.common.xcontent
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/SearchRelevancePlugin.java:82: error: cannot find symbol
      NamedXContentRegistry xContentRegistry,
      ^
  symbol:   class NamedXContentRegistry
  location: class SearchRelevancePlugin
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/ResultTransformerConfigurationFactory.java:40: error: name clash: interface ResultTransformerConfigurationFactory has two methods with the same erasure, yet neither overrides the other
    ResultTransformerConfiguration configure(StreamInput streamInput) throws IOException;
                                   ^
  first method:  configure(XContentParser) in ResultTransformerConfigurationFactory
  second method: configure(Settings) in ResultTransformerConfigurationFactory
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/TransformerConfiguration.java:18: error: cannot find symbol
  protected static final ParseField TRANSFORMER_ORDER = new ParseField(ORDER);
                                                            ^
  symbol:   class ParseField
  location: class TransformerConfiguration
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/TransformerConfiguration.java:19: error: cannot find symbol
  protected static final ParseField TRANSFORMER_PROPERTIES = new ParseField(PROPERTIES);
                                                                 ^
  symbol:   class ParseField
  location: class TransformerConfiguration
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/SearchConfigurationExtBuilder.java:31: error: cannot find symbol
    private static final ParseField RESULT_TRANSFORMER = new ParseField(TransformerType.RESULT_TRANSFORMER.toString());
                                                             ^
  symbol:   class ParseField
  location: class SearchConfigurationExtBuilder
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/SearchConfigurationExtBuilder.java:67: error: package XContentParser does not exist
        XContentParser.Token token = parser.currentToken();
                      ^
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/SearchConfigurationExtBuilder.java:69: error: package XContentParser does not exist
        if (token != XContentParser.Token.START_OBJECT && (token = parser.nextToken()) != XContentParser.Token.START_OBJECT) {
                                   ^
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/SearchConfigurationExtBuilder.java:69: error: package XContentParser does not exist
        if (token != XContentParser.Token.START_OBJECT && (token = parser.nextToken()) != XContentParser.Token.START_OBJECT) {
                                                                                                        ^
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/SearchConfigurationExtBuilder.java:72: error: package XContentParser does not exist
                    "Expected [" + XContentParser.Token.START_OBJECT + "] but found [" + token + "]",
                                                 ^
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/SearchConfigurationExtBuilder.java:76: error: package XContentParser does not exist
        while ((token = parser.nextToken()) != XContentParser.Token.END_OBJECT) {
                                                             ^
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/SearchConfigurationExtBuilder.java:77: error: package XContentParser does not exist
            if (token == XContentParser.Token.FIELD_NAME) {
                                       ^
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/SearchConfigurationExtBuilder.java:79: error: package XContentParser does not exist
            } else if (token == XContentParser.Token.START_OBJECT) {
                                              ^
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/SearchConfigurationExtBuilder.java:82: error: package XContentParser does not exist
                    while ((token = parser.nextToken()) != XContentParser.Token.END_OBJECT) {
                                                                         ^
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/configuration/SearchConfigurationExtBuilder.java:83: error: package XContentParser does not exist
                        if (token == XContentParser.Token.FIELD_NAME) {
                                                   ^
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfigurationFactory.java:37: error: configure(XContentParser) in KendraIntelligentRankingConfigurationFactory cannot implement configure(Settings) in ResultTransformerConfigurationFactory
    public ResultTransformerConfiguration configure(XContentParser parser) throws IOException {
                                          ^
  overridden method does not throw IOException
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfigurationFactory.java:42: error: name clash: class KendraIntelligentRankingConfigurationFactory has two methods with the same erasure, yet neither overrides the other
    public ResultTransformerConfiguration configure(StreamInput streamInput) throws IOException {
                                          ^
  first method:  configure(XContentParser) in KendraIntelligentRankingConfigurationFactory
  second method: configure(Settings) in KendraIntelligentRankingConfigurationFactory
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:39: error: cannot find symbol
    PARSER = new ObjectParser<>("kendra_intelligent_ranking_configuration", KendraIntelligentRankingConfiguration::new);
                 ^
  symbol:   class ObjectParser
  location: class KendraIntelligentRankingConfiguration
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:92: error: method does not override or implement a method from a supertype
  @Override
  ^
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:125: error: cannot find symbol
    protected static final ParseField BODY_FIELD = new ParseField(Constants.BODY_FIELD);
                                                       ^
  symbol:   class ParseField
  location: class KendraIntelligentRankingProperties
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:126: error: cannot find symbol
    protected static final ParseField TITLE_FIELD = new ParseField(Constants.TITLE_FIELD);
                                                        ^
  symbol:   class ParseField
  location: class KendraIntelligentRankingProperties
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:127: error: cannot find symbol
    protected static final ParseField DOC_LIMIT = new ParseField(Constants.DOC_LIMIT);
                                                      ^
  symbol:   class ParseField
  location: class KendraIntelligentRankingProperties
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:132: error: cannot find symbol
      PARSER = new ObjectParser<>("kendra_intelligent_ranking_configuration", KendraIntelligentRankingProperties::new);
                   ^
  symbol:   class ObjectParser
  location: class KendraIntelligentRankingProperties
/Users/lnse/opensearch/search-processor/src/main/java/org/opensearch/search/relevance/transformer/kendraintelligentranking/configuration/KendraIntelligentRankingConfiguration.java:182: error: method does not override or implement a method from a supertype
    @Override
    ^
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
59 errors

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':compileJava'.
> Compilation failed; see the compiler error output for details.

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.

* Get more help at https://help.gradle.org

Deprecated Gradle features were used in this build, making it incompatible with Gradle 8.0.

You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.

See https://docs.gradle.org/7.4.2/userguide/command_line_interface.html#sec:command_line_warnings

BUILD FAILED in 703ms
2 actionable tasks: 2 executed

[Build] Jackson-core as a dependency

Context

#132

Some GitHub action backend code using jackson-core 2.5.0 is failing builds because it includes an MRJAR with JDK 19 code. See, for example, this failed action.

Action items

For the short term, upgrade to jackson-core 2.5.0 in main branch to resolve the build failure. Also upgrade gradle #132 - Resolved here #134
For the long term, we must devise a sustainable decision on Jackson-core as a dependency ref to Dan's suggestions here. Can we remove all jackson- packages? if not. we need test on build and integration. [Non urgent]
Check if the current dependabot is working well, or examine alternatives.

Some observations:

OpenSearch main branch is using jackson-core 2.5.0 https://github.com/opensearch-project/OpenSearch/tree/main/client/sniffer/licenses
Security plugin removed jackson-databind and jackson-annotations opensearch-project/security#2325
Observability plugin removed jackson depenencies and then added them back opensearch-project/observability#1379

Redo the backporting failed PRs on branch 2.x

Manually redo the backporting failed PRs on bot. And merge all backport PRs in the correct order:

#37
Then
#63

Prototype search pipelines

I'm going to put together a scrappy first implementation of search pipelines.

This first implementation will largely be a copy/paste from ingest pipelines.

I think it should be a good conversation-starter about whether/how to share implementation with ingest pipelines. Depending on where I get with this task, it may be throwaway learning code or it may be the first draft of what we eventually want to merge.

This task should be moved to the OpenSearch core project, but I'm creating it here as a placeholder.

The goals for my prototype include:

Should be able to CRUD search pipelines, with persistence in cluster state.
Should be able to invoke a named search pipeline from a search request. (The named search pipeline might include a dummy "hello world" processor that e.g. adds a field to the first hit in the response or adds a filter to an incoming query.)

Features that can come later (but before "release") include:

Setting a default search pipeline for an index.
Specifying an ad hoc search pipeline as part of a search request.
Availability of a "standard" set of search pipeline processors (similar to ingest-common).
Support for BracketProcessor (processors that modify both request and response, with state carried from request time to response time).

Manually redo the backporting failed PRs #86

Manually redo the back-porting failed PRs on bot #86

[FEATURE] CI Workflows for Semantic Search/Search Request Processor Plugin

Is your feature request related to a problem?

No CI currently exists for this repo.

What solution would you like?

What alternatives have you considered?

None

[DOC] Update README File to Reflect What This Repo Contains

This one has been back and forth a few times, but now that Search Pipelines is in core, we should update the README to reflect that this contains the Kendra self-install plugin and will be a shim for the processor being built. Is this also the future of this repo long term?

[FEATURE] Windows Support

Is your feature request related to a problem?

All OpenSearch plugins must support Windows.

What solution would you like?

The in repo-CI build for the plugin passes on Windows

What alternatives have you considered?

None.

Do you have any additional context?

logs_15.zip
It looks like the AWS config either needs to be mocked or configured correctly
The workflow was copied from the k-NN repo.

Missing MAINTAINERS.md file

During an evaluation of our repos last week, I noticed this repository does not contain a MAINTAINERS.md file. Our https://github.com/opensearch-project/.github/blob/main/MAINTAINERS.md file provides an overview and outline to resolve this.

[CI] Add integration test and BWC test

Add integration test and BWC test

[BUG] Version tracking in 2.x

Description

The 2.x version should track 2.7.0, as well as use the SNAPSHOT version of build tools in the dependency. The naming conventions should be the same as #114.

Additionally, this change should have been in 2.6 as well.

[BUG] main branch should be tracking 3.0.0

Description

The main branch should be tracking version 3.0.0 and the 2.x branch should be tracking the current minor version.

As an example, Observability tracks 3.0.0 along with the SNAPSHOT tag. We should include a flag for SNAPSHOT in our build.gradle, since 3.0.0 will not compile without it since there is no 3.0.0 release currently.

[BUG] Flaky test failure KendraIntelligentRankingConfigurationTests.testSerializeToXContentRoundtrip

Details

As seen in the build failure in #106, the testSerializeToXContentRoundtrip test in KendraIntelligentRankingConfigurationTests fails sporadically.

Steps to reproduce

The following commands include the random seed to reproduce the error.

search-processor (82e59f1) ✔ ./gradlew ':test' --tests "org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests.testSerializeToXContentRoundtrip" -Dtests.seed=C94C719B5F62ADE -Dtests.security.manager=true -Dtests.locale=vi-VN -Dtests.timezone=Etc/GMT+9                
=======================================
OpenSearch Build Hamster says Hello!
  Gradle Version        : 7.4.2
  OS Info               : Mac OS X 12.6.3 (aarch64)
  JDK Version           : 11 (Eclipse Temurin JDK)
  JAVA_HOME             : /Library/Java/JavaVirtualMachines/jdk-11.0.17+8/Contents/Home
  Random Testing Seed   : C94C719B5F62ADE
  In FIPS 140 mode      : false
=======================================

> Task :test FAILED


org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests > testSerializeToXContentRoundtrip FAILED
    ParsingException[Failed to parse value [0] for Transformer order, must be >= 1]
        at __randomizedtesting.SeedInfo.seed([C94C719B5F62ADE:1195B20D26FE1F21]:0)
        at app//org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfiguration.parse(KendraIntelligentRankingConfiguration.java:84)
        at app//org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests.testSerializeToXContentRoundtrip(KendraIntelligentRankingConfigurationTests.java:36)
REPRODUCE WITH: ./gradlew ':test' --tests "org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests.testSerializeToXContentRoundtrip" -Dtests.seed=C94C719B5F62ADE -Dtests.security.manager=true -Dtests.locale=vi-VN -Dtests.timezone=Etc/GMT+9 -Druntime.java=11

Suite: Test class org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests
  1> [2023-02-27T17:08:19,270][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[C94C719B5F62ADE]]] unable to load JNA native support library, native methods will be disabled.
  1> java.lang.UnsatisfiedLinkError: Can't load library: /Users/lnse/Library/Caches/JNA/temp/jna10477499904397165298.tmp
  1>    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:2633) ~[?:?]
  1>    at java.lang.Runtime.load0(Runtime.java:768) ~[?:?]
  1>    at java.lang.System.load(System.java:1837) ~[?:?]
  1>    at com.sun.jna.Native.loadNativeDispatchLibraryFromClasspath(Native.java:1018) ~[jna-5.5.0.jar:5.5.0 (b0)]
  1>    at com.sun.jna.Native.loadNativeDispatchLibrary(Native.java:988) ~[jna-5.5.0.jar:5.5.0 (b0)]
  1>    at com.sun.jna.Native.<clinit>(Native.java:195) ~[jna-5.5.0.jar:5.5.0 (b0)]
  1>    at java.lang.Class.forName0(Native Method) ~[?:?]
  1>    at java.lang.Class.forName(Class.java:315) ~[?:?]
  1>    at org.opensearch.bootstrap.Natives.<clinit>(Natives.java:60) [opensearch-2.5.0.jar:2.5.0]
  1>    at org.opensearch.bootstrap.Bootstrap.initializeNatives(Bootstrap.java:123) [opensearch-2.5.0.jar:2.5.0]
  1>    at org.opensearch.bootstrap.BootstrapForTesting.<clinit>(BootstrapForTesting.java:105) [framework-2.5.0.jar:2.5.0]
  1>    at org.opensearch.test.OpenSearchTestCase.<clinit>(OpenSearchTestCase.java:257) [framework-2.5.0.jar:2.5.0]
  1>    at java.lang.Class.forName0(Native Method) ~[?:?]
  1>    at java.lang.Class.forName(Class.java:398) [?:?]
  1>    at com.carrotsearch.randomizedtesting.RandomizedRunner$2.run(RandomizedRunner.java:623) [randomizedtesting-runner-2.7.1.jar:?]
  1> [2023-02-27T17:08:19,285][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[C94C719B5F62ADE]]] cannot check if running as root because JNA is not available
  1> [2023-02-27T17:08:19,285][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[C94C719B5F62ADE]]] cannot install system call filter because JNA is not available
  1> [2023-02-27T17:08:19,286][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[C94C719B5F62ADE]]] cannot register console handler because JNA is not available
  1> [2023-02-27T17:08:19,287][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[C94C719B5F62ADE]]] cannot getrlimit RLIMIT_NPROC because JNA is not available
  1> [2023-02-27T17:08:19,287][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[C94C719B5F62ADE]]] cannot getrlimit RLIMIT_AS because JNA is not available
  1> [2023-02-27T17:08:19,287][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[C94C719B5F62ADE]]] cannot getrlimit RLIMIT_FSIZE because JNA is not available
  1> [2023-02-27T16:08:19,665][INFO ][o.o.s.r.t.k.c.KendraIntelligentRankingConfigurationTests] [testSerializeToXContentRoundtrip] before test
  1> [2023-02-27T16:08:19,868][INFO ][o.o.s.r.t.k.c.KendraIntelligentRankingConfigurationTests] [testSerializeToXContentRoundtrip] after test
  2> REPRODUCE WITH: ./gradlew ':test' --tests "org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests.testSerializeToXContentRoundtrip" -Dtests.seed=C94C719B5F62ADE -Dtests.security.manager=true -Dtests.locale=vi-VN -Dtests.timezone=Etc/GMT+9 -Druntime.java=11
  2> ParsingException[Failed to parse value [0] for Transformer order, must be >= 1]
        at __randomizedtesting.SeedInfo.seed([C94C719B5F62ADE:1195B20D26FE1F21]:0)
        at app//org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfiguration.parse(KendraIntelligentRankingConfiguration.java:84)
        at app//org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests.testSerializeToXContentRoundtrip(KendraIntelligentRankingConfigurationTests.java:36)
  2> NOTE: leaving temporary files on disk at: /Users/lnse/opensearch/search-processor/build/testrun/test/temp/org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests_C94C719B5F62ADE-002
  2> NOTE: test params are: codec=Asserting(Lucene94): {}, docValues:{}, maxPointsInLeafNode=1933, maxMBSortInHeap=6.854586443954316, sim=Asserting(RandomSimilarity(queryNorm=true): {}), locale=vi-VN, timezone=Etc/GMT+9
  2> NOTE: Mac OS X 12.6.3 aarch64/Eclipse Adoptium 11.0.17 (64-bit)/cpus=10,threads=1,free=405602816,total=536870912
  2> NOTE: All tests run in this JVM: [KendraIntelligentRankingConfigurationTests]

Tests with failures:
 - org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests.testSerializeToXContentRoundtrip

1 test completed, 1 failed

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':test'.
> There were failing tests. See the report at: file:///Users/lnse/opensearch/search-processor/build/reports/tests/test/index.html

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.

* Get more help at https://help.gradle.org

Deprecated Gradle features were used in this build, making it incompatible with Gradle 8.0.

You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.

See https://docs.gradle.org/7.4.2/userguide/command_line_interface.html#sec:command_line_warnings

BUILD FAILED in 2s
7 actionable tasks: 2 executed, 5 up-to-date

 search-processor (main) ✔ ./gradlew ':test' --tests "org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests.testSerializeToXContentRoundtrip" -Dtests.seed=7072CFFCEB4B3DC0 -Dtests.security.manager=true -Dtests.locale=mk -Dtests.timezone=Europe/Madrid
=======================================
OpenSearch Build Hamster says Hello!
  Gradle Version        : 7.4.2
  OS Info               : Mac OS X 12.6.3 (aarch64)
  JDK Version           : 11 (Eclipse Temurin JDK)
  JAVA_HOME             : /Library/Java/JavaVirtualMachines/jdk-11.0.17+8/Contents/Home
  Random Testing Seed   : 7072CFFCEB4B3DC0
  In FIPS 140 mode      : false
=======================================

> Task :test FAILED


org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests > testSerializeToXContentRoundtrip FAILED
    ParsingException[Failed to parse value [0] for Transformer order, must be >= 1]
        at __randomizedtesting.SeedInfo.seed([7072CFFCEB4B3DC0:6D73BAE87843083F]:0)
        at app//org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfiguration.parse(KendraIntelligentRankingConfiguration.java:84)
        at app//org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests.testSerializeToXContentRoundtrip(KendraIntelligentRankingConfigurationTests.java:36)
REPRODUCE WITH: ./gradlew ':test' --tests "org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests.testSerializeToXContentRoundtrip" -Dtests.seed=7072CFFCEB4B3DC0 -Dtests.security.manager=true -Dtests.locale=mk -Dtests.timezone=Europe/Madrid -Druntime.java=11

Suite: Test class org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests
  1> [2023-02-28T12:29:02,445][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[7072CFFCEB4B3DC0]]] unable to load JNA native support library, native methods will be disabled.
  1> java.lang.UnsatisfiedLinkError: Can't load library: /Users/lnse/Library/Caches/JNA/temp/jna8244722722494699276.tmp
  1>    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:2633) ~[?:?]
  1>    at java.lang.Runtime.load0(Runtime.java:768) ~[?:?]
  1>    at java.lang.System.load(System.java:1837) ~[?:?]
  1>    at com.sun.jna.Native.loadNativeDispatchLibraryFromClasspath(Native.java:1018) ~[jna-5.5.0.jar:5.5.0 (b0)]
  1>    at com.sun.jna.Native.loadNativeDispatchLibrary(Native.java:988) ~[jna-5.5.0.jar:5.5.0 (b0)]
  1>    at com.sun.jna.Native.<clinit>(Native.java:195) ~[jna-5.5.0.jar:5.5.0 (b0)]
  1>    at java.lang.Class.forName0(Native Method) ~[?:?]
  1>    at java.lang.Class.forName(Class.java:315) ~[?:?]
  1>    at org.opensearch.bootstrap.Natives.<clinit>(Natives.java:60) [opensearch-2.5.0.jar:2.5.0]
  1>    at org.opensearch.bootstrap.Bootstrap.initializeNatives(Bootstrap.java:123) [opensearch-2.5.0.jar:2.5.0]
  1>    at org.opensearch.bootstrap.BootstrapForTesting.<clinit>(BootstrapForTesting.java:105) [framework-2.5.0.jar:2.5.0]
  1>    at org.opensearch.test.OpenSearchTestCase.<clinit>(OpenSearchTestCase.java:257) [framework-2.5.0.jar:2.5.0]
  1>    at java.lang.Class.forName0(Native Method) ~[?:?]
  1>    at java.lang.Class.forName(Class.java:398) [?:?]
  1>    at com.carrotsearch.randomizedtesting.RandomizedRunner$2.run(RandomizedRunner.java:623) [randomizedtesting-runner-2.7.1.jar:?]
  1> [2023-02-28T12:29:02,457][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[7072CFFCEB4B3DC0]]] cannot check if running as root because JNA is not available
  1> [2023-02-28T12:29:02,458][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[7072CFFCEB4B3DC0]]] cannot install system call filter because JNA is not available
  1> [2023-02-28T12:29:02,458][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[7072CFFCEB4B3DC0]]] cannot register console handler because JNA is not available
  1> [2023-02-28T12:29:02,459][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[7072CFFCEB4B3DC0]]] cannot getrlimit RLIMIT_NPROC because JNA is not available
  1> [2023-02-28T12:29:02,459][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[7072CFFCEB4B3DC0]]] cannot getrlimit RLIMIT_AS because JNA is not available
  1> [2023-02-28T12:29:02,459][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[7072CFFCEB4B3DC0]]] cannot getrlimit RLIMIT_FSIZE because JNA is not available
  1> [2023-02-28T21:29:02,832][INFO ][o.o.s.r.t.k.c.KendraIntelligentRankingConfigurationTests] [testSerializeToXContentRoundtrip] before test
  1> [2023-02-28T21:29:03,063][INFO ][o.o.s.r.t.k.c.KendraIntelligentRankingConfigurationTests] [testSerializeToXContentRoundtrip] after test
  2> REPRODUCE WITH: ./gradlew ':test' --tests "org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests.testSerializeToXContentRoundtrip" -Dtests.seed=7072CFFCEB4B3DC0 -Dtests.security.manager=true -Dtests.locale=mk -Dtests.timezone=Europe/Madrid -Druntime.java=11
  2> ParsingException[Failed to parse value [0] for Transformer order, must be >= 1]
        at __randomizedtesting.SeedInfo.seed([7072CFFCEB4B3DC0:6D73BAE87843083F]:0)
        at app//org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfiguration.parse(KendraIntelligentRankingConfiguration.java:84)
        at app//org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests.testSerializeToXContentRoundtrip(KendraIntelligentRankingConfigurationTests.java:36)
  2> NOTE: leaving temporary files on disk at: /Users/lnse/opensearch/search-processor/build/testrun/test/temp/org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests_7072CFFCEB4B3DC0-001
  2> NOTE: test params are: codec=Asserting(Lucene94): {}, docValues:{}, maxPointsInLeafNode=863, maxMBSortInHeap=6.6242684347487675, sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=mk, timezone=Europe/Madrid
  2> NOTE: Mac OS X 12.6.3 aarch64/Eclipse Adoptium 11.0.17 (64-bit)/cpus=10,threads=1,free=406309992,total=536870912
  2> NOTE: All tests run in this JVM: [KendraIntelligentRankingConfigurationTests]

Tests with failures:
 - org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests.testSerializeToXContentRoundtrip

1 test completed, 1 failed

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':test'.
> There were failing tests. See the report at: file:///Users/lnse/opensearch/search-processor/build/reports/tests/test/index.html

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.

* Get more help at https://help.gradle.org

Deprecated Gradle features were used in this build, making it incompatible with Gradle 8.0.

You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.

See https://docs.gradle.org/7.4.2/userguide/command_line_interface.html#sec:command_line_warnings

BUILD FAILED in 2s
7 actionable tasks: 1 executed, 6 up-to-date

search-processor (main) ✔ ./gradlew ':test' --tests "org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests.testSerializeToXContentRoundtrip" -Dtests.seed=5651DD3B7CACCD87 -Dtests.security.manager=true -Dtests.locale=no -Dtests.timezone=Etc/GMT-9
=======================================
OpenSearch Build Hamster says Hello!
  Gradle Version        : 7.4.2
  OS Info               : Mac OS X 12.6.3 (aarch64)
  JDK Version           : 11 (Eclipse Temurin JDK)
  JAVA_HOME             : /Library/Java/JavaVirtualMachines/jdk-11.0.17+8/Contents/Home
  Random Testing Seed   : 5651DD3B7CACCD87
  In FIPS 140 mode      : false
=======================================

> Task :test FAILED


org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests > testSerializeToXContentRoundtrip FAILED
    ParsingException[Failed to parse value [0] for Transformer order, must be >= 1]
        at __randomizedtesting.SeedInfo.seed([5651DD3B7CACCD87:4B50A82FEFA4F878]:0)
        at app//org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfiguration.parse(KendraIntelligentRankingConfiguration.java:84)
        at app//org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests.testSerializeToXContentRoundtrip(KendraIntelligentRankingConfigurationTests.java:36)
REPRODUCE WITH: ./gradlew ':test' --tests "org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests.testSerializeToXContentRoundtrip" -Dtests.seed=5651DD3B7CACCD87 -Dtests.security.manager=true -Dtests.locale=no -Dtests.timezone=Etc/GMT-9 -Druntime.java=11

Suite: Test class org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests
  1> [2023-02-28T12:30:37,520][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[5651DD3B7CACCD87]]] unable to load JNA native support library, native methods will be disabled.
  1> java.lang.UnsatisfiedLinkError: Can't load library: /Users/lnse/Library/Caches/JNA/temp/jna6176494812616682527.tmp
  1>    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:2633) ~[?:?]
  1>    at java.lang.Runtime.load0(Runtime.java:768) ~[?:?]
  1>    at java.lang.System.load(System.java:1837) ~[?:?]
  1>    at com.sun.jna.Native.loadNativeDispatchLibraryFromClasspath(Native.java:1018) ~[jna-5.5.0.jar:5.5.0 (b0)]
  1>    at com.sun.jna.Native.loadNativeDispatchLibrary(Native.java:988) ~[jna-5.5.0.jar:5.5.0 (b0)]
  1>    at com.sun.jna.Native.<clinit>(Native.java:195) ~[jna-5.5.0.jar:5.5.0 (b0)]
  1>    at java.lang.Class.forName0(Native Method) ~[?:?]
  1>    at java.lang.Class.forName(Class.java:315) ~[?:?]
  1>    at org.opensearch.bootstrap.Natives.<clinit>(Natives.java:60) [opensearch-2.5.0.jar:2.5.0]
  1>    at org.opensearch.bootstrap.Bootstrap.initializeNatives(Bootstrap.java:123) [opensearch-2.5.0.jar:2.5.0]
  1>    at org.opensearch.bootstrap.BootstrapForTesting.<clinit>(BootstrapForTesting.java:105) [framework-2.5.0.jar:2.5.0]
  1>    at org.opensearch.test.OpenSearchTestCase.<clinit>(OpenSearchTestCase.java:257) [framework-2.5.0.jar:2.5.0]
  1>    at java.lang.Class.forName0(Native Method) ~[?:?]
  1>    at java.lang.Class.forName(Class.java:398) [?:?]
  1>    at com.carrotsearch.randomizedtesting.RandomizedRunner$2.run(RandomizedRunner.java:623) [randomizedtesting-runner-2.7.1.jar:?]
  1> [2023-02-28T12:30:37,532][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[5651DD3B7CACCD87]]] cannot check if running as root because JNA is not available
  1> [2023-02-28T12:30:37,532][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[5651DD3B7CACCD87]]] cannot install system call filter because JNA is not available
  1> [2023-02-28T12:30:37,533][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[5651DD3B7CACCD87]]] cannot register console handler because JNA is not available
  1> [2023-02-28T12:30:37,534][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[5651DD3B7CACCD87]]] cannot getrlimit RLIMIT_NPROC because JNA is not available
  1> [2023-02-28T12:30:37,534][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[5651DD3B7CACCD87]]] cannot getrlimit RLIMIT_AS because JNA is not available
  1> [2023-02-28T12:30:37,534][WARN ][o.o.b.Natives            ] [[SUITE-KendraIntelligentRankingConfigurationTests-seed#[5651DD3B7CACCD87]]] cannot getrlimit RLIMIT_FSIZE because JNA is not available
  1> [2023-03-01T05:30:37,887][INFO ][o.o.s.r.t.k.c.KendraIntelligentRankingConfigurationTests] [testSerializeToXContentRoundtrip] before test
  1> [2023-03-01T05:30:38,107][INFO ][o.o.s.r.t.k.c.KendraIntelligentRankingConfigurationTests] [testSerializeToXContentRoundtrip] after test
  2> REPRODUCE WITH: ./gradlew ':test' --tests "org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests.testSerializeToXContentRoundtrip" -Dtests.seed=5651DD3B7CACCD87 -Dtests.security.manager=true -Dtests.locale=no -Dtests.timezone=Etc/GMT-9 -Druntime.java=11
  2> ParsingException[Failed to parse value [0] for Transformer order, must be >= 1]
        at __randomizedtesting.SeedInfo.seed([5651DD3B7CACCD87:4B50A82FEFA4F878]:0)
        at app//org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfiguration.parse(KendraIntelligentRankingConfiguration.java:84)
        at app//org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests.testSerializeToXContentRoundtrip(KendraIntelligentRankingConfigurationTests.java:36)
  2> NOTE: leaving temporary files on disk at: /Users/lnse/opensearch/search-processor/build/testrun/test/temp/org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests_5651DD3B7CACCD87-001
  2> NOTE: test params are: codec=Asserting(Lucene94): {}, docValues:{}, maxPointsInLeafNode=1670, maxMBSortInHeap=5.474299831647895, sim=Asserting(RandomSimilarity(queryNorm=true): {}), locale=no, timezone=Etc/GMT-9
  2> NOTE: Mac OS X 12.6.3 aarch64/Eclipse Adoptium 11.0.17 (64-bit)/cpus=10,threads=1,free=402628568,total=536870912
  2> NOTE: All tests run in this JVM: [KendraIntelligentRankingConfigurationTests]

Tests with failures:
 - org.opensearch.search.relevance.transformer.kendraintelligentranking.configuration.KendraIntelligentRankingConfigurationTests.testSerializeToXContentRoundtrip

1 test completed, 1 failed

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':test'.
> There were failing tests. See the report at: file:///Users/lnse/opensearch/search-processor/build/reports/tests/test/index.html

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.

* Get more help at https://help.gradle.org

Deprecated Gradle features were used in this build, making it incompatible with Gradle 8.0.

You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.

See https://docs.gradle.org/7.4.2/userguide/command_line_interface.html#sec:command_line_warnings

BUILD FAILED in 2s
7 actionable tasks: 1 executed, 6 up-to-date

[RFC] OpenSearch Search Relevance

OpenSearch Search Relevance

Overview

In a search engine, the relevance is the measure of the relationship accuracy between the search query and the search result. Higher the relevance is, the higher is the quality of search result and the users are able to get more relevant content. This project aims to add plugins to OpenSearch to help users make their query results more accurate, contextual and relevant.

Relevancy and OpenSearch

Today, OpenSearch provides results in the order of scores generated by algorithms matching the indexed document contents to the input query. The documents that are more relevant to the query are ranked higher than the ones that are less relevant. These rankings may make sense to one set of users/applications and for others it may be very irrelevant. For example, relevancy for an E-commerce company can mean more similar products in the same category of the search query. While for a document search relevancy may mean, searching the query across different topics/categories present in the document store. This is why, we need more ways to customize the results and its rankings as per the need of the user/business.

Relevancy Engineering

Relevancy as a problem, can’t just be solved at the search layer. Improving relevancy should be envisioned holistically from understanding the ingested data and usage signals to extracting feature, adding re-writers and improving algorithms. Below is the architecture of OpenSearch Relevancy Engineering.

[Initially presented at Haystack 2022 by @anirudha , @JohannesDaniel and @ps48].

Overall the Relevancy Engineering can be divided into two tiers:

Ingestion Tier: This tier handles getting the data from different sources to OpenSearch. This data may include:
1. Search Data:
  1. Core search data, that needs to be queried on by OpenSearch
  2. Ingestion connectors to fetch the data from different data sources and sink in OpenSearch indices.
2. Search Management Data:
  1. Adding rules and judgements to the rewriter indices.
3. Observability Data:
  1. Adding customer usage signals to OpenSearch, these signals may include granular details like anonymized customer queries, clicks, orders and session details.
Search & Relevancy Platform Tier: This tier is responsible for analytics, re-wrtiers, model improvements and adding search configurations.
1. Search Analytics & Discovery:
  1. Dashboards for analytics, metrics for search tests, search UIs and query profiling.
2. Querqy based query Rewriting:
  1. Rewriters to customize queries with synonyms, word-breaks, spell corrections, query relaxation.
3. Search Back Office:
  1. Manage business rules, ontologies and manual judgments.
4. Relevancy workbench:
  1. Improve algorithms with automated testing, relevance model trainings, personalizations and custom re-rankers.

Appendix

OpenSearch is flexible with its plugin based architecture. Each plugin interface in OpenSearch provides different options to intercept a current workflow or extend the engine with new workflow options. One of the places where relevancy plugins should focus is the “Search Plugin interface”. This interface provides plugins the functionality to intercept/add/modify:

Score functions
Significance Heuristics
Define/modify Aggregator and their functions
Highlighters
Suggesters
Queries
Sorting orders
Re-scorers

More details on the each interface can be found in OpenSearch code base and in this blog.

[Backport] XContent Refactor to 2.x

XContent namespace refactor from common -> core is going to be merged to opensearch/2.x which will break the 2.x build. This issue is for backporting #117 to 2.x after the core namespace change is merged.

Depends on opensearch-project/OpenSearch#6470

[RFC] OpenSearch Querqy Plugin Design Document

OpenSearch Querqy Plugin Design Document

Plugin Code: https://github.com/querqy/querqy-opensearch
Querqy Project: https://querqy.org/

1. What is Querqy?

A query rewriting library. It helps you to tune your search results for specific search terms. It also comes with some query-independent search relevance optimisations. You can plugin in your own query rewriter. Available for Solr and Elasticsearch. Querqy is a query rewriting framework for Java-based search engines. It is probably best-known for its rule-based query rewriting, which applies synonyms, query dependent filters, boostings and demoting of documents. This rule-based query rewriting is implemented in the ‘Common Rules Rewriter’, but Querqy’s capabilities go far beyond this rewriter.

2. How does Querqy fit in the Search Relevancy Ecosystem?

Querqy Plugin will be a crucial part of our search relevancy project. It is the starting point for users to dive into query relevancy and optimization. Users can then add Querqy rules to deal with synonyms, query relaxations (with number units, replacements) and word breaks. More details on Search Relevancy RFC

3. What are Re-writers in Querqy?

Rewriters manipulate the query that was entered by the user. They can change the result set by adding alternative tokens, by removing tokens or by adding filters. They can also influence the ranking by adding boosting information. A single query can be rewritten by more than one rewriter. Together they form the rewrite chain.

4. Different rewriters available:

Common Rules Rewriter: The Common Rules Rewriter uses configurable rules to manipulate the matching and ranking of search results depending on the input query. In e-commerce search it is a powerful tool for merchandisers to fine-tune search results, especially for high-traffic queries.

notebook => (laptop or notebook)

Replace Rewriter: The Replace Rewriter is considered to be a preprocessor for other rewriters. In contrast to the Common Rules Rewriter, its main scope is to handle different variants of terms rather than enhancing the query by business logic.

notbook; noteboo => notebook

Word Break Rewriter: The Word Break Rewriter deals with compound words in queries. It works in two directions: it will split compound words found in queries and it will create compound words from adjacent query tokens.

iphone => (iphone or i phone)

Number-Unit Rewriter: The Number-Unit Rewriter takes term combinations comprising a number and a unit and rewrites these combinations to filter and boost queries. The precondition for configuring this rewriting for a certain unit is a numeric field in the index containing standardized values for the respective unit.

Laptop 15” => laptop and screen_size:[13.5 to 16.5]

5. OpenSearch Querqy Plugin

5.1 Sample Usage with Synonym rules

// Index sample docs

POST sample-index/_doc/1
{
  "@timestamp": "2099-11-15T13:12:00",
  "message": "GET /search HTTP/1.1 200 1070000",
  "user": {
    "id": "John"
  }
}

POST  sample-index/_doc/2
{
  "@timestamp": "2099-11-15T13:12:00",
  "message": "POST /search HTTP/1.1 200 1070000",
  "user": {
    "id": "David"
  }
}

POST  sample-index/_doc/3
{
  "@timestamp": "2099-11-15T13:12:00",
  "message": "PUT /search HTTP/1.1 200 1070000",
  "user": {
    "id": "Ani"
  }
}

POST  sample-index/_doc/4
{
  "@timestamp": "2099-11-15T13:12:00",
  "message": "DELETE HTTP/1.1 200 1070000",
  "user": {
    "id": "Josh"
  }
}

POST  sample-index/_doc/5
{
  "@timestamp": "2099-11-15T13:12:00",
  "message": "GETREQUEST HTTP/1.1 200 1070000",
  "user": {
    "id": "Jake"
  }
}

POST  sample-index/_doc/6
{
  "@timestamp": "2099-11-15T13:12:00",
  "message": "GET REQUEST HTTP/1.1 200 1070000",
  "user": {
    "id": "Paul"
  }
}

// Add a synonym rule to the Querqy common rules

PUT  /_plugins/_querqy/rewriter/common_rules
{
  "class": "querqy.opensearch.rewriter.SimpleCommonRulesRewriterFactory",
  "config": {
      "rules" : "request =>\nSYNONYM: GET"
  }
}

// Query the sample index with synonym rule
// [Search for "requests", ]

POST sample-index/_search
{
    "query": {
       "querqy": {
           "matching_query": {
               "query": "request"
           },
           "query_fields": [ "message"],
            "rewriters": ["common_rules"]
     }
  }
}

// Result of the above query
// [Results contain docs containing "GET"]

 {
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0054247,
    "hits" : [
      {
        "_index" : "sample-index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0054247,
        "_source" : {
          "@timestamp" : "2099-11-15T13:12:00",
          "message" : "GET /search HTTP/1.1 200 1070000",
          "user" : {
            "id" : "John"
          }
        }
      },
      {
        "_index" : "sample-index",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 1.0054247,
        "_source" : {
          "@timestamp" : "2099-11-15T13:12:00",
          "message" : "GET REQUEST HTTP/1.1 200 1070000",
          "user" : {
            "id" : "Paul"
          }
        }
      }
    ]
  }
}

// DELETE querqy rule

DELETE  /_plugins/_querqy/rewrite/common_rules

5.2 Architecture

The plugin would work similar to Querqy’s ES plugin

By René Kriegler @renekrie, Querqy Co-author & Maintainer

5.3 Configuration Options/Settings

The Querqy plugin needs three configuration settings:

Querqy Index number of replicas
Rules cache expire time after write operation
Rules cache expire time after read operation

NOTE: More details on caching in section 5.5

5.4 Index

Querqy index → .opensearch-querqy

All the Querqy configurations like rules, synonyms, word breaks, etc will be stored in a new plugin index called .opensearch-querqy. For each rewriter a separate doc is formed inside the index. Below is a sample of common_rules stored in the plugin.

GET /.opensearch-querqy/_doc/common_rules

{
  "_index" : ".opensearch-querqy",
  "_type" : "_doc",
  "_id" : "common_rules",
  "_version" : 3,
  "_seq_no" : 2,
  "_primary_term" : 3,
  "found" : true,
  "_source" : {
    "type" : "rewriter",
    "version" : 3,
    "class" : "querqy.opensearch.rewriter.SimpleCommonRulesRewriterFactory",
    "config_v_003" : """{"rules":"request =>\nSYNONYM: POST"}"""
  }
}

5.5 Processed rule caching

Usually users have thousand of rules in their index. Processing these rules and converting them to object factories take considerable amount of time, this processing cannot be done per request. Hence, the plugin resorts to caching the processed rules. The cache is build for each rewriter on the first search request made by any user. The cache stored is reloaded with each PUT request made to the querqy plugin. The cache is cleared when a particular rewriter is deleted with a DELETE request.

5.6 Security & FGAC

5.6.1 Access Control for querying over an index:

Users querying over an index with or w/o Querqy should have access to read/search that index. The security plugin will block the user for any unauthorized access based on the permissions mapped to the user’s role.

5.6.2 Access Control for plugin and its index:

Plugin Users:
Admin User : creating rules read/update/delete [All CRUD]
Search User: searches with querqy, not to view the rules but use it. [Only Read]

Querqy Index: This stores all the rules .opensearch-querqy
Querqy Query Component: This is responsible for using rules and applying it to index search queries made by the user
Approach 1:
- Only the users having read access to the Querqy index, can use the plugin with the search API.
- Only the users having write permissions on the Querqy index, can add/configure rules to the plugin.
- For users not having the access, we straight away deny request with 403 error response.
Approach 2:
- All the users can use the plugin with the search API.
- Only the users having write permissions on the Querqy index, can add/configure/delete rules to the plugin.
- For users not having the access, we straight away deny request with 403 error response.

More discussion about the issue here: querqy/querqy-opensearch#14

6. References:

Querqy project docs: https://querqy.org/
Querqy project github: https://github.com/querqy
Querqy ElasticSearch Plugin Details: https://github.com/querqy/querqy-elasticsearch
Querqy 5: https://opensourceconnections.com/blog/2021/03/05/whats-new-in-querqy-the-query-preprocessor-for-solr-and-elasticsearch/

7. Open Questions:

How would CVEs/security bugs be handled for the plugin code and upstream querqy library?
Should the plugin be multi-tenant?
Should the querqy plugin index be an internal index?
Should the plugin code be moved to OpenSearch project? What are the pros and cons?

[PROPOSAL] Add Sean Li as a Maintainer

Current maintainers voted to add Sean Li (@sejli) as a maintainer on this repository. @opensearch-project/admin, can you please add him to the maintainers in the collaborators list on the repo?

Thanks,
Mark

[RFC] Search Relevancy - from A Schema Perspective

Search Relevancy - A Schema Perspective

This document will present the key concepts of the following concerns:

Gain understanding for leading industry search relevancy patterns
Review and understand how Google's modern search engine utilizes schema to optimize search
Discuss the required building blocks for search relevancy framework

Introduction

Once reviewing the world's most popular and industry leading search engine that is a part of our every day activities - Google Search Engine, Its apparent that the search is being conducted with many additional notions in addition to the basic 'phrase' one is querying for.

Let us review these notions to get a better understanding of how google enables its search to be both relevant an accurate:

Entities

Google describes an entity, or named entity, as a single, well-defined thing or concept. Google Search works in three stages, and not all pages make it through each stage.

Crawling: Google downloads text, images, and videos from pages it found on the internet with automated programs called crawlers.
Indexing: Google analyzes the text, images, and video files on the page, and stores the information in the Google index, which is a large database.
Serving search results: When a user searches on Google, Google returns information that's relevant to the user's query.

When a user enters a query, google search the index for matching pages and return the results that are the highest quality and most relevant to the user.

Relevancy is determined by hundreds of factors, which could include information such as the user's location, language, and device (desktop or phone).

For example:

_searching for "bicycle repair shops" would show different results to a user in Paris than it would to a user in Hong Kong._

Entities compose our everyday world and also our spoken language. We talk about entities and think in terms of things and entities. To reflect this google (and many other companies) have turned to seek assistance from the knowledge base domain.

Common Entities

One of the core mechanisms they will use is entity recognition. If google understands that a query contains the same entities as another they have seen before with little in the way of qualifiers, then that would be an indication that the result sets may be identical or highly similar.

Standardization of the domain knowledge

- https://schema.org/docs/schemas.html

Schema.org is a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond.

Schema.org vocabulary can be used with many different encodings, these vocabularies cover entities, relationships between entities and actions. It can easily be extended through a well-documented extension model. Over 10 million sites use Schema.org to markup their web pages and email messages.

Many applications from Google, Microsoft, Pinterest, Yandex and others already use these vocabularies to power rich, extensible experiences.

Schema.org was founded by Google, Microsoft, Yahoo and Yandex, Schema.org vocabularies are developed by an open community process

Schema.org is defined as two hierarchies:

One for textual property values
One for the things that they describe.

The main schema.org hierarchy is a collection of types (or "classes"), each of which has one or more parent types.

- https://www.wikidata.org/wiki/Wikidata:Database_reports/EntitySchema_directory

Wikidata is a free and open knowledge base that can be read and edited by both humans and machines.
Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others.

The focus of Wikidata is structured data. Structured data refers to data that has been organized and is stored in a defined way, often with the intention to encode meaning and preserve the relationships between different data points within a dataset.

The Wikidata repository consists mainly of items, each one having a label, a description and any number of aliases. Statements describe detailed characteristics of an Item and consist of a property and a value.

RankBrain -

RankBrain is a system developed by Google - by which Google can better understand the likely user intent of a search query.
It was rolled out in the spring of 2015.

RankBrain, at its core, can be thought of as a pre-screening system. When a query is entered into Google, the search algorithm matches the query against your intent in an effort to surface the best content, in the best format(s).

Depending on the keyword, RankBrain will increase or decrease the importance of backlinks, content freshness, content length, domain authority etc.

Then, it looks at how Google searchers interact with the new search results. If users like the new algorithm better, it stays. If not, RankBrain rolls back the old algorithm.

Before RankBrain, Google would scan pages to see if they contained the exact keyword someone searched for.
Because these keywords were sometimes brand new, Google had no clue what the searcher actually wanted. So they guessed.

For example, let’s say you searched for “the grey console developed by Sony”. Google would look for pages that contained the terms “grey”, “console”, “developed” and “Sony” it would often not result in the anticipated manner the customer was intending to.

RankBrain workd By matching never-before-seen keywords to keywords that Google HAS seen before.
For example, Google RankBrain may have noticed that lots of people search for “grey console developed by Nintendo”.
And they’ve learned that people who search for “grey console developed by Nintendo” want to see a set of results about gaming consoles.

So when someone searches for “the grey console developed by Sony”, RankBrain brings up similar results to the keyword it already knows (“grey console developed by Nintendo”).

So it shows results about consoles. In this case, the PlayStation.

Ranking Search Results Based on Entity Metrics

Ranking Search Results Based On Entity Metrics is the title of a Google patent they were granted in 2015.
According to the patent, the ranking of entities for search involves considering Few factors:

Relatedness: Relatedness is determined based on the co-occurrence entities. In practice - if two entities are referenced frequently on the web (for example, “Donald Trump” and “President”) you get something like: president of the united states...
This is due to the fact that they exist frequently enough together and on authoritative enough properties to return as a single result.

This same process connects other entities with the term when we pluralize it: 'presidents of the united states'.
Each of these people is an entity. These entities are associated with the entity “President” and thus, when the query is plural– we see all of them at once.

Google uses metric in the next fashion: the more valuable an entity is (determined by things including links, reviews, mentions, and relevance), the lower the value of the category or topic it’s competing in, the higher its notability - it is similar to the TF/IDF concept.

Contribution. Contribution is determined by external signals (e.g., links, reviews) and is basically a measure of an entity’s contribution to a topic. A review from a well-established and respected food critic would add to this metric more than Dave’s rant on Yelp about the price because their entity contribution in the space is higher.
Prizes. The prize metric is exactly what it sounds like, a measure of the various relevant prizes an (Person for that matter) entity has received. These could be a Nobel Prize, an Oscar, or a U.S. Search Award. The type of prize determines its weight and the larger the prize the higher the value attached to the entity in question.

For Example : lets assume a search for [best actresses].

Google will run the query through these process in this order:

Determine the relatedness of other entities and assign values.
Determine the notability of those entities and assign a value to each.
Determine the contribution metrics of these entities and assign a value.
Determine any prizes awarded to the entities and assign a value.
Determine the applicable weights each should have based on the query type
Determine a final score for each possible entity.

This chain of evaluations allow for a composite scoring system to give accurate results for a large variety of use-cases.

Question Answering Using Entity References in Unstructured Data

Since it's equally important to have a capability to do a relevant and accurate search in an unstructured data as-well, google used the following techniques to address that:

In a document containing unstructured content - an entity extraction process is taking place to predict the assumed structures in that specific data.
Each extracted entity is assigned a unique identifier. Determining the most likely entity being requested by a searcher can be completed by establishing which entity appears the most times in the top K results.
Consult with an Entity database that helps saving process time for top results each time a query is run. That database exists for storing entities and their relations.
Entities are ranked by a quality score that may include freshness, previous selections by users, incoming links, and possibly outgoing links.
- When a query for an entity is conducted, the relevance of other entities is determined for the result .
Context inference - for multiple entities with the same name. For example, there is Philadelphia the city, the cream cheese, and the movie. If asking a “where” question its referring to the city, “who acted in” would be the movie, and “what’s goes good with” would be the food.

With these techniques, Google’s capabilities around learning about entities and their relationships becomes significantly stronger.

Related Entities

The 'entities database' also stores relationships for each entity. These relationships are weighted according to some formula according to former search requests and their commonality in the data itself.

These entities/relations concept allow for:

The ability to calculate the probability of meeting the user’s likely intent with far greater accuracy.
The ability to predict and evolve an entity over time using past knowledge and well-defined schematic structures.

Google's 'entities database' AKA Knowledge-Graph is actually a massive database of public information, it collects information considered public domain and the properties of each entity (people with birthdays, siblings, parents, occupations, etc.).

Using the Entity Category association

Determining the categories a query belongs with may include generating a score based on:

Whether the query includes terms associated with the category
How of the entities inside the query, associate with the category.

When calculating the correct search categories which best describe the search intention, a scoring metrics must be selected and revised according to search results feedback.

Once categories are selected - the centroid of that category can be used as reference for inferring entities and links that may be relevant for the results.

Semantic Search

BERT

In 2019 BERT (Bidirectional Encoder Representations from Transformers) was introduced by Google. This AI engine focuses on further understanding intent and conversation search context.

BERT allows users to more easily find valuable and accurate information. Semantic search has also evolved in large part to the rise of voice search.

--- TODO --- add relevant references for usage of BERT

Basic Building Block for Better search results

In Opensearch we are in a constant effort to improve the relevancy and accuracy of the search results.
It is especially important due to the fact that a vast amount of the engine's users are storing unstructured data for a variety of domains & use-cases.

Our goal is to allow each and every search to be the closest to the customer's intent - and doing so will require to address the concepts mentioned in this paper.

Support and Maintain a high Level Schema Structure

A modern search engine must be capable of maintaining the customer's (domain-related) schema structure. It makes no difference if the data was un-structured to begin with - Everyone expects the search result to be the most relevant.

Steps for Adding Schema related search relevancy capabilities:

Adding the Simple Schema to opensearch will allow explicitly preserving the customer's domain knowledge. It will be also very valuable if during the data ingestion phase - the engine can already perform some structure related tasks.
- Enable code / template / Index generation from the Domain Specific Schema - this will allow additional explicit capability for customers to write domain related code that will help them develop their applications seamlessly
Using the industry standard GraphQL API and SDL will grant our users to easily integrate with many open-source GraphQL compliant tools and simplify development process.
Using a schema allows developers to build the ingestion process using domain related vocabulary and easily define business rules with their language of choice.
PPL and SQL language used today in opensearch will significantly use the schematic knowledge of the data to simplify the construction and validation of queries.
Creating a Domain Knowledge Graph containing the relationships between the domain entities will allow calculation of score that is based on the schema relationships that appear in the search.
- represent the schema with the entities and the relationships inside a unified logical layer that can be queried and evolve according to how the data/business evolves
- allow constructions of ready-made reports and dashboards that are purely described using the business vocabulary.
- simplify machine learning graph based techniques to organize the unstructured data and give better search predictions based on relations and entities.
Explainability - In recent years, AI has increasingly found its way from research labs into applications: from the recommendation systems used by online retailers to image recognition on social networks, and mainly the recently discussed search engine.

As we work with AI and rely on AI for more and more decision-making processes that influence our daily actions, issues around user understanding of such processes have garnered attention.

One of our main goals at opensearch is for the search engineers and customers to understand and trust the search algorithm ( which has AI incorporated inside) - increasing user satisfaction and enabling transparency of AI related decision-making.

We believe that search explainability is a first class citizen that deserves our full attention. Every search should have a clear and concise way of explaining itself to the user (whether its a search engineer or a customer).

This is why our goal will be the integration of the explainability notion to every part and section in the search engine's decision making steps.

It equally important to understanding how/why search results came for a given query in addition to the actual results evaluation .

Search Pipelines - Process Requests, Responses, or Both

As a builder of functions to transform requests and responses, I want to build a single processor that can transform requests, responses, or both to keep business logic in one place and make it clear to my users what the processor does.