Coder Social home page Coder Social logo

Comments (13)

expani avatar expani commented on September 22, 2024 1

@jainankitk Yes opened this PR in Lucene

from opensearch.

expani avatar expani commented on September 22, 2024

Thanks for filing this issue @ChapterSevenSeeds

I am able to reproduce the issue with the steps you mentioned. Will take a stab at it and update as I get time.

from opensearch.

getsaurabh02 avatar getsaurabh02 commented on September 22, 2024

Thanks @expani . Removing untriaged label.

from opensearch.

expani avatar expani commented on September 22, 2024

The match rule

"match": {
    "query": "a"
}

gets converted into an empty query ( No Match ) due to the custom analyser removing stop words.

The final IntervalsSource created for the IntervalQuery here is a DisjunctionIntervalsSource and looks like below

or( 
 MAXGAPS/30(ORDERED(b,sister)),
 MAXGAPS/30(ORDERED(no_match,sister)) 
)

OpenSearch represents the IntervalsIterator for no_match using NO_INTERVALS which represents un-positioned docId using NO_MORE_DOCS -> 2147483647 here

The intervals iterator for the rule

"match":  {
    "query": "sister"
}

is a based on PostingsEnum which represents an un-positioned docId using -1

When both these iterators are matched, they differ which leads to the error reported being thrown here

Lucene also represents an IntervalsIterator for no_match called NO_INTERVALS but it represents un-positioned docIds using -1 Reference This was fixed a few years back in Lucene by this PR

I made the same changes to the NO_INTERVALS iterator used by OpenSearch and the query mentioned above worked as expected.

The diff is

diff --git a/server/src/main/java/org/opensearch/index/query/IntervalBuilder.java b/server/src/main/java/org/opensearch/index/query/IntervalBui
lder.java
index 0e42e79f67d..fa21e6fbd82 100644
--- a/server/src/main/java/org/opensearch/index/query/IntervalBuilder.java
+++ b/server/src/main/java/org/opensearch/index/query/IntervalBuilder.java
@@ -266,6 +266,8 @@ public class IntervalBuilder {
         @Override
         public IntervalIterator intervals(String field, LeafReaderContext ctx) {
             return new IntervalIterator() {
+
+                boolean exhausted = false;
                 @Override
                 public int start() {
                     return NO_MORE_INTERVALS;
@@ -293,16 +295,18 @@ public class IntervalBuilder {

                 @Override
                 public int docID() {
-                    return NO_MORE_DOCS;
+                    return exhausted ? NO_MORE_DOCS : -1;
                 }

                 @Override
                 public int nextDoc() {
+                    exhausted = true;
                     return NO_MORE_DOCS;
                 }

                 @Override
                 public int advance(int target) {
+                    exhausted = true;
                     return NO_MORE_DOCS;

We should make the NO_INTERVALS used by Lucene public and use the same in OpenSearch instead of creating a copy to avoid any such issues in future.

from opensearch.

jainankitk avatar jainankitk commented on September 22, 2024

@expani - Thank you root causing this issue. Completely agree on reusing the NO_INTERVALS iterator in Lucene instead of duplicating the code in OpenSearch. Have you created issue/PR with lucene for addressing this?

from opensearch.

jainankitk avatar jainankitk commented on September 22, 2024

@jainankitk Yes opened this apache/lucene#13385 in Lucene

Thank you for the lucene PR!

from opensearch.

uschindler avatar uschindler commented on September 22, 2024

This problem is often seen with patent search. I opened apache/lucene#13388

There are more issues, but making that constant publicly available is wrong. There should be a method to create an empty interval with a reason.

from opensearch.

uschindler avatar uschindler commented on September 22, 2024

P.S.: I tend to revert above PR. It does not fit the Lucene API patterns.

from opensearch.

uschindler avatar uschindler commented on September 22, 2024

P.S.: The workaround that can be used already in OpenSearch code is the following factory method: Intervals.atLeast(1)

This requires at least one match for an empty list of intervals, so it rewrites to a NoMatchIntervalsSource which is a duplicate of above constant. The given issue will also merge those implementations.

from opensearch.

expani avatar expani commented on September 22, 2024

Agree on putting this behind an API.

Returning a null iterator might not be properly checked at all places of usage so having NO_INTERVALS iterator returning NO_MORE_DOCS seems more safe of a change.

from opensearch.

uschindler avatar uschindler commented on September 22, 2024

I made a comment because of that in the PR. We need more tests, I agree.

Up to now this atLeast(1) has always worked for me.

Null is always used in lucene for iterators that fo not match anything.

from opensearch.

uschindler avatar uschindler commented on September 22, 2024

See here for DoxIdSetIterator and ScorerSupplier: https://github.com/apache/lucene/blob/22d50be2eab84c9a75ea65f55fcc4356724faba1/lucene/core/src/java/org/apache/lucene/search/MatchNoDocsQuery.java#L47

from opensearch.

expani avatar expani commented on September 22, 2024

Yes returning null worked for this case in OpenSearch as well, but NO_INTERVAL is being used extensively across Lucene and not all the usage seem to be covered by UTs.

I added a comment to support both style of iterators null and NO_INTERVAL in the PR to make sure the changes are safe.

from opensearch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.