Coder Social home page Coder Social logo

Comments (5)

patrik-simunic-cz avatar patrik-simunic-cz commented on July 28, 2024

Alright, something even fishier is going on. It seem the issue is not even related to uncommitted documents. I once again tried to index all 22k documents one by one, calling commit after every add_document. I tried it twice - once with the MergeWheneverPossiblePolicy:

#[derive(Debug, Clone)]
pub struct MergeWheneverPossiblePolicy {}

impl MergeWheneverPossiblePolicy {
    pub fn new() -> Self {
        MergeWheneverPossiblePolicy{}
    }

    pub fn as_box(self) -> Box<Self> {
        return Box::new(self);
    }
}

impl MergePolicy for MergeWheneverPossiblePolicy {
    fn compute_merge_candidates(&self, segment_metas: &[SegmentMeta]) -> Vec<MergeCandidate> {
        let segment_ids = segment_metas
            .iter()
            .map(|segment_meta| segment_meta.id())
            .collect::<Vec<SegmentId>>();

        if segment_ids.len() > 1 {
            return vec![MergeCandidate(segment_ids)];
        }

        vec![]
    }
}

and second time with my own MarlinPolicy:

#[derive(Debug, Clone)]
pub struct MarlinMergePolicy {
    target_docs_per_segment: u32,
}

impl MarlinMergePolicy {
    pub fn new(target_docs_per_segment: u32) -> Self {
        MarlinMergePolicy{ target_docs_per_segment }
    }

    pub fn as_box(self) -> Box<Self> {
        return Box::new(self);
    }
}

impl MergePolicy for MarlinMergePolicy {
    fn compute_merge_candidates(&self, segment_metas: &[SegmentMeta]) -> Vec<MergeCandidate> {
        let mut merge_candidates: Vec<(u32, Vec<SegmentId>)> = Vec::new();

        'merge_segment_loop: for segment in segment_metas {
            let segment_id = segment.id();
            let num_docs = segment.num_docs();

            for group in merge_candidates.iter_mut() {
                if group.0 + num_docs < self.target_docs_per_segment {
                    group.1.push(segment_id);
                    continue 'merge_segment_loop;
                }
            }

            if merge_candidates.len() < 1 {
                merge_candidates.push((num_docs, vec![segment_id]));
            }
        }

        merge_candidates
            .into_iter()
            .map(|group| MergeCandidate(group.1))
            .collect::<Vec<MergeCandidate>>()
    }
}

When using the MergeWheneverPossiblePolicy, then segment_metas argument contains between 0-2 items. When 2 items are passed, a merge happens. It works.

But when I use the MarlinPolicy, then segment_metas argument always contains between 0-1 item. There are never 2 items passed down to the MarlinPolicy as they are to the MergeWheneverPossiblePolicy.

I'd be inclined to believe there is something wrong with my code, but I don't really see how I could have possibly cause Tantivy to never pass more then 1 item to the compute_merge_candidates method.

It's quite perplexing and I have no idea what's going on... 😆

from tantivy.

fulmicoton avatar fulmicoton commented on July 28, 2024

I totally forgot about this. Can you check if the merge operations are executed/ finish successfully/ get rejected when the merge finishes etc.?

from tantivy.

patrik-simunic-cz avatar patrik-simunic-cz commented on July 28, 2024

@fulmicoton I will tell you in the morning (mine 😅). This has caught my attention - first think tomorrow I will create a reproducible demo testing various scenarios and share it here.

As a side note: it might be worth not calling the MergePolicy's compute_merge_candidates at all and exit the consider_merge_options prematurely if there are less then 2 mergeable segments. It seems like unnecessary overhead (couple extra allocations and function calls) - true, really tiny overhead, but still.

Btw. awesome work, Tantivy is amazing! I still can't believe (after merging the segments) how blazingly fast the queries are even with IO over the network (double ms digits - constantly) and how little memory the Lambda consumes. ❤️

from tantivy.

patrik-simunic-cz avatar patrik-simunic-cz commented on July 28, 2024

@fulmicoton Alright. Apologies for the the late response. As far as I understand your question, I have not encountered any merging operation errors/rejections (per se), however:

First thing on Thursday, I wrote a program to test different conditions and started the execution. We then had an inter-company party and when I came back some 12h later, one of the program variations was still running. It seems an infinite loop was somehow created.

Reproducible demo

You can find the reproducible demo here: https://github.com/testuj-to/tantivy-merge-policy-demo
It contains different variations, stats from these runs, observations and some conclusions.

What I think is going on:

  • When the merge policy is "heavier" (CPU/memory) above some threshold, then a race condition takes place with some internal Tantivy prodecure
  • This race condition somehow causes compute_merge_candidates never to be passed more then 1 merge candidate
  • Waiting for merging threads in combination with this race condition causes the program to be stuck in a (possibly) infinite loop

from tantivy.

PSeitz avatar PSeitz commented on July 28, 2024

I think the issue here is to try to merge single segments although there is nothing to do (no deletes). So they get marked as in merge, while the merge does nothing. In the next iteration they may still be in the merge pipeline and not considered for merge.

from tantivy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.