Comments (5)
Alright, something even fishier is going on. It seem the issue is not even related to uncommitted documents. I once again tried to index all 22k documents one by one, calling commit
after every add_document
. I tried it twice - once with the MergeWheneverPossiblePolicy
:
#[derive(Debug, Clone)]
pub struct MergeWheneverPossiblePolicy {}
impl MergeWheneverPossiblePolicy {
pub fn new() -> Self {
MergeWheneverPossiblePolicy{}
}
pub fn as_box(self) -> Box<Self> {
return Box::new(self);
}
}
impl MergePolicy for MergeWheneverPossiblePolicy {
fn compute_merge_candidates(&self, segment_metas: &[SegmentMeta]) -> Vec<MergeCandidate> {
let segment_ids = segment_metas
.iter()
.map(|segment_meta| segment_meta.id())
.collect::<Vec<SegmentId>>();
if segment_ids.len() > 1 {
return vec![MergeCandidate(segment_ids)];
}
vec![]
}
}
and second time with my own MarlinPolicy
:
#[derive(Debug, Clone)]
pub struct MarlinMergePolicy {
target_docs_per_segment: u32,
}
impl MarlinMergePolicy {
pub fn new(target_docs_per_segment: u32) -> Self {
MarlinMergePolicy{ target_docs_per_segment }
}
pub fn as_box(self) -> Box<Self> {
return Box::new(self);
}
}
impl MergePolicy for MarlinMergePolicy {
fn compute_merge_candidates(&self, segment_metas: &[SegmentMeta]) -> Vec<MergeCandidate> {
let mut merge_candidates: Vec<(u32, Vec<SegmentId>)> = Vec::new();
'merge_segment_loop: for segment in segment_metas {
let segment_id = segment.id();
let num_docs = segment.num_docs();
for group in merge_candidates.iter_mut() {
if group.0 + num_docs < self.target_docs_per_segment {
group.1.push(segment_id);
continue 'merge_segment_loop;
}
}
if merge_candidates.len() < 1 {
merge_candidates.push((num_docs, vec![segment_id]));
}
}
merge_candidates
.into_iter()
.map(|group| MergeCandidate(group.1))
.collect::<Vec<MergeCandidate>>()
}
}
When using the MergeWheneverPossiblePolicy
, then segment_metas
argument contains between 0-2 items. When 2 items are passed, a merge happens. It works.
But when I use the MarlinPolicy
, then segment_metas
argument always contains between 0-1 item. There are never 2 items passed down to the MarlinPolicy
as they are to the MergeWheneverPossiblePolicy
.
I'd be inclined to believe there is something wrong with my code, but I don't really see how I could have possibly cause Tantivy to never pass more then 1 item to the compute_merge_candidates
method.
It's quite perplexing and I have no idea what's going on... 😆
from tantivy.
I totally forgot about this. Can you check if the merge operations are executed/ finish successfully/ get rejected when the merge finishes etc.?
from tantivy.
@fulmicoton I will tell you in the morning (mine 😅). This has caught my attention - first think tomorrow I will create a reproducible demo testing various scenarios and share it here.
As a side note: it might be worth not calling the MergePolicy
's compute_merge_candidates
at all and exit the consider_merge_options
prematurely if there are less then 2 mergeable segments. It seems like unnecessary overhead (couple extra allocations and function calls) - true, really tiny overhead, but still.
Btw. awesome work, Tantivy is amazing! I still can't believe (after merging the segments) how blazingly fast the queries are even with IO over the network (double ms digits - constantly) and how little memory the Lambda consumes. ❤️
from tantivy.
@fulmicoton Alright. Apologies for the the late response. As far as I understand your question, I have not encountered any merging operation errors/rejections (per se), however:
First thing on Thursday, I wrote a program to test different conditions and started the execution. We then had an inter-company party and when I came back some 12h later, one of the program variations was still running. It seems an infinite loop was somehow created.
Reproducible demo
You can find the reproducible demo here: https://github.com/testuj-to/tantivy-merge-policy-demo
It contains different variations, stats from these runs, observations and some conclusions.
What I think is going on:
- When the merge policy is "heavier" (CPU/memory) above some threshold, then a race condition takes place with some internal Tantivy prodecure
- This race condition somehow causes compute_merge_candidates never to be passed more then 1 merge candidate
- Waiting for merging threads in combination with this race condition causes the program to be stuck in a (possibly) infinite loop
from tantivy.
I think the issue here is to try to merge single segments although there is nothing to do (no deletes). So they get marked as in merge, while the merge does nothing. In the next iteration they may still be in the merge pipeline and not considered for merge.
from tantivy.
Related Issues (20)
- Implementing Block WAND optimization for more queries HOT 3
- Adding Function Score Query HOT 5
- Implement "minimum number should match" on BooleanQuery HOT 3
- Flaky Test test_cancel_cpu_intensive_tasks HOT 3
- Does tantivy::IndexWriter support multi-process? HOT 4
- Rayon thread pool abort on panic
- Isolate Aggregations
- parsing simple quote in query doesn't always give a sensible result
- allow escape in query string outside of quotes
- Concurrent commit failed in multi-process environment HOT 1
- Unique field HOT 1
- Track new FxHash Algorithm
- Fix inefficiency on multivalued but sparse column. HOT 1
- Add error handling for invalid CustomOrder in term aggregation
- monotonic mapping broken for `get_docids_for_value_range`
- Possible Codec Between SPARSE and DENSE: CHIMERA HOT 2
- keys should be increasing panic HOT 4
- Tantivy NDCG Benchmarking Information Retrieval (BEIR) HOT 3
- calendar_interval in datehistogram
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tantivy.