Abstract This RFC proposes a new context aware/based segment creat

Would appreciate any initial feedback from <a class="user-me

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Comments (18)

RS146BIJAY commented on September 25, 2024 3

For HTTP logs workload and grouping by status code

Bulk Indexing Client =8

Grouping criteria

In this setup, we implemented grouping based on status code for the http_logs workload. Here we separated logs of successful requests from those with error (4xx) and fault(5xx) status codes. Segments are initially flushed at per status code level (sub grouping criteria). Context Aware merge policy will start merging error and fault status segments together
when thier aggregated size reaches a 2GB thrershold.

Performance Benchmark Metrics

Indexing

Size of index

Since with just status based grouping, data is not as ordered as previous case of Day based grouping with hourly sub grouping, we observe around 3% improvement in index size with DataAware merge policy.

Latency

Indexing latency remains almost same as Tiered and LogByteSize merge policy.

Segment Count

Since there are only two groups of successful (2xx) and failed status segments (4xx and 5xx), number of segments is almost same as Tiered and log byte size merge policy.

Search

With status code based grouping strategy, we see a considerable improvements in performance of range, aggregation and sort queries (order by timestamp) for fault/error logs within specific time range. This efficiency is attributed to lesser number of iterations to find fault/error logs as they are spread across fewer segments as compared to Tiered and LogByteSize merge policies.

Query type	% improvement over Tiered Merge Policy	% improvement over log byte size merge policy
range (400 status code logs within a given time range)	60	73.3
asc_sort_timestamp (400 status code logs within a given time range sorted by timestamp)	65	64.8
asc_sort_with_after_timestamp (500 status code logs after a timestamp and sorted by timestamp)	47.5	47.2
desc_sort_timestamp (400 status code logs within a given time range sorted in desc order by timestamp)	43.9	44.1
desc_sort_with_after_timestamp (500 status code logs after a timestamp and sorted in desc order by timestamp)	44.7	47.5
hourly aggregations (Day histogram of logs with status codes 400, 403 and 404 with min and max sizes of request)	45	46.6

With DataAware we are iterating fewer number of times (documents + segments) across segments to locate error/fault logs:

Query type	Iterations count in Tiered Merge Policy	Iterations count in LogByteSize merge policy
range (400 status code logs within a given time range)	3.2x more iterations than DataAware	3.8x more iterations than DataAware
asc_sort_timestamp (400 status code logs within a given time range sorted by timestamp)	15.3x more iterations than DataAware	15.6x more iterations than DataAware
asc_sort_with_after_timestamp (500 status code logs after a timestamp and sorted by timestamp)	15.3x more iterations than DataAware	15.6x more iterations than DataAware
desc_sort_timestamp (400 status code logs within a given time range sorted in desc order by timestamp)	15.3x more iterations than DataAware	15.6x more iterations than DataAware
desc_sort_with_after_timestamp (500 status code logs after a timestamp and sorted in desc order by timestamp)	15.3x more iterations than DataAware	15.6x more iterations than DataAware
hourly aggregations (Day histogram of logs with status codes 400, 403 and 404 with min and max sizes of request)	40x more iterations than DataAware	40x more iterations than DataAware

from opensearch.

RS146BIJAY commented on September 25, 2024 2

For HTTP logs workload and daily grouping

Bulk Indexing Client = 8

Grouping criteria

In this scenario we use a day based grouping for http_logs with hour as sub grouping criteria. Context Aware merge policy will start merging hourly segments into daily segments once total size of segments for a day surpasses threshold of 300 MB.

Performance Benchmark Metrics

Indexing

Size of index

Compression works more effectively for context aware segments, resulting in an improvement of approximately 19% in the size of indices over Tiered merge and log byte size merge policy. This happens because data is nearly ordered in Context Aware segments.

Latency

There is a minor improvement (3% - 6%) in indexing latency when we group hourly logs together during flush time and merge them into daily segments in increasing order of hour inside DataAware merge policy. This is because hourly segments are merged into daily segments only when total size of segments within that particular day is above 300 MB limit.

Segment Count

With DataAware merge policy (day based grouping) and hourly flushing, segment count tends to increase (about 4 to 5 times) for http_logs workload. This increase is attributed to non uniform distribution of data across days in http_log workload. However, this scenario is unlikely to occur in reality, where logs would be more uniformly distributed across days.

Search

For context aware segments, we see a significant improvement in performance for both range and ascending sort queries. This is because data is sorted in near-ascending order within Context Aware Segment. On the flip side, the efficiency of descending sort queries regresses with this method. To fix this regression in desc sort queries, a potential solution is to traverse segment data in reverse order (will create a separate lucene issue for this and link it here).

Query type	% improvement over Tiered Merge Policy	% improvement over log byte size merge policy
range	71.3	68.16
asc_sort_timestamp	36.8	61.12
asc_sort_with_after_timestamp	53.8	35.89
desc_sort_timestamp	-1997.03	-848.79
desc_sort_with_after_timestamp	-1116.9	-329.86
hourly aggregations	-15.3	-18.3

Segment merges

Intemediary number of segments is higher with Data Aware merge policy as compared to Tiered and LogByteSize merge policies. Below metrics on merge size (order is Tiered, LogByteSize and DataAware from top to bottom) shows that while Tiered merge policies do allow merging larger segments during indexing, Data Aware merge policy initially merges smaller hourly segments (keeping segment count high). And only once segment size for a day exceeds 300 MB, Data Aware merge policy shifts to merging larger segments.

CPU, JVM and GC count

CPU and JVM remains same for all the merge policies during entire indexing operation. GC count for DataAware merge policies is slightly higher due to fact that we are allocating different DWPTs for logs with different timestamps (hour) in DocumentWriter.

(order is Tiered, LogByteSize and DataAware from top to bottom)

from opensearch.

RS146BIJAY commented on September 25, 2024 2

Selecting the appropriate grouping criteria is crucial. Too small groups can increase the number of DWPT required in DocumentWriter regressing indexing performance. Additionally, this can also lead to multiple small segments. Conversely, selecting too large group can regress query performance. Implementing a guardrail around grouping criteria can prevent excessive small or large grouping.

@reta Yes. We will be implementing proper guardrails on grouping criteria to not allow too small or too large groups.

from opensearch.

Bukhtawar commented on September 25, 2024 1

Would appreciate any initial feedback from @shwetathareja, @nknize, @Bukhtawar, @sachinpkale, @muralikpbhat, @reta, @msfroh and @backslasht as well, so tagging more folks for visibility. Thanks!

Thanks @RS146BIJAY , the numbers look promising, but don't we violate the DPWT design by having man-in-the-middle (grouping) here? To elaborate on that, my understanding is that DW uses DPWT to associate the ingest threads with writer threads, with intention to eliminate synchronization (this is also highlighted in DW javadocs). With grouping, this won't be the case anymore - the grouping layer adds the indirection where multiple threads would be routed to the same DPWT. Is that right or my understanding is not correct? Thank you.

I had a similar concern @reta but the lock-free model of DWPT can still be improvised/matched with creating just enough number of DWPTs that can write concurrently to minimise contention or create more instances on demand if the lock is already acquired. So yes there needs to be some coordination but shouldn't directly require synchronisation.

from opensearch.

RS146BIJAY commented on September 25, 2024 1

With grouping, this won't be the case anymore - the grouping layer adds the indirection where multiple threads would be routed to the same DPWT. Is that right or my understanding is not correct?

Not excatly. To add to what @Bukhtawar mentioned, before this change DWPT Pool maintains a list of free DWPT on which no lock is present and no write is happening. Incase all DWPTs are locked, a new instance of DWPT is created on which write happens.

With our change, this free list is maintained at individual group level inside DWPT pool. If there is no free DWPT for a group, a new instance of DWPT for that group will be created and write will happen on it. So active number of DWPT in our case will be higher than what it was earlier, but each thread even now will be routed to different DWPT.

from opensearch.

shwetathareja commented on September 25, 2024 1

The details around exact structure of predicate will be published separately as a separate github issue.

@RS146BIJAY : We should also explore if the predicate can be deduced from user's frequent queries.

from opensearch.

reta commented on September 25, 2024 1

So active number of DWPT in our case will be higher than what it was earlier, but each thread even now will be routed to different DWPT.

@RS146BIJAY so it becomes function of a group? and in this case, if the cardinality of the group is high, we could easily OOM the JVM, right? we need the guardrails in place

from opensearch.

josefschiefer27 commented on September 25, 2024

This is a great strategy for reducing data volume with sparse data while also enhancing query performance. In the case of highly sparse data, such as with the http-error-codes example, very sparse fields could be even treated as constants within the metadata, eliminating the necessity to store any field values (e.g.store only the '200' metadata for segments that contain only the very common 200 error codes).

from opensearch.

mgodwan commented on September 25, 2024

@RS146BIJAY Could you share benchmarks with the POCs you have linked?

from opensearch.

reta commented on September 25, 2024

I think I have too many questions but one that probably stands out right now is regarding grouping predicate. This is essentially an arbitrary piece of code, right? What means user would use to supply this grouping predicate (arbitrary code) for particular index / indices + merge policy to OpenSearch?

from opensearch.

RS146BIJAY commented on September 25, 2024

This is essentially an arbitrary piece of code, right?

@reta The grouping predicate will be configurable for an index and won't be a standalone code (the above POC is for when grouping criteria is day based grouping for threshold = 300 MB). This grouping criteria passed by user will determine which set of data will be grouped together. The details around exact structure of predicate will be published separately as a separate github issue.

from opensearch.

peternied commented on September 25, 2024

[Triage - attendees 1 2 3 4 5 6 7]
@RS146BIJAY Thanks for creating this RFC

from opensearch.

Bukhtawar commented on September 25, 2024

Lets link this with #12683

from opensearch.

RS146BIJAY commented on September 25, 2024

Would appreciate any initial feedback from @shwetathareja, @nknize, @Bukhtawar, @sachinpkale, @muralikpbhat, @reta, @msfroh and @backslasht as well, so tagging more folks for visibility. Thanks!

from opensearch.

RS146BIJAY commented on September 25, 2024

For HTTP logs workload and daily grouping

Bulk Indexing Client = 1
Grouping criteria = Day based grouping with a threshold = 300 MB (when hour grouping segments will be elevated to day grouping in DataAware merge policy)

Performance Benchmark Metrics

Indexing

Size of index

Since data is completely sorted by timestamp, with DataAware merge policy there is no considerable improvement in size of index.

Latency

Indexing latency remained almost same with DataAware merge policy as Tiered and LogByteSize merge policy.

Segment Count

Similar to case when bulk_indexing_client = 8, with DataAware merge policy (day based grouping) and hourly flushing, segment count tends to increase (about 4 to 5 times) for http_logs workload.

Search

Since logs are order in increasing order of timestamp, there is no significant difference in the
performance of LogByteSize and Data Aware merge policy for this scenario. The reason behind this is when data is organised by timestamp, the skipping mechanism with BKD ensures equal optimal performance for both time range and sort queries. Consequently, the Data Aware merge policy's approach of grouping more relevant data together doest not influence performance, as BKD would skip the same number of documents regardless. (LogByteSize merge policy performs slightly better with DataAware merge policy, but with some tuning we can match this performance).

For asc sort and desc sort query with search after timestamp, DataAware performs better over LogByteSize merge policy. This is because of a bug in skipping logic while scoring document inside Lucene and improvement is not specifically related to DataAware segments.

Query type	% improvement over Tiered Merge Policy	% improvement over log byte size merge policy
range (logs with a given time range with 1000 response size)	4.2	-3.3
asc_sort_timestamp (logs within a given time range sorted in asc order with 1000 response size)	53.5	-19.9
asc_sort_with_after_timestamp (logs after a timestamp, sorted in asc order 1000 response size)	99.4	97.9
desc_sort_timestamp (logs within a given time range sorted in desc order with 1000 response size)	-6.1	-7.1
desc_sort_with_after_timestamp (logs after a timestamp, sorted in desc order with 1000 response size)	-7.8	78.8
hourly aggregations	-3.9	-2.7

from opensearch.

reta commented on September 25, 2024

Would appreciate any initial feedback from @shwetathareja, @nknize, @Bukhtawar, @sachinpkale, @muralikpbhat, @reta, @msfroh and @backslasht as well, so tagging more folks for visibility. Thanks!

Thanks @RS146BIJAY , the numbers look promising, but don't we violate the DPWT design by having man-in-the-middle (grouping) here? To elaborate on that, my understanding is that DW uses DPWT to associate the ingest threads with writer threads, with intention to eliminate synchronization (this is also highlighted in DW javadocs). With grouping, this won't be the case anymore - the grouping layer adds the indirection where multiple threads would be routed to the same DPWT. Is that right or my understanding is not correct? Thank you.

from opensearch.

RS146BIJAY commented on September 25, 2024

Adding few more scenarioes,

For HTTP logs workload and status grouping (bulk indexing client = 1)

Bulk Indexing Client = 1

Grouping criteria

In this setup, we implemented grouping based on status code for the http_logs workload with bulk_indexing_client = 1. Here we separated logs of successful requests from those with error (4xx) and fault(5xx) status codes. Segments are initially flushed at per status code level (sub grouping criteria). Context Aware merge policy will start merging error and fault status segments together
when thier aggregated size reaches a 2GB thrershold.

Performance Benchmark Metrics

Indexing

Size of index

Since data is completely sorted by timestamp, with DataAware merge policy there is no considerable improvement in size of index.

Latency

Indexing latency remained almost same with DataAware merge policy as Tiered and LogByteSize merge policy.

Segment Count

Since there are only two groups of successful (2xx) and failed status segments (4xx and 5xx), number of segments is almost same as Tiered and log byte size merge policy.

Search

Similar to scenario when bulk_indexing_client > 1, when there is a single bulk indexing client and with status code based grouping strategy, we see a considerable improvements in performance of range, aggregation and sort queries (order by timestamp) for fault/error logs within specific time range. This efficiency is again attributed to lesser number of iterations to find fault/error logs as they are spread across fewer segments as compared to Tiered and LogByteSize merge policies.

Query type	% improvement over Tiered Merge Policy	% improvement over log byte size merge policy
range (logs with a given time range with status code 500 and 1000 response size)	73.7	50
asc_sort_timestamp (logs within a given time range with status code 500, sorted in asc order with 1000 response size)	73.9	50
asc_sort_with_after_timestamp (logs after a timestamp with status code 500, sorted in asc order 1000 response size)	87.5	82.3
desc_sort_timestamp (logs within a given time range with status code 500 sorted in desc order with 1000 response size)	85	77.8
desc_sort_with_after_timestamp (logs after a timestamp with status code 500, sorted in desc order with 1000 response size)	86.8	77.2
hourly aggregations (daily agg of 400, 403 and 404 status code logs on min and max size)	89.7	64.8

from opensearch.

RS146BIJAY commented on September 25, 2024

We should also explore if the predicate can be deduced from user's frequent queries.

As a part of first phase, we will ask this as an input from the user itself. We will eventually explore how we can automatically determine grouping criteria based on Cx workload

from opensearch.

Comments (18)

For HTTP logs workload and grouping by status code

Grouping criteria

Performance Benchmark Metrics

Indexing

Size of index

Latency

Segment Count

Search

For HTTP logs workload and daily grouping

Grouping criteria

Performance Benchmark Metrics

Indexing

Size of index

Latency

Segment Count

Search

Segment merges

CPU, JVM and GC count

For HTTP logs workload and daily grouping

Performance Benchmark Metrics

Indexing

Size of index

Latency

Segment Count

Search

For HTTP logs workload and status grouping (bulk indexing client = 1)

Grouping criteria

Performance Benchmark Metrics

Indexing

Size of index

Latency

Segment Count

Search

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org