Comments (15)
@dbwiddis Let me try taking a look at this. We did change things around RequestCache with Tiered caching feature in 2.12/2.13.
from opensearch.
The stack trace points to the total memory size stat becoming negative in RequestCacheStats
.
The error occurs when this negative value is communicated over transport; the result is that whichever node is handling the REST request will successfully return the stats to the calling node (with the negative value included) while other nodes will fail to return the value. This is what is leading to the symptom of only seeing stats on the same node that the REST Request is directed toward.
We need to further investigate what changed with the RequestCacheStats
(particularly the total memory field) between 2.11.1 and 2.13.0. From the change log I see several PRs associated with Tiered Caching and/or pluggable caches. I've spent a bit of time digging into them but there's a lot changed. I'm guessing something's being subtracted (onRemoval) where it was never added (onCache). @msfroh, @sgup432, and @sohami do you have any ideas here?
from opensearch.
I am also seeing this error. The errors appear to occur until the affected node is restarted. After which the errors resume any where from 4 hours to 3 days later. The problem compounds as more nodes in a cluster begin presenting this error and more of the data becomes unavailable.
Given the call stack includes a reference to org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptor I deleted all rollups. The errors resumed about 4.5 hours later. I then disabled rollups in persistent cluster settings.
Following rollup disable via cluster settings the errors ceased without node restarts. I will continue monitoring for the next few days to see if they occur again with rollups disabled.
from opensearch.
@artemus717 There is an api for that. But it will not work in this scenario where it is going negative.
For now, you can turn off the request cache for indices via dynamic setting index.requests.cache.enable
and then restart node.
We already have the fix which hopefully should get picked up in next release.
from opensearch.
@weily2 @lyradc @artemus717 The fix for this would be available in 2.14 version release.
from opensearch.
[Triage - attendees 1 2 3 4 5 6]
@weily2 Thanks for creating this issue, this should definitely be addressed. Would you like to create a pull request to fix the issue?
from opensearch.
@weily2 I need more info/help on this while I try to reproduce it on my end.
I had couple of hunches where things might be going wrong. Either there was some race condition on removal but theoretically it looks alright and I also verified by running concurrent test locally. And secondly either calculation of size when item is cached is different compared to when it is removed somehow but that doesn't seem the case for now.
It seems you were trying to upgrade domain from 2.11 to 2.13 and saw these errors. Did you see these errors during domain upgrade or after the upgrade was complete? Also were you performing a rolling upgrade?
Can you give more precise steps on reproducing this so that I can try same on my end?
from opensearch.
Disabling rollups didn't stop the 'Negative longs unsupported' errors.
from opensearch.
@lyradc Thanks for confirming.
But this shouldn't be just related to rollups. Stacktrace just points out to RollupInterceptor
in handlers list(like others) which may not get executed.
Do you mind confirming few things around your layout?
- Are you running a 2.13 cluster?
- Is it possible to see list of indices along with their layout/settings?
- Do you perform any closing/deleting index operations manually/automatically?
from opensearch.
- Yes, running 2.13 clusters
- I should be able to get you indices schema/settings tomorrow.
- Not using index closing. Using ISM to handle index deletion for old indices.
from opensearch.
@lyradc It would also help if you can mention/share the steps to reproduce so that I can try on my end.
from opensearch.
Is there a api to clear RequestCache on node ?
from opensearch.
I have raised the PR for this fix.
It turns out that the issue occurs when an indexShard is deleted and then reallocated on the same node. So whenever stale entries from older shard are deleted, those are accounted for the new shard which has the same shardId.
I was able to reproduce it via IT by recreating the above scenario.
from opensearch.
Note:
Please don't close this issue even after the PR is merged. I would like to wait and confirm the fix to ensure we are not missing anything.
from opensearch.
@weily2 I need more info/help on this while I try to reproduce it on my end. I had couple of hunches where things might be going wrong. Either there was some race condition on removal but theoretically it looks alright and I also verified by running concurrent test locally. And secondly either calculation of size when item is cached is different compared to when it is removed somehow but that doesn't seem the case for now.
It seems you were trying to upgrade domain from 2.11 to 2.13 and saw these errors. Did you see these errors during domain upgrade or after the upgrade was complete? Also were you performing a rolling upgrade? Can you give more precise steps on reproducing this so that I can try same on my end?
@sgup432 sorry for the late reply. I do perform a rolling upgrade. After upgrade the log is normal ,but these error occurred after the upgrade was complete for a few hours.
from opensearch.
Related Issues (20)
- [BUG] Wazuh-indexer service warnings: Terminally Deprecated method has been called HOT 2
- Refactor FastFilterRewriteHelper
- [Profiling deep dive] Default aggregation vs. optimization code path
- [Feature Request] Fine grained control of (ingest|search) processors to install HOT 1
- [BUG] Potential Cluster Slowdown/Lags after merging #13748(#14348) #14338(#14391) in 2.15 HOT 12
- [BUG] flaky test Test derived_field supported type using search definition HOT 1
- [BUG] Batch async shard fetch holds up significant memory causing OOMs
- [PROPOSAL] Highlight REST API Changes in PRs HOT 3
- [Remote State] Timing and tracing for RemoteWritableEntityStore
- [Remote State] Optimize diff publication
- [AUTOCUT] Gradle Check Flaky Test Report for IndicesStoreIntegrationIT HOT 2
- [Automation Enhancement] Mechanism to close the created Gradle Check AUTOCUT flaky test issues. HOT 5
- [AUTOCUT] Gradle Check Flaky Test Report for RemoteStoreRestoreIT
- [AUTOCUT] Gradle Check Flaky Test Report for AzureStorageServiceTests HOT 9
- [RFC] Remove the `opensearch-dashboards` module from this repo HOT 4
- [BUG] Inconsistent behavior on POST _aliases when two opposing actions are operating on the same index/alias pair HOT 7
- [AUTOCUT] Gradle Check Flaky Test Report for ICacheKeySerializerTests HOT 2
- [AUTOCUT] Gradle Check Flaky Test Report for LangPainlessClientYamlTestSuiteIT
- [BUG] Option in NodesStatsRequest to return per-shard stats in response HOT 1
- [AUTOCUT] Gradle Check Flaky Test Report for RecoveryWhileUnderLoadIT HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from opensearch.