Coder Social home page Coder Social logo

[NEW] Handle spamming of custom LUA error messages in the INFO ERRORSTATS section with continued tracking of non LUA errors stats about valkey HOT 17 CLOSED

KarthikSubbarao avatar KarthikSubbarao commented on August 17, 2024
[NEW] Handle spamming of custom LUA error messages in the INFO ERRORSTATS section with continued tracking of non LUA errors stats

from valkey.

Comments (17)

PingXie avatar PingXie commented on August 17, 2024 1

We could pass a flag to it to differentiate custom lua errors?

Yes. They could be emulating a normal error though.

I wouldn't consider these "emulated" errors "normal" anymore ... I think it is fine to lump them all into "custom" errors, as far as storage is concerned so we don't lose other more important errors.

from valkey.

madolson avatar madolson commented on August 17, 2024 1

I think the difficulty here is how many error codes we can allow the script to customize. The scenes I have seen so far are all misuse, maybe there is a case that user do need to use a lot error code.

I've only seen on the order of like 10-15 custom error codes, I think you're right that this is likely only a misuse. I suppose my concern is that I would like the "normal" errors to continue working for debug purposes. I'm kind of okay with the tradeoff that #500 proposes, we continue to report normal errors but we stop reporting LUA errors.

@enjoy-binbin Just to understand your PoV, are you okay with #500, I'll try to followup and try to get it ready to merge this week if that is the case.

from valkey.

enjoy-binbin avatar enjoy-binbin commented on August 17, 2024 1

I'm kind of okay with the tradeoff that #500 proposes, we continue to report normal errors but we stop reporting LUA errors.

yean, i am ok with that.

from valkey.

enjoy-binbin avatar enjoy-binbin commented on August 17, 2024

does this redis/redis#13141 can somehow solve your problem?

from valkey.

madolson avatar madolson commented on August 17, 2024

does this redis/redis#13141 can somehow solve your problem?

I really don't understand why Oran marked that as not a breaking change, it's possible someone had more than 128 errors. I feel like he was just trying to merge stuff before the license change.

I think this should solve the issue @KarthikSubbarao brought up though.

from valkey.

enjoy-binbin avatar enjoy-binbin commented on August 17, 2024

I really don't understand why Oran marked that as not a breaking change, it's possible someone had more than 128 errors.

yes, that is indeed a potential breaking change. I have thought about add a new configuration item and defaulting to infinity to avoid the breaking change, but Oran does't seem to mention it and probable doesn't want to add new a configuration item for this.

I feel like he was just trying to merge stuff before the license change.

Looking at it now, it may indeed be.

from valkey.

KarthikSubbarao avatar KarthikSubbarao commented on August 17, 2024

does this redis/redis#13141 can somehow solve your problem?

@enjoy-binbin - (Just read the top comment) Yes, this will solve the issue of spamming INFO ERRORSTATS and reduce additional data stored on error stats.

For completeness - with redis/redis#13141, ERRORSTATS will be disabled when misuse is detected. This means non LUA errors (which do not spam) will also be disabled until we use CONFIG RESETSTAT.

Whereas, the idea proposed in this issue was to have a config to disable custom LUA error messages while continuing to allow regular Error stats to function normally.

But the main issue we wanted to address is misuse (spamming), so this should work. Thank you

Whereas, the idea proposed in this issue was to have a config to disable custom LUA error messages while continuing to allow regular Error stats to function normally.

If we think this is useful, we can consider making this improvement - by adding onto the existing logic

from valkey.

srgsanky avatar srgsanky commented on August 17, 2024

Resetting all the error count will also lose visibility into other errors like MOVED etc that can help with debugging issues in the cluster.

Is it possible to record all custom errors from lua in a separate RAX (based on scriptIsRunning() and if the error originated from luaRedisErrorReplyCommand) and show the custom errors only when using INFO ERRORSTATS explicitly?

from valkey.

KarthikSubbarao avatar KarthikSubbarao commented on August 17, 2024

Resetting all the error count will also lose visibility into other errors like MOVED etc that can help with debugging issues in the cluster.

Yes, looks like with the current implementation (from redis/redis#13141):

  1. To re-enable error stats tracking, we need to use CONFIG RESETSTAT which will clear command stat counters and error stat counters. So, we will lose visibility in both of these.
  2. When misuse is detected (> 128 error types), we prevent tracking of regular (non LUA) error types. e.g. MOVED, CLUSTERDOWN, etc.

These issues aside, the PR addresses the concerns we have perfectly.

Is it possible to record all custom errors from lua in a separate RAX (based on scriptIsRunning() and if the error originated from luaRedisErrorReplyCommand) and show the custom errors only when using INFO ERRORSTATS explicitly?

The proposal in this issue is to prevent tracking new error messages from LUA if the number of LUA error messages is greater than 128. Instead, we will track any additional LUA error type in a new counter: errorstat_LUA and if a non-LUA error (MOVED / CLUSTERDOWN) occurs, they will continue to be tracked as usual.

This will address the issue of spammed error messages / memory usage of the errors RAX. Additionally, we will not have to execute CONFIG RESETSTAT to restore functionality - as normal error messages continue to be tracked.

from valkey.

madolson avatar madolson commented on August 17, 2024

I like the idea of tracking lua errors in a separate RAX, however it can't just be based off of scriptIsRunning, since the script could generate normal error messages. I can't think of anything off the top of my head to tell if it was a user generated error message or not.

from valkey.

ranshid avatar ranshid commented on August 17, 2024

@KarthikSubbarao Thank you for driving this! I am reading through the PR and will provide technical comments there (if I have any) but wanted to ask here:
Why is that so different than the modules case? I think this will provide solution for LUA, but modules can cause the same kind of missuse right? I feel this error stats mechanism lacks some better structure than just be based on string parsing. Maybe we should consider improving that in the future?

from valkey.

KarthikSubbarao avatar KarthikSubbarao commented on August 17, 2024

@ranshid Thank you. Modules are similar in that they can result in custom error messages - but I would place Modules closer to the Valkey server in that it is the responsibility of those running the server to ensure that the modules they load do not lead to spamming the ERRORSTATS section. EVAL / EVALSHA commands, however, are completely customer driven. This is why the Issue here started off focusing only on LUA.

But (if required) we can indeed add support for preventing misuse by Modules by passing a flag in the VM_ReplyWithError and VM_ReplyWithErrorFormat APIs. I can update the PR if so

from valkey.

PingXie avatar PingXie commented on August 17, 2024

I like the idea of tracking lua errors in a separate RAX, however it can't just be based off of scriptIsRunning, since the script could generate normal error messages. I can't think of anything off the top of my head to tell if it was a user generated error message or not.

I guess that depends on the definition of "lua errors". If they are interpreted as "custom errors" then agreed but do we see value in interpreting them as the origin of the errors? Meaning any errors, both custom and normal, generated by any scripts go to its own RAX. If we go down this path, we would have to create a dedicated errorstat section just for scripts ( errorstat_script_*?). Since this is already a breaking change, I think it would be worthwhile expanding the discussion a bit.

from valkey.

madolson avatar madolson commented on August 17, 2024

If we go down this path, we would have to create a dedicated errorstat section just for scripts ( errorstat_script_*

I don't like this approach. I don't think we should have a special section just for scripts, since we try to treat the scripts as more or less normal clients. It also makes this definitely a break changing for users that are using this feature, as opposed to just a breaking change for users that are misusing the feature.

from valkey.

PingXie avatar PingXie commented on August 17, 2024

It also makes this definitely a break changing for users that are using this feature, as opposed to just a breaking change for users that are misusing the feature.

Good point.

I like the idea of tracking lua errors in a separate RAX, however it can't just be based off of scriptIsRunning

Looks like addReplyErrorFormatEx is what is used to record the custom error in the COB?

https://github.com/valkey-io/valkey/blob/unstable/src/script_lua.c#L620.

We could pass a flag to it to differentiate custom lua errors?

from valkey.

madolson avatar madolson commented on August 17, 2024

We could pass a flag to it to differentiate custom lua errors?

Yes. They could be emulating a normal error though.

from valkey.

enjoy-binbin avatar enjoy-binbin commented on August 17, 2024

Yes, looks like with the current implementation (from redis/redis#13141):

  1. To re-enable error stats tracking, we need to use CONFIG RESETSTAT which will clear command stat counters and error stat counters. So, we will lose visibility in both of these.
  2. When misuse is detected (> 128 error types), we prevent tracking of regular (non LUA) error types. e.g. MOVED, CLUSTERDOWN, etc.

Yes, I didn't realize that config resetstat is an admin command and may not be easy to call.
And, i thought about continuing to track existing errors when the limit was reached, but in the implementation at the time, we couldn't add new error codes once the number was reached, so the code does not make a big different.

I think the difficulty here is how many error codes we can allow the script to customize. The scenes I have seen so far are all misuse, maybe there is a case that user do need to use a lot error code.

One way is that we set ERROR_STATS_NUMBER as a configuration item?

btw, what PR #500 does is that when it encounters a miuse, it ensures that other normal errors can be counted instead of being disabled. I guess i am ok with that, if there is no miuse, everyone will be happy, if there is a miuse, at least you will want to check the errorstats. People still need to call CONFIG RESETSTATS to clear the miuse errorstats.

from valkey.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.