I’m revisiting the feature proposal we discussed in <a class="issue-link js-issue-link

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi Kyle! I have two concerns: Tracking memory for each data st

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

[NEW] Introduce slot level metrics to Valkey cluster,about valkey-io/valkey

Comments (32)

madolson commented on August 15, 2024 3

I think our big initial play should be cluster overhaul. I think a lot of us want it, and it makes the most compelling sense as the big "next major features".

from valkey.

kyle-yh-kim commented on August 15, 2024 2

Ignore my spam references above, I was reviewing the diff manually over Github UI.

PR has now been opened; #351

This PR is part one of the three upcoming PRs;

CLUSTER SLOT-STATS command introduction, with key-count support --> This PR.
cpu-usec support
memory-bytes support

from valkey.

madolson commented on August 15, 2024 1

I've also re-added my favorite creature comfort. @placeholderkv/core-team thoughts?

from valkey.

zuiderkwast commented on August 15, 2024 1

@kyle-yh-kim Yeah CLUSTER SLOT-STATS. We're a bit overloaded with the forking stuff, new core team, new project, etc. but I think we want this for our next release. There was already a lot of review done and I think it was almost ready to merge. Do you want to bring over your PR?

from valkey.

zuiderkwast commented on August 15, 2024 1

Hi Kyle! I have two concerns:

Tracking memory for each data structure seems to add considerable complexity. For dict, we'd need keySize and valueSize callbacks in dictType. For quicklist, it's just a size per quicklist I suppose, since the nodes already have a size, but what about compressed nodes? For rax and skiplist, I'd like to see a simple description for how to handle reach of these to understand the complexity.
Any performance degradation, or did you say it's a config? When off, there's no performance degradation?

Memory usage is not a concern to me, since we don't need any new structures to track the memory for the single-allocation datastructures (string, listpack, etc.) Modules is no great concern either since I'd imagine it's no disaster if this metric isn't 100% accurate.

What about alternative approaches? Can we check the total memory usage before and after each command? We know which slot each command operated on.

from valkey.

kyle-yh-kim commented on August 15, 2024 1

Understood. I like the proposal, and agree with giving more flexibility to the user. Through this, we let each user decide whether the trade-off is worth it or not for their specific application.

The new configuration should be immutable, since the calculation will become inaccurate if toggled numerously throughout a cluster's lifecycle. This would also mean that a disabled config will bypass both layers of aggregation. As a side-effect, we now have an un-used 8 bytes from d->size, in the case that the config is disabled. This is something I'd like to get more feedback from the core team.

Once the high-level design alignment is met, I will start with the String data-type, and gradually move on to Hash, Set, List, SortedSet and Stream.

from valkey.

hwware commented on August 15, 2024 1

@hwware Please review the performance impact here: redis/redis#10472 (comment). I think we have a good understanding of the performance.

You can see the worst case delta on 100% write throughput workloads.

Thanks for sharing this with me. I already updated my above comment

from valkey.

hwware commented on August 15, 2024 1

one experimental build option make ENABLE_FLASH=yes

@zuiderkwast We have different understanding for the term "experimental ", thus it confused you.

For me. experimental means a feature like compiling option ENABLE_FLASH=yes in Keydb, user can decide if the feature is in the bin file or not
For you, experimental means it's controlled by a config.

Then let me rephrase my words:

Because by enabling this feature, it is very little impact to performance, thus I think it could be an opt-in feature
Because it has very little impacted to performance and it need to compute the sizes of all keys in the database, thus we can
enable this feature and set it as immutable config OR we do not need support it as a configurable parameter even.

from valkey.

madolson commented on August 15, 2024

I fully agree!

from valkey.

zuiderkwast commented on August 15, 2024

Sure, metrics seem fine. I don't have strong opinions about it, only that I think fixing the cluster consistency problems is more important than metrics.

from valkey.

zuiderkwast commented on August 15, 2024

redis/redis#11432

from valkey.

kyle-yh-kim commented on August 15, 2024

Good to hear back on this thread, hope you all have been doing well.

Where we left-off

In total, there were 3 proposed metrics under CLUSTER SLOT-STATS command group;

key_count
cpu_usec
memory_bytes

Next steps

memory_bytes is the most complex of all, but this shouldn't stop us from implementing the first two metrics to gain some momentum.

I will open two PRs for key_count and cpu_usec in the coming days. These PRs will be based off of the already existing PRs for key_count and cpu_usec under Redis repository.

As for CLUSTER SLOT-STATS command format, below was the latest development we agreed upon. Lengthy discussion and rationale can be found here and here.

CLUSTER SLOT-STATS
[SLOTSRANGE start-slot end-slot [start-slot end-slot ...]]|
[ORDERBY column [LIMIT limit] [ASC|DESC]]

from valkey.

hwware commented on August 15, 2024

It is great, and I prefer to add this feature in CLUSTER INFO.

from valkey.

madolson commented on August 15, 2024

It is great, and I prefer to add this feature in CLUSTER INFO.

Why cluster info? It's a free form field I guess, it could be a new sub info field I suppose

from valkey.

kyle-yh-kim commented on August 15, 2024

Thanks for chiming in. Personally, I'm opposed to CLUSTER INFO. We could perhaps add aggregated information under CLUSTER INFO, but not for the slot level metrics themselves.

Imagine dumping ~16384 slot level metrics under CLUSTER INFO. This would unnecessarily bloat the info string, when the user may have only wanted to check cluster_state:ok.

A new command dedicated for slot level metrics querying, in this case, CLUSTER SLOT-STATS, is more suitable. For reference, below was the latest command format we agreed on.

CLUSTER SLOT-STATS
[SLOTSRANGE start-slot end-slot [start-slot end-slot ...]]|
[ORDERBY column [LIMIT limit] [ASC|DESC]]

I'll wait for the core team to finalize this decision, before opening the PRs.

from valkey.

kyle-yh-kim commented on August 15, 2024

Moving ahead, I would like to resume our conversation on per-slot memory metrics. I'd argue this is the most important per-slot metric of all, as it enables for smoother horizontal scaling given the accurate memory tracking at per-slot granularity.

Last time, we converged on its high-level strategy in "online analysis" (amortizing memory tracking cost per mutative-command, over offline RDB snapshot analysis / forking a process), as well as its performance and memory impact. The following conclusion was drawn, before the issue was then put on halt by the previously open-sourced Redis-core team.

Overall this data seems really good to me. There is the separate project for improving main memory efficiency of the dictionary, so if these two features are released together it might not be noticeable.

Source: redis/redis#10472 (comment)

As for module consideration, I mention in details here to keep this feature as an opt-in service to maintain backwards compatibility. For opt-in modules, they will be required to accurately track its value size, and call a newly introduced hook RM_ModuleUpdateMemorySlotStats() upon its mutation, to signal valkey-server to register the memory size gain / loss from the module’s registered write commands.

If we are still aligned to this strategy, I will start on its implementation, and open incremental PRs following the merger of the above CLUSTER SLOT-STATS command PR #351.

from valkey.

kyle-yh-kim commented on August 15, 2024

Based on Madelyn's latest comment;

Defer the decision about memory usage since it was contentious.

Memory metric is our greatest interest, since it would enable smoother horizontal scaling given accurate information of each slot's memory consumption.

Whenever possible, I'd like to understand more on the proposed design's concerns from the core team. Once the concerns are shared, I will evaluate alternative options.

One thing I can state for certainty is that - we've put a lot of time and effort into this technical design. Ultimately, there is no solution that comes free of charge - it all boils down to tradeoff decisions (performance, memory, and maintainability).

from valkey.

kyle-yh-kim commented on August 15, 2024

Thanks for your prompt response. My response is attached below.

Complexity in memory tracking

quicklist compression
zmalloc_size(node->entry) is called before and after compression to assess its difference. The two hook points are; 1) __quicklistCompressNode(), and 2) __quicklistDecompressNode(). This difference is then accumulated to quicklist->alloc_bytes, where alloc_bytes is a newly introduced size_t field that tracks its allocation bytes.
zskiplist
zskiplistNode holds two major memory allocations; 1) node->ele, and 2) node->level[]. Both of which can be easily introspected through zmalloc_size(). Similar to dict and quicklist, lowest common hook points are chosen, such that the change is minimally invasive. This change can be accomplished by 2 line changes in 1) zslInsert(), and 2) zslDeleteNode().
rax
There exist an open OSS PR which tracks rax allocation size in its header. The change isn't complex, as we simply add or subtract zmalloc_size(raxNode) per mutation, for which there're about 20 touch points. This effort can be resumed in Valkey project.

Performance degradation

The configuration is based on server.cluster_enabled. If enabled, per-slot memory will be aggregated. Else, the code will be bypassed.

The aggregation comes in two layers;

Track accurate memory usage of Redis key-value entry.
Aggregate memory usage at per-slot level, given that we can track each Redis key-value entry’s memory usage.

Right now, the proposal for CMD is to bypass only the second aggregation and retain the first one. This way, both CMD and CME will have O(1) accurate MEMORY USAGE.

More on performance benchmarking can be found here. In the worst case scenario for CMD, the performance degradation may reach ~1% TPS. For an average workload of 8:2 R/W, the degradation is negligible.

Alternative approaches

Can we check the total memory usage before and after each command? We know which slot each command operated on.

Yes, this was the very first design candidate we ideated. Initially, we expected this to be as simple as subtracting zmalloc_used_memory()_after_cmd - zmalloc_used_memory()_before_cmd. However, it carried far greater complexity due to the following reasons;

Maintenance and hard-to-follow logic. At first glance, this approach seems simplistic to implement. However, zmalloc context switching from customer key-space to others intents (including but are not limited to; 1. Transient / temporary 2. Redis administration 3. Client input / output buffer) can occur throughout all depths of Redis mutative command call-stack. Out of all zmalloc operations, we must isolate those relevant to customer key-space. Thus, for every mutative Redis command, we must first completely map-out these context switching windows, followed by its maintenance upon any new zmalloc introduction within these windows.

The 2nd candidate solves this maintenance problem by logically separating all size tracking within the memory sparse internal data-structure files, such as rax.c, dict.c, quicklist.c and so on. The size tracking will not creep into other depths of call-stacks.

Down the road, if any bug is introduced, 1st candidate will require sweeping across all zmalloc operations within all depths of call-stacks. For the 2nd candidate, we may simply refer to the specific internal data-structure file.

Complex and invasive, as zmalloc can not be relied under all cases.

For example, in order to get the relevant slot number, the input must first be parsed. However, parsing of this input requires zmalloc. We now run into a cyclic dependency, where zmalloc needs slot number to increment, but the slot number can only be obtained once key is parsed through zmalloc. To mitigate, we may temporarily save the size of these variables, then increment them once the slot number is parsed and request is successful. But now, we need a way to carry this additional temporary variable, either through another global variable, or additional argument across all call-stacks.

Another example would be, robj value are conditionally re-created following the initial parsing (createStringObjectFromLongLongWithOptions()). So then, the size of the initially parsed value may or may not be disregarded from the slot metrics array. This requires another layer of consideration.

After a few edge case considerations, the implementation touches multiple signatures and growing number of global variables.

We’ve also investigated various “offline” approaches, such as 1) background thread, 2) cron-job, and 3) forking, all of which were not preferred due to unbounded upper scanning limit, as well as recency lag.

This was greatly discussed over in the other threads, here and here.

from valkey.

zuiderkwast commented on August 15, 2024

Thanks! Yes, I have seen those threads before but I didn't follow this carefully back then. :)

OK, so memory is tracked even for standalone mode and it has almost 1% throughput impact for standalone and nearly 2% in cluster mode. This makes me think that we should add a config for it and wrap all of these in if, like if (server.memory_tracking) { d->size += zmalloc_size(p); }. If the config is off, CPU branch prediction will make sure it doesn't cost anything to execute this kind of if statements.

Why? I think speed is more important than metrics for some users. 1% is not that much but it adds up.

from valkey.

zuiderkwast commented on August 15, 2024

Since we use listpacks for small hashes, sets, etc., the memory overhead for storing the size will only apply to larger objects. I think that's negligible.

@valkey-io/core-team Are you OK with a config (on/off) for collecting memory statistics per key and per slot? (With the config off it has no CPU overhead, presumably.)

from valkey.

enjoy-binbin commented on August 15, 2024

I haven't read all the comments. It is a good idea to have the slot level metrics. Memory indicators can help us migrate slots and see if the machine memory is available.

So this metrics is for normal users? (not for admin). since the config is immutable and default to disable, i am worried that it won't be used because normal users tend not to care about this, they're probably more concerned with the performance.

from valkey.

zuiderkwast commented on August 15, 2024

I believe the metric is especially useful for auto-scaling of a cluster. I think it's implemented in AWS control plane. Maybe one day we can have it natively in Valkey too.

We didn't discuss the default yet. I think there are pros and cons for on and off:

For a new feature, it can be good to start with "off" by default and say it's experimental. Users (or admins, sometimes it's the same) can test it and if there are bugs, we can fix them. Maybe in Valkey 9 we can make it "on" by default.
Or we can make it be "on" by default. Users can turn it "off" only if they notice they need more performance. (We can run benchmarks with "off" if we want to show maximum performance. 😁)
Managed services with auto-scaling can make it always "on" and prevent users from changing it.

from valkey.

kyle-yh-kim commented on August 15, 2024

Thanks for your feedback. I see two high level approaches in this;

1. New configuration, `memory-usage`

For a new feature, it can be good to start with "off" by default and say it's experimental. Users (or admins, sometimes it's the same) can test it and if there are bugs, we can fix them. Maybe in Valkey 9 we can make it "on" by default.

This would be the most conservative approach. Any user with interest will review the release document, and enable the memory-tracking config for their specific application needs. Through this new config, we defer the trade-off decisions to the end-users.

Question here is, does Redis/Valkey have prior history in first marking a feature "experimental", sitting behind a default disabled config, followed by default enabled config release at a later version?

2. Follow an existing configuration, `cluster-enabled`

The case for coupling with cluster-enabled config is that - there're a lot to gain for CME (cluster-mode-enabled), while there's not much to gain for CMD (cluster-mode-disabled). As Binbin mentioned, slot-level memory indicators will help with slot migration, which is essential to a cluster's lifecycle. For this reason, one could argue that it may make more sense to enable memory tracking, based on whether the cluster mode is enabled or not.

In this case, we make the trade-off decision on behalf of the end-users. In addition, since the flexibility is not granted upfront, this would be considered a backward-compatible approach. Based on the OSS community's feedback, we can later introduce memory-tracking config and pivot back to the 1st option.

My thoughts

As of writing this comment, I hold slight preference towards the 2nd option. Flexibility can be provided at a later time based on the OSS community's feedback, and we keep things simple by coupling it directly with cluster-enabled. That said - no strong opinions on this, I could swing both ways based on the core team's feedback.

from valkey.

madolson commented on August 15, 2024

valkey-io/core-team Are you OK with a config (on/off) for collecting memory statistics per key and per slot? (With the config off it has no CPU overhead, presumably.)

I would rather there be no config. It adds another major bifurcation in our codebase which I don't want to think about it or make sure it is tested. I also think a major reason people will prefer us is we have better introspection and observability compared to something like Memcached (or Redis). It also only applies to large data structures, which I still think we have a lot of room to optimize to reclaim that 1%.

from valkey.

zuiderkwast commented on August 15, 2024

Question here is, does Redis/Valkey have prior history in first marking a feature "experimental", sitting behind a default disabled config, followed by default enabled config release at a later version?

Redis? I don't know. Valkey? It's a new project with a leadership team. IMO we can and should improve the methods.

Speaking of defaults, we should really revise the defaults for Valkey 8! I just opened #653.

from valkey.

madolson commented on August 15, 2024

Redis? I don't know. Valkey? It's a new project with a leadership team. IMO we can and should improve the methods.

I like the idea of opt-in features, although we need a better name than "experimental" since that implies they might get removed or are not really ready for production.

from valkey.

zuiderkwast commented on August 15, 2024

we need a better name than "experimental" since that implies they might get removed or are not really ready for production

Might get removed? Hopefully not, but I do think we should be open up our process to deprecate and actually remove features, not least our communication about it, although we should still remain very conservative and only remove things that really cause us trouble. Features not enabled by default are easier to remove than those enabled by default.

Consider features moving slowly across this scale of feature support: Removed < Deprecated < Not enabled by default < Enabled by default < Mandatory and always on

Not really ready for production? We've discussed that we do want users to find bugs, because nobody tests the release candidates. I think this is one of the benefits, but sure we can use another name, if you can come up with one.

from valkey.

hwware commented on August 15, 2024

I just finish going through this thread discussion, @zuiderkwast worry about the performance, and @madolson hope this is opt-in feature in Valkey.

I know in keydb, there is one experimental build option make ENABLE_FLASH=yes. And I read the link redis/redis#10472 (comment). So personally, I think I have the following options:

Experimental term can be acceptable (Revise: Iooks like we do not need it)
But much better, make this feature or metric as configurable, like lazyfree-lazy-eviction parameter and make it mutable
(Revise: According to the benchmark result, we can make it as one opt-in feature. )
Or like what @madolson said, if it could be an opt-in feature. (I vote for this option)

Thanks

from valkey.

madolson commented on August 15, 2024

@hwware Please review the performance impact here: redis/redis#10472 (comment). I think we have a good understanding of the performance.

You can see the worst case delta on 100% write throughput workloads.

from valkey.

zuiderkwast commented on August 15, 2024

@hwware I'm a little confused by what you mean.

When I said "experimental feature", I meant it's controlled by a config.

"Opt-in feature" means enabled by a config that's off by default. So "opt-in" and "experimental" is almost the same idea, different names.

We usually don't want compile-time options (ifdef), because not everyone can build from source. This is not what I

To make the config mutable in runtime is difficult for this feature, because we need to compute the sizes of all keys in the database. If we decide to use a config, we can start with immutable and make it mutable in the future without any breaking change, so let's not do that now.

I think Madelyn likes opt-in for other features, but not for this feature. This feature has very little performance impact so she prefers that it's always on. Also the extra complexity and "bifurcation" (TIL) of a config is a good reason to avoid it. I didn't consider the testing needed and other difference for both on and off.

I can accept always-enabled, to avoid the extra complexity and bifurcation.

Are there any other benefits of accurate memory tracking? I've seen these so far but I imagine some users don't care about any of these:

Useful metric for cluster auto-scaling
More accurate MEMORY USAGE key
It can improve Bigkeys?
It can be used for stream trim by size (mentioned in redis/redis#10152)

from valkey.

PingXie commented on August 15, 2024

Experimental

I get the high level idea but I am still unable to articulate the process. @hwware do you think you could propose an "experimental" process doc (maybe EXPERIMENTAL.md)? Until we have a signed off process doc, I would assume we have NO process admitting any changes under experimental.

default "on" or "off" for a new feature

We discussed this topic in one of the core team meetings but I couldn't find the meeting notes. Nevertheless, my takeaway of the previous discussion is that "potentially risky or breaking changes" should default off in the first release(s) to ensure seamless upgrade.

This discussion brings up a new class of changes: changes that are presumably compatible and safe but its ROI is not a slam dunk. I like how @kyle-yh-kim put it: at the end of the day, it boils down to compromises/tradeoffs. So as far as this per-slot memory stat is concerned, I would imagine every cluster user can benefit from knowing the slot memory usage, cloud provider or not. I will need to recall the detailed discussion about the overhead but if the overhead is indeed ~1%, i.e., negligible, I would be inclined to enable it by default (again assuming safe/compatible). I believe this metric justifies the ~1% overhead and I agree there are many other more systematic/cleaner ways to improve the performance (such as the re-thinking around the dictionary, better io-threading, io_uring, and RDMA, etc).

from valkey.

kyle-yh-kim commented on August 15, 2024

PR for per-slot CPU metric has been opened; #712

The metric tracks cpu time in micro-seconds, sharing the same value as INFO COMMANDSTATS, aggregated under per-slot context.

First revision only holds implementation changes for initial feedback purposes, with pending perf testing.
Integration tests are not up-to-date, and thus failing. This will soon be followed-up.

from valkey.

[NEW] Introduce slot level metrics to Valkey cluster about valkey HOT 32 CLOSED

Comments (32)

Where we left-off

Next steps

Complexity in memory tracking

Performance degradation

Alternative approaches

1. New configuration, `memory-usage`

2. Follow an existing configuration, `cluster-enabled`

My thoughts

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (32)

Where we left-off

Next steps

Complexity in memory tracking

Performance degradation

Alternative approaches

1. New configuration, memory-usage

2. Follow an existing configuration, cluster-enabled

My thoughts

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

1. New configuration, `memory-usage`

2. Follow an existing configuration, `cluster-enabled`