Even just by accident, it's really easy for a user to overload or hotspot an ingester

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Limit per-user metric cardinality about cortex HOT 17 CLOSED

cortexproject commented on August 17, 2024

Limit per-user metric cardinality

from cortex.

Comments (17)

juliusv commented on August 17, 2024

There's a question here of whether we want to limit the total number of series per user or per ingester. Maybe both.

From a system safety perspective, we'll want to limit the number of series per ingester. But that is hard to communicate to the user. They will want to know how many series they can create overall, because they don't understand the uneven distribution of their series over ingesters (similarly to how we stumbled over Dynamo table throughput issues with uneven table shards).

Limiting the per-user series on each ingester would be technically easiest though, because the necessary state is readily available. Given that this is necessary for basic safety, we will want to have some limit here in any case, even if it's pretty high.

I'm not sure how we would track the total number of series for a user as another limit. The distributor would either have to get stats from the ingesters on every append (probably infeasible) or the distributor would have to track metric cardinality itself. Doing full cardinality tracking in the distributor would use too many resources, but maybe it could be done approximately with HyperLogLog.

from cortex.

tomwilkie commented on August 17, 2024

Limiting the per-user series on each ingester would be technically easiest though, because the necessary state is readily available. Given that this is necessary for basic safety, we will want to have some limit here in any case, even if it's pretty high.

Yes, I think this is the best place to start. We can always offer users this limit as a lower bound.

Doing full cardinality tracking in the distributor would use too many resources, but maybe it could be done approximately with HyperLogLog.

Thats an interesting idea; it would need some kind of moving average as well, as cardinality over time can be virtually unlimited - we close inactive series after 1h.

from cortex.

rade commented on August 17, 2024

What are we proposing to do when the limit is exceeded? Throw away the data cortex has received? How will a user know that (and why) this is happening?

from cortex.

juliusv commented on August 17, 2024

@rade Throw away the samples and return an error to the user. For the rate-limiting case, we do the same and return a 429 Too Many Requests, which wouldn't be fully accurate here. I don't know what the best HTTP response code would be, maybe 418 I'm a teapot :)

At first, this would require a user to meta-monitor their Prometheus scraper for failed remote writes, but eventually we could notify users automatically when we see that they are being continuously denied.

from cortex.

tomwilkie commented on August 17, 2024

We need to make it clear thats its the product of the label cardinality for a given metric thats problematic for us. If we detect such a metric, we should black list it, but not drop the entire batch.

from cortex.

rade commented on August 17, 2024

Getting something to show up in the Weave Cloud cortex UI would be nice. And, going totally meta... feed a metric into the instance's cortex, which the user can set an alert on :)

from cortex.

rade commented on August 17, 2024

This is all post MVP, obviously. Let's be safe before being nice.

from cortex.

juliusv commented on August 17, 2024

We could still store other samples, but should still return an HTTP error, because a 200 would not seem ok in that case. Then the user doesn't know which samples were stored, but I don't think we can do better.

Yeah, in the future we can have nice UI features and meta-metrics for this.

from cortex.

rade commented on August 17, 2024

Then the user doesn't know which samples were stored, but I don't think we can do better.

We can say what did/didn't happen in the response body. That won't be easy for the user to track down (would the sending prom even log it?), but it's better than nothing.

from cortex.

juliusv commented on August 17, 2024

The Prometheus server doesn't look at the response body at all, but yeah, theoretically a user could tcpdump it. It would not be super useful, since the series that were rejected are not special in any way, they just happened to be the first ones that were "one too much".

On a technical level, reporting these details back would require changing the gRPC response from the ingesters to include this information and the distributor to then merge it and send the failed series back to the user. Since I don't believe it'll ever be even seen by anyone, I doubt the value of that.

from cortex.

rade commented on August 17, 2024

The Prometheus server doesn't look at the response body at all

That seems wrong. Surely it should log any errors it gets back.

the series that were rejected are not special in any way, they just happened to be the first ones that were "one too much".

What makes them special is that cortex has thrown away some of the data. And that is of interest to users, I would have thought.

On a technical level, (it's complicated)

Fair enough. Not part of the MVP them. As I said, let's be safe before being nice.

from cortex.

juliusv commented on August 17, 2024

The Prometheus server doesn't look at the response body at all

That seems wrong. Surely it should log any errors it gets back.

It logs the remote write send failure based on the HTTP status code, but does not inspect the response body, or expect anything to be in it.

What makes them special is that cortex has thrown away some of the data. And that is of interest to users, I would have thought.

True. Though if they hit this situation, they will be more interested in finding out which metric of theirs is currently causing the blowup.

Fair enough. Not part of the MVP them. As I said, let's be safe before being nice.

Yup.

from cortex.

rade commented on August 17, 2024

(prometheus) does not inspect the (error) response body, or expect anything to be in it.

That's what I meant by "wrong" :)

from cortex.

juliusv commented on August 17, 2024

(prometheus) does not inspect the (error) response body, or expect anything to be in it.

That's what I meant by "wrong" :)

Well, it's not part of our generic write protocol to return anything in the response body... but anyways :)

from cortex.

juliusv commented on August 17, 2024

#273 implemented a total series limit per user and ingester. Tom suggested also limiting the per-metric cardinality, which I'm looking at next.

from cortex.

juliusv commented on August 17, 2024

@tomwilkie for checking the current number of series for a metric, the index has a nested map map[model.LabelName]map[model.LabelValue][]model.Fingerprint (https://github.com/weaveworks/cortex/blob/master/ingester/index.go#L13), but I think it'd be unwise to iterate through it and add up the number of fingerprints at the leaves for every sample we ingest. So I propose adding another seriesPerMetric map[model.LabelName]int map to the index that just tracks how many series there currently are for which metric. Sounds good?

from cortex.

tomwilkie commented on August 17, 2024

…

On Tue, Feb 7, 2017 at 4:25 PM, Julius Volz ***@***.***> wrote: @tom <https://github.com/tom> for checking the current number of series for a metric, the index has a nested map map[model.LabelName]map[model. LabelValue][]model.Fingerprint (https://github.com/ weaveworks/cortex/blob/master/ingester/index.go#L13), but I think it'd be unwise to iterate through it and add up the number of fingerprints at the leaves for every sample we ingest. So I propose adding another seriesPerMetric map[model.LabelName]int map to the index that just tracks how many series there currently are for which metric. Sounds good? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#47 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAbGhXiVXylqy-fp_L-LW_AbLlzm2-8kks5raJrdgaJpZM4KT3Hx> .

from cortex.

Limit per-user metric cardinality about cortex HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent