Coder Social home page Coder Social logo

Comments (17)

juliusv avatar juliusv commented on August 17, 2024

There's a question here of whether we want to limit the total number of series per user or per ingester. Maybe both.

From a system safety perspective, we'll want to limit the number of series per ingester. But that is hard to communicate to the user. They will want to know how many series they can create overall, because they don't understand the uneven distribution of their series over ingesters (similarly to how we stumbled over Dynamo table throughput issues with uneven table shards).

Limiting the per-user series on each ingester would be technically easiest though, because the necessary state is readily available. Given that this is necessary for basic safety, we will want to have some limit here in any case, even if it's pretty high.

I'm not sure how we would track the total number of series for a user as another limit. The distributor would either have to get stats from the ingesters on every append (probably infeasible) or the distributor would have to track metric cardinality itself. Doing full cardinality tracking in the distributor would use too many resources, but maybe it could be done approximately with HyperLogLog.

from cortex.

tomwilkie avatar tomwilkie commented on August 17, 2024

Limiting the per-user series on each ingester would be technically easiest though, because the necessary state is readily available. Given that this is necessary for basic safety, we will want to have some limit here in any case, even if it's pretty high.

Yes, I think this is the best place to start. We can always offer users this limit as a lower bound.

Doing full cardinality tracking in the distributor would use too many resources, but maybe it could be done approximately with HyperLogLog.

Thats an interesting idea; it would need some kind of moving average as well, as cardinality over time can be virtually unlimited - we close inactive series after 1h.

from cortex.

rade avatar rade commented on August 17, 2024

What are we proposing to do when the limit is exceeded? Throw away the data cortex has received? How will a user know that (and why) this is happening?

from cortex.

juliusv avatar juliusv commented on August 17, 2024

@rade Throw away the samples and return an error to the user. For the rate-limiting case, we do the same and return a 429 Too Many Requests, which wouldn't be fully accurate here. I don't know what the best HTTP response code would be, maybe 418 I'm a teapot :)

At first, this would require a user to meta-monitor their Prometheus scraper for failed remote writes, but eventually we could notify users automatically when we see that they are being continuously denied.

from cortex.

tomwilkie avatar tomwilkie commented on August 17, 2024

We need to make it clear thats its the product of the label cardinality for a given metric thats problematic for us. If we detect such a metric, we should black list it, but not drop the entire batch.

from cortex.

rade avatar rade commented on August 17, 2024

Getting something to show up in the Weave Cloud cortex UI would be nice. And, going totally meta... feed a metric into the instance's cortex, which the user can set an alert on :)

from cortex.

rade avatar rade commented on August 17, 2024

This is all post MVP, obviously. Let's be safe before being nice.

from cortex.

juliusv avatar juliusv commented on August 17, 2024

We could still store other samples, but should still return an HTTP error, because a 200 would not seem ok in that case. Then the user doesn't know which samples were stored, but I don't think we can do better.

Yeah, in the future we can have nice UI features and meta-metrics for this.

from cortex.

rade avatar rade commented on August 17, 2024

Then the user doesn't know which samples were stored, but I don't think we can do better.

We can say what did/didn't happen in the response body. That won't be easy for the user to track down (would the sending prom even log it?), but it's better than nothing.

from cortex.

juliusv avatar juliusv commented on August 17, 2024

The Prometheus server doesn't look at the response body at all, but yeah, theoretically a user could tcpdump it. It would not be super useful, since the series that were rejected are not special in any way, they just happened to be the first ones that were "one too much".

On a technical level, reporting these details back would require changing the gRPC response from the ingesters to include this information and the distributor to then merge it and send the failed series back to the user. Since I don't believe it'll ever be even seen by anyone, I doubt the value of that.

from cortex.

rade avatar rade commented on August 17, 2024

The Prometheus server doesn't look at the response body at all

That seems wrong. Surely it should log any errors it gets back.

the series that were rejected are not special in any way, they just happened to be the first ones that were "one too much".

What makes them special is that cortex has thrown away some of the data. And that is of interest to users, I would have thought.

On a technical level, (it's complicated)

Fair enough. Not part of the MVP them. As I said, let's be safe before being nice.

from cortex.

juliusv avatar juliusv commented on August 17, 2024

The Prometheus server doesn't look at the response body at all

That seems wrong. Surely it should log any errors it gets back.

It logs the remote write send failure based on the HTTP status code, but does not inspect the response body, or expect anything to be in it.

What makes them special is that cortex has thrown away some of the data. And that is of interest to users, I would have thought.

True. Though if they hit this situation, they will be more interested in finding out which metric of theirs is currently causing the blowup.

Fair enough. Not part of the MVP them. As I said, let's be safe before being nice.

Yup.

from cortex.

rade avatar rade commented on August 17, 2024

(prometheus) does not inspect the (error) response body, or expect anything to be in it.

That's what I meant by "wrong" :)

from cortex.

juliusv avatar juliusv commented on August 17, 2024

(prometheus) does not inspect the (error) response body, or expect anything to be in it.

That's what I meant by "wrong" :)

Well, it's not part of our generic write protocol to return anything in the response body... but anyways :)

from cortex.

juliusv avatar juliusv commented on August 17, 2024

#273 implemented a total series limit per user and ingester. Tom suggested also limiting the per-metric cardinality, which I'm looking at next.

from cortex.

juliusv avatar juliusv commented on August 17, 2024

@tomwilkie for checking the current number of series for a metric, the index has a nested map map[model.LabelName]map[model.LabelValue][]model.Fingerprint (https://github.com/weaveworks/cortex/blob/master/ingester/index.go#L13), but I think it'd be unwise to iterate through it and add up the number of fingerprints at the leaves for every sample we ingest. So I propose adding another seriesPerMetric map[model.LabelName]int map to the index that just tracks how many series there currently are for which metric. Sounds good?

from cortex.

tomwilkie avatar tomwilkie commented on August 17, 2024

from cortex.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.