Coder Social home page Coder Social logo

Tune chunk size about cortex HOT 15 CLOSED

cortexproject avatar cortexproject commented on August 17, 2024
Tune chunk size

from cortex.

Comments (15)

jml avatar jml commented on August 17, 2024

From @jml

Why should it be 1hr?

from cortex.

jml avatar jml commented on August 17, 2024

From @tomwilkie

We should investigate what the right number is, but the parameters are:

  • data batched up in the ingesters is a risk of loss in event of machine failure. we should bound this.
  • chunks are 1kb, and making bigger chunks doesn't necessarily make things more efficient. we should try and fill chunks as much as possible (something we already monitoring for).

Ticket should really say "max 1hr" to bound the loss, if that give good utilization

from cortex.

jml avatar jml commented on August 17, 2024

This is possibly related to the dynamo errors we are seeing in #85

from cortex.

juliusv avatar juliusv commented on August 17, 2024

Oh wow yeah, the default chunk max age of 10 minutes seems way too low. I'm wondering why we're still achieving such decent chunk utilization (sum(cortex_ingester_chunk_utilization_sum) / sum(cortex_ingester_chunk_utilization_count) is around 0.43) with such a low max age. Under certain circumstances, chunks can last for hours or days, so maybe it's the frequent scraping plus noisiness of the data that makes the chunks fill up that fast. Still, I would set the max age to an hour or so (as you said, it depends a bit on our risk profile, of course).

from cortex.

tomwilkie avatar tomwilkie commented on August 17, 2024

I suspect it can't flush chunks quickly enough, and therefore they are
getting more than 10mins worth of data.

On Wednesday, 2 November 2016, Julius Volz [email protected] wrote:

Oh wow yeah, the default chunk max age of 10 minutes seems way too low.
I'm wondering why we're still achieving such decent chunk utilization (
sum(cortex_ingester_chunk_utilization_sum) / sum(cortex_ingester_chunk_
utilization_count) is around 0.43) with such a low max age. Under certain
circumstances, chunks can last for hours or days, so maybe it's the
frequent scraping plus noisiness of the data that makes the chunks fill up
that fast. Still, I would set the max age to an hour or so (as you said, it
depends a bit on our risk profile, of course).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#11 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAbGhYFnewSerf7ltmWR169MdiiOQ0bfks5q6SEHgaJpZM4J5RTd
.

from cortex.

juliusv avatar juliusv commented on August 17, 2024

I suspect it can't flush chunks quickly enough, and therefore they are getting more than 10mins worth of data.

At least the failures should not have a big effect because during normal operation, only ~4% of chunk puts fail (sum(rate(cortex_ingester_chunk_store_failures_total[1m])) / sum(rate(cortex_ingester_chunk_utilization_count[1m])) -> 0.043). Maybe general latency in non-failed puts delays things somewhat, but the effect cannot be huge, as sum(cortex_ingester_memory_chunks) / sum(cortex_ingester_memory_series) shows us that there's just 1.12 chunks per series in memory at a given time (there's always at least one open head chunk for active series).

from cortex.

tomwilkie avatar tomwilkie commented on August 17, 2024

Actually make sense, since we're on doubledelta (not varbit). So its about 3.3 bytes per sample, at 15s scrape interval == about 20mins per chunk. With 10mins, you'd expect 50% utilisation.

from cortex.

juliusv avatar juliusv commented on August 17, 2024

Hmm, how do you get to 20 mins per chunk at 15s scrape interval and 3.3 bytes per sample? 1024 / 3.3 = 310 samples per chunk, but 20 minutes of samples would only be 4 * 20 = 80 samples? So a chunk should be full after ~ 310 / 4 = 77 minutes. Or am I missing something stupid?

from cortex.

tomwilkie avatar tomwilkie commented on August 17, 2024

Nope, I was being stupid. I did 300/15 not 300*15.

from cortex.

tomwilkie avatar tomwilkie commented on August 17, 2024

Okay, bit more progress: 99th percentil chunk "age" is 27mins on flush. This could explain the higher utilisation. Just added a dashboard for it, will link to it when it live.

http://frontend.dev.weave.works/admin/grafana/dashboard/file/cortex-chunks.json

from cortex.

tomwilkie avatar tomwilkie commented on August 17, 2024

So, the question is why are some chunks 27mins old?

Thoughts:

  • it takes 0.8s avg to flush a single chunk 0.07s s3 + 0.7s dynamo
  • we limit concurrent chunk flushes to 100
  • we need to flush 20k chunks every 10mins
  • means we should be able to flush all chunks in ~3mins.

from cortex.

tomwilkie avatar tomwilkie commented on August 17, 2024

Except:

  • we need to write to dynamo multiple times for each chunk (due to indexing)
  • we batch dynamo write (currently doing about 200qps, and using about 600 capacities/s in dynamo, so batch size is 3?

from cortex.

tomwilkie avatar tomwilkie commented on August 17, 2024

Average number of entries per chunk is 8.6 here

And its no coincident that 8.6 * 3min is 27mins - which is the 99%ile chunk age...

from cortex.

tomwilkie avatar tomwilkie commented on August 17, 2024

With the latest change, we may be writing chunks more than once. Needs fixing.

from cortex.

tomwilkie avatar tomwilkie commented on August 17, 2024

Set to 1hr and behaving as expected in #118

from cortex.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.