Coder Social home page Coder Social logo

Comments (8)

etiennebarrie avatar etiennebarrie commented on June 12, 2024 1

Yes it's documented in the comment you've quoted. The count is only for display purposes, so being exact is not super important.

You can always override count if you want an exact count, you can access csv_content on the Task if you need to.

The count method is called once when the job starts, and we didn't want to parse huge CSV files at that point, so we went with an approximation (since once again count is only for display).

Due to the design of Active Job/StringIO, we haven't found a way to stream the CSV (we can download in chunk, but we can't feed the chunks as the input to CSV). We're downloading the file in its entirety instead and keeping it in memory, so the reason is not memory savings, but instead really just CPU perf due to parsing the CSV. For example, parsing a 1 million line CSV takes 2 seconds on my machine vs less than 2ms to simply count the newlines.

from maintenance_tasks.

adrianna-chang-shopify avatar adrianna-chang-shopify commented on June 12, 2024 1

@etiennebarrie after some thought, I think I'm actually in favour of changing CSV counting to parse the CSV as you've suggested. We already call out in the README that you can override #collection to avoid running a large query, so we could add a line there about doing the same for large CSVs (or suggest the faster implementation we've been using so far). I think we could get away with it in a minor bump as well -- it's not really a breaking change, rather an implementation detail that could have an impact for future tasks that accept large CSVs.

20s as a one-time cost on a maintenance task that's going to take a significant amount of time anyways doesn't feel too bad IMHO. I guess if it scales beyond 10mil rows, we start encroaching on the interruptibility window, but a slow query would do the same...

from maintenance_tasks.

D-system avatar D-system commented on June 12, 2024

To get the expected line number:

count = CSV.parse(task.csv_content, headers: true).count

I wonder if it has been considered but not included for performance or memory concerns.

from maintenance_tasks.

D-system avatar D-system commented on June 12, 2024

@etiennebarrie Merci Γ‰tienne pour l'explication.
I'm puzzled because at some point, the CSV needs to be parsed before to be given to the process method, isn't it?
Could Maintenance Tasks do or update the count at that time?

from maintenance_tasks.

etiennebarrie avatar etiennebarrie commented on June 12, 2024

Yes, the CSV is parsed by job-iteration at some point, but this is done as a stream, just reading enough data to emit a row. So we don't know how many rows there are until we're finished basically.

We could change the display a bit, but we do handle all the different cases:
image
These are multiple runs of UpdatePostsTask, when count is defined as returning respectively 8, nil, 12, and 10. There are always 10 Post records.

So when count is not defined in the task, and it's the default implementation for CSV, we can have an incorrect count, but then we handle it like we would do if you had provided an incorrect count.

One solution would be to set the total to the number of rows we've found at the end of a run, but then we would lose some information, namely the expected count that was defined by the task (implicitly or explicitly).

We could also provide an option to csv_collection, e.g. csv_collection exact_count: true but that's not great.

At the very least, we should add this in the README, in the section about CSV. We mention the trailing newline, but don't explain why (it's in order to have the correct count, when rows don't include newlines). We could expand and explain that the count is based on the number of lines and will be off if rows includes newlines or if the trailing newline is missing, and maybe even give a one-liner to get correct counts for those cases.

  def count = CSV.parse(csv_content, headers: true).count

Taking a step back, given that we've moved to automatically count the number of records for Active Record relations in v2.0.0, and that can also be very slow, it could be argued that it's fair to make the CSV implicit counting slower but correct by default, and leave it to task authors to override count with nil or a faster implementation (if they don't expect newlines in rows) if they really need to.

@adrianna-chang-shopify @nvasilevski WDYT? Would that be too much of a breaking change for a minor bump? I.e. a 10-million-line CSV would cause the run to take 20 seconds at the beginning of the first job to count the lines. I think I'm in favor of just documenting this in the README.

from maintenance_tasks.

D-system avatar D-system commented on June 12, 2024

Thank you for taking the time to explain. I didn't know it was streaming the CSV.
I'm happy with an update of the readme and/or update the task template with the accurate count one-liner method commented.

from maintenance_tasks.

etiennebarrie avatar etiennebarrie commented on June 12, 2024

I'm happy with an update of the readme

Please do!

update the task template

But not that part. πŸ˜„ If we feel display correctness is more important, then we should fix it, but we shouldn't just cause more code to be in all apps.


In the readme, we should mention that the count is only an approximation counting the newlines in the CSV, and that if they have newlines in their CSV and want an accurate total, they can override count. My understanding is that most maintenance tasks CSVs are full of ids instead, and won't need that, so it shouldn't be in the template.

from maintenance_tasks.

D-system avatar D-system commented on June 12, 2024

I made a PR to update the #count method: #991
I was not sure about updating for README too, so I made a different commit to revert it in the case it is not wanted.

from maintenance_tasks.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.