When a CSV has new lines inside a cell, the default count is incorrect. <p dir="au

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

To get the expected line number: <div class="highlight highlight-source-ruby notra

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I made a PR to update the #count method: <a class="is

CSV count wrong when new line in cell about maintenance_tasks HOT 8 CLOSED

D-system commented on September 24, 2024

CSV count wrong when new line in cell

from maintenance_tasks.

Comments (8)

etiennebarrie commented on September 24, 2024 1

Yes it's documented in the comment you've quoted. The count is only for display purposes, so being exact is not super important.

You can always override count if you want an exact count, you can access csv_content on the Task if you need to.

The count method is called once when the job starts, and we didn't want to parse huge CSV files at that point, so we went with an approximation (since once again count is only for display).

Due to the design of Active Job/StringIO, we haven't found a way to stream the CSV (we can download in chunk, but we can't feed the chunks as the input to CSV). We're downloading the file in its entirety instead and keeping it in memory, so the reason is not memory savings, but instead really just CPU perf due to parsing the CSV. For example, parsing a 1 million line CSV takes 2 seconds on my machine vs less than 2ms to simply count the newlines.

from maintenance_tasks.

adrianna-chang-shopify commented on September 24, 2024 1

@etiennebarrie after some thought, I think I'm actually in favour of changing CSV counting to parse the CSV as you've suggested. We already call out in the README that you can override #collection to avoid running a large query, so we could add a line there about doing the same for large CSVs (or suggest the faster implementation we've been using so far). I think we could get away with it in a minor bump as well -- it's not really a breaking change, rather an implementation detail that could have an impact for future tasks that accept large CSVs.

20s as a one-time cost on a maintenance task that's going to take a significant amount of time anyways doesn't feel too bad IMHO. I guess if it scales beyond 10mil rows, we start encroaching on the interruptibility window, but a slow query would do the same...

from maintenance_tasks.

D-system commented on September 24, 2024

To get the expected line number:

count = CSV.parse(task.csv_content, headers: true).count

I wonder if it has been considered but not included for performance or memory concerns.

from maintenance_tasks.

D-system commented on September 24, 2024

@etiennebarrie Merci Étienne pour l'explication.
I'm puzzled because at some point, the CSV needs to be parsed before to be given to the process method, isn't it?
Could Maintenance Tasks do or update the count at that time?

from maintenance_tasks.

etiennebarrie commented on September 24, 2024

Yes, the CSV is parsed by job-iteration at some point, but this is done as a stream, just reading enough data to emit a row. So we don't know how many rows there are until we're finished basically.

We could change the display a bit, but we do handle all the different cases:

These are multiple runs of UpdatePostsTask, when count is defined as returning respectively 8, nil, 12, and 10. There are always 10 Post records.

So when count is not defined in the task, and it's the default implementation for CSV, we can have an incorrect count, but then we handle it like we would do if you had provided an incorrect count.

One solution would be to set the total to the number of rows we've found at the end of a run, but then we would lose some information, namely the expected count that was defined by the task (implicitly or explicitly).

We could also provide an option to csv_collection, e.g. csv_collection exact_count: true but that's not great.

At the very least, we should add this in the README, in the section about CSV. We mention the trailing newline, but don't explain why (it's in order to have the correct count, when rows don't include newlines). We could expand and explain that the count is based on the number of lines and will be off if rows includes newlines or if the trailing newline is missing, and maybe even give a one-liner to get correct counts for those cases.

  def count = CSV.parse(csv_content, headers: true).count

Taking a step back, given that we've moved to automatically count the number of records for Active Record relations in v2.0.0, and that can also be very slow, it could be argued that it's fair to make the CSV implicit counting slower but correct by default, and leave it to task authors to override count with nil or a faster implementation (if they don't expect newlines in rows) if they really need to.

@adrianna-chang-shopify @nvasilevski WDYT? Would that be too much of a breaking change for a minor bump? I.e. a 10-million-line CSV would cause the run to take 20 seconds at the beginning of the first job to count the lines. I think I'm in favor of just documenting this in the README.

from maintenance_tasks.

D-system commented on September 24, 2024

Thank you for taking the time to explain. I didn't know it was streaming the CSV.
I'm happy with an update of the readme and/or update the task template with the accurate count one-liner method commented.

from maintenance_tasks.

etiennebarrie commented on September 24, 2024

I'm happy with an update of the readme

Please do!

update the task template

But not that part. 😄 If we feel display correctness is more important, then we should fix it, but we shouldn't just cause more code to be in all apps.

In the readme, we should mention that the count is only an approximation counting the newlines in the CSV, and that if they have newlines in their CSV and want an accurate total, they can override count. My understanding is that most maintenance tasks CSVs are full of ids instead, and won't need that, so it shouldn't be in the template.

from maintenance_tasks.

D-system commented on September 24, 2024

I made a PR to update the #count method: #991
I was not sure about updating for README too, so I made a different commit to revert it in the case it is not wanted.

from maintenance_tasks.

CSV count wrong when new line in cell about maintenance_tasks HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent