Comments (8)
Yes it's documented in the comment you've quoted. The count is only for display purposes, so being exact is not super important.
You can always override count
if you want an exact count, you can access csv_content
on the Task if you need to.
The count
method is called once when the job starts, and we didn't want to parse huge CSV files at that point, so we went with an approximation (since once again count
is only for display).
Due to the design of Active Job/StringIO, we haven't found a way to stream the CSV (we can download in chunk, but we can't feed the chunks as the input to CSV). We're downloading the file in its entirety instead and keeping it in memory, so the reason is not memory savings, but instead really just CPU perf due to parsing the CSV. For example, parsing a 1 million line CSV takes 2 seconds on my machine vs less than 2ms to simply count the newlines.
from maintenance_tasks.
@etiennebarrie after some thought, I think I'm actually in favour of changing CSV counting to parse the CSV as you've suggested. We already call out in the README that you can override #collection
to avoid running a large query, so we could add a line there about doing the same for large CSVs (or suggest the faster implementation we've been using so far). I think we could get away with it in a minor bump as well -- it's not really a breaking change, rather an implementation detail that could have an impact for future tasks that accept large CSVs.
20s as a one-time cost on a maintenance task that's going to take a significant amount of time anyways doesn't feel too bad IMHO. I guess if it scales beyond 10mil rows, we start encroaching on the interruptibility window, but a slow query would do the same...
from maintenance_tasks.
To get the expected line number:
count = CSV.parse(task.csv_content, headers: true).count
I wonder if it has been considered but not included for performance or memory concerns.
from maintenance_tasks.
@etiennebarrie Merci Γtienne pour l'explication.
I'm puzzled because at some point, the CSV needs to be parsed before to be given to the process
method, isn't it?
Could Maintenance Tasks do or update the count at that time?
from maintenance_tasks.
Yes, the CSV is parsed by job-iteration at some point, but this is done as a stream, just reading enough data to emit a row. So we don't know how many rows there are until we're finished basically.
We could change the display a bit, but we do handle all the different cases:
These are multiple runs of UpdatePostsTask, when count
is defined as returning respectively 8, nil
, 12, and 10. There are always 10 Post records.
So when count
is not defined in the task, and it's the default implementation for CSV, we can have an incorrect count, but then we handle it like we would do if you had provided an incorrect count.
One solution would be to set the total to the number of rows we've found at the end of a run, but then we would lose some information, namely the expected count that was defined by the task (implicitly or explicitly).
We could also provide an option to csv_collection
, e.g. csv_collection exact_count: true
but that's not great.
At the very least, we should add this in the README, in the section about CSV. We mention the trailing newline, but don't explain why (it's in order to have the correct count, when rows don't include newlines). We could expand and explain that the count is based on the number of lines and will be off if rows includes newlines or if the trailing newline is missing, and maybe even give a one-liner to get correct counts for those cases.
def count = CSV.parse(csv_content, headers: true).count
Taking a step back, given that we've moved to automatically count the number of records for Active Record relations in v2.0.0, and that can also be very slow, it could be argued that it's fair to make the CSV implicit counting slower but correct by default, and leave it to task authors to override count with nil
or a faster implementation (if they don't expect newlines in rows) if they really need to.
@adrianna-chang-shopify @nvasilevski WDYT? Would that be too much of a breaking change for a minor bump? I.e. a 10-million-line CSV would cause the run to take 20 seconds at the beginning of the first job to count the lines. I think I'm in favor of just documenting this in the README.
from maintenance_tasks.
Thank you for taking the time to explain. I didn't know it was streaming the CSV.
I'm happy with an update of the readme and/or update the task template with the accurate count one-liner method commented.
from maintenance_tasks.
I'm happy with an update of the readme
Please do!
update the task template
But not that part. π If we feel display correctness is more important, then we should fix it, but we shouldn't just cause more code to be in all apps.
In the readme, we should mention that the count is only an approximation counting the newlines in the CSV, and that if they have newlines in their CSV and want an accurate total, they can override count
. My understanding is that most maintenance tasks CSVs are full of ids instead, and won't need that, so it shouldn't be in the template.
from maintenance_tasks.
I made a PR to update the #count
method: #991
I was not sure about updating for README too, so I made a different commit to revert it in the case it is not wanted.
from maintenance_tasks.
Related Issues (20)
- Weekly CI run failed
- Dangerous query method (method whose arguments are used as raw SQL) called with non-attribute argument(s) HOT 2
- Weekly CI run failed HOT 1
- How do we set `limit`? HOT 8
- No callback is called after Interrupted -> Resumed HOT 5
- Release notes HOT 1
- Web UI and Rails API mode HOT 2
- CSV Task Failure: ActiveRecord::NotNullViolation Error on UUID record_id HOT 2
- please add a CHANGELOG.md file HOT 4
- Weekly CI run failed
- Weekly CI run failed HOT 1
- Weekly CI run failed HOT 1
- Weekly CI run failed HOT 1
- Weekly CI run failed HOT 1
- Sorbet typing in templates HOT 1
- Figure out slow system tests
- Ensure `Rack::MethodOverride` HOT 6
- Weekly CI run failed
- Weekly CI run failed HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from maintenance_tasks.