Coder Social home page Coder Social logo

Comments (4)

dvsrepo avatar dvsrepo commented on June 10, 2024

Hi!

We should make more clear that users need to use this:

https://distilabel.argilla.io/latest/technical-reference/pipeline/#dataset-checkpoints

from distilabel.

kcentric avatar kcentric commented on June 10, 2024

Ah! Yes, I did see that argument was available, but I didn't know what exactly it was for. That would be helpful.

Might it be possible it add a default value for dataset-checkpoints, just in case users don't use it?

from distilabel.

plaguss avatar plaguss commented on June 10, 2024

Hi @kcentric the default for the value is to have a DatasetCheckpoint with the default values as can be seen here in the API reference, which results in the dataset being returned at the end of the pipeline by default. The reason for this was that we couldn't set a "sensible" value for the checkpointing (for example, we can set the checkpoint strategy to save the dataset every 100 records always, but that number implies a different frequency for a dataset of 100, 1K or 10K records for example, and for a bigger dataset, the default could mean that the dataset is saved to often, which could also be undesired). Thanks for your feedback, will try to improve the default behaviour.

from distilabel.

k-neev avatar k-neev commented on June 10, 2024

No worries, thanks @plaguss. Maybe we could do something like making the argument a required one (but that could add some more complication) or set it to updating every 1/10th of whatever dataset size we have (which might have its own implications but personally for me it's better to save more frequently than less frequently 😅). Anyways, I see now.

from distilabel.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.