Coder Social home page Coder Social logo

Comments (12)

kuizhiqing avatar kuizhiqing commented on August 17, 2024 1

what would be easier to understand with_events or verbose flag ?

@andreyvelich Well, you'er right, it would be better to use verbose option without exposing event definition.

from training-operator.

terrytangyuan avatar terrytangyuan commented on August 17, 2024

Good idea. +1

from training-operator.

tenzen-y avatar tenzen-y commented on August 17, 2024

It sounds good!

Also, at the same time, showing the event of FrameworkJob (e.g., TFJob) at the top might be helpful.
@andreyvelich WDYT?

from training-operator.

johnugeorge avatar johnugeorge commented on August 17, 2024

Should this be a different API for providing more clarity ?

from training-operator.

kuizhiqing avatar kuizhiqing commented on August 17, 2024

It would be helpful!

Maybe we should add new API get_job_events to print events of job and pods. And we add new arg with_events=False in API get_job_logs to make it possible to get all the information in the same API.

from training-operator.

andreyvelich avatar andreyvelich commented on August 17, 2024

Also, at the same time, showing the event of FrameworkJob (e.g., TFJob) at the top might be helpful.

@tenzen-y That sounds good. The question is how to identify which Job user created ? get_job_logs doesn't have job_type as an input argument, and we just check which pod has these labels:

training.kubeflow.org/job-name=my-job
training.kubeflow.org/job-role=master

The same labels could have multiply jobs (e.g. PyTorchJob, XGBoostJob).

That ties to my other question, if we are going to introduce mandatory job_type argument to our get_job_logs API, should we follow the same pattern for all APIs ?
E.g. instead of get_pytorchjob, get_tfjob, we are going to have a single API called: get_job which takes job_type as a mandatory argument and we are going to get appropriate job based on this type.

We can do the same for all other APIs: create_job, create_job_from_func, get_job, delete_job`, etc.

After refactoring our SDK: #1719, I noticed that it is very confusing for the user that we have some CRUD operations job specific (e.g. create_tfjob), but some of them are not (e.g. get_job_pod_names, get_job_logs).
I can create separate issue to discuss this, and I am going to provide more feedback from the users soon.
cc @kubeflow/wg-training-leads

from training-operator.

andreyvelich avatar andreyvelich commented on August 17, 2024

Should this be a different API for providing more clarity ?

I am not sure, if users who are not familiar with Kubernetes should know differences between events and logs.
Usually, when Data Scientists create a ML Job, they want to directly check the logs from this job (e.g. run get_job_logs API).
Otherwise, we should somehow explain them if get_job_logs API fails, they should run get_job_events API.
WDYT @johnugeorge ?

And we add new arg with_events=False in API get_job_logs to make it possible to get all the information in the same API.

I like the idea @kuizhiqing, what would be easier to understand with_events or verbose flag ?

from training-operator.

johnugeorge avatar johnugeorge commented on August 17, 2024

+1 Agree with you @andreyvelich

from training-operator.

tenzen-y avatar tenzen-y commented on August 17, 2024

The same labels could have multiply jobs (e.g. PyTorchJob, XGBoostJob).

That ties to my other question, if we are going to introduce mandatory job_type argument to our get_job_logs API, should we follow the same pattern for all APIs ?
E.g. instead of get_pytorchjob, get_tfjob, we are going to have a single API called: get_job which takes job_type as a mandatory argument and we are going to get appropriate job based on this type.

We can do the same for all other APIs: create_job, create_job_from_func, get_job, delete_job`, etc.

After refactoring our SDK: #1719, I noticed that it is very confusing for the user that we have some CRUD operations job specific (e.g. create_tfjob), but some of them are not (e.g. get_job_pod_names, get_job_logs).
I can create separate issue to discuss this, and I am going to provide more feedback from the users soon.

@andreyvelich Thanks for the clarification.
I agree with you. Let's work on events for XXXJob in another issue.

from training-operator.

andreyvelich avatar andreyvelich commented on August 17, 2024

/assign @andreyvelich

from training-operator.

github-actions avatar github-actions commented on August 17, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from training-operator.

andreyvelich avatar andreyvelich commented on August 17, 2024

/assign @andreyvelich

from training-operator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.