I think it would be useful if the operator used structured logging.

Lots of workarounds discussed in <a class="issue-link js-issue-link" data-error-text="

This solution looks promising <a class="issue-link js-issue-link" data-error-text=

Structured Logging For the operator about training-operator HOT 12 CLOSED

kubeflow commented on August 11, 2024

Structured Logging For the operator

from training-operator.

Comments (12)

jlewi commented on August 11, 2024

Lots of workarounds discussed in sirupsen/logrus#63. prometheus implemented this way so we could potentially copy them.

from training-operator.

jlewi commented on August 11, 2024

I'm adding this to our next milestone because I think structured logging will be critical to scaling. As we scale up to more jobs and larger jobs we will need to be able to easily filter logs by pod, job etc... to get to relevant logs.

from training-operator.

gaocegege commented on August 11, 2024

Personally, I recommend glog since most of repos in the Kubernetes community use glog.

from training-operator.

ScorpioCPH commented on August 11, 2024

+1 for glog :)

from training-operator.

jlewi commented on August 11, 2024

If we use glog is there a way to output json logs with metadata such as the job and replica a log message is associated with?

from training-operator.

gaocegege commented on August 11, 2024

I am afraid not 🤔 , since there is no function about it in the docs https://godoc.org/github.com/golang/glog

from training-operator.

jlewi commented on August 11, 2024

With glog how do we make it really easy to filter the TFJob operator logs so we can see log messages for a particular job.

I think this will be super useful for debugging troubleshooting.

If we use structured logging then we can add a tag corresponding to the job name. Then it should be very easy to filter the logs to find all log messages for a particular job.

from training-operator.

jlewi commented on August 11, 2024

This solution looks promising
sirupsen/logrus#63 (comment)
sirupsen/logrus#63 (comment)

I believe this solution just uses the filename hook
https://github.com/onrik/logrus

I think we can just define a logrus logger with that hook and it will work.
https://github.com/onrik/logrus

Would be great if someone could just try it out using the example here:
https://github.com/onrik/logrus

from training-operator.

ankushagarwal commented on August 11, 2024

I'm looking into this. I will try the filenameHook from onrik/logrus and post the results here

from training-operator.

gaocegege commented on August 11, 2024

Now we use flag package to support command line flags, and glog also uses it by default. Then you can see our binary have more flags than we thing although we use logrus instead of glog:

➜  tf-operator git:(416) ✗ ./tf-operator -h               
Usage of ./tf-operator:
  -alsologtostderr
    	log to standard error as well as files
  -chaos-level int
    	DO NOT USE IN PRODUCTION - level of chaos injected into the TFJob created by the operator. (default -1)
  -controller-config-file string
    	Path to file containing the controller config.
  -gc-interval duration
    	GC interval (default 10m0s)
  -json-log-format
    	Set true to use json style log format. Set false to use plaintext style log format (default true)
  -log_backtrace_at value
    	when logging hits line file:N, emit a stack trace
  -log_dir string
    	If non-empty, write log files in this directory
  -logtostderr
    	log to standard error instead of files
  -stderrthreshold value
    	logs at or above this threshold go to stderr
  -v value
    	log level for V logs
  -version
    	Show version and quit
  -vmodule value
    	comma-separated list of pattern=N settings for file-filtered logging

There are some pros and cons:

We can support vendor and client's glog, since we have glog's flags
But the users may be confused since the tf-operator outputs logs regardless of the flag -logtostderr

from training-operator.

gaocegege commented on August 11, 2024

I think we could close the issue after #416 merged. And I will file a new issue for the extra flag problem. But it is now a big problem. We can refer to etcd/etcd-operator.

from training-operator.

gaocegege commented on August 11, 2024

xref #424

from training-operator.

Structured Logging For the operator about training-operator HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent