Comments (12)
Lots of workarounds discussed in sirupsen/logrus#63. prometheus implemented this way so we could potentially copy them.
from training-operator.
I'm adding this to our next milestone because I think structured logging will be critical to scaling. As we scale up to more jobs and larger jobs we will need to be able to easily filter logs by pod, job etc... to get to relevant logs.
from training-operator.
Personally, I recommend glog since most of repos in the Kubernetes community use glog.
from training-operator.
+1 for glog
:)
from training-operator.
If we use glog is there a way to output json logs with metadata such as the job and replica a log message is associated with?
from training-operator.
I am afraid not 🤔 , since there is no function about it in the docs https://godoc.org/github.com/golang/glog
from training-operator.
With glog how do we make it really easy to filter the TFJob operator logs so we can see log messages for a particular job.
I think this will be super useful for debugging troubleshooting.
If we use structured logging then we can add a tag corresponding to the job name. Then it should be very easy to filter the logs to find all log messages for a particular job.
from training-operator.
This solution looks promising
sirupsen/logrus#63 (comment)
sirupsen/logrus#63 (comment)
I believe this solution just uses the filename hook
https://github.com/onrik/logrus
I think we can just define a logrus logger with that hook and it will work.
https://github.com/onrik/logrus
Would be great if someone could just try it out using the example here:
https://github.com/onrik/logrus
from training-operator.
I'm looking into this. I will try the filenameHook
from onrik/logrus and post the results here
from training-operator.
Now we use flag package to support command line flags, and glog also uses it by default. Then you can see our binary have more flags than we thing although we use logrus instead of glog:
➜ tf-operator git:(416) ✗ ./tf-operator -h
Usage of ./tf-operator:
-alsologtostderr
log to standard error as well as files
-chaos-level int
DO NOT USE IN PRODUCTION - level of chaos injected into the TFJob created by the operator. (default -1)
-controller-config-file string
Path to file containing the controller config.
-gc-interval duration
GC interval (default 10m0s)
-json-log-format
Set true to use json style log format. Set false to use plaintext style log format (default true)
-log_backtrace_at value
when logging hits line file:N, emit a stack trace
-log_dir string
If non-empty, write log files in this directory
-logtostderr
log to standard error instead of files
-stderrthreshold value
logs at or above this threshold go to stderr
-v value
log level for V logs
-version
Show version and quit
-vmodule value
comma-separated list of pattern=N settings for file-filtered logging
There are some pros and cons:
- We can support vendor and client's glog, since we have glog's flags
- But the users may be confused since the tf-operator outputs logs regardless of the flag
-logtostderr
from training-operator.
I think we could close the issue after #416 merged. And I will file a new issue for the extra flag problem. But it is now a big problem. We can refer to etcd/etcd-operator.
from training-operator.
xref #424
from training-operator.
Related Issues (20)
- Support ARM64 platform in PyTorch examples HOT 4
- Support ARM64 platform in TensorFlow examples HOT 4
- Support ARM64 platform in XGBoost examples HOT 2
- Update third party worflows in the gh actions HOT 4
- mpijob will stuck if LastReconcileTime is updated in 1 second
- Worker failed without exit code
- PyTorchJobClient not found HOT 3
- The actual default RestartPolicy of PyTorch is inconsistent with its description in the CRD HOT 1
- spatial dataset training functions HOT 1
- TfJob creation failed due to webhook validation failure HOT 1
- [GSOC] Tracking Issue: Integrate JAX in Kubeflow Training Operator
- Improve Training Operator release process HOT 4
- [GSOC] Project 7 Tracking Issue: Automate docs generation for Training-operator Python SDK HOT 1
- Docs: reference architecture for fault tolerance capabilities HOT 12
- [SDK] Add more unit tests for TrainingClient APIs HOT 18
- TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'eval_strategy' HOT 3
- [Release] Training Operator 1.9 Roadmap
- Kubeflow Training V2 API
- Encountered an error while running the example in the document train_api_hf_dataset HOT 6
- Enable pre-commit for repo HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from training-operator.