Coder Social home page Coder Social logo

logging's Introduction

MLPerf Logging

MLPerf Compliance Logging Utilities and Helper Functions

Installation

Use one of the following ways to install.

  • For development, you may download the latest version and install from local path:

    git clone https://github.com/mlperf/logging.git mlperf-logging
    pip install -e mlperf-logging
  • Install from github at a specific commit (Replace TAG_OR_COMMIT_HASH with actual tag or commit hash):

    pip install "git+https://github.com/mlperf/logging.git@TAG_OR_COMMIT_HASH"

Uninstall:

pip uninstall mlperf-logging

Packages

  • mllog: MLPerf logging library
  • compliance checker: utility checking compliance of MLPerf log
  • system_desc_checker: utility checking compliance of system description json file
  • rcp_checker: utility running convergence checks in submission directories
  • package_checker: top-level checker for a package, it calls compliance checker, system desc checker, and rcp checker
  • result_summarizer: utility that parses package and prints out result summary
  • repo_checker: utility that checks source code files for github compliance

Instructions

A submission needs to pass the package checker and run the result summarizer. For submission 2.1 (latest) you can do that with. For previous versions use the respective verify script.

./scripts/verify_for_v2.1_training.sh <submission_directory>

If you want to run the individual utilities/checker, please check the README files in the respective subdirectories.

logging's People

Contributors

ahmadki avatar ar-nowaczynski avatar azrael417 avatar christ1ne avatar davidmochen avatar drcanchi avatar emizan76 avatar erichan1 avatar guschmue avatar hiwotadese avatar itayhubara avatar janekl avatar maanug-nv avatar matthew-frank avatar michal2409 avatar mmarcinkiewicz avatar mwawrzos avatar nathanw-mlc avatar nv-eric-hw avatar nv-rborkar avatar petermattson avatar pgmpablo157321 avatar sgpyc avatar shangw-nvidia avatar shriyapalsamudram avatar sparticlesteve avatar sub-mod avatar tgrel avatar xyhuang avatar yuanzhedong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

logging's Issues

Calculating throughput from log

Throughput information is useful for submitters, we need to support that in mlperf logs.
This can be implemented in log checker where throughput can be calculated from training samples and training time.

Detect results/<system>/<benchmark>/result*.log in addition to results/<system>/<benchmark>/result*.txt

Many submitters in the v1.0 round made the mistake of naming their result logs
results/<system>/<benchmark>/result*.log
instead of
results/<system>/<benchmark>/result*.txt

But the .log files are silently ignored by the checkers and the result summarizer, making it look like the submitter didn't submit a benchmark result for that system.

Either (a) every file in results/<system>/<benchmark>/* should be treated as a result file or (b) the checker should issue an error for every file that doesn't belong in the results/<system>/<benchmark>/* directory. At the very least, we should be more flexible about the extension on the results files to accommodate this very common difference.

MaskRCNN

It looks like constants.py missed several const variables. I believe I need to add below:
mllogger.event(key=mllog_const.MIN_IMAGE_SIZE, value=cfg.INPUT.MIN_SIZE_TRAIN)
mllogger.event(key=mllog_const.MAX_IMAGE_SIZE, value=cfg.INPUT.MAX_SIZE_TRAIN)
mllogger.event(key=mllog_const.EVAL_ACCURACY, value=uncased_score,
metadata={mllog_const.BBOX: bbox_map, mllog_const.SEGM: segm_map})

Please add release tags

Having some version on which we can hook benchmark implementations would be very useful and it is more readable, than referring to a commit.

Make README.md more useful to new users

The top level README.md should describe how to run (or at least find) package checker and result summarizer, instead of having pointers to compliance checker and system_desc_checker (which are both invoked through package checker).

How to use compliance_checker

I'm new to this.

When i use the command as below

python -m mlperf_logging.compliance_checker --config ./1.0.0/common.yaml --ruleset ./mlp_parset/ruleset_060.py /home/gigabyte/submissions_training_v1.0/GIGABYTE/results/G492-ZD2_A100_20210517.mxnet/resnet/result_0.txt

==========================================================================
image

Anyone can tell me how to fix ?

In the document no any example and also no description for ruleset
Refer:
https://github.com/mlcommons/logging/tree/master/mlperf_logging/compliance_checker

Should we just use -csv with result summarizer?

After #127, we MUST use -csv. Is that the assumption? Or shall I fix that and allow this to run w/o csv file?

emizan@emizan2:~/emizan_logging$ python3 -m mlperf_logging.result_summarizer ~/submission_training_1.0/Google/ training 1.0.0
Traceback (most recent call last):
File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/google/home/emizan/emizan_logging/mlperf_logging/result_summarizer/main.py", line 3, in
result_summarizer.main()
File "/usr/local/google/home/emizan/emizan_logging/mlperf_logging/result_summarizer/result_summarizer.py", line 376, in main
summarize_results(args.folder, args.ruleset, csv_file)
UnboundLocalError: local variable 'csv_file' referenced before assignment

bug in BERT-sepcific progress reporting

When I use the master branch updated with #126 to check my logs, it complains with RCP. The reason is that the new code parses the wrong converge epochs (or rather training samples). Given the log below:

... "key": "eval_accuracy", "value": 0.720, "metadata": {"file": "/home/work/run_pretrain.py", "lineno": 340, "train_samples": 123456789, "epoch_num": 20}}

A small fix makes it working well,

@@ -57,7 +57,7 @@ def get_submission_epochs(result_files, benchmark, bert_train_samples):
                         conv_epoch = json.loads(eval_accuracy_str)["metadata"]["epoch_num"]
                     if use_train_samples and "train_samples" in str:
                         eval_accuracy_str = str
-                        conv_epoch = json.loads(eval_accuracy_str)["value"]
+                        conv_epoch = json.loads(eval_accuracy_str)["metadata"]["train_samples"]
                     if "run_stop" in str:
                         # Epochs to converge is the the last epochs value on
                         # eval_accuracy line before run_stop
@@ -67,6 +67,7 @@ def get_submission_epochs(result_files, benchmark, bert_train_samples):
                         else:
                             subm_epochs.append(1e9)
                             not_converged = not_converged + 1
+                        break
     if (not_converged > 1 and benchmark != 'unet3d') or (not_converged > 4 and benchmark == 'unet3d'):
         subm_epochs = None
     return bs, subm_epochs

Add hparams and compliance checks for training and eval samples for all benchmarks

According to issue https://github.com/mlcommons/submission_training_1.0/issues/39, the number of training samples is 117266. Many submissions do hardcode this value, even though the reference does not. On that specific issue the submission in question used a different value.

The decision was to add train_samples, and eval_samples as hyperparameters + the related compliance checker rules so we avoid such issues in the future.

Document RCPs method of production

It would be useful if each RCP point also notes the method of production that was used to generate it, in the RCP jsons.
Intention would be to have a path of reproducibility if needed or record for future reference- especially when at submission time we will have few PRs for different submitters with new RCPs, probably run on different HW platforms & different framework versions.

Method of production can be noted via additional fields per RCP point such as:

  • "hw_notes": for example tpuv3-128 or DGXA100 etc.
  • "framework": for example TF2.1.1, pytorch etc
  • "reference git commit hash": To note the reference commit that was used to generate the RCP
  • "notes": For optional details, for example if a rcp is from a submission code repo vs the reference repo for temporary reasons

This would avoid us all scratching our heads post v1.0 about different RCP points.

Validate number of logging entries

The MLPerf v7.0 logging was described by a MLPerf Training Logging Spec v0.7 document, which includes a definition of the number of expected logs.

The committee decided to drop support for this document, and the checkers should be the only source of logging rules.
With this decision, we lost the definition of required logs, and submitters don't know, what is expected.

I propose to update verifiers to add support for log entries count validation. This test should be a part of the tool, that has knowledge about a number of nodes/accelerators in the submission. The package checker may be the right choice.

Minigo Logging

This is a port of Training Policies issue #213, which may be obsolete.

SWG Notes:

It looks like there may be some differences or confusion in the minigo logging standard. We believe the review process should be accommodating and understanding of differences in minigo logging. For v0.7 we can revisit this.

Missing MLPERF BERT training hyperparameters in mlperf-logging's constants.py

Hello, although there are multiple hyper-parameters required by mlperf training rules for BERT model (https://github.com/mlcommons/training_policies/blob/master/training_rules.adoc#91-hyperparameters),
we do not see bert in the list of benchmark names, and we do not see bert's hyperparameters such as opt_learning_rate_training_steps, num_warmup_steps, start_warmup_step, opt_lamb_weight_decay_rate, etc in the list in https://github.com/mlcommons/logging/blob/master/mlperf_logging/mllog/constants.py.

missing model_bn_span in schema for rcp_checker resnet hyperparams

In Submission 1.0 there are examples of various competitors having trouble reproducing various RCPs on the resnet reference code. The problem is that the list of Hyperparameters in https://github.com/mlcommons/logging/blob/master/mlperf_logging/rcp_checker/1.0.0/rcps_resnet.json is missing the hyperparameter model_bn_span, which impacts convergence. To make the RCPs reproducible, this information should be added.

Additional info:
model_bn_span is one of the hyperparameters listed in https://github.com/mlcommons/training_policies/blob/master/training_rules.adoc#91-hyperparameters.

The reproduction issues are:

at global batch 3264:

at global batch 32768:

Init time handling

Moving Training Policies issue #212 here, not sure if this is now obsolete.

SWG Notes:

To compute init time, we will subtract the init_stop and the minimum (earliest) init start. We always have exactly one init stop.

compliance checker: relax "exactly once" rules for some keys

For many logging keys e.g. hyper-parameter keys the "exactly once" rule might be too strict. Depending on the location the information is logged, the corresponding logs might be triggered more than once.
One suggestion to relax this rule is that we allow "at least once" of those keys to be logged, we can have a warning for "more than once but same value" cases and an error for "more than once and different values" cases.

package import problem

The current import code:
from ..compliance_checker import mlp_compliance

If we don't install the package and execute the python scripts directly, there will be an import error.

I think we shoud import like this:

sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), "../compliance_checker"))
import mlp_compliance

Seed verification

Submitters are allowed to not log seed used for RNG (Random Number Generator) if RNGs are seeded with default methods (for example, initialized with seed=None).

We propose that the package verifier validates that.

  1. If seeds are logged, the submission should be rejected if the same seed occurs more than once.
  2. If seeds are not logged, the verifier should raise a warning if the submission contains a source file with an occurrence of a seed word.
  3. In other cases, seed check passes.

Compliance log checker

Check whether the logs printed in a benchmark experiment comply with the logging spec requirements.

Seed checker gets confused when handling multiple implementations

There is a bug in the seed checker:

On this line we see that

# Find all source codes for this benchmark.

the seed checker checks all source directories under implementation. So if we have two implementations, one in TF, and one in PyT for the same benchmark, then the following scenario is possible:

Assume the TF implementation reports the seed and the PyT does not. Then for the logs of the TF implementation the PyT source code will also be checked and will give a warning. This is rather confusing.

The solution is to have ONLY one implementation to be checked as each log file comes only from one implementation.

fatal errors in result_summarizer if all reult logs failed mlp_compliance check

An error will happen at line 154 of result_summarizer.py when len(scores) ==0
sum_of_scores -= min(scores)

A similar error wii happen when len(scores)=1, and is droped by line 160. In this case, count =0 and an erroe will happen at line 162:
return sum_of_scores * 1.0 / count

A suggested replacment is :

if len(scores)>1 :
# Subtract off the min
sum_of_scores -= min(scores)
count -= 1

    # Subtract off the max, only if the max was not already dropped
    if dropped_scores == 0:
      sum_of_scores -= max(scores)
      count -= 1    
    return sum_of_scores * 1.0 / count
elif len(scores) == 1:
    return sum_of_scores
else:
    return -1 # no result

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.