mlcommons / logging Goto Github PK

View Code? Open in Web Editor NEW

29.0 20.0 45.0 511 KB

MLPerf™ logging library

Home Page: https://mlcommons.org/en/groups/best-practices-benchmark-infra

License: Apache License 2.0

Python 96.42% Shell 3.58%

logging's Introduction

MLPerf Logging

MLPerf Compliance Logging Utilities and Helper Functions

Installation

Use one of the following ways to install.

For development, you may download the latest version and install from local path:

git clone https://github.com/mlperf/logging.git mlperf-logging
pip install -e mlperf-logging

Install from github at a specific commit (Replace TAG_OR_COMMIT_HASH with actual tag or commit hash):
```
pip install "git+https://github.com/mlperf/logging.git@TAG_OR_COMMIT_HASH"
```

Uninstall:

pip uninstall mlperf-logging

Packages

mllog: MLPerf logging library
compliance checker: utility checking compliance of MLPerf log
system_desc_checker: utility checking compliance of system description json file
rcp_checker: utility running convergence checks in submission directories
package_checker: top-level checker for a package, it calls compliance checker, system desc checker, and rcp checker
result_summarizer: utility that parses package and prints out result summary
repo_checker: utility that checks source code files for github compliance

Instructions

A submission needs to pass the package checker and run the result summarizer. For submission 2.1 (latest) you can do that with. For previous versions use the respective verify script.

./scripts/verify_for_v2.1_training.sh <submission_directory>

If you want to run the individual utilities/checker, please check the README files in the respective subdirectories.

logging's People

Contributors

Stargazers

Watchers

Forkers

xyhuang mwawrzos levskaya davidmochen tgrel rsl18 mengkai94 nv-rborkar kurhula mlperf-hpc nv-eric-hw mneilly-et badenh mmarcinkiewicz shangw-nvidia nvpohanh lilit1122 emizan76 matthew-frank itayhubara wangxihuiok cocteautwins azrael417 sparticlesteve erichan1 upepo nvaprodromou akihiro-tabuchi sgpyc ahmadki ckt624 sneaxiy yuanzhedong maanug-nv davidjurado chochowski rakshithvasudev fabfish shriyapalsamudram littlereal janekl ucascnic v1otusc michal2409 drcanchi

logging's Issues

Add missing RCPs for RNN-T

In #146 only BS=2048 RCPs are updated. We need to update new numbers for reminder configs.

Calculating throughput from log

Throughput information is useful for submitters, we need to support that in mlperf logs.
This can be implemented in log checker where throughput can be calculated from training samples and training time.

Detect results/<system>/<benchmark>/result.log in addition to results/<system>/<benchmark>/result.txt

Many submitters in the v1.0 round made the mistake of naming their result logs
results/<system>/<benchmark>/result*.log
instead of
results/<system>/<benchmark>/result*.txt

But the .log files are silently ignored by the checkers and the result summarizer, making it look like the submitter didn't submit a benchmark result for that system.

Either (a) every file in results/<system>/<benchmark>/* should be treated as a result file or (b) the checker should issue an error for every file that doesn't belong in the results/<system>/<benchmark>/* directory. At the very least, we should be more flexible about the extension on the results files to accommodate this very common difference.

Compliance checker should check if the log shows correct line numbers

[Training+Inference] Spec of common format

[Inference] Utility checkers

Logging library casts bool to float

Some values are required to have a boolean type (for example cache_clear).

Logging library casts everything possible to float:
https://github.com/mlperf/training/blob/8ca84826610f0becce960b80b2ad5cce8ddd66d8/compliance/mllog/mllog.py#L103

Logging spec and lib should be aligned.

MaskRCNN

It looks like constants.py missed several const variables. I believe I need to add below:
mllogger.event(key=mllog_const.MIN_IMAGE_SIZE, value=cfg.INPUT.MIN_SIZE_TRAIN)
mllogger.event(key=mllog_const.MAX_IMAGE_SIZE, value=cfg.INPUT.MAX_SIZE_TRAIN)
mllogger.event(key=mllog_const.EVAL_ACCURACY, value=uncased_score,
metadata={mllog_const.BBOX: bbox_map, mllog_const.SEGM: segm_map})

Please add release tags

Having some version on which we can hook benchmark implementations would be very useful and it is more readable, than referring to a commit.

Make README.md more useful to new users

The top level README.md should describe how to run (or at least find) package checker and result summarizer, instead of having pointers to compliance checker and system_desc_checker (which are both invoked through package checker).

How to use compliance_checker

I'm new to this.

When i use the command as below

python -m mlperf_logging.compliance_checker --config ./1.0.0/common.yaml --ruleset ./mlp_parset/ruleset_060.py /home/gigabyte/submissions_training_v1.0/GIGABYTE/results/G492-ZD2_A100_20210517.mxnet/resnet/result_0.txt

==========================================================================

Anyone can tell me how to fix ?

In the document no any example and also no description for ruleset
Refer:
https://github.com/mlcommons/logging/tree/master/mlperf_logging/compliance_checker

Should we just use -csv with result summarizer?

After #127, we MUST use -csv. Is that the assumption? Or shall I fix that and allow this to run w/o csv file?

emizan@emizan2:~/emizan_logging$ python3 -m mlperf_logging.result_summarizer ~/submission_training_1.0/Google/ training 1.0.0
Traceback (most recent call last):
File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/google/home/emizan/emizan_logging/mlperf_logging/result_summarizer/main.py", line 3, in
result_summarizer.main()
File "/usr/local/google/home/emizan/emizan_logging/mlperf_logging/result_summarizer/result_summarizer.py", line 376, in main
summarize_results(args.folder, args.ruleset, csv_file)
UnboundLocalError: local variable 'csv_file' referenced before assignment

bug in BERT-sepcific progress reporting

When I use the master branch updated with #126 to check my logs, it complains with RCP. The reason is that the new code parses the wrong converge epochs (or rather training samples). Given the log below:

... "key": "eval_accuracy", "value": 0.720, "metadata": {"file": "/home/work/run_pretrain.py", "lineno": 340, "train_samples": 123456789, "epoch_num": 20}}

A small fix makes it working well,

@@ -57,7 +57,7 @@ def get_submission_epochs(result_files, benchmark, bert_train_samples):
                         conv_epoch = json.loads(eval_accuracy_str)["metadata"]["epoch_num"]
                     if use_train_samples and "train_samples" in str:
                         eval_accuracy_str = str
-                        conv_epoch = json.loads(eval_accuracy_str)["value"]
+                        conv_epoch = json.loads(eval_accuracy_str)["metadata"]["train_samples"]
                     if "run_stop" in str:
                         # Epochs to converge is the the last epochs value on
                         # eval_accuracy line before run_stop
@@ -67,6 +67,7 @@ def get_submission_epochs(result_files, benchmark, bert_train_samples):
                         else:
                             subm_epochs.append(1e9)
                             not_converged = not_converged + 1
+                        break
     if (not_converged > 1 and benchmark != 'unet3d') or (not_converged > 4 and benchmark == 'unet3d'):
         subm_epochs = None
     return bs, subm_epochs

Add hparams and compliance checks for training and eval samples for all benchmarks

According to issue https://github.com/mlcommons/submission_training_1.0/issues/39, the number of training samples is 117266. Many submissions do hardcode this value, even though the reference does not. On that specific issue the submission in question used a different value.

The decision was to add train_samples, and eval_samples as hyperparameters + the related compliance checker rules so we avoid such issues in the future.

[Training] File level checker

hyperparams
sys description

Document RCPs method of production

It would be useful if each RCP point also notes the method of production that was used to generate it, in the RCP jsons.
Intention would be to have a path of reproducibility if needed or record for future reference- especially when at submission time we will have few PRs for different submitters with new RCPs, probably run on different HW platforms & different framework versions.

Method of production can be noted via additional fields per RCP point such as:

"hw_notes": for example tpuv3-128 or DGXA100 etc.
"framework": for example TF2.1.1, pytorch etc
"reference git commit hash": To note the reference commit that was used to generate the RCP
"notes": For optional details, for example if a rcp is from a submission code repo vs the reference repo for temporary reasons

This would avoid us all scratching our heads post v1.0 about different RCP points.

[Inference] Logging performance

Implementation proposal for RCPs

Here is a doc:

https://docs.google.com/document/d/1w-3sAnfH5mO0jfpPv-Dv9WeO8DsBIDTH7VPaE2P4u_Q/edit?resourcekey=0-qqQ_-JT5cNtkLT9EH0jNEw

compliance checker: only one of epoch_start/stop and block_start/stop is required

right now the compliance checker requires that both epoch_start/stop and block_start/stop are present, however we can allow either one of them to be optional (as long as the other exists) according to logging rule.

Add info about gradient accumulation in logging

[BERT|RNNT] Verify vocab size in compliance checker

The compliance checker should verify the vocab size for the bert and rnnt benchmarks.

Validate number of logging entries

The MLPerf v7.0 logging was described by a MLPerf Training Logging Spec v0.7 document, which includes a definition of the number of expected logs.

The committee decided to drop support for this document, and the checkers should be the only source of logging rules.
With this decision, we lost the definition of required logs, and submitters don't know, what is expected.

I propose to update verifiers to add support for log entries count validation. This test should be a part of the tool, that has knowledge about a number of nodes/accelerators in the submission. The package checker may be the right choice.

compliance checker: change training_samples/evaluation_samples to train_samples/eval_samples

the logging rules currently use "train_", "eval_" pattern for most keys, hence suggest to change the new keys "training_samples", "evaluation_samples" to follow the same pattern, i.e. "train_samples", "eval_samples".

Minigo Logging

This is a port of Training Policies issue #213, which may be obsolete.

SWG Notes:

It looks like there may be some differences or confusion in the minigo logging standard. We believe the review process should be accommodating and understanding of differences in minigo logging. For v0.7 we can revisit this.

Compliance checker: epoch number for evaluations should be validated against reference schedule

Package checker's dependencies not installed with `pip install`

The package checker doesn't install its dependencies with

pip install https://github.com/mlcommons/logging/archive/refs/heads/master.zip

@mwawrzos

[Inference] Fuse logs to 1 file

Missing MLPERF BERT training hyperparameters in mlperf-logging's constants.py

Hello, although there are multiple hyper-parameters required by mlperf training rules for BERT model (https://github.com/mlcommons/training_policies/blob/master/training_rules.adoc#91-hyperparameters),
we do not see bert in the list of benchmark names, and we do not see bert's hyperparameters such as opt_learning_rate_training_steps, num_warmup_steps, start_warmup_step, opt_lamb_weight_decay_rate, etc in the list in https://github.com/mlcommons/logging/blob/master/mlperf_logging/mllog/constants.py.

missing model_bn_span in schema for rcp_checker resnet hyperparams

In Submission 1.0 there are examples of various competitors having trouble reproducing various RCPs on the resnet reference code. The problem is that the list of Hyperparameters in https://github.com/mlcommons/logging/blob/master/mlperf_logging/rcp_checker/1.0.0/rcps_resnet.json is missing the hyperparameter model_bn_span, which impacts convergence. To make the RCPs reproducible, this information should be added.

Additional info:
model_bn_span is one of the hyperparameters listed in https://github.com/mlcommons/training_policies/blob/master/training_rules.adoc#91-hyperparameters.

The reproduction issues are:

at global batch 3264:

at global batch 32768:

[Training+Inference] Common lib for common logging format

Init time handling

Moving Training Policies issue #212 here, not sure if this is now obsolete.

SWG Notes:

To compute init time, we will subtract the init_stop and the minimum (earliest) init start. We always have exactly one init stop.

Wrong string in mlperf_logging/system_desc_checker/system_desc_checker.py

mlperf_logging/system_desc_checker/system_desc_checker.py
It should be division, not divison.

compliance checker: relax "exactly once" rules for some keys

For many logging keys e.g. hyper-parameter keys the "exactly once" rule might be too strict. Depending on the location the information is logged, the corresponding logs might be triggered more than once.
One suggestion to relax this rule is that we allow "at least once" of those keys to be logged, we can have a warning for "more than once but same value" cases and an error for "more than once and different values" cases.

Pretrained model checkpoint validation

If a pretrained model is used, have a list of approved checkpoints and make sure the submission uses the approved checkpoints.

Enable dataset validation

potentially using md5 sums

BERT compliance checker: please change opt_epsilon to opt_lamb_epsilon

Line 13 in compliance_checker -> closed_bert.yaml should be changed from opt_epsilon to opt_lamb_epsilon as defined in the constants.py file.

Include preprocessing script as part of the submission

package import problem

The current import code:
from ..compliance_checker import mlp_compliance

If we don't install the package and execute the python scripts directly, there will be an import error.

I think we shoud import like this:

sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), "../compliance_checker"))
import mlp_compliance

Seed verification

Submitters are allowed to not log seed used for RNG (Random Number Generator) if RNGs are seeded with default methods (for example, initialized with seed=None).

We propose that the package verifier validates that.

If seeds are logged, the submission should be rejected if the same seed occurs more than once.
If seeds are not logged, the verifier should raise a warning if the submission contains a source file with an occurrence of a seed word.
In other cases, seed check passes.

Compliance log checker

Check whether the logs printed in a benchmark experiment comply with the logging spec requirements.

Inference directory structure is out of date

The directory structure in 5.6 for inference doesn't conform with the de facto structure in the inference submission repo.

Add info about initialization method in logging

Seed checker gets confused when handling multiple implementations

There is a bug in the seed checker:

On this line we see that

logging/mlperf_logging/package_checker/package_checker.py

Line 123 in 9ede9c6

# Find all source codes for this benchmark.

the seed checker checks all source directories under implementation. So if we have two implementations, one in TF, and one in PyT for the same benchmark, then the following scenario is possible:

Assume the TF implementation reports the seed and the PyT does not. Then for the logs of the TF implementation the PyT source code will also be checked and will give a warning. This is rather confusing.

The solution is to have ONLY one implementation to be checked as each log file comes only from one implementation.

A suggested replacment is ：

if len(scores)>1 :
# Subtract off the min
sum_of_scores -= min(scores)
count -= 1

    # Subtract off the max, only if the max was not already dropped
    if dropped_scores == 0:
      sum_of_scores -= max(scores)
      count -= 1    
    return sum_of_scores * 1.0 / count
elif len(scores) == 1:
    return sum_of_scores
else:
    return -1 # no result

Create a github actions for publishing to pypa

Mlcube project uses a github actions for publishing to pypa
Copy the code here to automate the release.
https://github.com/mlcommons/mlcube/blob/master/.github/workflows/python-publish.yml