Coder Social home page Coder Social logo

facebookresearch / fambench Goto Github PK

View Code? Open in Web Editor NEW
27.0 15.0 20.0 12.65 MB

Benchmarks to capture important workloads.

License: Apache License 2.0

Dockerfile 0.35% Shell 6.96% Python 56.03% CMake 0.18% C++ 21.65% HTML 0.21% CSS 2.03% Jupyter Notebook 10.90% Cuda 1.70%

fambench's Introduction

FAMBench (Family Friendly Benchmarking)

These benchmarks represent important workloads. The faster these benchmarks are, the happier owners of important workloads are. The maintainers, updates, and rules in this benchmark suite all exist to keep the connection between the people running these benchmarks and the people running the original workloads.

The key things to know:

  • These benchmarks are directly connected to real workloads run every day
  • The main metric is throughput, subject to some constraints such as latency or max batchsize
  • Data is often synthetic, though we have safeguards to ensure correctness
  • There are special requirements when improving these benchmarks - it's not "anything goes"
  • This includes benchmarks (runnable on 1 device, multiple devices, clusters) and microbenchmarks

To get starting running the benchmark suite right away on a V100:

cd benchmarks
./run_all.sh

The Suite

This suite captures benchmarks across multiple devices, across multiple precisions, and includes microbenchmarks. We organize the suite so each benchmark result is identified as:

Benchmark = Models + Implementation + Mode + Configuration

Models

This suite contains the following benchmarks:

  • Recommendation: DLRM
  • Speech: RNN-T
  • Text: XLM-R (TorchText)
  • Text: OSCAR
  • Text: MoE
  • Vision: CVT
  • Video transformers: BEVT

Implementation

Each benchmark comes in three different implementations:

  • Out Of The Box (OOTB): indicates the performance that is provided by the libraries and frameworks. Code is written like a regular AI engineer / researcher would write the code, not like a systems/hardware specialist would write the code.
  • Optimized: Represents the best possible performance which can be reached; the code is tuned, re-written (and perhaps even mangled) by hardware and software experts
  • Microbenchmarks: benchmarks which look at a specific component of dev, computer or cluster. These are highly unique and specialized in their purpose.

Modes

For OOTB and optimized implementations, the modes are Inference and Training. For Microbenchmarks, the mode is the specific kind of microbenchmark being run.

Configurations

Each implementation comes in multiple configurations. Each configuration looks at the benchmark in a different way, such as:

  • The model and data scaled to different number of devices: e.g. 1, 8, multiple node
  • Different precisions and numeric formats
  • Different variants of the models, representing possible different layers or sizes the model might be run at.

Results

Running one or more benchmarks on a specific machine or cluster produces a results table. Below are example results which you may get.

Model Implementation Mode Config Batch Size Score Units
Recommend: DLRM OOTB Training A.1dev-embed32-fp32 1024 570.16 ex/s
Recommend: DLRM OOTB Inference A.1dev-embed4-fp32 1024 61.85* ex/s
Recommend: DLRM Micro MLP/Linear linear_A.1dev 256 7.08 TF/s
Recommend: DLRM Micro EmbeddingBag emb_A.1dev 65536 537.80 GB/s
  • = missed latency target

Notice the following in this table:

  • Each row is one Benchmark run with a batch size (Model + Implementation + Mode + Config at a given batch size). More on batch size in Suite Design.
  • All rows in the same table are run on the same machine. Benchmarks from different hardware must appear in different result tables.
  • Some results have a * denoting that they missed the latency target. More on latency targets in Suite Design.
  • You may report multiple batch sizes for the same benchmark, they appear as different lines in the table.

Results by System Scale

We look at all the results to understand the broader picture of performance.

** For systems that can't run the full model: ** Microbenchmarks give us a picture into potential performance and early indicators of where to explore more.

** For single device systems: ** For training, single device configurations and microbenchmarks can indicate trends in overall cluster performance; microbenchmarks run on the cluster paired with single device results can indicate if single device performance is in fact the bottleneck. For inference, single inference is often easily parallelizable across multiple devices, the single device benchmarks are a very good indicator of real performance. This has the added advantage of being quick and easy for debugging and experiments.

** For multiple device, single node: ** For Training, multidevice configurations give good insight into how single nodes perform within a cluster - this can be combined with microbenchmarks on the cluster to predict overall performance. For inference, this is a great reflection of actual workloads. This has the added advantage of being quick and easy for debugging and experiments.

** For Clusters: ** Running these benchmarks on a cluster gives the best indication of performance for Training but does not add additional information for Inference. The downside is, obviously, these runs are more costly to set up and run.

How Results are Consumed

There are two broad comparisons that can be done: hardware-to-hardware and OOTB v. Optimized.

  • System to System: Compare two tables generated by two different systems to understand their differences
  • OOTB v. Optimized: Look at one table, one system, and understand the gap between the software (compilers, frameworks, and libraries) and what might be possible if the software was improved.

Generally, consuming results is specific to the situation. Different goals will result in placing different priorities and weights when evaluating results so there isn't a one size fits all approach here. It's up to the people and situation.

Suite Design

We are very specific about how these benchmarks must be run and optimized in order to maintain our goal: ** improvements to these benchmarks connect directly to improvements in important internal workloads **. Where our methodology may seem arbitrary or cumbersome, it is in service of maintaining the connection to the source.

Ownership, Versions & Updates

Each Benchmark (Model + Implementation + Mode + Config) is connected with an actual owner of an actual workload who endorsed the benchmark. The owner is the arbiter of changes, updates, and methodology for the benchmark. It is exceptionally frustrating to see benchmarks change while you are working on them. It sucks, and we version our benchmarks to help with bookkeeping. Ultimately, our goal here is to reflect the current state of what people care about - unfortunately this means (sometimes too frequently) bumping versions to ensure we are offering the best proxy to the world.

Convergence and Accuracy

The gold standard in understanding how the system works is measuring convergence and accuracy of the model in the end-to-end context. Unfortunately, as shown by MLPerf, this is exceptionally costly, burdensome and slow. We do not place an emphasis on convergence and accuracy for the following reasons:

  • We don't allow significant changes to model code (see "Improving the Benchmark Score"), so we don't expect people to be breaking convergence
  • We limit the data types and precisions to ones we understand and are known to be viable
  • We (will) offer the ability to verify correctness (possibly through real data or through statistical analysis on synthetic data)
  • We lean on benchmarks in MLPerf which has a similar suite of models and submissions to MLPerf are required to test correctness.

Overall, we aim to allow benchmarking at the granularity which is usable by people in their projects, representative of the actual workloads, and not overly cumbersome or expensive. It's a compromise.

Data

As discussed in Convergence and Accuracy, we are not an accuracy or convergence benchmark. This frees us up to use synthetic data which significantly improves usability and time-to-results for this suite.

We may choose to use real data, or data derived from real data, where we cannot generate proper synthetic data.

Batch Sizes

Generally speaking, the bigger the batch size the better the throughput but the longer the time to converge and the higher the latency. When running these benchmarks, people will want to see:

  • The benchmark run at specific known batch sizes (where the convergence is understood) to allow for predicting and modeling
  • The benchmark at the batch size which gives the best throughput, subject to either (a) a maximum batchsize for which the model will converge, or (b) a latency requirement for requests.

Latency Limits

Inference benchmarks come with latency limits and the goal is to provide the best QPS while hitting the latency limit. Some inference benchmarks may reflect user facing operations where latency is key. Some inference benchmarks may reflect background jobs where throughput is key - so the latency limit is very high in these cases.

Improving the Benchmark Score

The bigger the score, the better - but there are limits on how to get there. The limits depend on the implementation (Out-Of-The-Box OOTB, Optimized, or Microbenchmark).

  • Out-Of-The-Box (OOTB): Improvements must come in through libraries, frameworks, and new hardware. No changing the model code (special exceptions for non-optimizing changes which enable porting to new hardware).
  • Optimized: No holds barred - make the system shine. Just keep in mind everything you do, you're asking the actual people who run the workloads to do it too if they're going to realize that performance. You'll need to describe what changes you made, so keep track.
  • Microbenchmarks - Implement the same operation as defined, and make it as fast as possible.

License

This is released under the APACHE 2 license. Please see the LICENSE file for more information.

fambench's People

Contributors

aaronenyeshi avatar amathews-amd avatar dependabot[bot] avatar dllehr-amd avatar erichan1 avatar jataylo avatar jpvillam-amd avatar liligwu avatar mindest avatar nrsatish avatar orionr avatar pnunna93 avatar samiwilf avatar xuzhao9 avatar zstreet87 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fambench's Issues

Assert is triggered when execute './run_all.sh'

The DLRM Ubench benchmark will trigger assertion when execute './run_all.sh':

Command in run_all.sh

./run_dlrm_ubench_train_embeddingbag.sh -l results -c "[(2,2,2,2),(2,2,2,2),(2,2,2,2),(2,2,2,2),(2,2,2,2)]" # Config not real

Output:

=== Launching FB5 ===
Benchmark: dlrm
Implementation: ubench
Mode: train
Config: embeddingbag_[(2,2,2,2),(2,2,2,2),(2,2,2,2),(2,2,2,2),(2,2,2,2)]float
Saving FB5 Logger File: results/dlrm_ubench_train_embeddingbag
[(2,2,2,2),(2,2,2,2),(2,2,2,2),(2,2,2,2),(2,2,2,2)]_float.log

Running Command:
++ python dlrm/ubench/dlrm_ubench_train_driver.py --steps=100 --device=cpu '--fb5logger=results/dlrm_ubench_train_embeddingbag_[(2,2,2,2),(2,2,2,2),(2,2,2,2),(2,2,2,2),(2,2,2,2)]_float.log' emb '--dataset=[(2,2,2,2),(2,2,2,2),(2,2,2,2),(2,2,2,2),(2,2,2,2)]'
Measuring the performance of emb on device = cpu
Steps = 100 warmups = 10
with emb dataset [(2,2,2,2),(2,2,2,2),(2,2,2,2),(2,2,2,2),(2,2,2,2)]
Traceback (most recent call last):
File "dlrm/ubench/dlrm_ubench_train_driver.py", line 83, in
assert(len(run_dataset) == 1)
AssertionError
=== Completed Run ===

If use -c "[(2,2,2,2),(2,2,2,2),(2,2,2,2),(2,2,2,2),(2,2,2,2)]", the length of run_dataset should be 5, not 1.

Could you take a look?

KGemm module not found

Hi guys,

I tried to run the "run_all.sh", but I get the following error:

import pytorch_gemm as kgemm
ModuleNotFoundError: No module named 'pytorch_gemm'

It seems that I dont have the kgemm module, but I can't find it anywhere, can you point me to a location where I can download a whl for it?

Thanks!

XLM-R failing 'omegaconf._utils' has no attribute 'is_primitive_type'

we are noticing a new failure:

Traceback (most recent call last):
  File "xlmr/ootb/xlmr.py", line 178, in <module>
    run()
  File "xlmr/ootb/xlmr.py", line 142, in run
    xlmr = get_model()
  File "xlmr/ootb/xlmr.py", line 29, in get_model
    fairseq_xlmr_large = torch.hub.load('pytorch/fairseq:main', 'xlmr.large')
  File "/opt/conda/lib/python3.7/site-packages/torch/hub.py", line 399, in load
    model = _load_local(repo_or_dir, model, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/hub.py", line 428, in _load_local
    model = entry(*args, **kwargs)
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/roberta/model_xlmr.py", line 44, in from_pretrained
    **kwargs,
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/hub_utils.py", line 75, in from_pretrained
    arg_overrides=kwargs,
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/checkpoint_utils.py", line 421, in load_model_ensemble_and_task
    state = load_checkpoint_to_cpu(filename, arg_overrides)
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/checkpoint_utils.py", line 339, in load_checkpoint_to_cpu
    state = _upgrade_state_dict(state)
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/checkpoint_utils.py", line 677, in _upgrade_state_dict
    state["cfg"] = convert_namespace_to_omegaconf(state["args"])
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/dataclass/utils.py", line 405, in convert_namespace_to_omegaconf
    with omegaconf_no_object_check():
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/dataclass/utils.py", line 367, in __init__
    self.old_is_primitive = _utils.is_primitive_type
AttributeError: module 'omegaconf._utils' has no attribute 'is_primitive_type'

An error is thrown when running run_dlrm_ubench_train_allreduce.sh

When running mpirun --allow-run-as-root -np 8 -N 8 --bind-to none ./run_dlrm_ubench_train_allreduce.sh -c xxxx, an error is thrown:

Traceback (most recent call last): File "dlrm/ubench/dlrm_ubench_comms_driver.py", line 133, in <module> main() File "dlrm/ubench/dlrm_ubench_comms_driver.py", line 106, in main comms_main() File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 1208, in main collBenchObj.runBench(comms_world_info, commsParams) File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 1161, in runBench backendObj.benchmark_comms() File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/pytorch_dist_backend.py", line 659, in benchmark_comms self.commsParams.benchTime(index, self.commsParams, self) File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 1128, in benchTime self.reportBenchTime( File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 853, in reportBenchTime self.reportBenchTimeColl(commsParams, results, tensorList) File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 860, in reportBenchTimeColl latencyAcrossRanks = np.array(tensorList) File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 723, in __array__ return self.numpy() TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

DLRM ubench failing, possibly due to FBGEMM API update

Traceback (most recent call last):
  File "dlrm/ubench/dlrm_ubench_train_embeddingbag_driver.py", line 207, in <module>
    global_elap, global_bytes = run_emb(args, run_dataset)
  File "dlrm/ubench/dlrm_ubench_train_embeddingbag_driver.py", line 104, in run_emb
    requests = bench.split_table_batched_embeddings_benchmark.generate_requests(
TypeError: generate_requests() got an unexpected keyword argument 'weights_precision'

due to pytorch/FBGEMM@4c581375

XLM-R failing with ValueError: invalid literal for int() with base 10: '0a0'

ValueError: invalid literal for int() with base 10: '0a0'

Possibly related to facebookresearch/fairseq#4532

This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
  **kwargs,
2022-07-07 04:27:14 | INFO | fairseq.tasks.multilingual_masked_lm | dictionary: 250001 types
Traceback (most recent call last):
  File "xlmr/ootb/xlmr.py", line 182, in <module>
    run()
  File "xlmr/ootb/xlmr.py", line 142, in run
    xlmr = get_model()
  File "xlmr/ootb/xlmr.py", line 29, in get_model
    fairseq_xlmr_large = torch.hub.load('pytorch/fairseq:main', 'xlmr.large')
  File "/opt/conda/lib/python3.7/site-packages/torch/hub.py", line 525, in load
    model = _load_local(repo_or_dir, model, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/hub.py", line 554, in _load_local
    model = entry(*args, **kwargs)
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/roberta/model_xlmr.py", line 44, in from_pretrained
    **kwargs,
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/hub_utils.py", line 75, in from_pretrained
    arg_overrides=kwargs,
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/checkpoint_utils.py", line 473, in load_model_ensemble_and_task
    model = task.build_model(cfg.model, from_checkpoint=True)
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/tasks/fairseq_task.py", line 676, in build_model
    model = models.build_model(args, self, from_checkpoint)
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/__init__.py", line 106, in build_model
    return model.build_model(cfg, task)
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/roberta/model.py", line 237, in build_model
    encoder = RobertaEncoder(args, task.source_dictionary)
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/roberta/model.py", line 553, in __init__
    self.sentence_encoder = self.build_encoder(args, dictionary, embed_tokens)
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/roberta/model.py", line 570, in build_encoder
    encoder = TransformerEncoder(args, dictionary, embed_tokens)
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/transformer/transformer_encoder.py", line 433, in __init__
    return_fc=return_fc,
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/transformer/transformer_encoder.py", line 96, in __init__
    [self.build_encoder_layer(cfg) for i in range(cfg.encoder.layers)]
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/transformer/transformer_encoder.py", line 96, in <listcomp>
    [self.build_encoder_layer(cfg) for i in range(cfg.encoder.layers)]
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/transformer/transformer_encoder.py", line 438, in build_encoder_layer
    TransformerConfig.from_namespace(args),
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/transformer/transformer_encoder.py", line 107, in build_encoder_layer
    cfg, return_fc=self.return_fc
  File "/root/.cache/torch/hub/pytorch_fairseq_main/fairseq/modules/transformer_layer.py", line 131, in __init__
    + int(self.torch_version[2])

rnn-t setup issue with sox not found by python interpreter

December-4-2021-RNN-T-on-AWS-installing.txt

On a fresh AWS p3 instance, I cloned FAMBench, and ran setup_rnnt.sh. sudo apt-get install sox seemed to install sox successfully, but later the following exception appeared many times:
ModuleNotFoundError: No module named 'sox'.

Searching "sox" in the attached log shows the relevant parts of the log.

Installing sox using pip3 install sox resolved the issue for me. Any thoughts on whether we should use pip3 instead of sudo apt-get for sox?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.