Coder Social home page Coder Social logo

stanford-futuredata / dawn-bench-entries Goto Github PK

View Code? Open in Web Editor NEW
256.0 34.0 74.0 506 KB

DAWNBench: An End-to-End Deep Learning Benchmark and Competition

Home Page: http://dawn.cs.stanford.edu/benchmark/

Python 100.00%
imagenet squad cifar10 inference training deeplearning

dawn-bench-entries's People

Contributors

alicloud-damo-hci avatar baidu-usa-gait-leopard avatar bearpelican avatar bignamehyp avatar bkj avatar brettkoonce avatar chuanli11 avatar codeforfun9 avatar codyaustun avatar daisyden avatar deepakn94 avatar dmrd avatar felixgwu avatar iyaja avatar jph00 avatar kay-tian avatar listenlink avatar lvniqi avatar lxgsbqylbk avatar lxylzxc avatar mzhangatge avatar raoxing avatar shaohuawu2018 avatar sleepfin avatar stephenbalaban avatar tccccd avatar terminatorzwm avatar wang-chen avatar wbaek avatar yaroslavvb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dawn-bench-entries's Issues

Initialization / data preparation / checkpointing time included?

I assumed the hours field reported in dawn benchmark is between the time each checkpoint is saved and the start of the program. Can we exclude initialization time before training. For example, we can load the entire CIFAR dataset to memory first. And also saving checkpoints to disks is expensive especially when the training is super fast. Can we exclude the checkpointing time as well?

Besides can we report training progress that saved every x epochs?

Clarification about training time for Imagenet

I had a quick question. In measuring the training cost and time should the job be run as:

  1. Series of train and eval steps (1 epoch at a time) and then the total time is measured for both the train and eval combined.
    OR
  2. Do the training and store checkpoints at each epoch. This is the training cost.
    Eval results can be generated posthoc from the checkpoints but don't contribute to the training time.

What are the guidelines on how should the job be configured here?

Blacklisted or non-blacklisted validation set:

The ImageNet validation set consists of 50,000 images. In the 2014 devkit, there is a list of 1762 "blacklisted" files. When we report the top-5 accuracy, should we use the blacklisted or non-blacklisted version? In Google's submission, results are obtained by using the full 50,000 including those blacklisted images. But some submissions used the blacklisted version. Just make sure we're comparing the same thing.

Printing # of epochs and training time during training

Hi, @yaroslavvb

What script and flags are you using to get the training results with Resnet50-Imagenet?. I am running the benchmarks tf_cnn_benchmarks.py with Imagenet, and I need to print the epochs and training time during the training. I don't find the flags to activate the printing of this information, or do I need to modify the script to do it?

Here is the display I am getting:
Step Img/sec total_loss top_1_accuracy top_5_accuracy
1 images/sec: 451.0 +/- 0.0 (jitter = 0.0) 8.168 0.003 0.005

Questions on inference latency

For DAWNBench latency rule,
I have a question and need your confirmation:

when we calculate the latency, could we ignore image processing time?
for example, we hanle image processing(including decoding, resize and crop) offline ?

Thanks

Questions on inference Latency

Hello~
I'm trying to reproduce the PingAn GammaLab & PingAn Cloud team's work, which is the No.1 in inference latency benchmark. This work uses this model to evaluate the inference time.

I notice that this model is not the original resnet50, and the network architecture is quite different from resnet50.

So, could you help me to confirm their real network architecture?
And i'm wondering is it allowed to use these light network like mobilenet there?

Question on inference cost

Hi,
To calculate the inference cost, is it permitted to use dual-instances in one VM?
To say it explicitly:

  1. Launch 2 inference processes(or threads), each serving 25k images in imagenet-2012-val
  2. Get the total time and calculate the average cost for every 10k images.
    The formula just like: max[sum(process1_time), sum(process2_time)] / 50 * 10 * vm_cost_per_milliseconds.

For inference latency, we will still use: [sum(process1_time) + sum(process2_time)] / total_images to measure per image latency.

Thx very much.

Clarification about checkpoints in training on Imagenet

We're working on a DAWNBench entry that uses a slice of a TPU pod. Each epoch is processed so quickly on the pod that a significant amount of time is now being spent on saving checkpoints. Would it be possible for us to provide a submission where we only checkpoint once at the end of the training run and then run eval to validate accuracy?

As another possibility, we could provide data on two runs, one with checkpointing enabled for every epoch and the other with checkpointing disabled until the end. You could use the timing of the run without checkpointing but inspect the accuracy values along the way via the auxiliary run with checkpointing.

Please let us know if either of these paths would be acceptable.

Questions on inference latency/cost

Hello,

I am understanding the latency rule in DAWNBench:
โ€ข Latency: Use a model that has a top-5 validation accuracy of 93% or greater. Measure the total time needed to classify all 50,000 images in the ImageNet validation set one-at-a-time, and then divide by 50,000

I am not sure how to better understand "one-at-a-time" here, so I raised some questions here and need your confirmation:

  1. Does it allow the pipeline of image processing and CNN inference?
  2. Does it allow preprocessed images (resize and crop done offline)?
  3. Does it allow dummy data?

Thanks.

Resubmission should not be allowed after competition deadline

I noticed that there were resubmissions from fast.ai for the ImageNet training track after competition deadline:
https://github.com/stanford-futuredata/dawn-bench-entries/blob/master/ImageNet/train/fastai_pytorch.json
https://github.com/stanford-futuredata/dawn-bench-entries/blob/master/ImageNet/train/fastai_pytorch.tsv

Their new result was baked from lots of code changes including hyper-parameter tuning and model changes. I don't think it is fair to other participants. Resubmission should not be allowed after competition deadline.

As a fair alternative, I would suggest the organizer to create a new ranking list for ImageNet without those blacklisted images. Any submission prior to the deadline can be ranked in the respective list. This can avoid confusion and also honor the game rules.

Kindly requesting some info

Hi everyone, thanks for this amazing work.
I was wondering if you could shed some light on the following questions.

I see different benchmarks on different dataset and different hardware but I am having some trouble inferring the following information:

For instance we see a difference by half at training time on TPUs but this between different models, i.e. resnet & amoeba-net.

I would like to know what speed gain if:

  1. We have the same model on same hardware but different version of library, of instance TF 1.7 vs TF 1.8. In other words how much of that speed gain is solely based on the new software release

  2. Test the same model on different hardware but same version of library so that we can understand how much of the percentage gain is solely from the hardware.

Thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.