bencheeorg / benchee Goto Github PK

View Code? Open in Web Editor NEW

1.4K 1.4K 65.0 1.6 MB

Easy and extensible benchmarking in Elixir providing you with lots of statistics!

License: MIT License

Elixir 99.96% Shell 0.04%

benchmarking elixir extensible graphs microbenchmarks statistics

benchee's People

Contributors

Stargazers

Watchers

Forkers

ruby2elixir alvinlindstam ldr eksperimental-forks slavazim wasnotrice kamilkowalski boddhisattva h4cc olemchls joshnuss renderedtext xiamx elpikel ruan-brandao lwalter azranel namjae pragmaticivan codetriage-readme-bot michalmuskala gerbal mikejdorm antonioj-mattos benmorganio kimshrier jamesgood626 fboyer ondrej-tucek leeduckgo mlg-hub gomoripeti jrogov pablocostass treble37 ismailunal nickneck mad42 kianmeng zakimedina pvmart tmr08c mayel brooklinjazz elixiteers vshev4enko hrzndhrn yhalapup-calltracks akoutmos drobnyd plastic-forks pinkdiamond1 bamorim 7venquegistka nallwhy stefanluptak tomciopp cozygow megaredhand bradhanks mhanberg angelikatyborska sabiwara kkalb 92kns

benchee's Issues

Separate printing from benchmarking

Internally right now a lot of printing happens right inside of Benchee.Benchmark - overall suite information as well as warnings etc. - it'd be great to put all of that into its own module that can be injected and exchanged for testing purposes. That way we'd also avoid all the awkward capture_ios in the Benchmark tests and be more pure/side effect free.

Provide statistics for drawing box plot diagrams

drawing boxplot diagrams eats a lot of resources on the browser side in github.com/PragTob/benchee_html

So it'd be nice to provide statistics to draw them right away (like this) - median we already have, the quantille1/quantille3, Inter quantille range and others :)

Optionally show additional statistics in console output

an option for the console formatter like extended_statistics or something would be nice to show statistics that we already collect but don't print yet. As of now this would be:

minimum
maximum
sample size

These values can be interesting for different reasons, e.g. what is the worst case performance here or on how many results is this based. I don't really want to add them to the standard console formatter as the output would probably get too wide.

The new/extra statistics should probably be displayed underneath the normal statistics in a smiliar fashion as the normal statistics, meaning in the same order and in a table like format that goes:

name - minimum - maximum - sample size

Configure multiple formatters to be used for one run

In my recent benchmarks I noticed, that it was sort of counter productive to first record results and print them to the console and then generate the csv to make the graph. Of course, results are slightly different. One could use the more verbose API, save statistics and then prepare both values. But that's not ideal, hence I want:

a configuration option formatters: that takes an array of remote functions like formatters: [&Benchee.Formatters.Console.format/1, &Benchee.Formatters.CSV.format/1] and then runs both formatters, or maybe just the module when they all use predefined function names
a new method in formatters that also takes care of the output (IO.puts for console, writing to CSV for CSV) which will then be the remote function
the initial config needs to be available even after statistics so it can have configuration options for the formatters (file/name path for the csv plugin to save the CSV to for instance)

More plugins - what do you want/need?

Ideas for further plugins, open a new issue or share ideas here :)

something that generates a graph right away without other manual steps

Update TCO blog post with new API before publishing

https://pragtob.wordpress.com/2016/06/16/tail-call-optimization-in-elixir-erlang-not-as-efficient-and-important-as-you-probably-think/

--> needs an update, or well more precisely the gists in it as they won't run anymore on Benchee 0.6.0

Deal with overly long benchmark names

Right now the names are just cut off. There should either be a warning or it should work by just using the second line for the name (splitting it onto n lines).

tobi@happy ~/github/benchee $ cat samples/long_name.exs 
Benchee.run [{"some very long name that doesn't fit in the space", fn -> :timer.sleep(100) end}]
tobi@happy ~/github/benchee $ mix run samples/long_name.exs 
Benchmarking some very long name that doesn't fit in the space...

Name                                    ips        average    deviation         median
some very long name that doesn         9.90    100994.16μs     (±0.01%)    100994.00μs

Parallelize generation of formatters

Follow up for #55 - depends on #53

As correctly noted in #55 we can't just parallelize the formatters output/1 as console output might come from multiple formatters (warnings and such) which might get in each others way. But right now there is already the convention of having a format/1 function that is pure and just creates the structure which is then just written out in output/1 - and the writing out can then be done sequentially.

Gather more system data

It'd be great to have a platform independent way to show more basic data about the machine the benchmark is being run on such as:

Operating System
CPU clock speed
Number of cores
available memory

Check standard deviation if only one sample was taken.

I would expect standard deviation to be 0 when we only have one sample or all samples are the same, but still someone had standard deviation where it could have/should have taken only one sample:

Erlang/OTP 19 [erts-8.0] [64-bit] [smp:12:12] [async-threads:10]
Elixir 1.3.4
Benchmark suite executing with the following configuration:
warmup: 10.0s
time: 10.0s
parallel: 1
Estimated total run time: 120.0s
 
Benchmarking map TCO no reverse...
Benchmarking map simple without TCO...
Benchmarking map tail-recursive with ++...
Benchmarking map with TCO new arg order...
Benchmarking map with TCO reverse...
Benchmarking stdlib map...
 
Name                                 ips        average  deviation         median
map TCO no reverse                  0.33         3.06 s    ±23.57%         3.11 s
map with TCO reverse                0.26         3.84 s    ±28.88%         3.84 s
map with TCO new arg order          0.26         3.91 s    ±18.79%         3.91 s
map tail-recursive with ++        0.0918        10.90 s    ±12.83%        10.90 s
stdlib map                        0.0910        10.99 s    ±11.87%        10.99 s
map simple without TCO            0.0899        11.13 s    ±13.20%        11.13 s
 
Comparison:
map TCO no reverse                  0.33
map with TCO reverse                0.26 - 1.26x slower
map with TCO new arg order          0.26 - 1.28x slower
map tail-recursive with ++        0.0918 - 3.56x slower
stdlib map                        0.0910 - 3.59x slower
map simple without TCO            0.0899 - 3.63x slower

(time was 10 seconds and the execution of 2 apparently took over 10 seconds on average, the only way how it could get a standard deviation is that if the first run was faster than 10 seconds afaik)

Warn for potentially performance degrading settings

It'd be great if benchee could warn the user if the benchmarks are run with some settings in the elixir run time that are potentially hampering the performance.

I don't have a full list in mind yet but it includes:

Protocol consolidation disabled

Flatten module hierarchy?

For discussion: when removing Benchee.Time, and replacing it with Benchee.Unit.Duration, I was reminded that Benchee.Duration might be nicer, and similarly Benchee.Count. Not sure how you feel about the hierarchy, @PragTob, and how you'd want to organize the files if we did flatten it out (whether you want the directories to mirror the modules)

Make the comparison report optional?

The comparison report makes sense when you're benchmarking one thing vs another thing, but it doesn't make sense in a land where you're just providing benchmarks for a project.

For example, if I build a library which has function1, function2, function3, etc. I might want to provide a single benchmark.exs which outputs the stats for all of them (i.e. benchmark the entire library). At this point the comparison doesn't make sense as it's comparing completely different functions, so it'd be nice to have a flag to be able to turn it off :)

Print config information at the start of the benchmark

Something like "benchmarking with a warmup of n seconds and a benchmark time of m seconds. Estimated total run time ~ benchmarks * (m + n)"

Auto Scale units

It’d be nice to for instance show the average time in milliseconds if a benchmark is slower or write something to the effect of “80.9 Million” iterations per second for the console output for a fast benchmark.

It'd be important to me this auto scaling takes into account all results. I.e. I always find it harder to compare when results are reported in different units so either the values of (for instance) all averages are scaled to the same unit/magnitude or none are.

Ensure usability for macro benchmarks

Test and make sure that benchee is also usable for more macro benchmarks. E.g. for benchmarks where an individual execution takes seconds or even minutes. With usable I mean that the display still works then and that it is not hard to configure.

Probably depends on #2 (and might also be solved by it)

Provide option to run a function after/before every benchmark invocation (teardown/setup) + before/after a scenario

Sometimes I wish I could run a function after every invocation of the benchmarking function, of course outside of the measurement of the benchmarked function. Why would I want that?

run assertions on the return of the function to see that it really does the right thing all the time. When benchmarking n things that all should do the same function, we could have one assertion for them that checks that they really did. Of course, that's what tests are for but when trying out which function to use those tests are often rather duplicated.
reset some global state (for instance cache busting) if we don't wanna benchmark the caching, as doing the same call 1000s of times is rather atypical.

Based on these use case the after function would need access to:

return value of the function
input or input name currently benchmarked
maybe name of the job currently benchmarked

A good name apparently (at least similar to benchfella) seems to be teardown and as teardown alone would be lonely we can add a setup sibling that behaves similarly.

Progress

before_each/after_each hooks
before/after_scenario hooks
passing values from before hooks to before hook
passing results to after hooks
README documentation

edit: updated as this blog post also calls for some setup/teardown

edit2: also before/after a scenario sounds sensible to do (specifically I think I need it for benchmarking Hound as I need to start hound for our specific PID)

Define a behaviour for formatters

Right now formatters are defined through functions, if you use a given plugin you gotta give it the whole Benchee.Formatters.FooBar.output/1 - instead it'd be nice if there were a Formatter behaviour that formatters could just implement and then in the list one could specify functions or module names likes Benchee.Formatters.FooBar :)

The functions of course would be format/1 and output/1 as it already is :)

Integrate auto-scaled units into console formatter

For example, 1000000 might display as 1.00M

See #26 for the unit handling implementation

Add type-checking

Having all of benchee's public interface type specced and the typespecs checked in the CI would be great.

As to how to do it, my experience with typespecs is limited - I used dialyxir with varying degree of success. There's also dialyze but it doesn't seem to be as active anymore.

Also, with the behaviours in the unit system we added quite some type specs - would be great to complement those.

Option to benchmark until confidence value is reached

I'm not sure how this works, but there are measures in statistics to be confident of your results (standard deviation we have and I think it is one of them) - with this given we could add an option that says benchmark until this confidence level is reached, with some timeout though so it doesn't run forever if results vary too much naturally.

Allow for adjustable input(sizes)

Rerun the benchmark with different input sizes (aka 10 elements, 100 elements and 1000 elements) or something of the like to get reports on all different sizes with one benchmark run.

Noticed this in elixir-lang/elixir#5082 where it would have been nice to have multiple input sizes in one benchmark.

Check usage from Erlang

It'd be amazing to be a good BEAM citizen and let other languages, specifically Erlang, also use benchee.

Would be great to check if/how benchee can be used from Erlang/rebar3 as package manager and then write down a sample benchmark for using benchee in Erlang in the README.

Of course, also get rid of all incompatibilities. E.g. I'm not sure but right now it might happen that benchee crasses when there is no Elixir version.

Fix non deterministic test

When running tests, some tests fail non-deterministically. They sometimes fail and sometimes pass, without any changes to the code.

Example test failure:

  1) test variance does not skyrocket on very fast functions (Benchee.BenchmarkTest)
     test/benchee/benchmark_test.exs:60
     Assertion with < failed
     code: std_dev_ratio < 1.2
     lhs:  1.7548733092872468
     rhs:  1.2
     stacktrace:
       (elixir) lib/enum.ex:651: Enum."-each/2-lists^foreach/1-0-"/2
       (elixir) lib/enum.ex:651: Enum.each/2
       (ex_unit) lib/ex_unit/capture_io.ex:146: ExUnit.CaptureIO.do_capture_io/2
       (ex_unit) lib/ex_unit/capture_io.ex:119: ExUnit.CaptureIO.do_capture_io/3
       test/benchee/benchmark_test.exs:61: (test)

Parallelize generation of statistics and formatters

right now statistics are computed sequentially just as formatters are executed sequentially. There's no real reason for this and it should be "stupidly easily parallelizable" as there are no dependencies between those so it should be easily doable via Task.async/1 and Task.await/1.

There's even some good sense behind it, as statistics generation takes an increasing amount of time the more samples there are - sorting with a Million elements can take ~0.3s and if we let a fast benchmark run even for a little this number is not hard to achieve, of course more benchmarks, even more time.

Formatters are rather fast, but could also take longer and always have some IO going on.

Of course, probably statistics and formatters should be two different PRs :)

Use deep merge for deep config merging

The current implementation of config has to do a special handling of the print option as whatever is configured needs to be merged in there. It'd be nice to just have a deep_merge implementation to do this, which elixir doesn't provide atm to the best of my knowledge.

Provide nth percentile statistics

Provide a statistic in the statistics module that shows the nth percentile of response times. E.g. for 99 it should show the time in which 99% of all samples finished. This should help remove outliers from garbage collection & friends.

For now I don't think the nth-percentile needs to be adjustable. I'd settle for a pre defined value, I think 99th seems fine - will have to try with some real world examples to see how it performs:

add 99th_percentile key in the statistics hash after statistics computation
display the 99th percentile in the console formatter

Missing instructions for CVS plugin

It's needed to update the mix.exs file

Add benchee_csv to your list of dependencies in `mix.exs`:

```elixir
def deps do
  [ {:benchee, "~> 0.3", only: :dev},
    {:benchee_csv, "~> 0.3"},
  ]
end

Afterwards, run mix deps.get to install it.
``

Measure memory consumption

Performance isn't only about execution speed but also about memory consumption. There seem to be some erlang APIs to get memory consumption. Question is, how reliable are they?

is this viable to do? When, how can one measure? (for every run?)
how will this data be presented (I'm guessing a new output line after execution times, otherwise it gets too hard to read)

Add option for displaying long labels for scaled units

For example, display 1.23 Million instead of 1.23M.

This should be an option unit_label with possible values :short and :long defaulting to short. It can then be picked up by the different formatters, in the first version just the console formatter to display long names or short names.

edit (@PragTob) : changed to be a general option and not just for the console

Mention compiled vs. non compiled code in the README

As pointed out by @michalmuskala on elixirforum it is problematic that the standard to run benchmarks is portrayed as just running outside of a module which gets problematic if more complex code is in the individual anonymous functions

That said, one thing that bothers me each time I look at the README is that the example benchmark is not inside a module. Code that is not inside modules, is not compiled, but interpreted. This gives vastly different performance characteristics and makes benchmarks pretty much useless.
This is not a huge problem in the example, since the functions immediately call a module (so only the initial, anonymous function call is interpreted), but can lead to false results with more complex things.

Ideally the README should point this out and recommend either just testing functions that are defined in their own modules or in case of doubt write a module containing the benchmark and then calling it in the run script.

defmodule MyBenchmark do
  def benchmark do
    Benchee.run(...)
  end
end

MyBenchmark.benchmark()

Code in modules seems to always be compiled so that'd solve the problem mentioned above :)

Provide "meta" statistics

For different use cases, like bencheeorg/benchee_html#10 it'd be great to have statistics about statistics - what I'd call "meta statistics" - although there's probably some better real statistics name for this :)

What should be in there (that I know of so far):

job size (how many jobs are in there)
minimum of run times over all jobs
maximum of run times over all jobs

This should be added as a new key to the benchmarking suite (:meta_statistics) - it could be added within the statistics module but might be better off as a separate MetaStatistics step that is computed after general statistics have been computed.

Shorten fast execution warning and link to wiki

The fast execution warning is quite long and verbose - shorten it up and link to the wiki

Make unit_scaling a top level configuration option

Right now unit_scaling is nested underneath the console formatter, where it can and should also remain (for now, we could watn that it moved to top level now).

However, unit_scaling makes sense with more formatters than just the console formatter (everyone that displays units of some kind). Right now that doesn't make as much sense for CSV for instance as there is no unit field, but would definitely make sense for the HTML report.

This way different formatters can support the unit_scaling option and it would be consistent among all formatters used.

edit: As there was a misconception, what is meant with top level configuration option is more the following:

unit_scaling: :best,
formatter_options: whatever

i.e. it's not attached to any specific formatter, but every formatter can look it up in the general configuration

Drop Elixir 1.2 support

Wait for release of Elixir 1.4

Elixir 1.4 has been released
Update TODO in Benchee.Formatters.Console.units/1
Use keyword type declaration in Benchee.Formatter.Unit
Use String.trim/1 instead of String.strip/1
use describe blocks in tests a needed

Remove Time and replace occurences with Unit.Duration

This is relatively easy, once upon a time there was Benchee.Time to convert from seconds to microseconds and the other way around (not sure if this way is used at all) - we have a new module Unit.Duration where similar functionality lives + more - so the idea is to remove calls to Time by equivalent calls in Duration :)

How to disable warnings?

Hello @PragTob,
thanks for creating Benchee,
I am playing with it.
just wondering since I am testing a log of functions, that execute really fast, if there is a way to disable this warning

Warning: The function you are trying to benchmark is super fast, making time measures unreliable!

if not, i think having an option to disable it, it will be very useful.

cheers

Use application inflection when switching support to ~> 1.4

Whenever that happens, we can change the listed applications to extra applications etc. as described in the release notes.

Utility function to join run_times and statistics

Right now, run_times and statistics are 2 separate entries in the benchee suite which makes sense given that statistics needs run_times as an input and that often times afterwards we don't need to worry about them.

For some plugins (like json and html) run_times are convenient and it's a hassle to always grab the run times according to the statistics you currently want to display (or vice versa) - having a function that joins them together under something like measurements would be greatly beneficial.

-->

run_times: %{"My input" => [...]},
statistics: %{"My input" => %{...}}

adds another key as:

measurements: %{
  "My input" => %{
     run_times: [...],
      statistics: %{...}
  }
}

Support keyword lists as option arguments

So far benchee uses maps the configuration options, in Elixir it is more common to use keyword lists though. I detailed some reasons/thoughts in this thread - people pointed out that it is probably still better to stick with the main convention in the language and Jose suggested to just convert the keyword list to a map for internal use.

That seems like a great idea.

However, if I recall correctly options are usually used as the last argument (currently they are the first option from the mantra of first I configure the benchmark, then I define the benchmarks) which would be a rather big API change that we could probably cleverly get out of with some pattern matching.

Here is a little wish list:

accept keyword list options
still accept map as an argument
see what happens if we reverse the order of arguments to Benchee.run - how does it look/feel
if we reverse the order of arguments can we still work with the old order (I'm thinking of checking for options keys or the like) and then we can print a warning?

Input/ideas welcome ( @wasnotrice ? :) )

Percent Unit?

In the upcoming plootly_js/HTML formatter I already happily reuse our newly created unit formatting methods (well I don't scale to microseconds because Erlang doesn't like that cause it apparently can't UTF8 but wel'll see about that) and then noticed that it'd be great to have same formatting for percent units (e.g. the ratio of the standard deviation) as in the console formatter.

Right now it's a private method on the console formatter, we could make it public, but that feels wrong. My idea was to create a Percent or Ratio or something unit that then encapsulates the formatting of this unit much like Count and Duration do. The weird thing would be, that I don't see any scaling happening there. We'd just always scale to percent (for now).

I still think it sort of fits into the concept of a unit... @wasnotrice what do you think? Would love to get your input here!

Reduce the effects of Garbage Collection

Especially micro benchmarking can be affected by garbage collection as single runs will be much slower than the others leading to a sky rocketing standard deviation and unreliable measures. Sadly, to the best of my knowledge, one can’t turn off GC on the BEAM.

The best breadcrumb to achieve anything like this so far:

You can try to use erlang:spawn_opt http://erlang.org/doc/man/erlang.html#spawn_opt-2 setting fullsweep_after and min_heap_size to high values to reduce chances of garbage collection.

This would then go into a new configuration option like: avoid_gc: true/false.

Would also need testing with existing benchmarks to see effect on standard deviation etc. - likely a large-ish operation :)

Proposal: Update README to point to a resource that has more info on what exactly mix.exs is for a beginner to Elixir

Background

As a beginner to Elixir, I wasn't quite sure what exactly the mix.exs file was and couldn't immediately figure out that it was a file that is generally generated by the mix build tool. I knew for sure that I wanted to set up a benchmarking tool like benchee(have previous experience of setting up similar tools for ruby in the past) as soon as I was getting started with some small exercises to get a better understanding of Elixir but I wasn't immediately able to leverage the insights I could get from benchee as I didn't know what the mix.exs file was in the first place.

Proposed Change

Update the Installation section of the current README file -

From

Add benchee to your list of dependencies in mix.exs:

def deps do
  [{:benchee, "~> 0.6", only: :dev}]
end

To

Add benchee to your list of dependencies in mix.exs as shown below. In case you're new to Elixir and don't know what the mix.exs file is, you can read more about the same here.

defp deps do
  [{:benchee, "~> 0.6", only: :dev}]
end

Kindly note, it looks like deps is a private function and not a public one from what I could make out from here and a few other places. Also proposing to correct the same(i.e., by using defp instead of def) in the PR.

Please let me know your thoughts on the above and I could accordingly submit a PR for the same.

Thank you.

Change internal structure to be a map

Right now the internal structure for benchmarks is a list of tuples of benchmark name and function. I did this initially so that benchmark names could be duplicated. However, that doesn't really make much sense and should rather give a warning (otherwise you can't tell them apart in the output either).

So a structure like:

%{"benchmark name" => function, ...}

seems to be better suited.

To not be breaking I'd like to preserve the old [{name, fun}, {name, fun}] way for now to not break existing benchmarks. The function should just convert it to a hash map then.

Print Elixir and Erlang version in configuration information

To give an immediate view of which version combination produced the results :)

Allow configuration of unit auto scaling options

Right now the auto scaling strategies best, largest and smallest. So far we always use best (afaik) but it'd be nice if users could also use smallest/largest. Furthermore, if someone doesn't want any unit scaling to happen that should also be possible:

make unit scaling configurable for the console formatter with console: %{unit_scaling: option}
allow disabling unit scaling through a :none option

Regenerate documentation for older versions with version tag

See elixir-lang/ex_doc#578 :)

Scale estimaed run time duration

Right now when you run a slightly more thorough benchmark with increased times or multiple inputs the estimated time shows up with many seconds:

tobi@speedy ~/github/elixir_playground $ mix run bench/tco_blog_post.exs 
Erlang/OTP 19 [erts-8.1] [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false]
Elixir 1.4.0
Benchmark suite executing with the following configuration:
warmup: 10.0s
time: 10.0s
parallel: 1
inputs: none specified
Estimated total run time: 120.0s

It'd be nice to use the automatic unit scaling to show 2mins or something similar in cases like this (of course even more so with cases like 360s).

Scaling can be fun in Benchee.Conversion.Duration while the printing is a part of Benchee.Output.BenchmarkPrinter :)

More statistics - what do you need?

What further statistics do you need/wish for? Open new issues or add them here. I think we can learn some from criterion which antipax pointed out to me.

Add configuration options to override Benchee default values

Benchee is getting a lot of configuration options - which is great. But the more configurations options and formatters there are, the more people would probably prefer to configure some of them globally for their project.

Something like:

config :benchee, :options, %{} # fancy map or keyword list overriding default options

These options should then represent the new default options. So merge order would be something like: default_config <- app_config <- benchmark_config (<- meaning right overrides left)