Since we are using a bunch of lambdas and functions calls in other functions (cf. all

Numba is making more damages than giving advantages... <a class="use

see also <a href="https://github.com/N3PDF/dis/blob/profile-NL

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Precompile nested functions about yadism HOT 17 CLOSED

alecandido commented on August 22, 2024

Precompile nested functions

from yadism.

Comments (17)

alecandido commented on August 22, 2024

Probably we should take a look in numba, that means #27 is maybe enough to solve also this one.

from yadism.

alecandido commented on August 22, 2024

Numba is making more damages than giving advantages...

@scarrazza @scarlehoff have you any advice?

Fri Apr 17 17:32:26 2020    apfel_w_numbify.prof

         169133690 function calls (160232623 primitive calls) in 183.925 seconds

   Ordered by: cumulative time
   List reduced from 8601 to 11 due to restriction <11>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   1382/1    0.013    0.000  184.000  184.000 {built-in method builtins.exec}
        1    0.000    0.000  184.000  184.000 benchmark_against_apfel.py:4(<module>)
        7    0.001    0.000  182.688   26.098 benchmark_against_apfel.py:55(run_against_apfel)
        7    0.000    0.000  181.581   25.940 /home/alessandro/projects/N3PDF/dis/src/yadism/runner.py:190(apply)
        7    0.018    0.003  181.581   25.940 /home/alessandro/projects/N3PDF/dis/src/yadism/runner.py:140(__call__)
        7    0.000    0.000  181.116   25.874 /home/alessandro/projects/N3PDF/dis/src/yadism/runner.py:130(get_output)
        7    0.000    0.000  181.116   25.874 /home/alessandro/projects/N3PDF/dis/src/yadism/structure_functions/__init__.py:49(get_output)
      126    0.000    0.000  181.116    1.437 /home/alessandro/projects/N3PDF/dis/src/yadism/structure_functions/EvaluatedStructureFunction.py:65(get_output)
      126    0.002    0.000  181.115    1.437 /home/alessandro/projects/N3PDF/dis/src/yadism/structure_functions/EvaluatedStructureFunction.py:35(_compute)
      252    0.065    0.000  181.114    0.719 /home/alessandro/projects/N3PDF/dis/src/yadism/structure_functions/EvaluatedStructureFunction.py:48(_compute_component)
    14868    0.361    0.000  181.019    0.012 /home/alessandro/projects/N3PDF/dis/src/yadism/structure_functions/convolution.py:139(convolution)

Fri Apr 17 17:29:04 2020    apfel_wo_numbify.prof

         61928349 function calls (61900659 primitive calls) in 61.382 seconds

   Ordered by: cumulative time
   List reduced from 8070 to 11 due to restriction <11>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   1382/1    0.015    0.000   61.399   61.399 {built-in method builtins.exec}
        1    0.000    0.000   61.399   61.399 benchmark_against_apfel.py:4(<module>)
        7    0.001    0.000   60.217    8.602 benchmark_against_apfel.py:55(run_against_apfel)
        7    0.000    0.000   59.400    8.486 /home/alessandro/projects/N3PDF/dis/src/yadism/runner.py:190(apply)
        7    0.018    0.003   59.400    8.486 /home/alessandro/projects/N3PDF/dis/src/yadism/runner.py:140(__call__)
        7    0.000    0.000   58.927    8.418 /home/alessandro/projects/N3PDF/dis/src/yadism/runner.py:130(get_output)
        7    0.000    0.000   58.927    8.418 /home/alessandro/projects/N3PDF/dis/src/yadism/structure_functions/__init__.py:49(get_output)
      126    0.001    0.000   58.926    0.468 /home/alessandro/projects/N3PDF/dis/src/yadism/structure_functions/EvaluatedStructureFunction.py:65(get_output)
      126    0.002    0.000   58.926    0.468 /home/alessandro/projects/N3PDF/dis/src/yadism/structure_functions/EvaluatedStructureFunction.py:35(_compute)
      252    0.058    0.000   58.924    0.234 /home/alessandro/projects/N3PDF/dis/src/yadism/structure_functions/EvaluatedStructureFunction.py:48(_compute_component)
    14868    0.292    0.000   58.839    0.004 /home/alessandro/projects/N3PDF/dis/src/yadism/structure_functions/convolution.py:139(convolution)

from yadism.

felixhekhorn commented on August 22, 2024

see also

from yadism.

scarlehoff commented on August 22, 2024

Without knowing what functions are you actually trying to compile:
numba doing worse than no-numba usually means you are compiling things that you shouldn't or that you are running through numba functions that do not need it.

As a general rule, try to have a function that take as argument as many things as you can and compile that one and pass everything as an argument.
i.e., ensure that you are not for instance compiling the call for each different value of the momentum.

Another possible pitfall is having intermediate objects that are not allowing numba to fully compile (basically, complicated python objects, where complicated I think starts at dictionary, can't remember right now though).

For more information I guess I'd have to read the code you are trying to numba.

from yadism.

alecandido commented on August 22, 2024

Thanks, we thought that this could be the issue, and we will try to address in some way.

The main problem is that we are not really calling numba, inside yadism, but we are calling methods from eko that are numbified (BasisFunctions in particular), and as you said this will generate the issue.

from yadism.

scarlehoff commented on August 22, 2024

Ah, right, there the structure/organization whatever of what is compiled and when is done thinking on the specific case for eko, which might be highly unoptimal here.

from yadism.

felixhekhorn commented on August 22, 2024

Well, I think not really - we're dealing with BasisFunctions here and there - they should compile once and forever (for a given configuration) - I don't know if we (in eko) really do this? and how can we actually check this?

from yadism.

scarlehoff commented on August 22, 2024

A quick way is checking where nb.njit is called and printing "compiling!" just before.

(when used as a decorator it wouldn't be much of a problem because I think we only use it like that in a few functions that are very basic, nothing fancy there)

from yadism.

alecandido commented on August 22, 2024

@scarlehoff we made a try and we discover that we are compiling once per BasisFunction, but the problem is that for the limited size of the required output it is not worth the effort, at this stage we are wondering what to do:

implement some kind of threshold (purely on our side, given to the user, we together with numba) to decide to compile or not to do it
parallelize compilation, but in order to do this we need eager compilation (while numba is lazy, of course) and find a way to tell python to continue even if compilation thread are going on their own until he hit a line when he need to evaluate function numba is compiling on another thread
- or we can just make a compilation block for basis functions' compilation, when instantiating the InterpolatorDispatcher, or right after, and compile them in parallel waiting for the last one to finish

from yadism.

scarlehoff commented on August 22, 2024

I think I put a flag to avoid compilation? Or maybe I just thought about putting it... I did it for vegasflow for sure :_

In any case, if compilation is a problem parallelizing just creates a problem in every thread, the right thing is not to compile.

(maybe I'm failing to understand the issue, but looking at the benchmark you linked before it looks like what you want is to avoid the compile, lowering the time from 1 minute will create more problems that it will solve)

from yadism.

felixhekhorn commented on August 22, 2024

relevant timings are:

/wo numba: 1.5s = 27s/17 = used time in log_evaluate_x / number of observables (=: nobs)
/w numba 2.5s = 145s/59 = used time in _compile_for_args / number of basis functions (=: n_bf)

so our current setup is

n_obs = 17 and n_bf = 59 and we are loosing 120s = 2.5*17 - 1.5*59
if we would have e.g. 180 observables, we would gain 120s

or more general, if and only if, we have n_obs/n_bf > 5/3 it is worth to run numba

The common number of data points in a DIS only PDF fit are ~ 3000

from yadism.

scarlehoff commented on August 22, 2024

Ok, so you are "losing" time in the trial benchmark but "in real life" you would prefer to compile. Is that it?

If that's the case I would just avoid compilation during development (but keeping it in tests to ensure it works)

from yadism.

alecandido commented on August 22, 2024

Ok, if all those points are going to be included in the relevant fits our threshold is for xgrids' size about 2000, so we can safely say that if you are going to use an xgrid with less than 1000 points it's useful to make numba compiling BasisFunctions.

from yadism.

alecandido commented on August 22, 2024

relevant timings are:
* /wo numba: 1.5s =  27s/17 = used time in `log_evaluate_x` / number of observables (=: nobs)

* /w numba 2.5s = 145s/59 = used time in `_compile_for_args` / number of basis functions (=: n_bf)
so our current setup is
* n_obs = 17 and n_bf = 59 and we are loosing 120s = 2.5*17 - 1.5*59

* if we would have e.g. 180 observables, we would gain 120s
or more general, if and only if, we have n_obs/n_bf > 5/3 it is worth to run numba

The common number of data points in a DIS only PDF fit are ~ 3000

We updated the convolution function, using some clever caching in order to speed up. Because of this we also moved the trade off value, mainly because we optimized the number of calls to log_evaluate_x and the coefficient functions.

The trade off report updated:

/wo numba: 0.6s = 10s/17 = used time in log_evaluate_x / number of observables (=: nobs)
/w numba 2.6s = 157s/59 = used time in _compile_for_args / number of basis functions (=: n_bf)

Then the new trade off is: n_obs/n_bf > 5

Conclusion: still convenient, but exactly 3 times less...

from yadism.

alecandido commented on August 22, 2024

Ok, the discussion on compiling eko with numba has been sufficiently investigated, and we also made some investigation about the chance of compiling coefficient functions, still with numba.

We concluded that with the current structure of the code is not possible to compile coefficient function inside python, because it's depending on external objects (quark_0 is a function, to be sum with quark_1 multiplied by eko's alpha, and they are three different functions to be joint), so we will keep python evaluation for the time being.

@scarlehoff if you have any idea about compiling sums and products of functions with numba it's perfect, but experimenting we found that it really does not like functional programming :)

from yadism.

scarlehoff commented on August 22, 2024

it really does not like functional programming :)

Because it's python not js!

In any case, as I said, if timing is not a problem over-optimizing is not worth it.
But if you will need at some point better speed it is better to keep in mind the limitations of numba and slowly move away from passing lambdas around. In particular because eventually I want to work on making eko run on GPU* as the people from numba seem to be working heavily on GPU support now. This should be very good for eko.**

*personally I'm ignoring it for now but making eko run in GPU is in the "cool things to do" of my TODO list (cc @scarrazza, they claim to be compatible with AMD cards)

** @felixhekhorn (I'm using github as a slack chat at this point) about eko, maybe we need to have a chat at some point to decide priorities, whether we want to keep it on the side for now or whether we want to start pushing forward to have it for the summer meeting (I'll be busy for a few weeks on n3fit related stuff but will be much freer afterwards)

from yadism.

alecandido commented on August 22, 2024

Probably if at the end speed will be an issue the solution of inlining lambdas is working out a proper computational graph with a suitable library.

Does this mean tensorflow?

from yadism.

Precompile nested functions about yadism HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent