gammapy / gamma-speed Goto Github PK

Measure gamma-ray data analysis speed (focus on multi-core likelihood fitting speed-ups)

Home Page: https://gamma-speed.readthedocs.org/

Shell 11.67% Python 87.60% C++ 0.73%

gamma-speed's Introduction

Gammapy

A Python Package for Gamma-ray Astronomy.

Gammapy is an open-source Python package for gamma-ray astronomy built on Numpy, Scipy and Astropy. It is used as core library for the Science Analysis tools of the Cherenkov Telescope Array (CTA), recommended by the H.E.S.S. collaboration to be used for Science publications, and is already widely used in the analysis of existing gamma-ray instruments, such as MAGIC, VERITAS and HAWC.

Webpage: https://gammapy.org
Documentation: https://docs.gammapy.org/
Code: https://github.com/gammapy/gammapy

Contributing Code, Documentation, or Feedback

The Gammapy Project is made both by and for its users, so we welcome and encourage contributions of many kinds. Our goal is to keep this a positive, inclusive, successful, and growing community by abiding with the Gammapy Community Code of Conduct.

The Gammapy project uses a mechanism known as a Developer Certificate of Origin (DCO). The DCO is a binding statement that asserts that you are the creator of your contribution, and that you wish to allow Gammapy to use your work to cite you as contributor. More detailed information on contributing to the project or submitting feedback can be found on the Contributing page.

Licence

Gammapy is licensed under a 3-clause BSD style license - see the LICENSE.rst file.

Supporting the project

The Gammapy project is not sponsored and the development is made by the staff of the institutes supporting the project over their research time. Any contribution is then encouraged, as punctual or regular contributor.

Status shields

(mostly useful for developers)

gamma-speed's People

Contributors

Stargazers

Watchers

gamma-speed's Issues

Pipeline issues

The one core pipeline takes a lot longer to execute than would be normal. Why?
Why does the pipeline sometimes crash

Assess mpi4py

http://mpi4py.scipy.org

From the introduction page, the mpi4py package looks like a good candidate for the parallelisation of python based code.

Check out basic monitoring tools

For a given process we want to measure time series of CPU, memory and disk usage, e.g. log these quantities to an ascii file every second.

Please read the manuals or tutorials for these tools and see what kind of measurements they provide:

top and variants like atop and htop
iostat
...

Things to pay attention to:

Is it possible to log to file? (If not it's pretty much useless for us.)
Is it possible to only monitor one process? How do you tell the tool which process you want to monitor?

Make a wiki page to document what you find out.

Some random links I found with Google that might be helpful:

Check out Intel performance tools

Intel claims to have nice tools for parallel performance work:
http://software.intel.com/en-us/non-commercial-software-development

Please install their tools on your desktop and learn how to use them and what they can / can't do. Find tutorials and run them on very small toy programs that use OpenMP.

Model and measure ctlike runtime

There should be simple approximate models for the ctlike runtime.

Unbinned: t = t(n_threads, n_obs, n_events)
Binned: t = t(n_threads, n_obs, n_bins)

E.g. a model for the unbinned case could be

t = A + B * (n_obs / n_threads) + C * (n_events / n_threads)

... or not ... we noticed that the unbinned ctlike runtime was the same in this case:

n_threads = 3, n_obs = 100, n_events = 1700k
n_threads = 3, n_obs = 100, n_events = 36k
i.e. a factor of 50 in the number of events didn't matter.

In detail the runtime will of course also depend e.g. on the model and model parameter start values, but there should be regimes with simple runtime scaling behaviours.

Why does the number of ``ctlike`` iterations depend on the number of threads used?

The number of iterations is handwritten in green

Increased memory usage

Why can we see an increase in memory usage with the increase in number of threads?

Make monitor work on Mac

$ ./monitor.py "../scripts/use_cpu"
Traceback (most recent call last):
  File "./monitor.py", line 143, in <module>
    main()
  File "./monitor.py", line 140, in main
    cpuinterval=args.timeinterval)
  File "./monitor.py", line 46, in monitor
    self.process.get_io_counters()[0],
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/psutil/__init__.py", line 881, in __getattribute__
    %(self.__class__.__name__, name))
AttributeError: Popen instance has no attribute 'get_io_counters'

Document compilers used

We want to try out different C++ compilers and see if some are faster: gcc, clang, icc (Intel)
Can you summarize the versions on the different computers we use?

The intel compiler is free for non-commercial use ... can you install it on the Ubuntu machines?
http://software.intel.com/en-us/non-commercial-software-development

Write Python prototype parallel likelihood function

Some parts of the evaluation of the likelihood function can be parallelised, e.g. as described on slides 13 to 15 here the model has to be evaluated for a large number of bins and then the fit statistic computed and summed for those bins.

We should create an IPython notebook that implements certain steps in parallel using the Python multiprocessing module (an example is here so that we can do some prototyping and timing for which data sizes or model function evaluation costs splitting gives good speedups.

install_ctools.py deadlocks

The script manages to download the github repositories but when it has to execute ./configure it deadlocks. Below is the output I get:

$ ./install_ctools.py -log=True
INFO - Creating installation: extralog_install
remote: Counting objects: 27847, done.
remote: Compressing objects: 100% (6177/6177), done.
remote: Total 27847 (delta 21970), reused 27469 (delta 21595)
Receiving objects: 100% (27847/27847), 82.90 MiB | 1.33 MiB/s, done.
Resolving deltas: 100% (21970/21970), done.
remote: Counting objects: 2953, done.
remote: Compressing objects: 100% (1331/1331), done.
remote: Total 2953 (delta 1626), reused 2927 (delta 1600)
Receiving objects: 100% (2953/2953), 3.47 MiB | 450 KiB/s, done.
Resolving deltas: 100% (1626/1626), done.
INFO - software successfuly downloaded
INFO - Entered GAMMALIB install
Switched to a new branch 'gammaspeed_extra_log'
configure.ac:74: installing `./config.guess'
configure.ac:74: installing `./config.sub'
configure.ac:35: installing `./install-sh'
configure.ac:35: installing `./missing'
configure: WARNING: Python wrapper(s) missing. Requires swig for wrapper generation.
config.status: WARNING:  'src/gammalib-setup.in' seems to ignore the --datarootdir setting
libtool: link: warning: `-version-info/-version-number' is ignored for convenience libraries
libtool: link: warning: `-version-info/-version-number' is ignored for convenience libraries
libtool: link: warning: `-version-info/-version-number' is ignored for convenience libraries
^CERROR - GAMMALIB install failed
make[3]: *** [GCOMSupport.lo] Error 1
make[2]: *** [all-recursive] Interrupt
make[1]: *** [all-recursive] Interrupt
make: *** [all] Interrupt

I suppressed the installer after it did not want to work with ^C. For usage of ./install_ctools.py you can

./install_ctools.py -log=True for the extra logging version
./install_ctools.py -gen=True for the normal version

my best guesses are:

the Popen is trying to execute configure, make and make install at the same time but I am not sure about it.
somehow, the proc.wait() is slinging the whole thing into a deadlock

Make a GammaLib / ctools automatic benchmark suite

Possible tools:

Some links:

Efficiency issues

The parallel parts of the ctools code does not have a perfect efficiency. In theory, the speedup should be linear i.e. speedup = cores. In practice, the efficiency is less than perfect.

Below, the efficiency and speedup of ctobssim for the parallel part of the code.

Measure FITS I/O peformance

I'd like to measure some FITS I/O (raw and how it's used in gammalib) to see if the performance is good and compare to ROOT and HDF5.

I started some docs (only has some links at the moment).

Check out profiling tools

This is related to issue #6 , some tools do monitoring and / or profiling.

With "monitoring" CPU / memory / disk I/O usage I mean looking at a process as a whole.
With "profiling" I mean in addition to looking at the total process also looking where in the where in the code the CPU spends time (main focus), allocates memory, does disk I/O.

Here's some profiling tools:

http://valgrind.org (different tools: http://valgrind.org/info/tools.html#helgrind)
http://en.wikipedia.org/wiki/Gprof and http://kcachegrind.sourceforge.net/html/Home.html
http://software.intel.com/en-us/intel-vtune-amplifier-xe (commercial)
http://en.wikipedia.org/wiki/IBM_Rational_Purify (commercial)
http://code.google.com/p/gperftools/

Understand optimizer algorithm

@ignatndr Before attempting to profile ctlike we should understand the optimizer method used, i.e. the Levenberg–Marquardt algorithm:
https://cta-redmine.irap.omp.eu/projects/gammalib/wiki/GOptimizerLM

The wikipedia article seems like a good starting point with tons of references:
http://en.wikipedia.org/wiki/Levenberg–Marquardt_algorithm