nnpdf / pinefarm Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 1.27 MB

Generate PineAPPL grids from PineCards

Home Page: https://pinefarm.readthedocs.io

License: GNU General Public License v3.0

Python 100.00%

hep physics

pinefarm's Introduction

NNPDF: An open-source machine learning framework for global analyses of parton distributions

The NNPDF collaboration determines the structure of the proton using Machine Learning methods. This is the main repository of the fitting and analysis frameworks. In particular it contains all the necessary tools to reproduce the NNPDF4.0 PDF determinations.

Documentation

The documentation is available at https://docs.nnpdf.science/

Install

See the NNPDF installation guide for the conda package, and how to build from source.

Please note that the conda based workflow described in the documentation is the only supported one. While it may be possible to set up the code in different ways, we won't be able to provide any assistance.

We follow a rolling development model where the tip of the master branch is expected to be stable, tested and correct. For more information see our releases and compatibility policy.

Cite

This code is described in the following paper:

@article{NNPDF:2021uiq,
    author = "Ball, Richard D. and others",
    collaboration = "NNPDF",
    title = "{An open-source machine learning framework for global analyses of parton distributions}",
    eprint = "2109.02671",
    archivePrefix = "arXiv",
    primaryClass = "hep-ph",
    reportNumber = "Edinburgh 2021/13, Nikhef-2021-020, TIF-UNIMI-2021-12",
    doi = "10.1140/epjc/s10052-021-09747-9",
    journal = "Eur. Phys. J. C",
    volume = "81",
    number = "10",
    pages = "958",
    year = "2021"
}

If you use the code to produce new results in a scientific publication, please follow the Citation Policy, particularly in regards to the papers relevant for QCD NNLO and EW NLO calculations incorporated in the NNPDF dataset.

Contribute

We welcome bug reports or feature requests sent to the issue tracker. You may use the issue tracker for help and questions as well.

If you would like contribute to the code, please follow the Contribution Guidelines.

pinefarm's People

Contributors

Stargazers

Watchers

pinefarm's Issues

Polarized integrability

I'm reopening this PR as I think we should impose integrability on \Delta f(x) and not x \Delta f(x) as currently implemented.
See #58

Originally posted by @giacomomagni in #57 (comment)

Allow integration of different distributions during the same mg5_aMC run

Something that isn't possible with our toolchain is to generate two separate distributions, because all grids are merged together in the end. For the CDF W-boson mass grids that would be beneficial and also for some of our top-pair production grids. We could distinguish distributions that need to be merged together and those which are separate by looking at the names of each histogram: merge grids together if their histograms have the same name, leave them alone if their names differ. Another concern is the metadata, for which we'll need as many files as there are different histograms.

Improve theory handling

Starting with the Python implementation of the runner we have to specify a theory whenever we want to generate a grid, for instance

./rr run TEST_RUN_SH theories/theory_200.yaml

I think we should reflect this in the filename of the generated grid, so in this instance we should generate

TEST_RUN_SH_T200.pineappl.lz4, which tells us the grid was generated with theory 200.
- alternatively use folders
define a unified theory card format, that will include all parameters of all the generators
- some of them can have a default value, if they are not (or not yet) in NNPDF theory db
- it should be possible to generate the unified theory card from an entry of the theory db
add a theory converter for each external
- given a unified theory card, it should extract the minimal theory required for the given generator, by filtering and rearranging

Furthermore, we need to discuss

which parameters should be in the theory database (we definitely need an overcomplete set of parameters),
what the parameter means,
where it is typically used (and sometimes it is important where it isn't: we need to talk about DIS vs. hadron collider observables),

Further steps breakout:

use the theory provided variables (replace the hardcoded ones)
change name to the final grid to include theory dependency
apply a coherent scheme for CLI arguments

Issue with pinefarm cli, theories and pinecards not listed

Hi @cschwan , @felixhekhorn. After installing pinefarm with pip (pip install pinefarm) in a fresh conda environment I try to get the list of pinecards and theories. In both cases the command fails:

$ pinefarm list theories
^[[1;2D/store/DAMTP/mnc33/miniconda3/envs/pinefarm/lib/python3.9/site-packages/pinefarm/cli/_base.py:26: UserWarning: No configuration file detected.
  warnings.warn("No configuration file detected.")
/store/DAMTP/mnc33/miniconda3/envs/pinefarm/lib/python3.9/site-packages/pinefarm/configs.py:82: UserWarning: Using default minimal configuration ('root = $PWD').
  warnings.warn("Using default minimal configuration ('root = $PWD').")

Issues with nnlojet runcard autogeneration

At the moment all the WP / WM datasets are split into a WP and WM runcard. This should only be done if there's only one observable in the set, otherwise assume that the WP and WM channels are separated.
Update the conda package to include the nnlojet python package
Add multi-histogram option (this is needed for more complicated cuts)
Automatize a bit more the options-per-experiment.

Please comment with other problems you find.

Update README

The README isn't up-to-date and in particular it doesn't answer the important question: how do I run a process?

pinefarm requires pinecards to be in a repo

@niclaurenti got hit by the following bug:

in https://github.com/NNPDF/runcards/blob/7f11afce4242791acad47d4c7be393e629b5121d/pinefarm/external/interface.py#L114 we require that we are inside an actual repo (as would be in development mode, so when cloning this repo) - not being so, as can happen e.g. upon pip install pineline[full] and adding some stand alone cards, results in an exception (@niclaurenti if you still have - please paste the full error here)

I guess we should catch that error and eventually leave the field empty

AttributeError: 'PosixPath' object has no attribute 'rstrip'

(env) cschwan@montblanc ~ $ pinefarm run ATLAS_1JET_8TEV_R06 pinefarm/extras/theories/theory_200_1.yaml 
/home/cschwan/runcards/env/lib/python3.8/site-packages/pinefarm/cli/_base.py:24: UserWarning: No configuration file detected.
  warnings.warn("No configuration file detected.")
/home/cschwan/runcards/env/lib/python3.8/site-packages/pinefarm/configs.py:81: UserWarning: Using default minimal configuration ('root = $PWD').
  warnings.warn("Using default minimal configuration ('root = $PWD').")
ATLAS_1JET_8TEV_R06
Computing ATLAS_1JET_8TEV_R06...
✓ Found pineappl
Installing...
Traceback (most recent call last):
  File "/home/cschwan/runcards/env/bin/pinefarm", line 8, in <module>
    sys.exit(command())
  File "/home/cschwan/runcards/env/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/cschwan/runcards/env/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/cschwan/runcards/env/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/cschwan/runcards/env/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/cschwan/runcards/env/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/cschwan/runcards/env/lib/python3.8/site-packages/pinefarm/cli/run.py", line 30, in subcommand
    main(dataset, theory_card, pdf)
  File "/home/cschwan/runcards/env/lib/python3.8/site-packages/pinefarm/cli/run.py", line 65, in main
    install_reqs(runner, pdf)
  File "/home/cschwan/runcards/env/lib/python3.8/site-packages/pinefarm/cli/run.py", line 84, in install_reqs
    runner.install()
  File "/home/cschwan/runcards/env/lib/python3.8/site-packages/pinefarm/external/mg5/__init__.py", line 47, in install
    install.mg5amc()
  File "/home/cschwan/runcards/env/lib/python3.8/site-packages/pinefarm/install.py", line 75, in mg5amc
    shutil.move(el, dest)
  File "/usr/lib/python3.8/shutil.py", line 787, in move
    real_dst = os.path.join(dst, _basename(src))
  File "/usr/lib/python3.8/shutil.py", line 750, in _basename
    return os.path.basename(path.rstrip(sep))
AttributeError: 'PosixPath' object has no attribute 'rstrip'

Add support for PDFs other than protons

Anything other than protons yields a wrong convolution:

https://github.com/NNPDF/runcards/blob/master/runcardsrunner/table.py#L33-L35

This should read out the metadata initial_state_1 and initial_state_2 and call Grid::convolute_with_two if the initial states are different.

Add conversion backend

As we have different MC available as back-end (at the moment mg5 and yadism), we should add a conversion back-end powered by pineappl conversion scripts.

Indeed, we are not able to produce all of the grids needed (and we won't be for quite some time), as some of them are the result of MC runs, with some non-publicly available MC.
In these cases we're gently gifted the runcards, so we should download them from somewhere else (or have the user running rr downloading them), and then convert to pineappl.

Replace clone with download

It seems that to clone a repository from GitHub now you always need to set credentials, whatever URL you are using.

So, I'm proposing to replace repositories (that in general are also an overkill) with the master/main branch content, i.e. instead of cloning with Git we will just download the zip content of one branch from GitHub.
E.g. for this repository the corresponding URL is https://github.com/NNPDF/runcards/archive/refs/heads/master.zip

The other option is to use the latest release, e.g. for PineAPPL the URL to the zip of the last release can be found with a request to:
https://api.github.com/repos/n3pdf/pineappl/releases/latest
(the key in the response is actually zipball_url). Once we have the URL, one GET more and we'll have the zip as well (for tarball replace zip with tar in the key.

Which one do you prefer?

master/main: always updated
latest release: more stable

TODO:

fix version
replace repositories with tarballs
NNPDF/pinecards#127 (comment)
purge unused (and thus unmaintained) Git related code
- if possible, drop the whole pygit2 dependency, if not used for anything left
download MadGraph from its new source https://github.com/mg5amcnlo/mg5amcnlo (branch 3.4.1, see NNPDF/pinecards#142)
purge Breeze if not any longer used

Do we need lz4 v3?

https://github.com/NNPDF/runcards/blob/9c1f06fb6c2e5a439e54b7e8af586d9c12d7ad6f/pyproject.toml#L30

This is connected to this issue NNPDF/pineline#15
I can install lz4 v4.0.2 perfectly fine, but version 3. errors out (I think because it tries to look for Python.h in the wrong place, so clearly their fault (also because they fixed it, even if I could've helped it a bit). But since I needed to install python3.10 somewhere else it doesn't work.

FileNotFoundError when running Madgraph trough pinefarm

Hi @cschwan, I am trying to reproduce one of the pinecards using pinefarm. For ATLAS_TTB_13TEV_TOT (as well as for others) I get this error that seems related to pineappl

INFO:  
INFO: Checking test output: 
INFO: P0_gg_ttx 
INFO:  Result for test_ME: 
Command "launch auto " interrupted with error:
FileNotFoundError : [Errno 2] No such file or directory: '/store/DAMTP/mnc33/Projects_store/PhD/nnpdf40_pheno/pinefarm_runs/results/200-ATLAS_TTB_13TEV_TOT--20231113103915/ATLAS_TTB_13TEV_TOT/SubProcesses/P0_gg_ttx/test_ME.log'
Please report this bug on https://bugs.launchpad.net/mg5amcnlo
More information is found in '/store/DAMTP/mnc33/Projects_store/PhD/nnpdf40_pheno/pinefarm_runs/results/200-ATLAS_TTB_13TEV_TOT--20231113103915/ATLAS_TTB_13TEV_TOT/run_01_tag_1_debug.log'.
Please attach this file to your report.
INFO:  
quit
INFO:  
quit
quit
Error calling StartServiceByName for org.freedesktop.Notifications: Timeout was reached
Traceback (most recent call last):
  File "/store/DAMTP/mnc33/miniconda3/envs/pinefarm/bin/pinefarm", line 8, in <module>
    sys.exit(command())
  File "/store/DAMTP/mnc33/miniconda3/envs/pinefarm/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/store/DAMTP/mnc33/miniconda3/envs/pinefarm/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/store/DAMTP/mnc33/miniconda3/envs/pinefarm/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/store/DAMTP/mnc33/miniconda3/envs/pinefarm/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/store/DAMTP/mnc33/miniconda3/envs/pinefarm/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/store/DAMTP/mnc33/miniconda3/envs/pinefarm/lib/python3.9/site-packages/pinefarm/cli/run.py", line 31, in subcommand
    main(dataset, theory_card, pdf)
  File "/store/DAMTP/mnc33/miniconda3/envs/pinefarm/lib/python3.9/site-packages/pinefarm/cli/run.py", line 67, in main
    run_dataset(runner)
  File "/store/DAMTP/mnc33/miniconda3/envs/pinefarm/lib/python3.9/site-packages/pinefarm/cli/run.py", line 122, in run_dataset
    runner.generate_pineappl()
  File "/store/DAMTP/mnc33/miniconda3/envs/pinefarm/lib/python3.9/site-packages/pinefarm/external/mg5/__init__.py", line 184, in generate_pineappl
    grid = pineappl.grid.Grid.read(mg5_grids[0])
IndexError: list index out of range
Thanks for using LHAPDF 6.4.0. Please make sure to cite the paper:
  Eur.Phys.J. C75 (2015) 3, 132  (http://arxiv.org/abs/1412.7420)

Log instead of print

At the moment, in runcardsrunner I'm printing (w/ or w/o rich, even inconsistently...) so I should move to logging (using rich as handler, it will be automatically consistent).

Most likely log.py will have to be updated accordingly (e.g. Tee).

Thanks @scarlehoff for noticing

Explicit Installation Dependencies

I'm not fully happy about the "dependencies": they are a bit hidden inside the functions; nothing we need to deal now, but maybe I'd implement a very basic "dependency manager": a dictionary (containing mutual dependencies), and when start installing things we populate a list, and iterate over it to install stuffs sorted (it should be simple enough, but more explicit)

Originally posted by @alecandido in NNPDF/pinecards#152 (comment)

Improve `pinecard` metadata

See the pending review on https://github.com/NNPDF/pinefarm/pull/23/files/177c81e1ac686f608970ba1cb79cad93e42f8520.

Store pinecard in the grid

As we decided with @felixhekhorn @cschwan and @scarlehoff, we will stop storing the pinecard version in the grid (or maybe make it optional?), and we will write the full tar-gzipped pinecard (the folder) in the metadata, encoded in a string with base64, that is a rather common encoding.

PineAPPL will provide support for extracting the tarball from the metadata (i.e. decode base64 to bytes, a redirect should do the rest of the job, I guess).

Standardize CKM matrix

The CKM matrix (should be) a list of numbers, however it is declared as a string

https://github.com/NNPDF/theories_slim/blob/04a3716991015717bb442038adbe25dec57d2650/data/theory_cards/700.yaml#L3

Personally I'd put a list everywhere. Is there a reason why a string would be preferred?

No valid pineappl installation found when running pinefarm

I am running pinefarm (installed with pip) as following:

pinefarm run runcards/ATLAS_TTB_13TEV_TOT theory_200_1.yaml

The command is executed in a fresh conda environment in which I have only installed lhapdf (but no pineappl.).

The run crashes in the following way, so it seems that pinefarm is not installing PineAPPL on the fly as said in (https://pinefarm.readthedocs.io/en/latest/run.html)

Error detected in "launch auto "
write debug file /store/DAMTP/mnc33/Projects_store/PhD/nnpdf40_pheno/pinefarm_runs/results/200-ATLAS_TTB_13TEV_TOT--20231115093751/ATLAS_TTB_13TEV_TOT/run_01_tag_1_debug.log 
If you need help with this issue, please, contact us on https://answers.launchpad.net/mg5amcnlo
str : No valid pineappl installation found. 
	Please set the path to pineappl-config by using 
	MG5_aMC> set <absolute-path-to-pineappl>/bin/pineappl-config

Moreover, the .prefix/bin folder generated in the same folder in which I run pinefarm is empty

Container architecture

I thought a bit about this idea, and I'm coming with a proposal.

I want to split the pinefarm package in two different ones (but distributed together, as eko and ekobox), to put a boundary between the two.
One it will be the current UI, with the CLI and all the tools (installation, configs, ...).
The other will contain mostly run.py and external/, and it will contain whatever is strictly related to the computation itself.

Of course with a new package I will need a new name, the best I came with is pinefarmer. Alternatives are welcome.

So, pinefarm will do everything it is doing, plus managing the container as well.
pinefarmer instead will be installed inside the container, and it will accept a minimal input from the outside, and perform the actual grid computation.

Then, there is the problem of tooling for containers.

Managing containers

There are two main container engines for our purposes: Docker (by Docker) and Podman (by RedHat). Docker is more or less the first and most popular one, while Podman arrived later on.
There are a few more engines, and in general more complications of many kinds (orchestration, runtimes, ...), mainly because cloud computing is a big market. We are not really interested in cloud computing at the moment, we just want to take a tool from there, but in case you struggle with the vocabulary, RedHat provides a good summary.

Initialize disclaimer: I dislike Docker, at this point also for historical reasons, a few of which still applies, but not all of them. I might be biased.

The main difference between Docker and Podman (besides the companies behind them) is that the first requires a daemon to run (dockerd), and the second one is daemon-less. In the old times (i.e. a couple years ago, at most) dockerd required to run under root user, now they also provide a root-less option (but, if I understood correctly, it is not the default one).
More details on this RedHat page.

This reason for me was sufficient to choose Podman: I could simply do apt install podman, and then use the CLI:

podman pull <container-image>
podman run <contianer-image>
podman ps  # show active containers
...

nothing more.

Now, I'd like not to rely on the CLI availability, and if possible also not relying on anything else to be installed. I would prefer that everything is installed by the pinefarm package installation, as a Python dependency, but I'm not sure I can do.

Docker would require at least the daemon installation, Podman maybe would require nothing.

On the other hand, I'd like to have a ready-to-use Python package, and Docker has it -> docker-py. While Docker is not traditionally open source (even if it released and even "donated" some codes after some time), this is. Podman also has a python package -> podman-py, but it is much less popular and maintained. However, it mostly contains bindings for Podman REST API, so we could skip the package and directly go for requests to the API. But this would require a service to run (and so also the Python package requires it), so not much different from root-less Docker in the end...

They both have a library that is directly accessible, to make use of the same functions of the CLI, so (at least for Podman) it wouldn't require a service to run. But, of course, they are Go libraries:

The Docker one is most likely the corresponding of the Python package (that I expect to be bindings to it), but it might require the daemon anyhow. I also expect you do not want to move pinefarm to Go...

Fix datasets with jets

In Madgraph5_aMC@NLO v3.3.1 the cutting routine was changed, so that pjet isn't calculated in the place where we use it anymore. We either have to 1) run the jet algorithm ourselves or 2) move jet-related custom cuts from passcuts_user to passcuts_jets in cuts.f.

Missing `tomlkit`

At least on dom on the DIS-more branch I get:

Error processing line 1 of /home/cschwan/.local/lib/python3.10/site-packages/zzz_poetry_dynamic_versioning.pth:

  Traceback (most recent call last):
    File "/usr/lib/python3.10/site.py", line 186, in addpackage
      exec(line)
    File "<string>", line 1, in <module>
    File "/home/cschwan/.local/lib/python3.10/site-packages/poetry_dynamic_versioning/__init__.py", line 13, in <module>
      import tomlkit
  ModuleNotFoundError: No module named 'tomlkit'