nnpdf / reportengine Goto Github PK

1.0 4.0 2.0 603 KB

A framework for declarative data analysis

Home Page: https://data.nnpdf.science/validphys-docs/guide.html

License: GNU General Public License v2.0

Python 100.00%

reportengine's Introduction

Reportengine

Reportengine is a framework to develop scientific applications. It is focused on supporting declarative input (YAML), enforcing initialization time ("compile time") constraints, and enabling iteration within the declarative input.

It includes support for figures, tables (pandas) and HTML reports.

The documentation of the NNPDF specific implementation can be found here:

https://data.nnpdf.science/validphys-docs/guide.html

An example application can be found in the example directory.

Install

It is recommended to work with the package using conda.

For linux or Mac, you can install a precompiled package by running

conda install reportengine -c https://packages.nnpdf.science/conda

Alternatively the package can be installed from pip:

pip install reportengine

Note that it will additionally require pandoc to work.

Development

Install in development mode:

pip install -e .

Running the tests

Easiest way is:

pytest

reportengine's People

Contributors

Stargazers

Watchers

Forkers

zaharid comane

reportengine's Issues

Files in the same folder can take precedence over validphys scripts' files

If one runs a validphys script (for instance vp-comparefits) in a folder where there is already a report.md file. This report.md will take precedence over the one define in the script.

Not sure whether this is something that should be fixed in reportengine or in validphys.

development installation metadata doesn't appear to match master

I was just looking to test out lockfiles and followed the dev install instructions and noted that my curio version got downgraded.

I think the curio version got unlocked here: f4446a6

But perhaps the flit file needs to be regenerated?

reportengine/pyproject.toml

Line 27 in d057768

"curio ==0.9",

Allow multiple config classes

The users should be able to define a Config class in each provider module, with the effective application Config generated as a mixin of them all, in the same order as the providers are declared.

Node identity should depend on inputs, not nsspec

Right now two nodes from provider functions are considered equal if their namespace matches. The equality should instead explicitly consider the inouts of the nodes (that by assumption steam from the user's config ultimately). This would avoid duplicating computations that are known to have the same result.

One problem is that (nested) YAML lists and mappings can be used as inputs to the providers, and we would need a way to freeze them to make them hashable.

numpydoc style docstrings get reformatted when using --help

AFAICT this is happening in

reportengine/src/reportengine/helputils.py

Line 28 in 491c92e

def sane_wrap(txt, width=70, initial_indent=' ', subsequent_indent=' '):

Is there a reason we don't want to respect new lines? Surely we should have the freedom to add a single new line if we want?

ValueError in pd.option_contex

This line now raises a ValueError: Value must be a nonnegative integer or None

reportengine/src/reportengine/table.py

Line 53 in 9d34899

with pd.option_context('display.max_colwidth', -1):

Think it may have changed in the new pandas.

Incompatibility with ruamel >= 0.18

https://sourceforge.net/p/ruamel-yaml/code/ci/0.18.0/tree/CHANGES

Since 0.18.0 many functions that were used from ruamel are deprecated and at some point they've become errors. It might be possible to go up all the way to 0.18.2 but for the time being 0.18 is a safe cut.

At some point, if nobody deals with this and #59 I'll try to fix all problems and create a release, but I don't think I'll have time in the next few weeks or moths.

Problem with lockfiles

I'm getting the following error using the latest version of reportengine

Traceback (most recent call last):
  File "/media/storageSSD/Academic_Workspace/NNPDF/source/nnpdf/n3fit/src/n3fit/n3fit.py", line 193, in run
    super().run()
  File "/media/storageSSD/Academic_Workspace/NNPDF/source/nnpdf/validphys2/src/validphys/app.py", line 144, in run
    super().run()
  File "/media/storageSSD/Academic_Workspace/NNPDF/source/reportengine/src/reportengine/app.py", line 353, in run
    c.dump_lockfile()
  File "/media/storageSSD/Academic_Workspace/NNPDF/source/reportengine/src/reportengine/configparser.py", line 803, in dump_lockfile
    with open(self.environment.input_folder/"lockfile.yaml", "w+") as f:
AttributeError: 'N3FitEnvironment' object has no attribute 'input_folder'

For now I'm staying at 901aa52 which was working fine

Things that might or not be relevant:

I am not using conda (in conda it seems to work fine, not sure whether the conda version of reportengine is up to date)
I've done a fresh installation of nnpdf (and all dependencies)
I'm using python 3.8.2

(not sure whether I should've changed something in n3fit to accommodate the changes)

Production rule parameter should check types of inputs

Something like this:


    def produce_matched_datasets_from_dataspecs(self, dataspecs,
                                                matched_cuts:bool=False):

should check that matched_cuts is in fact a bool.

Conditional execution

I have been thinking for a while on having actions of the form

name::space action | condition_action

and similarly

{@with name::space | condition_action@}

where the new thing in both cases is the | condition_action bit. The semantics are that condition_action gets executed for each possibility in the namespace, but action (or the thing inside the with block) only gets executed for the namespaces in which condition_action is true. From the point of view if the checks, we always assume that everything is going to be executed and the checks must pass in every namespace.

We currently cannot express this in reportengine. The use cases include things like "Make a bunch of plots if the chi2 is too high and show them on top of the report with scary letters in red".

This is not so trivial, because we would have to modify the execution graph at runtime in a non trivial way. We would like that if action is something like:

def action(expensive_thing): ...

then expensive_thing is not computed if condition_action is False, and that is the only node that requires it. One simple solution that doesn't work is to make every dependence of action depend on condition_action, and sacrifice some parallelism opportunities (because we would need to compute condition_action before any of the dependencies). This doesn't work because condition_action could itself depend on e.g. expensive_thing, and that would create a cycle.

Instead we would need some careful bookkeeping on what depends on which condition, and the ability to prune nodes at runtime.

This can only happen once the code has been rewritten.

Add table formats

This should work similarly to figure formats.

Ideally it should be possible to save tables in latex and parquet as well as html.

Export graph log

I think would be interesting to have the graph log for a reportengine session, e.g. for regression tests.
I am thinking about something similar to https://www.tensorflow.org/api_docs/python/tf/debugging/set_log_device_placement.

Namespace production rules give wrong results for implicit keys

The following validphys runcard

theory:
    from_: fit

theoryid:
    from_: theory

use_cuts: "fromfit"


pdf:
    from_: fit


experiments:
    from_: fit

dataspecs:
  - fit: NNPDF31_nlo_as_0118_1000

  - fit: NNPDF31_nnlo_as_0118_1000

actions_:
    - matched_datasets_from_dataspecs::dataspecs plot_fancy_dataspecs

gives the wrong results in that the datasets incorrectly resolve to the first (NLO) value. It works fine if

experiments:
    from_: fit

is inside each dataspec. This has to do with the fact that even though we specify write=False in the validphys production rules, the parameter is not propagated to resolve_signature_params. This change:

diff --git a/src/reportengine/configparser.py b/src/reportengine/configparser.py
index 26138aa..3d79657 100644
--- a/src/reportengine/configparser.py
+++ b/src/reportengine/configparser.py
@@ -294,7 +294,7 @@ class Config(metaclass=ConfigMetaClass):
                                                ns,
                                                input_params= input_params,
                                                max_index=max_index,
-                                               parents=parents)
+                                               parents=parents, write=False)
             except KeyError:
                 if param.default is not sig.empty:
                     pval = param.default

appears to fix the problem, but then some other test fails. Have to check if correctly propagating the argument works. Or rewrite the whole thing in a way that is not crazy.

Set CPU affinity

Python libraries (like tensorflow or numpy) have a tendency to open threads and processes with no measure. As a result running several tasks in parallel can be inefficient as there is a -unneded- competition for resources.

A solution to this is setting the CPU affinity of the resource-hungry program to ensure it doesn't spill over the rest of the computer. Can be easily done with a system call to taskset (with something like a force_single_core: true flag in the runcard)

Pros:

You can abstract a bit the call to taskset. I think setting the affinity once the program is running is better than running the program with taskset.
You can even set a different number of cores depending on the load of the system

Cons:

(many) to begin with I am not sure this is inside of the scope of reportengine
It will not work on every computer.
Setting manually in which CPU you are running is not always advisable and sometimes a really bad idea. The OS will always do a better job than you, this is only useful to stop very resource-hungry libraries.

@Zaharid, if you are not against it I can have a look at implementing it at the level of reportengine in a more-or-less more robust way. Right now when I need it I just hack the two lines I need in:

pid  = os.getpid()
subprocess.run(f'taskset -pc 1 {pid}')

let me know what you think (or if you know of a better way of doing this). I didn't want to do a PR directly because I am really not sure whether the pros are "pro enough" to compensate for the cons.

checks should have an apply method

... Especially argchecks.

When the make_argcheck decorator is applied, the result is a function that takes another function as argument, and instructs reportengine to apply the corresponding check. However we may be interested in simply using the function outside the loop. A possible solution is to make it so make_argcheck adds an apply method to the checks, that has the same effect as the original function.

Document dynamic provider dispatch in reportengine

The idea is that you can use the explicit_node decorator in the config parser to return functions, which then have the role of providers.

One can do something like:

#provider
def action1(some, required, resources):
     ...

def action2(totally, different, resources):
    ....

def generic_table(dispatch_action):
     #table that works for both the result of action1 and action2
    ...

def generic_plot(generic_table):
   #A plot that works with the output of generic table and might not care about action1 and action2
   ...


#config   
#class Config ...
    @configparser.explicit_node
    def produce_dispatch_action(self, dispatch_value:str):
        if dispatch_value == "action1": 
            return action1
        elif dispatch_value == "action2":
           return action2
        raise ConfigError(...)

Then one could do:

#runcard.yaml
dispatch_value: "action1"
some: ...
required: ...
input_for_resource_called_resources: ...
actions_:
  - generic_plot

This has the crucial advantage that one does not have to redo the whole pipeline (i.e. there is only one generic_table). Runtime dispatch (as in one big substitute of generic_table that takes the inputs of both action1 and action2) will not work well if the various actions have completely different inputs.
One disadvantage is that it obfuscates the help.

This should all b written in the guide somehow.

Consider vendoring `curio`

Unfortunately, its author decided that software didn't need releases anymore: https://github.com/dabeaz/curio#important-notice-october-25-2022

This makes packaging this package a bit harder than necessary, please consider vendoring the dependency or using something else if you don't need it. :)

Thank you!

every push is creating a new reportengine package

reportengine/.github/workflows/conda.yml

Line 3 in 5d5c32e

on: [push]

Version available in pypi doesn't work correctly in Mac OS

I haven't pinpointed the problem, but the collection over experiments in NNPDF was not working correctly (and so I was getting a mismatch of arrays).

Installing (with pip install .) from this repository directly fixed the problem.

If I have time I'll try to debug it a bit more (as far as I can see no package was updated and the version of report engine is the same). For the time being I'm leaving this issue here as information for other users.

Use the report-stile action specifications in the config

The meaning of actions_ is incredibly cumbersome, and in fact I don't think I ever documented it explicitly anywhere. There is good way of representing actions in YAML. Instead, we should use the action specification of the report, so that actions is a list of strings of the form

actions:
    - "name::space action"

We would require a different key from actions_: to preserve backwards compatibility. Perhaps it could be actions without the underscore.

Implement saved resource framework

This should generalize @figure and @table and deal with things like creating paths.

Using key: {from_: <production rule>} overwrites value for key in collect

This is linked to NNPDF/nnpdf#1008 but I have found another example:

fit: 031120-mw-001
pdf: {from_: fitpdf} # take key from production rule

pdfs:
 - NNPDF31_nnlo_as_0118

actions_:
 - report(main=True)

template_text: |
 {@with pdfs@}
 {@pdf@}
 {@endwith@}

outputs 031120-mw-001. However:

fit: 031120-mw-001
pdf: {from_: fit} # take key from a resource with `as_input` method

pdfs:
 - NNPDF31_nnlo_as_0118

actions_:
 - report(main=True)

template_text: |
 {@with pdfs@}
 {@pdf@}
 {@endwith@}

does not, it outputs NNPDF31_nnlo_as_0118. I personally think the second behaviour is correct but maybe I have a bad opinion!

If we want to change this then I guess there is something special with how {from_: <production rule>} is interacting with collect? I guess it doesn't get resolved and somehow takes precedence when unresolved whereas if {from_: } gets resolved sooner?

Create a new tag or release for reportengine

reportengine has changed quite a lot since the last tag/release, in particular the dash stuff that @comane added clearly warrants a new release/tag NNPDF/nnpdf#1804 (comment)

And an associated question which should have been formulated sooner... @Zaharid are you planning to keep on maintaining reportengine?

Warn about unused keys

The semantics are trickier than it seems. We should detect whether the parent mapping is used as a namespace and if yes track every key inside. E.g. the root level is always a namespace but may contain a dict where we use the literal value.

However the identity comparison should be by input key rather than by nsspec.

This probably requires a much needed rewrite of the resolve_key machinery.

Improve CI infrastructure

CIs have improved a lot since we set up gitlab for reportengine. Moreover we don't need a complicated setup for a simple python package. Also reportengine is public, which means that a lot of tools are easily available for free.

The setup in https://github.com/Zaharid/validobj/ is how it should look like if done today.

Improve workflow for papers

The experience of working on a paper with a large vp_runcards folder from where figures are produced could be better when it comes to naming files. At the moment we have something like

.
├── paper.tex
├── plots
│   └── sec_somename_plot_pdfs_g.pdf
└── vp_runcards
    ├── output
    │   └── figures
    │       └── plot_pdfs_g.pdf
    └── plot_some_pdfs.yaml

where typically output is not commited to the repository and instead the results are copied and renamed manually to the plots directory. The problem with that approach is that redoing the plots (for example to improve the style) is quite a chore, as one needs to identify the relevant files in the paper, the find the runcard that produces them, make the changes, run validphys, hope that output is clean and doesn't lead to some kind of confussion, and rename the files manually as appropriate back (which may or may not be a task involving regular expressions). This is error prone and somewhat time consuming when one has to do it many times.

An easy solution would be to change the way we work to e.g. subfolders from each invocation of vp and commit the results of running it. The problem with this is that arxiv doesn't like subfolders, so instead the filename of the plots should have enough uniqueness. For that I think life would be simpler if we had a flag like --folder-prefix (or better name) that

Had as default name for the output folder to be the stem of the runcard filename.
Prepended the runcard filename as prefix to all figures and tables.

Then if we had something like say pdfs_40vs31.yaml we would get pdfs_40vs31-plot_pdfs_g.pdf, which is an acceptable enough name to refer to in the tex, avoids conflicts with other plot_pdfs_g from different runcards and can be easy copy pasted (or preferably simlinked).

cc @enocera

Provider modules should be able to install Configs

It makes little sense using e.g. reportengine.report without also adding the corresponding Config class. Moreover, when one forgets it, strange things happen like templates not being parsed, but shipped verbatim to pandoc, which makes them hard to debug.

Add --dry option

So that the checks are performed but the actions not run or resources downloaded (see NNPDF/nnpdf#1184 (comment))