erum-data-idt / pd4ml Goto Github PK

View Code? Open in Web Editor NEW

22.0 22.0 9.0 999 KB

Python 100.00%

pd4ml's People

Contributors

Stargazers

Watchers

Forkers

nikoladze pfackeldey hosein47 inspectordidi axect ewencedr ishutkarsh1995 alinutzal

pd4ml's Issues

Preprocessing for non-graph inputs

Thinking a bit more about this i see now what the technical issue is. We have a custom preprocessing for the common graph model, done by the load_graph method of each dataset. I abused this in the Belle dataset to also do (in addition to creating the adjacency matrix) some additional preprocessing for the pdg ids (one-hot encoding).

Now, this is not done for the inputs to the common FCN network which would make the comparison a bit unfair. I can't apply this preprocessing beforehand, because to train the reference model i actually need the (non-one-hot encoded) pdg ids as inputs since this uses an embedding layer.

How to resolve this? Some options i could think of:

Don't do it and accept that the comparison between the FCN and graph network is a bit unfair.
Also don't do the one-hot encoding for the graph network. This way the comparison between the common models is fair, but now the comparison to the reference model would be a bit unfair ...
Have the possibility for e.g. a load_flat method of datasets which is used for the FCN.
Have the possibility for a model-specific callback in the datasets that does some additional preprocessing that fits the particular model.

I think i would tend towards the load_flat option - How is the situation for the other datasets? What do others think?

Unable to download datasets?

When attempting to load a dataset (e.g., TopTagging.load('train', path = './pd4ml/datasets')) I am getting an error message regarding an "invalid load key, '<'". The cause appears to be that the downloaded 1_top_tagging_2M.npz file is an html error message. Indeed, if I try to directly access the file using a web browser (https://desycloud.desy.de/index.php/s/aZqyNSg4B7nn8qQ/download) I get the same error message that "The document could not be found on the server. Maybe the share was deleted or has expired?".

Are these files still available?

Pretrained models

Hi @erikbuh @WilliamKorcari ,

So far the the repository supports the model reference model in terms of code.
I suppose some models have huge training requirements and long runtimes. What do you think of providing pretrained models for the reference models? Maybe they can be hosted in the same way as the datasets and loaded on demand, if one does not has the resources or time to retrain from start.
This is just an idea which came to my mind earlier, let me know what you think :)

(Another thing: Due to non-deterministic initialisation the training result will be slightly different all the time. Therefore, when quoting the performance of a model publicly, it would be safer and more correct to have pretrained models, which behave always the same.)

Best, Peter

Base Model Template

Dear developers,

as far as I understood there is a predefined Network template here: https://github.com/erikbuh/erum_data_data/blob/main/template_model_implamentation/template.py

In order to enforce a common Network structure it might make sense to transition this template (currently for copy&paste purposes) into a ABC base class from which one should inherit.

This will ensure/allow (at minimum):

common Network structure/API
common functions, can be forwarded through inheritance

The changes code-wise are rather small and will not affect the already integrated Network definitions.
What do you think about this proposal?
If you want I can prepare a PR regarding this and you can have a more hands-on look how this change might feel/look like.

Cheers, Peter

Additional dataset to graph transformations

Dear collaborators,

We'd like to add functions to the dataset class that transform the belle2 and airshower dataset into a graph (features & adjecancy matrix). So far this is done for the image based datasets(Spinodal, EOS) and the top datasets.

For the graph transformation of the images we've added a definition such as this here:
https://github.com/erikbuh/erum_data_data/blob/main/erum_data_data/graphs.py#L31
That definition is loaded in the dataset class here:
https://github.com/erikbuh/erum_data_data/blob/main/erum_data_data/erum_data_data.py#L82

So here one could add the airshower and belle2 dataset to graph transformations as well.

A few remarks:

the Top dataset adjacency matrix is quite computational expensive and takes up 16 GB of memory for a matrix of shape [1.6M, 100, 100] (not really an issue for our 250 GB RAM machines, but something to keep in mind)
This might also be the case for the airshower and belle2 adj matrices. I still think it's worth to try to implement the dataset to graph transformation before the training as this way we have a documented code on how to turn different datasets into graphs.
But of course if this becomes unreasonable complicated, we can change rails and provide the graph implementation of the datasets directly as a download
currently the simple_graph_net in graph_models based on the belle2 reference network is not performing very well for the images - we probably have to optimise it further or change the graph calculation for images
If you'd like to run the simple_graph_net for the TopTagging dataset, pease let us know. There are a few package dependencies for the graph calculation we have not yet sorted out as a package requirement.
As always: If you have a good idea how to simplify or expand our API / setup, feel free to let us know or to make a pull request :)

Let us know here if you have any issues with the current setup.

Thanks & cheers,
Erik

Python 3.8 is not supported

Support for multi input datasets in common training script

The erum_data_data.load function returns multi-input dataset as a list. Therefore i believe we need a slightly different procedure in the common training script where currently only the first element of that list is passed further on:

https://github.com/erikbuh/erum_data_data/blob/c65f60e0c45a6f907b6990d841ac89240b2ec035/template_model_implamentation/main.py#L29-L30

https://github.com/erikbuh/erum_data_data/blob/c65f60e0c45a6f907b6990d841ac89240b2ec035/template_model_implamentation/main.py#L33

Maybe something like

x_train = nn.preprocessing(X_train) 
x_test = nn.preprocessing(X_test)

and

model = nn.model(ds, shape = [x.shape[1:] for x in x_train])

Also this line
https://github.com/erikbuh/erum_data_data/blob/c65f60e0c45a6f907b6990d841ac89240b2ec035/template_model_implamentation/main.py#L44

should probably have x_test (with lowercase x)

erum-data-idt / pd4ml Goto Github PK

pd4ml's People

Contributors

Stargazers

Watchers

Forkers

pd4ml's Issues

Preprocessing for non-graph inputs

Unable to download datasets?

Pretrained models

Base Model Template

Additional dataset to graph transformations

Python 3.8 is not supported

Support for multi input datasets in common training script

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent