Coder Social home page Coder Social logo

pd4ml's People

Contributors

erikbuh avatar ewencedr avatar gkasieczka avatar lbenato avatar nikoladze avatar pfackeldey avatar williamkorcari avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pd4ml's Issues

Preprocessing for non-graph inputs

Thinking a bit more about this i see now what the technical issue is. We have a custom preprocessing for the common graph model, done by the load_graph method of each dataset. I abused this in the Belle dataset to also do (in addition to creating the adjacency matrix) some additional preprocessing for the pdg ids (one-hot encoding).

Now, this is not done for the inputs to the common FCN network which would make the comparison a bit unfair. I can't apply this preprocessing beforehand, because to train the reference model i actually need the (non-one-hot encoded) pdg ids as inputs since this uses an embedding layer.

How to resolve this? Some options i could think of:

  • Don't do it and accept that the comparison between the FCN and graph network is a bit unfair.
  • Also don't do the one-hot encoding for the graph network. This way the comparison between the common models is fair, but now the comparison to the reference model would be a bit unfair ...
  • Have the possibility for e.g. a load_flat method of datasets which is used for the FCN.
  • Have the possibility for a model-specific callback in the datasets that does some additional preprocessing that fits the particular model.

I think i would tend towards the load_flat option - How is the situation for the other datasets? What do others think?

Unable to download datasets?

When attempting to load a dataset (e.g., TopTagging.load('train', path = './pd4ml/datasets')) I am getting an error message regarding an "invalid load key, '<'". The cause appears to be that the downloaded 1_top_tagging_2M.npz file is an html error message. Indeed, if I try to directly access the file using a web browser (https://desycloud.desy.de/index.php/s/aZqyNSg4B7nn8qQ/download) I get the same error message that "The document could not be found on the server. Maybe the share was deleted or has expired?".

Are these files still available?

Pretrained models

Hi @erikbuh @WilliamKorcari ,

So far the the repository supports the model reference model in terms of code.
I suppose some models have huge training requirements and long runtimes. What do you think of providing pretrained models for the reference models? Maybe they can be hosted in the same way as the datasets and loaded on demand, if one does not has the resources or time to retrain from start.
This is just an idea which came to my mind earlier, let me know what you think :)

(Another thing: Due to non-deterministic initialisation the training result will be slightly different all the time. Therefore, when quoting the performance of a model publicly, it would be safer and more correct to have pretrained models, which behave always the same.)

Best, Peter

Base Model Template

Dear developers,

as far as I understood there is a predefined Network template here: https://github.com/erikbuh/erum_data_data/blob/main/template_model_implamentation/template.py

In order to enforce a common Network structure it might make sense to transition this template (currently for copy&paste purposes) into a ABC base class from which one should inherit.

This will ensure/allow (at minimum):

  1. common Network structure/API
  2. common functions, can be forwarded through inheritance

The changes code-wise are rather small and will not affect the already integrated Network definitions.
What do you think about this proposal?
If you want I can prepare a PR regarding this and you can have a more hands-on look how this change might feel/look like.

Cheers, Peter

Additional dataset to graph transformations

Dear collaborators,

We'd like to add functions to the dataset class that transform the belle2 and airshower dataset into a graph (features & adjecancy matrix). So far this is done for the image based datasets(Spinodal, EOS) and the top datasets.

For the graph transformation of the images we've added a definition such as this here:
https://github.com/erikbuh/erum_data_data/blob/main/erum_data_data/graphs.py#L31
That definition is loaded in the dataset class here:
https://github.com/erikbuh/erum_data_data/blob/main/erum_data_data/erum_data_data.py#L82

So here one could add the airshower and belle2 dataset to graph transformations as well.

A few remarks:

  • the Top dataset adjacency matrix is quite computational expensive and takes up 16 GB of memory for a matrix of shape [1.6M, 100, 100] (not really an issue for our 250 GB RAM machines, but something to keep in mind)
  • This might also be the case for the airshower and belle2 adj matrices. I still think it's worth to try to implement the dataset to graph transformation before the training as this way we have a documented code on how to turn different datasets into graphs.
  • But of course if this becomes unreasonable complicated, we can change rails and provide the graph implementation of the datasets directly as a download
  • currently the simple_graph_net in graph_models based on the belle2 reference network is not performing very well for the images - we probably have to optimise it further or change the graph calculation for images
  • If you'd like to run the simple_graph_net for the TopTagging dataset, pease let us know. There are a few package dependencies for the graph calculation we have not yet sorted out as a package requirement.
  • As always: If you have a good idea how to simplify or expand our API / setup, feel free to let us know or to make a pull request :)

Let us know here if you have any issues with the current setup.

Thanks & cheers,
Erik

Support for multi input datasets in common training script

The erum_data_data.load function returns multi-input dataset as a list. Therefore i believe we need a slightly different procedure in the common training script where currently only the first element of that list is passed further on:

https://github.com/erikbuh/erum_data_data/blob/c65f60e0c45a6f907b6990d841ac89240b2ec035/template_model_implamentation/main.py#L29-L30

https://github.com/erikbuh/erum_data_data/blob/c65f60e0c45a6f907b6990d841ac89240b2ec035/template_model_implamentation/main.py#L33

Maybe something like

x_train = nn.preprocessing(X_train) 
x_test = nn.preprocessing(X_test) 

and

model = nn.model(ds, shape = [x.shape[1:] for x in x_train])

Also this line
https://github.com/erikbuh/erum_data_data/blob/c65f60e0c45a6f907b6990d841ac89240b2ec035/template_model_implamentation/main.py#L44

should probably have x_test (with lowercase x)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.