pd4ml's People
pd4ml's Issues
Preprocessing for non-graph inputs
Thinking a bit more about this i see now what the technical issue is. We have a custom preprocessing for the common graph model, done by the load_graph
method of each dataset. I abused this in the Belle dataset to also do (in addition to creating the adjacency matrix) some additional preprocessing for the pdg ids (one-hot encoding).
Now, this is not done for the inputs to the common FCN network which would make the comparison a bit unfair. I can't apply this preprocessing beforehand, because to train the reference model i actually need the (non-one-hot encoded) pdg ids as inputs since this uses an embedding layer.
How to resolve this? Some options i could think of:
- Don't do it and accept that the comparison between the FCN and graph network is a bit unfair.
- Also don't do the one-hot encoding for the graph network. This way the comparison between the common models is fair, but now the comparison to the reference model would be a bit unfair ...
- Have the possibility for e.g. a
load_flat
method of datasets which is used for the FCN. - Have the possibility for a model-specific callback in the datasets that does some additional preprocessing that fits the particular model.
I think i would tend towards the load_flat
option - How is the situation for the other datasets? What do others think?
Unable to download datasets?
When attempting to load a dataset (e.g., TopTagging.load('train', path = './pd4ml/datasets')) I am getting an error message regarding an "invalid load key, '<'". The cause appears to be that the downloaded 1_top_tagging_2M.npz file is an html error message. Indeed, if I try to directly access the file using a web browser (https://desycloud.desy.de/index.php/s/aZqyNSg4B7nn8qQ/download) I get the same error message that "The document could not be found on the server. Maybe the share was deleted or has expired?".
Are these files still available?
Pretrained models
Hi @erikbuh @WilliamKorcari ,
So far the the repository supports the model reference model in terms of code.
I suppose some models have huge training requirements and long runtimes. What do you think of providing pretrained models for the reference models? Maybe they can be hosted in the same way as the datasets and loaded on demand, if one does not has the resources or time to retrain from start.
This is just an idea which came to my mind earlier, let me know what you think :)
(Another thing: Due to non-deterministic initialisation the training result will be slightly different all the time. Therefore, when quoting the performance of a model publicly, it would be safer and more correct to have pretrained models, which behave always the same.)
Best, Peter
Base Model Template
Dear developers,
as far as I understood there is a predefined Network
template here: https://github.com/erikbuh/erum_data_data/blob/main/template_model_implamentation/template.py
In order to enforce a common Network
structure it might make sense to transition this template (currently for copy&paste purposes) into a ABC
base class from which one should inherit.
This will ensure/allow (at minimum):
- common
Network
structure/API - common functions, can be forwarded through inheritance
The changes code-wise are rather small and will not affect the already integrated Network
definitions.
What do you think about this proposal?
If you want I can prepare a PR regarding this and you can have a more hands-on look how this change might feel/look like.
Cheers, Peter
Additional dataset to graph transformations
Dear collaborators,
We'd like to add functions to the dataset class that transform the belle2 and airshower dataset into a graph (features & adjecancy matrix). So far this is done for the image based datasets(Spinodal, EOS) and the top datasets.
For the graph transformation of the images we've added a definition such as this here:
https://github.com/erikbuh/erum_data_data/blob/main/erum_data_data/graphs.py#L31
That definition is loaded in the dataset class here:
https://github.com/erikbuh/erum_data_data/blob/main/erum_data_data/erum_data_data.py#L82
So here one could add the airshower and belle2 dataset to graph transformations as well.
A few remarks:
- the Top dataset adjacency matrix is quite computational expensive and takes up 16 GB of memory for a matrix of shape [1.6M, 100, 100] (not really an issue for our 250 GB RAM machines, but something to keep in mind)
- This might also be the case for the airshower and belle2 adj matrices. I still think it's worth to try to implement the dataset to graph transformation before the training as this way we have a documented code on how to turn different datasets into graphs.
- But of course if this becomes unreasonable complicated, we can change rails and provide the graph implementation of the datasets directly as a download
- currently the
simple_graph_net
ingraph_models
based on the belle2 reference network is not performing very well for the images - we probably have to optimise it further or change the graph calculation for images - If you'd like to run the
simple_graph_net
for the TopTagging dataset, pease let us know. There are a few package dependencies for the graph calculation we have not yet sorted out as a package requirement. - As always: If you have a good idea how to simplify or expand our API / setup, feel free to let us know or to make a pull request :)
Let us know here if you have any issues with the current setup.
Thanks & cheers,
Erik
Python 3.8 is not supported
Support for multi input datasets in common training script
The erum_data_data.load
function returns multi-input dataset as a list. Therefore i believe we need a slightly different procedure in the common training script where currently only the first element of that list is passed further on:
Maybe something like
x_train = nn.preprocessing(X_train)
x_test = nn.preprocessing(X_test)
and
model = nn.model(ds, shape = [x.shape[1:] for x in x_train])
Also this line
https://github.com/erikbuh/erum_data_data/blob/c65f60e0c45a6f907b6990d841ac89240b2ec035/template_model_implamentation/main.py#L44
should probably have x_test
(with lowercase x
)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.