Coder Social home page Coder Social logo

lartpc_mlreco3d's Introduction

Build Status

Documentation Status

A Machine Learning Pipeline for LArTPC Data

This repository contains code used for training and running machine learning models on LArTPC data.

Full chain

Installation

We recommend using a Singularity or Docker containers pulled from deeplearnphysics/larcv2: https://hub.docker.com/r/deeplearnphysics/larcv2. It needs to have at least

  • MinkowskiEngine,
  • larcv2,
  • pytorch_geometric,
  • PyTorch,
  • standard Python scientific libraries.

Then git clone this repository and have fun!

Usage

Basic example:

# assume that lartpc_mlreco3d folder is on python path
from mlreco.main_funcs import process_config, train
import yaml
# Load configuration file
with open('lartpc_mlreco3d/config/train_uresnet.cfg', 'r') as f:
    cfg = yaml.load(f, Loader=yaml.Loader)
process_config(cfg)
# train a model based on configuration
train(cfg)

Example Configuration Files

For your inspiration, the following standalone configurations are available in the config folder:

Configuration name Model
train_uresnet.cfg UResNet alone
train_uresnet_ppn.cfg UResNet + PPN
train_graph_spice.cfg GraphSpice
train_grappa_shower.cfg GrapPA for shower fragments clustering (particle fragments -> particle clusters)
train_grappa_interaction.cfg GrapPA for interaction clustering (particle clusters -> interactions)

Switching from train to test mode is as simple as switching trainval.train: False for all models. The only exception at the moment is GraphSpice, for which an example test configuration is provided (test_graph_spice.cfg).

Typically in a configuration file the first things you may want to edit will be:

  • batch_size (in 2 places)
  • weight_prefix (where to save the model checkpoints)
  • log_dir (where to save the logs)
  • iterations
  • model_path (checkpoint to load, optional)
  • train (boolean)
  • gpus (leave empty '' if you want to run on CPU)

If you want more information stored, such as network output tensors and post-processing outcomes, you can use analysis (scripts) and outputs (formatters) to store them in CSV format and run your custom analysis scripts (see folder analysis).

This section has described how to use the contents of this repository to train variations of what has already been implemented. To add your own models and analysis, you will want to know how to contribute to the mlreco module.

Running A Configuration File

Most basic usage is to use the run script. From the lartpc_mlreco3d folder:

nohup python3 bin/run.py train_gnn.cfg >> log_gnn.txt &

This will train a GNN specified in config/train_gnn.cfg, save checkpoints and logs to specified directories in the cfg, and output stderr and stdout to log_gnn.txt

You can generally load a configuration file into a python dictionary using

import yaml
# Load configuration file
with open('lartpc_mlreco3d/config/train_uresnet.cfg', 'r') as f:
    cfg = yaml.load(f, Loader=yaml.Loader)

Reading a Log

A quick example of how to read a training log, and plot something

import pandas as pd
import matplotlib.pyplot as plt
fname = 'path/to/log.csv'
df = pd.read_csv(fname)

# plot moving average of accuracy over 10 iterations
df.accuracy.rolling(10, min_periods=1).mean().plot()
plt.ylabel("accuracy")
plt.xlabel("iteration")
plt.title("moving average of accuracy")
plt.show()

# list all column names
print(df.columns.values)

Recording network output or running analysis

We use LArTPC MLReco3D Analysis Tools for all inference and high-level analysis related work.

Repository Structure

  • bin contains very simple scripts that run the training/inference functions.
  • config has various example configuration files.
  • docs Documentation (in progress)
  • mlreco the main code lives there!
  • test some testing using Pytest
  • analysis: LArTPC MLReco3D Analysis Tools, a pure python interface for inference, high-level analysis, and visualization using the full chain.

Please consult the README of each folder respectively for more information.

Contributing

Before you start contributing to the code, please see the contribution guidelines.

Adding a new model

You may be able to re-use a fair amount of code, but here is what would be necessary to do everything from scratch:

  1. Make sure you can load data you need.

Parsers already exist for a variety of sparse tensor outputs as well as particle outputs.

The most likely place you would need to add something is to mlreco/iotools/parsers.py.

If the data you need is fundamentally different from data currently used, you may also need to add a collation function to mlreco/iotools/collates.py

  1. Include your model

You should put your model in a new file in the mlreco/models folder.

Add your model to the dictionary in mlreco/models/factories.py so it can be found by the configuration parsers.

At this point, you should be able to train your model using a configuration file.

lartpc_mlreco3d's People

Contributors

bnels avatar codingkazu avatar dkoh0207 avatar drinkingkazu avatar francois-drielsma avatar justinjmueller avatar lkashur avatar mcfatelin avatar temigo avatar zhulcher avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

lartpc_mlreco3d's Issues

Unify the network output format and input data

data_blob and the network return (forward return) have different format currently, and unwrap functions I put in can a) unify the format and b) make one sensible format among options.
However, it's an option currently to be run only for output_formatters.

I'm thinking: shall we make this an option to run within the trainval::forward call? The upside is that we can expect a uniformly formatted data representations in anywhere downstream, and the downside is that it may spend some time shuffling data into the format (i.e. CPU time spent).
The reason why I ask is because we almost always want a unified, easy to interpret format, I think, and it's worth taking a small hit of processing time. (plus, currently downstream processing code like output_formatters and analysis written for two different formatting conventions = spreading a mess)

Log epoch calculation appears to be wrong

Relevant part of config file:

  dataset:
    name: LArCVDataset
    data_dirs:
      - /gpfs/slac/staas/fs1/g/neutrino/kterao/data/dlprod_ppn_v10/combined
    data_key: train_512px
    limit_num_files: 10

If I run for 1000 iterations with batch size 8, epoch is logged as 0.8. However, I have only seen 8000 events out of about 80k, so epoch should be 0.1.

import statements

In almost all scripts, we use mlreco full path to import. While this is fine, I think it's less damaging to use . where possible.

Understand why DataLoader gets killed

Especially on V100, training UResNet (uresnet_lonely from Temigo/lartpc_mlreco3d, branch temigo) with batch size 64 and spatial size 768px.

Traceback (most recent call last):
  File "/u/ki/ldomine/lartpc_mlreco3d/bin/run.py", line 33, in <module>
    main()
  File "/u/ki/ldomine/lartpc_mlreco3d/bin/run.py", line 28, in main
    train(cfg)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/main_funcs.py", line 36, in train
    train_loop(cfg, handlers)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/main_funcs.py", line 236, in train_loop
    res = handlers.trainer.train_step(data_blob)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/trainval.py", line 66, in train_step
    res_combined = self.forward(data_blob)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/trainval.py", line 83, in forward
    res = self._forward(blob)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/trainval.py", line 127, in _forward
    result = self._net(data)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/models/uresnet_lonely.py", line 140, in forward
    x = self.input((coords, features))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/sparseconvnet/ioLayers.py", line 63, in forward
    self.mode
  File "/usr/local/lib/python3.6/dist-packages/sparseconvnet/ioLayers.py", line 184, in forward
    mode
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 93813) is killed by signal: Killed.

interaction_id bug for "older" larcv files

v4 file (/sdf/data/neutrino/generic/mpvmpr_2020_01_v04/train.root) interaction_id is not filled properly with below lines.
(np.unique returns [-1, 65535])
When the lines are commented out, the interaction_id seems sensible.

inter_ids = np.array([p.interaction_id() for p in particles], dtype=np.int32)
if np.any(inter_ids != INVAL_ID):
inter_ids[~valid_mask] = -1
return inter_ids

Cluster Assignment

A possible solution to many of the cluster assignment ambiguities:

We seem to have cluster assignment conflicts in delta/michel clusters, because lowest group id is (arbitrarily) prioritized, and the lower group id often belongs to a track.

Since we've already decided on a 5-types label, it seems like the better solution is to favor the group id that has the same type as the 5-types label.

There would still be ambiguities if two particles of the same type (which agrees with the 5-types label) go through the same voxel, but will likely help greatly will michel/delta clusters.

Unit Tests + Coverage

We should implement some basic unit tests that ensure that basic examples run, and that important functions behave in an appropriate way.

Options are:

  • unittest - pretty easy to get started with
  • pytest - Maybe preferable, but some more up-front investment
  • ???

Once this is done, we should set up continuous integration so that things are demonstrably not broken with each commit. We can integrate with github so that tests have to pass before a pull request can be merged

After CI is set up, can set up code coverage. Options are:

We can also check coverage manually:

Thorough testing

The current testing scheme is a big Work In Progress. Notably, we are still lacking many tests in

  • test_parsers
  • test_models_forward (currently skips models that do not specify input types)
  • test_models_full (same)

Will update this issue with a more concrete TODO list.

Graph-SPICE objective

As mentioned in a meeting with Kazu a while back but not yet addressed, it would be useful to change the objective of Graph-SPICE to account for gaps in GT tracks. Currently, Graph-SPICE tries to cluster GT tracks regardless of their completeness, event though it's not what Graph-SPICE is intended to do. Please do the following:

  • Run DBSCAN on individually GT track instances. If there is a break, label each so-produced track fragment as a separate GT track. This way, only fully connected set of points can be classified as a single track instance;
  • GrapPA-Track will pick up the slack by putting together track chunks which are collinear (as designed).

iotool data return should be dict instead of list

Currently dataloader returns a list of requested data, and a user needs to know which index each data corresponds to (this info is given as a return of loader factory construction method). Better to just return a dict with keys being data name from the config file.

Batch size + logging

We currently divide any metrics such as accuracy, (or other logged variables) by the batch size.
This is due to the way that UResNets compute accuracy (by summing over the batch).

However, this makes computing metrics for GNNs troublesome, because accuracy is often computed over the batch graph (not a bunch of individual graphs). Currently, we have to multiply all metrics by batch size, only to have it divided out later.

The problem is that somehow the GNN needs to know about batch size. If there are events that have no em clusters (sometimes happens in Pi0 data), then this can't be determined only from the input and needs to be passed in somehow. Unfortunately, the model/loss construction doesn't have access to the I/O config, so the batch size has to be listed again and passed explicitly to the model.

I think there are two solutions:

  1. Have UResNet divide by batch size before returning, so we don't have to divide by batch size elsewhere
  2. Make it easy to access batch size in I/O tools from a model

I think 1 is preferable above

review output_formatter design

output_formatter may need a review in its design.

  • minor: currently requires import X for X.py under the module. This can (and better) be avoided.
  • It assumes sparseconvnet format in general. This should not be the case. A better implementation is to have a separate parsing function to break the scn formatted data, and have that parsing function called by model-specific X.py (since whether scn or not depends on model architecture).
  • Further, since parsing data is model a part of model as well as anaysis, shouldn't they live in model architecture file? Why are they in separate X.py under output_formatters, which also adds unnecessary book keeping in other places main_funcs.py and trainval.py. Seems bad design?

[ME library] CPU mode bug

The full chain ME code is currently unreliable in CPU mode. We rely on the coordinate ordering to perform many operations, e.g. ghost masking or semantic segmentation masking. The crucial assumption that the coordinate ordering is conserved throughout the full chain operations is verified on GPU, but breaks down on CPU.

Specifically, it starts breaking down in PPN (specifically models/layers/common/ppnplus.py in the class AttentionMask) and strongly suspected in GraphSpice as well.

How to reproduce the "bug"

This is the shortest minimal example that I could come up with.

import numpy as np
import MinkowskiEngine as ME
import torch

# Parameters
N = 10
device = 'cuda:0' # change this to 'cpu' to see the difference

# Create x
feats = torch.rand(N, 1).to(device)
coords = torch.cat([torch.zeros((N, 1)), torch.rand(N, 3) * 100], dim=1).to(device)
x = ME.SparseTensor(features=feats, coordinates=coords )

# Create mask
mask = (torch.rand(N, 6) > 0.5).float().to(device)
mask = ME.SparseTensor(
    coordinates=x.C,
    features=mask,
    coordinate_manager=x.coordinate_manager,
    tensor_stride=x.tensor_stride,
)

# Create x0
x0 = ME.SparseTensor(
    coordinates=x.C,
    features=torch.zeros(x.F.shape[0], mask.F.shape[1]).to(device),
    coordinate_manager=x.coordinate_manager,
    tensor_stride=x.tensor_stride
)

Now you can compare the coordinate tensors obtained through the .C attribute and the order will change after the addition x0+mask :

print(x.C, mask.C, x0.C ) # These are all identical
# No a priori reason but this set of coordinates is ordered differently on CPU, and identical to the previous one on GPU
print((mask + x0).C)

What does MinkowskiEngine say?

Well, they do not guarantee the coordinate ordering. See
https://github.com/NVIDIA/MinkowskiEngine/blob/master/MinkowskiEngine/MinkowskiTensor.py#L291

The order of coordinates is non-deterministic within each batch.
Use :attr:decomposed_coordinates_and_features to retrieve
both coordinates features with the same order. To retrieve the
order the decomposed coordinates is generated, use :attr:decomposition_permutations.

(I have to say, it is not 100% clear to me what decomposition_permutations is for. But it definitely does not allow to retrieve the original coordinate ordering. (still would be cumbersome to have to correct every now and then in the code))

Learning Rate Schedules

It will likely be useful to have a way to schedule learning rates
e.g. in *.cfg

training:
   learning_rate: [0.01, 0.001, 0.0001]
   iterations: [1000, 1000, 1000]

would do 1k iterations with learning rate at 0.01, then 1k iterations at learning rate 0.001, and so on

Changing the nomenclature

  • Function names
    parse_particle => parse_particle_points
  • Configuration parameters
    training => trainval (?)

Rename `Chain` to `UResNetPPN` (in uresnet_ppn_chain.py)

Chain is ambiguous name that should not be used anywhere (same applies for cluster_chain_gnn.py). Changing this probably causes a problem for loading network weights(?) w/o hacking the variable names upon loading. So I open an issue here to change at the right timing...

Clustering in hyperspace

As an extension of our past clustering attempts (dynamic-gcnn here and there, as well as Dae Hyun's recent attempt based on this), here's another architecture proposal: apply clustering loss at all spatial resolution level in U-ResNet(+PPN).

  • Starting from the most spatially contracted tensor (the bottom of U), apply N convolutions, then interpret each pixel's feature array as hyperspace coordinate and apply the "clustering loss". A cluster pixel mask needs to be down-sampled w/ scn operation appropriately. For pixels where clusters overlap, I think taking the "minimum distance to any of overlapping cluster" is OK in the loss.
  • N convolutions can apply >3 filters and we can still visualize (i.e. no need to be 2d or 3d) using t-SNE or something.
  • The output of N convolutions should be up-sampled and concatenated to the next block in the decoding path to effectively propagate clustering information to a higher spatial resolution. It should not hurt segmentation task, but hopefully helpful if anything.

Any volunteers? :)

Fix `num_strides` in `uresnet_lonely` (and other models potentially)

We agreed that num_strides should describe the number of downsampling operations (i.e. number of strides). This is not currently the case, at least not in uresnet_lonely: for example a spatial size of 768 leads to 6 feature maps of size [768, 384, 192, 96, 48, 24], which according to the above should be described at num_strides: 5 (currently resulting from num_strides: 6). We should update the implementation to match the description above.

Policy on a breaking change?

Should have some way of communicating breaking changes (till we have good set of unit tests ๐Ÿ™‚). Maybe at the pull request comment at minimum. This should motivate us to organize development paths (i.e. breaking change in a separate branch, separate pull request... otherwise pull request can take long to merge).

A particular occasion I encountered was parse_particles changed to parse_particle_points. This makes config/test_uresnet_ppn.cfg to break, which I have been using for testing, even if it's trivial to notice. I think this needs beyond unit-test, unfortunately, and needs communication at pull request anyway because it can break people's notebook (recall we have had notebooks from the workshop where config is hardcoded, they will all break) that cannot be covered by unit test.

`analysis_keys` and the reduction of network output in `trainval.py::forward` seems not ideal

trainval.py::forward should be one of most useful function for a user to "run the net, get the output" unaltered (i.e. don't call reduction like sum or mean). That should be the default. However, currently, it is optional only if one specifies analysis_keys in the config for ALL network output tensor keys.

The only reason I can see why this is in trainval.py::forward is so that it can be reported in main.py::log function. Should be the other way around: the main.py::log function, as a consumer, should be computing a mean/count.

For both "output" and "analysis" code have capability to parse the full tensor already. So it seems we should remove those if statements in trainval.py::forward.

Further, though maybe unrelated... "output" and "analysis" are two different functionalities and should not be sharing a parameter analysis_keys. In fact, why are those separate? Storing net output as is sounds like the dumbest type of "analysis" and can be part of it.

Seed Behavior

An observation:
If I take the same model, and run with the same seed twice, I will see the same events and see the same training behavior
If I take different models and run with the same seed twice I will see different events

This is likely due to the seed being set once, then

  1. using the random number generator for model creation
  2. using the random number generator for determining event ordering

It seems like we may wish to have separate seeds for model creation and training data ordering so we can know if we're seeing the same events in the same order across models (or if we change model definition)

i.e.

sampler:
    seed: 0
    name: RandomSequenceSampler
    batch_size: 8

for the sampler

and

model:
    seed: 0

for model creation

Graceful error handling in the full chain (and elsewhere)

The full chain (and elsewhere) should handle errors gracefully when assumptions are made, e.g.:

  • If UResNet assumes non-empty images, check for it, skip
  • If Graph-SPICE assumes > X voxels that are non-ghosts or non-LE, check for it, skip
  • If GrapPA assumes > 0 clusters, check for it, skip
  • Etc.

parser to support multi-channel sparse tensors

Would be good for parsers for tensor3d (sparse and dense) to support multi-channel input data. Configuration-wise, it's just a matter of concatenating arbitrary number of sparse3d data products + should have a sanity check for meta to be compatible (Voxel3DMeta implements ==).

  1. parser to be modified
  2. code that reference to a single feature channel to be modified and reference a multi-channel slice

running error

Besides larcv3 is installed, I'm getting the error

No module named 'larcv.pylarcv'
When trying to import larcv
Thans and regards!

Record Event IDs in Log

It may be useful to record the first event in a batch in logs.

For RandomSequenceSampler, this would allow one to figure out the exact batch used if a log reveals some funny looking behavior.

Unwrapper handling of empty entries within a batch

Two things I observed:

  • With batch_size=1 and an empty particle_graph, the unwrapper crashes when buidling the unwrap_map
  • With batch_size=2 and a particle_graph with only one batch_id, the unwrapper returns a single entry for that data product

It looks to me that if a data product, say particle_graph, has no corresponding row for a particular batch ID, the number of np.arrays returned by the unwrapper may be different from other data products that do, which is undesirable. If the batch_size is 16, we should have 16 objects per data product returned, even if they are empty. One way to do this is to use the batch_idx_max to resize outputs instead of appending them. Correct me if I am misreading the code.

Multi-GPU support for GNNs

GNN edge models seem to run into problems when running on multiple GPUs:

RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:255

Predicted to true matching throws exception when encountering interaction with zero reconstructed particles

Running on the BNB numu sample I get the following stack trace:

`~/lartpc_mlreco3d/v02_05_00/lartpc_mlreco3d/analysis/classes/ui.py in match_interactions(self, entry, mode, drop_nonprimary_particles, match_particles, return_counts, **kwargs)
1104 return_counts=False, **kwargs):
1105 if mode == 'pred_to_true':
-> 1106 ints_from = self.get_interactions(entry, drop_nonprimary_particles=drop_nonprimary_particles)
1107 ints_to = self.get_true_interactions(entry, drop_nonprimary_particles=drop_nonprimary_particles)
1108 elif mode == 'true_to_pred':

~/lartpc_mlreco3d/v02_05_00/lartpc_mlreco3d/analysis/classes/ui.py in get_interactions(self, entry, drop_nonprimary_particles)
645 - out: List of instances (see particle.Interaction).
646 '''
--> 647 particles = self.get_particles(entry, only_primaries=drop_nonprimary_particles)
648 out = group_particles_to_interactions_fn(particles)
649 for ia in out:

~/lartpc_mlreco3d/v02_05_00/lartpc_mlreco3d/analysis/classes/ui.py in get_particles(self, entry, only_primaries, min_particle_voxel_count, attaching_threshold)
544 particles_seg = self.result['particles_seg'][entry]
545
--> 546 type_logits = self.result['node_pred_type'][entry]
547 input_node_features = [None] * type_logits.shape[0]
548 if 'particle_node_features' in self.result:

KeyError: 'node_pred_type'`

This was traced back to an event (entry==10 in the numu sample ) with 0 reconstructed particles. This can be caught explicitly downstream, but should really be handled upstream.

`forward time` report is always 0

simple but a bug, useful to have it so one gets reminded how much time spent on pre/post processing steps (that are outside forward call)

import error

First, modify test/test_loader.py to actually run run_test() function.. but then
python3 test/test_loader.py config/test_loader.cfg
fails with the following error:

  File "test/test_loader.py", line 71, in <module>
    test_loader()
  File "test/test_loader.py", line 37, in test_loader
    loader,data_keys = loader_factory(cfg)
  File "/home/drinkingkazu/sw/git/lartpc_mlreco3d/mlreco/iotools/factories.py", line 46, in loader_factory
    ds = dataset_factory(cfg)
  File "/home/drinkingkazu/sw/git/lartpc_mlreco3d/mlreco/iotools/factories.py", line 34, in dataset_factory
    import mlreco.iotools.datasets
  File "/home/drinkingkazu/sw/git/lartpc_mlreco3d/mlreco/iotools/datasets.py", line 6, in <module>
    import mlreco.iotools.parsers
  File "/home/drinkingkazu/sw/git/lartpc_mlreco3d/mlreco/iotools/parsers.py", line 7, in <module>
    from mlreco.utils.gnn.primary import get_em_primary_info
  File "/usr/local/root/lib/ROOT.py", line 463, in _importhook
    return _orig_ihook( name, *args, **kwds )
  File "/home/drinkingkazu/sw/git/lartpc_mlreco3d/mlreco/utils/gnn/primary.py", line 10, in <module>
    from mlreco.utils.gnn.cluster import get_cluster_label, get_cluster_batch
  File "/usr/local/root/lib/ROOT.py", line 463, in _importhook
    return _orig_ihook( name, *args, **kwds )
  File "/home/drinkingkazu/sw/git/lartpc_mlreco3d/mlreco/utils/gnn/cluster.py", line 3, in <module>
    from mlreco.models.layers.dbscan import DBScanClusts
  File "/usr/local/root/lib/ROOT.py", line 463, in _importhook
    return _orig_ihook( name, *args, **kwds )
  File "/home/drinkingkazu/sw/git/lartpc_mlreco3d/mlreco/models/__init__.py", line 7, in <module>
    from mlreco.models import attention_gnn
  File "/usr/local/root/lib/ROOT.py", line 463, in _importhook
    return _orig_ihook( name, *args, **kwds )
  File "/home/drinkingkazu/sw/git/lartpc_mlreco3d/mlreco/models/attention_gnn.py", line 8, in <module>
    from mlreco.utils.gnn.cluster import form_clusters, get_cluster_batch, get_cluster_label, form_clusters_new
ImportError: cannot import name 'form_clusters'

Circular import?
iotools.parsers needs utils.gnn.primary needs utils.gnn.cluster needs models.layer.dbscan needs models.attention_gnn needs utils.gnn.cluster

Full Chain Grappa Track breaks when a batch is too large

If the batch was too large the training breaks

The complete graph is too large, must skip batch
Traceback (most recent call last):
...
  File "/sdf/home/b/bearc/lartpc_mlreco3d_fd/mlreco/models/layers/common/gnn_full_chain.py", line 107, in run_gnn
    gnn_output['edge_index'][0][b],
KeyError: 'edge_index'

The error is just caused by skipping the graph and landing in an area its not supposed to be. The temporary work around is to train on smaller batches. In the standalone module it'll just skip the batch.

multi-gpu error

Trying to train/inference using multiple gpus fail in a strange way. The issue is identified as running some torch methods before setting device id and os.environ['CUDA_VISIBLE_DEVICES'].

PPN Output Parser

There should be an easy way to turn PPN output into points. This is currently not the case in uresnet_ppn. I see two options:

  1. Output points directly from forward function in PPN model
  2. Have a parser function available that will give the output

The goal is to use this in a chain where we can feed the PPN positions into a GNN which will then use the positions for Primary identification

Documentation

We should have some better documentation in this repository on:

  1. How to set up configuration files (and options)
  2. High level functions (and models)
  3. How to add/extend to the repository

I think (1) and (3) are known by some, but not well documented. Maybe best to do this in README.md
(2) really needs to be done (when the code is first written) in a docstring.

Minimally, the docstrings should include:

  1. A brief description of what the function does
  2. A description of the inputs
  3. A description of the outputs
  4. any assumptions the function makes (optional)
    Better yet, we should have a standardized template for this.

Yes, this should be done for the models and important functions. No, this shouldn't be required of short helper functions (but it's always nice).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.