intellabs / matsciml Goto Github PK

Open MatSci ML Toolkit is a framework for prototyping and scaling out deep learning models for materials discovery supporting widely used materials science datasets, and built on top of PyTorch Lightning, the Deep Graph Library, and PyTorch Geometric.

License: MIT License

Python 99.83% Dockerfile 0.16% Shell 0.02%

ai dgl pytorch pytorch-lightning

matsciml's Introduction

Open MatSci ML Toolkit : A Broad, Multi-Task Benchmark for Solid-State Materials Modeling

This is the implementation of the MatSci ML benchmark, which includes ~1.5 million ground-state materials collected from various datasets, as well as integration of the OpenCatalyst dataset supporting diverse data format (point cloud, DGL graphs, PyG graphs), learning methods (single task, multi-task, multi-data) and deep learning models. Primary project contributors include: Santiago Miret (Intel Labs), Kin Long Kelvin Lee (Intel AXG), Carmelo Gonzales (Intel Labs), Mikhail Galkin (Intel Labs), Marcel Nassar (Intel Labs), Matthew Spellings (Vector Institute).

News

[2023/09/27] Release of pre-packaged lmdb-based datasets from v1.0.0 via Zenodo.
[2023/08/31] Initial release of the MatSci ML Benchmark with integration of ~1.5 million ground state materials.
[2023/07/31] The Open MatSci ML Toolkit : A Flexible Framework for Deep Learning on the OpenCatalyst Dataset paper is accepted into TMLR. See previous version for code related to the benchmark.

Introduction

The MatSci ML Benchmark contains diverse sets of tasks (energy prediction, force prediction, property prediction) across a broad range of datasets (OpenCatalyst Project [1], Materials Project [2], LiPS [3], OQMD [4], NOMAD [5], Carolina Materials Database [6]). Most of the data is related to energy prediction task, which is the most common property tracked for most materials systems in the literature. The codebase support single-task learning, as well as multi-task (training one model for multiple tasks within a dataset) and multi-date (training a model across multiple datsets with a common property). Additionally, we provide a generative materials pipeline that applies diffusion models (CDVAE [7]) to generate new unit cells.

The package follows the original design principles of the Open MatSci ML Toolkit, including:

Ease of use for new ML researchers and practitioners that want get started on interacting with the OpenCatalyst dataset.
Scalable computation of experiments leveraging PyTorch Lightning across different computation capabilities (laptop, server, cluster) and hardware platforms (CPU, GPU, XPU) without sacrificing performance in the compute and modeling.
Integrating support for DGL and PyTorch Geometric for rapid GNN development.

The examples outlined in the next section how to get started with Open MatSci ML Toolkit using simple Python scripts, Jupyter notebooks, or the PyTorch Lightning CLI for a simple training on a portable subset of the original dataset (dev-set) that can be run on a laptop. Subsequently, we scale our example python script to large compute systems, including distributed data parallel training (multiple GPU on a single node) and multi-node training (multiple GPUs across multiple nodes) in a computing cluster. Leveraging both PyTorch Lightning and DGL capabilities, we can enable the compute and experiment scaling with minimal additional complexity.

Installation

Docker: We provide a Dockerfile inside the docker that can be run to install a container using standard docker commands.
Conda: We have included a conda specification that provides a complete installation including XPU support for PyTorch. Run conda env create -n matsciml --file conda.yml, and in the newly created environment, run pip install './[all]' to install all of the dependencies.
pip: In some cases, you might want to install matsciml to an existing environment. Due to how DGL distributes wheels, you will need to add an extra index URL when installing via pip. As an example: pip install -f https://data.dgl.ai/wheels/repo.html './[all]' will install all the matsciml dependencies, in addition to telling pip where to look for CPU-only DGL wheels for your particular platform and Python version. Please consult the DGL documentation for additional help.

Additionally, for a development install, one can specify the extra packages like black and pytest with pip install './[dev]'. These can be added to the commit workflow by running pre-commit install to generate git hooks.

Intel XPU capabilities

There are currently extra requirements in getting a complete software environment in order to run on Intel XPUs, namely runtime libraries that can't be packaged cohesively together (yet). While conda.yml provides all of the high performance Python requirements (i.e. PyTorch and IPEX), we assume you have downloaded and sourced the oneAPI base toolkit (==2024.0.0). On managed clusters, sysadmins will usually provide modules (i.e. module avail/module load oneapi); on free clusters or workstations, please refer to instructions found here with the appropriate version (currently 2.1.0). Specific requirements are MKL==2024.0, and oneCCL==2021.11.0 with the current IPEX (2.1.10+xpu) and oneccl_bind_pt (2.1.100+xpu). MKL>=2024.1, at the time of writing, is incompatiable with the IPEX version.

The module matsciml.lightning.xpu implements interfaces for Intel XPU to Lightning abstractions, including the XPUAccelerator and two strategies for deployment (single XPU/tile and distributed data parallel). Because we use PyTorch Lightning, there aren't many marked differences in running on Intel XPU, or GPUs from other vendors. The abstractions we mentioned are registered in the various Lightning registries, and should be accessible simply through pl.Trainer arguments, e.g.:

trainer = pl.Trainer(accelerator='xpu')

The one major difference is for distributed data parallelism: Intel XPUs use the oneCCL communication backend, which replaces nccl, gloo, or other backends typically passed to torch.distributed. Please see examples/devices for single XPU/tile and DDP use cases.

NOTE: Currently there is a hard-coded torch.cuda.stream context in PyTorch Lightning's DDPStrategy. This issue has been created to see if the maintainers would be happy to patch it so that the cuda.Stream context is only used if a CUDA device is being used. If you encounter a RuntimeError: Tried to instantiate dummy base class Stream, please just set ctx = nullcontext() in the line of code that raises the exception.

Examples

The examples folder contains simple, unit scripts that demonstrate how to use the pipeline in specific ways:

Get started with different datasets with "devsets"

# Materials project
python examples/datasets/materials_project/single_task_devset.py

# Carolina materials database
python examples/datasets/carolina_db/single_task_devset.py

# NOMAD
python examples/datasets/nomad/single_task_devset.py

# OQMD
python examples/datasets/oqmd/single_task_devset.py

Representation learning with symmetry pretraining

# uses the devset for synthetic point group point clouds
python examples/tasks/symmetry/single_symmetry_example.py

Example notebook-based development and testing

jupyter notebook examples/devel-example.ipynb

For more advanced use cases:

Checkout materials generation with CDVAE

CDVAE [7] is a latent diffusion model that trains a VAE on the reconstruction objective, adds Gaussian noise to the latent variable, and learns to predict the noise. The noised and generated features inlcude lattice parameters, atoms composition, and atom coordinates. The generation process is based on the annealed Langevin dynamics.

CDVAE is implemented in the GenerationTask and we provide a custom data split from the Materials Project bounded by 25 atoms per structure. The process is split into 3 parts with 3 respective scripts found in examples/model_demos/cdvae/.

Training CDVAE on the reconstruction and denoising objectives: cdvae.py
Sampling the structures (from scratch or reconstruct the test set): cdvae_inference.py
Evaluating the sampled structures: cdvae_metrics.py

The sampling procedure takes some time (about 5-8 hours for 10000 structures depending on the hardware) due to the Langevin dynamics. The default hyperparameters of CDVAE components correspond to that from the original paper and can be found in cdvae_configs.py.

# training
python examples/model_demos/cdvae/cdvae.py --data_path <path/to/splits>

# sampling 10,000 structures from scratch
python examples/model_demos/cdvae/cdvae_inference.py --model_path <path/to/checkpoint> --data_path <path/to/splits> --tasks gen

# evaluating the sampled structures
python examples/model_demos/cdvae/cdvae_metrics.py --root_path <path/to/generated_samples> --data_path <path/to/splits> --tasks gen

Multiple tasks trained using the same dataset

# this script requires modification as you'll need to download the materials
# project dataset, and point L24 to the folder where it was saved
python examples/tasks/multitask/single_data_multitask_example.py

Utilizes Materials Project data to train property regression and material classification jointly

Multiple tasks trained using multiple datasets

python examples/tasks/multitask/three_datasets.py

Train regression tasks against IS2RE, S2EF, and LiPS datasets jointly

Data Pipeline

In the scripts folder you will find two scripts needed to download and preprocess datasets: the download_datasets.py can be used to obtain Carolina DB, Materials Project, NOMAD, and OQMD datasets, while the download_ocp_data.py preserves the original Open Catalyst script.

In the current release, we have implemented interfaces to a number of large scale materials science datasets. Under the hood, the data structures pulled from each dataset have been homogenized, and the only real interaction layer for users is through the MatSciMLDataModule, a subclass of LightningDataModule.

from matsciml.lightning.data_utils import MatSciMLDataModule

# no configuration needed, although one can specify the batch size and number of workers
devset_module = MatSciMLDataModule.from_devset(dataset="MaterialsProjectDataset")

This will let you springboard into development without needing to worry about how to wrangle with the datasets; just grab a batch and go! With the exception of Open Catalyst, datasets will typically return point cloud representations; we provide a flexible transform interface to interconvert between representations and frameworks:

From point clouds to DGL graphs

from matsciml.datasets.transforms import PointCloudToGraphTransform

# make the materials project dataset emit DGL graphs, based on a atom-atom distance cutoff of 10
devset = MatSciMLDataModule.from_devset(
    dataset="MaterialsProjectDataset",
    dset_kwargs={"transforms": [PointCloudToGraphTransform(backend="dgl", cutoff_dist=10.)]}
)

But I want to use PyG?

from matsciml.datasets.transforms import PointCloudToGraphTransform

# change the backend argument to obtain PyG graphs
devset = MatSciMLDataModule.from_devset(
    dataset="MaterialsProjectDataset",
    dset_kwargs={"transforms": [PointCloudToGraphTransform(backend="pyg", cutoff_dist=10.)]}
)

What else can I configure with `MatSciMLDataModule`?

Datasets beyond devsets can be configured through class arguments:

devset = MatSciMLDataModule(
    dataset="MaterialsProjectDataset",
    train_path="/path/to/training/lmdb/folder",
    batch_size=64,
    num_workers=4,     # configure data loader instances
    dset_kwargs={"transforms": [PointCloudToGraphTransform(backend="pyg", cutoff_dist=10.)]},
    val_split="/path/to/val/lmdb/folder"
)

In particular, val_split and test_split can point to their LMDB folders, or just a float between [0,1] to do quick, uniform splits. The rest, including distributed sampling, will be taken care of for you under the hood.

How do I compose multiple datasets?

Given the amount of configuration involved, composing multiple datasets takes a little more work but we have tried to make it as seamless as possible. The main difference from the single dataset case is replacing MatSciMLDataModule with MultiDataModule from matsciml.lightning.data_utils, configuring each dataset manually, and passing them collectively into the data module:

from matsciml.datasets import MaterialsProjectDataset, OQMDDataset, MultiDataset
from matsciml.lightning.data_utils import MultiDataModule

# configure training only here, but same logic extends to validation/test splits
train_dset = MultiDataset(
  [
    MaterialsProjectDataset("/path/to/train/materialsproject"),
    OQMDDataset("/path/to/train/oqmd")
  ]
)

# this configures the actual data module passed into Lightning
datamodule = MultiDataModule(
  batch_size=32,
  num_workers=4,
  train_dataset=train_dset
)

While it does require a bit of extra work, this was to ensure flexibility in how you can compose datasets. We welcome feedback on the user experience! 😃

Task abstraction

In Open MatSci ML Toolkit, tasks effective form learning objectives: at a high level, a task takes an encoding model/backbone that ingests a structure to predict one or several properties, or classify a material. In the single task case, there may be multiple targets and the neural network architecture may be fluid, but there is only one optimizer. Under this definition, multi-task learning comprises multiple tasks and optimizers operating jointly through a single embedding.

References

[1] Chanussot, L., Das, A., Goyal, S., Lavril, T., Shuaibi, M., Riviere, M., Tran, K., Heras-Domingo, J., Ho, C., Hu, W. and Palizhati, A., 2021. Open catalyst 2020 (OC20) dataset and community challenges. Acs Catalysis, 11(10), pp.6059-6072.
[2] Jain, A., Ong, S.P., Hautier, G., Chen, W., Richards, W.D., Dacek, S., Cholia, S., Gunter, D., Skinner, D., Ceder, G. and Persson, K.A., 2013. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL materials, 1(1).
[3] Batzner, S., Musaelian, A., Sun, L., Geiger, M., Mailoa, J.P., Kornbluth, M., Molinari, N., Smidt, T.E. and Kozinsky, B., 2022. E (3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nature communications, 13(1), p.2453.
[4] Kirklin, S., Saal, J.E., Meredig, B., Thompson, A., Doak, J.W., Aykol, M., Rühl, S. and Wolverton, C., 2015. The Open Quantum Materials Database (OQMD): assessing the accuracy of DFT formation energies. npj Computational Materials, 1(1), pp.1-15.
[5] Draxl, C. and Scheffler, M., 2019. The NOMAD laboratory: from data sharing to artificial intelligence. Journal of Physics: Materials, 2(3), p.036001.
[6] Zhao, Y., Al‐Fahdi, M., Hu, M., Siriwardane, E.M., Song, Y., Nasiri, A. and Hu, J., 2021. High‐throughput discovery of novel cubic crystal materials using deep generative neural networks. Advanced Science, 8(20), p.2100566.
[7] Xie, T., Fu, X., Ganea, O.E., Barzilay, R. and Jaakkola, T.S., 2021, October. Crystal Diffusion Variational Autoencoder for Periodic Material Generation. In International Conference on Learning Representations.

Citations

If you use Open MatSci ML Toolkit in your technical work or publication, we would appreciate it if you cite the Open MatSci ML Toolkit paper in TMLR:

Miret, S.; Lee, K. L. K.; Gonzales, C.; Nassar, M.; Spellings, M. The Open MatSci ML Toolkit: A Flexible Framework for Machine Learning in Materials Science. Transactions on Machine Learning Research, 2023.

@article{openmatscimltoolkit,
  title = {The Open {{MatSci ML}} Toolkit: {{A}} Flexible Framework for Machine Learning in Materials Science},
  author = {Miret, Santiago and Lee, Kin Long Kelvin and Gonzales, Carmelo and Nassar, Marcel and Spellings, Matthew},
  year = {2023},
  journal = {Transactions on Machine Learning Research},
  issn = {2835-8856}
}

If you use v1.0.0, please cite our paper:

Lee, K. L. K., Gonzales, C., Nassar, M., Spellings, M., Galkin, M., & Miret, S. (2023). MatSciML: A Broad, Multi-Task Benchmark for Solid-State Materials Modeling. arXiv preprint arXiv:2309.05934.

@article{lee2023matsciml,
  title={MatSciML: A Broad, Multi-Task Benchmark for Solid-State Materials Modeling},
  author={Lee, Kin Long Kelvin and Gonzales, Carmelo and Nassar, Marcel and Spellings, Matthew and Galkin, Mikhail and Miret, Santiago},
  journal={arXiv preprint arXiv:2309.05934},
  year={2023}
}

Please cite datasets used in your work as well. You can find additional descriptions and details regarding each dataset here.

matsciml's People

Contributors

Stargazers

Watchers

Forkers

laserkelvin melo-gonzo ghzytp marsh-sudo danieltlevy mvpvipliu sailfish009 bmuaz vaibhav525 sajidmannan dukeprashanth n0w0f jonathanschmidt1 efuem mcneela hn-yu michaelcshn

matsciml's Issues

IS2RE devset not being included in package install

Currently, IS2REDGLDataModule.from_devset throws an error when the framework is installed without develop mode, as the minimal devset is not included in site-packages.

MANIFEST.in needs to be updated for this to happen.

[Feature request]: Support using intermediate embeddings

Feature/behavior summary

Refactor Embeddings data structure, and their use by OutputBlocks to allow the use of intermediate embeddings for modeling. In some models such as MACE, the output of the model is given as some reduction over projections of intermediate layers, i.e. $E_f = E_0 + E_1 + \ldots + E_l$ for $l$ layers.

The current implementation hinders this mode of usage a little, as we need to be able to store intermediate embeddings and then use them correctly after every layer is computed, which would be fine if the intermediate embeddings were the same shape and could be concatenated along a single (new) dimension, and the output blocks just broadcasts. In the case of MACE and other equivariance preserving models, the intermediate layers may have different shapes and can't be concatenated.

Request attributes

Would this be a refactor of existing code?
Does this proposal require new package dependencies?
Would this change break backwards compatibility?
Does this proposal include a new model?
Does this proposal include a new dataset?
Does this proposal include a new task/workflow?

Related issues

#72 is the main issue, but #83 is a related WIP PR

Solution description

One way would be to refactor Embeddings to allow intermediate embeddings; i.e. for node/system levels, we expect either a Tensor or a list[Tensor]. This would then need the logic of OutputBlock to be modified so that we use different output heads per intermediate embedding, then reduce them for the final output.

For now I don't think this will be backwards breaking.

Additional notes

No response

[Feature request]: Remove obsolete `encoder_only` arguments from model definitions

Feature/behavior summary

With #67, model forward methods are expected to output Embeddings objects, which doesn't leave room for ambiguity on what the model outputs are since the output heads from tasks are meant to do the final projections.

Request attributes

Would this be a refactor of existing code?
Does this proposal require new package dependencies?
Would this change break backwards compatibility?
Does this proposal include a new model?
Does this proposal include a new dataset?
Does this proposal include a new task/workflow?

Related issues

#67 is where this was brought up by @melo-gonzo, based on the model output refactors.

Solution description

Remove encoder_only arguments from models that have them, and any associated projection layers in models that won't be used.

Additional notes

No response

[Bug]: `matsciml` installation fails because of `dgl` distribution

Expected behavior

Running pip install -e . should install matsciml in a developer installation with the minimum set of dependencies.

Actual behavior

Since #98, pip no longer resolves dgl>1.1.3. According to the DGL instructions, you need to provide --extra-index-url to a separate repository in order to obtain wheels even for a CPU-only build.

Steps to reproduce the problem

Run pip install on main, and/or if you just run pip install dgl and it will raise the following error:

 $ pip install dgl==2.0.0
ERROR: Could not find a version that satisfies the requirement dgl==2.0.0 (from versions: 0.1.0, 0.1.2, 0.1.3, 0.6.0, 0.6.0.post1, 0.6.1, 0.9.0, 0.9.1, 1.0.0, 1.0.1, 1.0.4, 1.1.0, 1.1.1, 1.1.2, 1.1.3)
ERROR: No matching distribution found for dgl==2.0.0

Specifications

N/A

[Bug]: Incorrect Key In OutputHead

Expected behavior

The final block should have residual=False.

Actual behavior

The final block uses the defualt residual=True.

Steps to reproduce the problem

Run any of the example scripts.

Specifications

v1.0.0

[Feature request]: Refactor MACE to be compatible with task abstractions

Feature/behavior summary

The current implementation of MACE is not fully compatible with the task abstraction we have designed: for example, force computation is hardcoded into the model similar to how the original OCP models were implemented. This means that it is not plug-and-play like the other models in being able to freely compose to an end-to-end pipeline.

Request attributes

Would this be a refactor of existing code?
Does this proposal require new package dependencies?
Would this change break backwards compatibility?
Does this proposal include a new model?
Does this proposal include a new dataset?
Does this proposal include a new task/workflow?

Related issues

N/A

Solution description

There are a number of things that need to be done in order to have MACE comply with the rest of the pipeline. To preserve backwards compatibility, it might be a better option to refactor the existing architecture as a "vanilla" implementation, and have a second compliant architecture. A "seamless" solution would be to have the MACE class act as a wrapper, which subsequently comprises both vanilla and conforming versions of the architecture.

A quasi-ordered task list (that will get updated) that should get converted into individual issues/PRs where applicable:

#116
Isolate/modularize unneeded utility functions, particularly those in support of independent training
Duplicate and rewrite MatSciML conforming MACE to use IrrepOutputBlock
#83
Register MACE in model registry

Additional notes

N/A

[Bug]: remove_self_loops() function is missing

Expected behavior

In matsciml\common\utils.py, def remove_self_loops() function is missing.

Actual behavior

When you open the utils.py, you can see line 60: edge_index, edge_attr = remove_self_loops(edge_index, edge_attr), where "remove_self_loops" is not defined.

Steps to reproduce the problem

Line 60: edge_index, edge_attr = remove_self_loops(edge_index, edge_attr)

Specifications

matsciml as of [9568e18]

[Bug]: MACE fails to JIT compile with Python 3.10 union type hints

Expected behavior

The MACE model uses @torch.jit.script for performance and would be good to maintain functionality.

Actual behavior

The following error gets raised as I was writing example unit tests:

    fn = torch._C._jit_script_compile(
E   RuntimeError: 
E   Expression of type | cannot be used in a type expression:
E     File "/home/kinlongk/Repos/matsciml/matsciml/models/pyg/mace/tools/scatter.py", line 39
E       index: torch.Tensor,
E       dim: int = -1,
E       out: torch.Tensor | None = None,
E            ~~~~~~~~~~~~~~~~~~~ <--- HERE
E       dim_size: int | None = None,
E       reduce: str = "sum",

According to this PR, @torch.jit.script is in maintenance mode and does not support new Python features out of the box. The referenced PR is completed and merged, but will have to wait until PyTorch 2.2.0 to see it being available.

The short term solution would be to annotate those lines to stop formatters from changing out the type hint to Python 3.10 unions.

Steps to reproduce the problem

Related PyTorch issue: pytorch/pytorch#114755

Specifications

absl-py==2.1.0
aiohttp==3.9.1
aiosignal==1.3.1
annotated-types==0.6.0
anyio @ file:///home/conda/feedstock_root/build_artifacts/anyio_1688651106312/work/dist
argon2-cffi @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi_1692818318753/work
argon2-cffi-bindings @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi-bindings_1695386548994/work
ase==3.22.1
asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1698341106958/work
async-timeout==4.0.3
attrs==23.1.0
bandit==1.7.5
beautifulsoup4 @ file:///home/conda/feedstock_root/build_artifacts/beautifulsoup4_1705564648255/work
black==23.11.0
bleach @ file:///home/conda/feedstock_root/build_artifacts/bleach_1696630167146/work
Brotli @ file:///home/conda/feedstock_root/build_artifacts/brotli-split_1695989787169/work
cachetools==5.3.2
certifi @ file:///home/conda/feedstock_root/build_artifacts/certifi_1707022139797/work/certifi
cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1696001742886/work
cfgv==3.4.0
charset-normalizer @ file:///home/conda/feedstock_root/build_artifacts/charset-normalizer_1698833585322/work
click==8.1.7
cloudpickle==3.0.0
colorama @ file:///home/conda/feedstock_root/build_artifacts/colorama_1666700638685/work
comm @ file:///home/conda/feedstock_root/build_artifacts/comm_1704278392174/work
contextlib2==0.5.5
cycler @ file:///home/conda/feedstock_root/build_artifacts/cycler_1696677705766/work
Cython @ file:///home/sat_bot/base/conda-bld/cython_1695764990393/work
debugpy @ file:///home/conda/feedstock_root/build_artifacts/debugpy_1695534305529/work
decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work
defusedxml @ file:///home/conda/feedstock_root/build_artifacts/defusedxml_1615232257335/work
Deprecated @ file:///home/conda/feedstock_root/build_artifacts/deprecated_1685233314779/work
dgl==0.9.1
dgllife==0.3.2
distlib==0.3.7
docstring-parser==0.15
dpctl==0.15.0+42.g2a2c98f1f
dpnp==0.13.0+170.gf6d175b0f
e3nn==0.5.1
einops==0.7.0
emmet-core==0.64.0
entrypoints @ file:///home/conda/feedstock_root/build_artifacts/entrypoints_1643888246732/work
exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1704921103267/work
executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1698579936712/work
fastjsonschema @ file:///home/conda/feedstock_root/build_artifacts/python-fastjsonschema_1703780968325/work/dist
filelock==3.13.1
flake8==6.1.0
flake8-bandit==4.1.1
flake8-black==0.3.6
Flake8-pyproject==1.2.3
fonttools==4.25.0
frozenlist==1.4.0
fsspec==2023.10.0
future==0.18.3
geometric-algebra-attention==0.5.1
gitdb==4.0.11
GitPython==3.1.40
google-auth==2.26.2
google-auth-oauthlib==1.2.0
greenlet==3.0.3
grpcio==1.60.0
hyperopt==0.2.7
identify==2.5.32
idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1663625384323/work
importlib-metadata==7.0.1
importlib-resources @ file:///home/conda/feedstock_root/build_artifacts/importlib_resources_1699364556997/work
iniconfig==2.0.0
intel-extension-for-pytorch @ file:///home/gta/workspace/LKG/build_conda_package_pytorch/binaries/intel_extension_for_pytorch-2.0.100%252Bcpu-cp39-cp39-linux_x86_64.whl#sha256=901fd4612125a1962621c7a5c9fced997723662761297987d9ab22a371208a11
ipykernel @ file:///home/conda/feedstock_root/build_artifacts/ipykernel_1707182759703/work
ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1701831663892/work
ipython-genutils==0.2.0
ipywidgets @ file:///home/conda/feedstock_root/build_artifacts/ipywidgets_1694607144474/work
jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1696326070614/work
Jinja2 @ file:///tmp/build/80754af9/jinja2_1624781299557/work
joblib @ file:///home/conda/feedstock_root/build_artifacts/joblib_1691577114857/work
jsonargparse==4.27.1
jsonschema @ file:///home/conda/feedstock_root/build_artifacts/jsonschema-meta_1705707496704/work
jsonschema-specifications @ file:///tmp/tmpkv1z7p57/src
jupyter @ file:///home/conda/feedstock_root/build_artifacts/jupyter_1696255489086/work
jupyter-client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1654730843242/work
jupyter-console @ file:///home/conda/feedstock_root/build_artifacts/jupyter_console_1678118109161/work
jupyter-server @ file:///home/conda/feedstock_root/build_artifacts/jupyter_server_1693923066986/work
jupyter_core @ file:///home/conda/feedstock_root/build_artifacts/jupyter_core_1704727023078/work
jupyterlab-widgets @ file:///home/conda/feedstock_root/build_artifacts/jupyterlab_widgets_1694598704522/work
jupyterlab_pygments @ file:///home/conda/feedstock_root/build_artifacts/jupyterlab_pygments_1707149102966/work
kiwisolver @ file:///home/conda/feedstock_root/build_artifacts/kiwisolver_1695379916629/work
latexcodec==2.0.1
lightning-utilities==0.10.0
llvmlite==0.41.1
lmdb==1.3.0
Markdown==3.5.2
markdown-it-py==3.0.0
MarkupSafe==2.1.3
matgl==0.9.2
matplotlib @ file:///tmp/build/80754af9/matplotlib-suite_1634667019719/work
matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1660814786464/work
-e git+ssh://[email protected]/laserkelvin/matsciml.git@bbac579af862787ec635c98e2d18d596f2c0023c#egg=matsciml
mccabe==0.7.0
mdurl==0.1.2
mendeleev==0.14.0
mistune @ file:///home/conda/feedstock_root/build_artifacts/mistune_1698947099619/work
mkl-fft==1.3.6
mkl-random==1.2.2
mkl-service==2.4.0
mkl-umath==0.1.1
monty==2023.11.3
mp-api==0.33.3
mpmath @ file:///home/conda/feedstock_root/build_artifacts/mpmath_1678228039184/work
msgpack==1.0.7
multidict==6.0.4
munch==2.5.0
munkres==1.1.4
mypy-extensions==1.0.0
nbclassic @ file:///home/conda/feedstock_root/build_artifacts/nbclassic_1683202081046/work
nbclient @ file:///home/conda/feedstock_root/build_artifacts/nbclient_1684790896106/work
nbconvert @ file:///home/conda/feedstock_root/build_artifacts/nbconvert-meta_1707182911809/work
nbformat @ file:///home/conda/feedstock_root/build_artifacts/nbformat_1690814868471/work
nest_asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1705850609492/work
networkx @ file:///tmp/build/80754af9/networkx_1627459939258/work
neural-compressor @ file:///root/w0/workspace/lpot-nightly-release-wheel-build/lpot-models/dist/neural_compressor-2.3.1-py3-none-any.whl#sha256=d94ba25aad77289c24d475a1967889832a9712ba685afa03093a06d93b84d1d0
nodeenv==1.8.0
notebook @ file:///home/conda/feedstock_root/build_artifacts/notebook_1695225629675/work
notebook_shim @ file:///home/conda/feedstock_root/build_artifacts/notebook-shim_1682360583588/work
numba==0.58.1
numpy==1.26.2
oauthlib==3.2.2
oneccl-bind-pt @ file:///home/gta/workspace/LKG/build_conda_package_pytorch/binaries/oneccl_bind_pt-2.0.0%252Bcpu-cp39-cp39-linux_x86_64.whl#sha256=165c2d5525cf18390f8636fb18ee174b74ad7e785de5a000a9c63008e50f2182
opencv-python-headless @ file:///workspace/jsasswis/build_opecv/build_conda/opencv_python_headless-4.8.0.74-cp39-cp39-linux_x86_64.whl#sha256=aa6b111915aa44338293aa56605dd3d6e71998ba18b780d4907f0029b46a8fbb
opt-einsum==3.3.0
opt-einsum-fx==0.1.4
packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1696202382185/work
palettable==3.3.3
pandas==1.5.3
pandocfilters @ file:///home/conda/feedstock_root/build_artifacts/pandocfilters_1631603243851/work
parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1638334955874/work
pathspec==0.11.2
pbr==6.0.0
pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1706113125309/work
pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602536217715/work
Pillow @ file:///home/conda/feedstock_root/build_artifacts/pillow_1695247205741/work
pkgutil_resolve_name @ file:///home/conda/feedstock_root/build_artifacts/pkgutil-resolve-name_1694617248815/work
platformdirs @ file:///home/conda/feedstock_root/build_artifacts/platformdirs_1696272223550/work
plotly==5.18.0
pluggy==1.3.0
pooch @ file:///home/conda/feedstock_root/build_artifacts/pooch_1698245576425/work
pre-commit==3.5.0
prettytable==0.7.2
prometheus-client @ file:///home/conda/feedstock_root/build_artifacts/prometheus_client_1700579315247/work
prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1702399386289/work
protobuf==4.23.4
psutil @ file:///tmp/build/80754af9/psutil_1612297992929/work
ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
pure-eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1642875951954/work
py-cpuinfo @ file:///tmp/build/80754af9/py-cpuinfo_1618437357364/work
py4j==0.10.9.7
pyasn1==0.5.1
pyasn1-modules==0.3.0
pybtex==0.24.0
pycocotools @ file:///home/conda/feedstock_root/build_artifacts/pycocotools_1626785515988/work
pycodestyle==2.11.1
pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1636257122734/work
pydantic==1.10.12
pydantic-settings==2.1.0
pydantic_core==2.14.5
pyfiglet==0.8.post1
pyflakes==3.1.0
Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1700607939962/work
pymatgen==2023.7.20
pyparsing @ file:///home/conda/feedstock_root/build_artifacts/pyparsing_1690737849915/work
PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1661604839144/work
pytest==7.4.3
python-dateutil==2.8.2
python-dotenv==1.0.0
pytorch-lightning==1.8.6
pytz @ file:///home/conda/feedstock_root/build_artifacts/pytz_1693930252784/work
PyYAML @ file:///home/conda/feedstock_root/build_artifacts/pyyaml_1695373447169/work
pyzmq @ file:///home/conda/feedstock_root/build_artifacts/pyzmq_1666828545060/work
qtconsole @ file:///home/conda/feedstock_root/build_artifacts/qtconsole-base_1700168901209/work
QtPy @ file:///home/conda/feedstock_root/build_artifacts/qtpy_1698112029416/work
rdkit==2023.3.1
referencing @ file:///home/conda/feedstock_root/build_artifacts/referencing_1706711412823/work
requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1684774241324/work
requests-oauthlib==1.3.1
rich==13.7.0
rpds-py @ file:///home/conda/feedstock_root/build_artifacts/rpds-py_1705159800573/work
rsa==4.9
ruamel.yaml==0.18.5
ruamel.yaml.clib==0.2.8
ruff==0.2.1
schema==0.7.5
scikit-learn @ file:///home/conda/feedstock_root/build_artifacts/scikit-learn_1696574864087/work
scipy==1.10.1
Send2Trash @ file:///home/conda/feedstock_root/build_artifacts/send2trash_1682601222253/work
six @ file:///tmp/build/80754af9/six_1644875935023/work
smmap==5.0.1
sniffio @ file:///home/conda/feedstock_root/build_artifacts/sniffio_1662051266223/work
soupsieve @ file:///home/conda/feedstock_root/build_artifacts/soupsieve_1693929250441/work
spglib==2.1.0
SQLAlchemy==2.0.25
stack-data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1669632077133/work
stevedore==5.1.0
sympy @ file:///home/conda/feedstock_root/build_artifacts/sympy_1684180539862/work
tabulate==0.9.0
TBB==0.2
tenacity==8.2.3
tensorboard==2.15.1
tensorboard-data-server==0.7.2
tensorboardX==2.6.2.2
terminado @ file:///home/conda/feedstock_root/build_artifacts/terminado_1699810101464/work
threadpoolctl @ file:///home/conda/feedstock_root/build_artifacts/threadpoolctl_1689261241048/work
tinycss2 @ file:///home/conda/feedstock_root/build_artifacts/tinycss2_1666100256010/work
tomli==2.0.1
torch @ file:///home/gta/workspace/LKG/build_conda_package_pytorch/binaries/torch-2.0.1%252Bcpu-cp39-cp39-linux_x86_64.whl#sha256=73482a223d577407c45685fde9d2a74ba42f0d8d9f6e1e95c08071dc55c47d7b
torch-scatter==2.1.2
torch-sparse==0.6.18
torch_geometric==2.4.0
torchmetrics==1.2.1
torchvision @ file:///home/gta/workspace/LKG/build_conda_package_pytorch/binaries/torchvision-0.15.2a0%2Bfa99a53-cp39-cp39-linux_x86_64.whl#sha256=9051b9e66fca6dcc8ef5118adb9ddebb28dab1fe966e9baffefdb2eba183be52
tornado @ file:///tmp/build/80754af9/tornado_1606942317143/work
tqdm==4.66.1
traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1704212992681/work
typeshed-client==2.4.0
typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1695040754690/work
uncertainties==3.1.7
urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1697720414277/work
virtualenv==20.25.0
wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1704731205417/work
webencodings @ file:///home/conda/feedstock_root/build_artifacts/webencodings_1694681268211/work
websocket-client @ file:///home/conda/feedstock_root/build_artifacts/websocket-client_1701630677416/work
Werkzeug==3.0.1
widgetsnbextension @ file:///home/conda/feedstock_root/build_artifacts/widgetsnbextension_1694598693908/work
wrapt @ file:///tmp/build/80754af9/wrapt_1638433857881/work
yarl==1.9.3
zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1695255097490/work

Point cloud pipeline breaks abstractions

Right now point cloud representations are implemented as their own "special" Dataset class, which means breaking off the mainstream of pipeline abstractions (e.g. different data modules, model treatment, etc.).

For better reusability, point cloud representations could be refactored through the transform interface: we intercept data grabbed from DGL modules, and perform the point cloud extraction that way.

Missing test pipeline for evalAI submissions

In order to submit runs to the Open Catalyst leaderboard, we need a way to process the test set and export the results in a standardized format to be ingested by evalAI.

[Feature request]: Dockerfile Update

Feature/behavior summary

The docker file in the repo is outdated.

Request attributes

Would this be a refactor of existing code?
Does this proposal require new package dependencies?
Would this change break backwards compatibility?
Does this proposal include a new model?
Does this proposal include a new dataset?
Does this proposal include a new task/workflow?

Related issues

No response

Solution description

An updated docker file should be created which mirrors the requirements from pyproject.toml.

Additional notes

No response

[Feature request]: Standardized data structure for datasets

Feature/behavior summary

A consistent, standardized data structure would make new datasets significantly easier to implement and maintain, as well as easier for model and task development by setting reasonable expectations of attribute names, etc.

Request attributes

Would this be a refactor of existing code?
Does this proposal require new package dependencies?
Would this change break backwards compatibility?
Does this proposal include a new model?
Does this proposal include a new dataset?
Does this proposal include a new task/workflow?

Related issues

#89 was where some of these discussions were had, and originated from #85

Solution description

There are two possible ways of implementing this: a flat DataSample structure which may comprise a graph or point cloud, leaving it a little ambiguous; a base AbstractDataSample class, and have PointCloudSample and GraphSample structures.

Not 100% sure how batching will look yet, but perhaps a Batch structure should also be introduced.

Additional notes

No response

[Bug]: SAM callback does not work for `ForceRegressionTask`

Expected behavior

Callbacks should be compatible with task abstractions.

Actual behavior

The pytest output:

/home/kinlongk/miniforge3/envs/matsciml-pymatgen-upgrade/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:208: in _call_callback_hooks
    fn(trainer, trainer.lightning_module, *args, **kwargs)
../callbacks.py:764: in on_before_optimizer_step
    step_output = pl_module.training_step(self.batch, self.batch_idx)
../../models/base.py:1735: in training_step
    opt.step()
/home/kinlongk/miniforge3/envs/matsciml-pymatgen-upgrade/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py:152: in step
    step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
E   RecursionError: maximum recursion depth exceeded in comparison
!!! Recursion detected (same locals & position)

At face value, it seems that the interaction between pl_module.training_step and callbacks result in recursion.

Steps to reproduce the problem

Run pytest matsciml/lightning/tests/test_sam.py

Specifications

N/A

[Feature request]: Flexibility in label transformations

Feature/behavior summary

Given that properties from different datasets can span large dynamic ranges, and/or are very non-Gaussian, we should design a framework for modifying and transforming labels ideally just before loss calculations. As part of this, it may be advantageous to calculate dataset-wide statistics on the fly with caching.

Request attributes

Would this be a refactor of existing code?
Does this proposal require new package dependencies?
Would this change break backwards compatibility?
Does this proposal include a new model?
Does this proposal include a new dataset?
Does this proposal include a new task/workflow?

Related issues

#75 pertains to an issue with normalization not being applied; this solution would supersede it.

Solution description

One solution would be to implement this as a subclass of transform, which mutates data in-place:

class AbstractLabelTransform(AbstractTransform):
    def apply(self, *args, **kwargs):
         ...

    def cache_statistic(self, key, value):
        ...

    def save(self, path):
        ...

On-the-fly statistics could be calculated using a moving-average or something, which is then cached to disk based on the dataset class, and the dataset path. The only issue with this is synchronization: for DDP scenarios, we'd want to make sure statistics are the same across each data loader worker. Could probably do some reduction call, etc.

We can then implement concrete versions of the transforms:

class NormalTransform(AbstractLabelTransform):
     # rescales based on mean/std

class MinMaxTransform(AbstractLabelTransform):
    # rescales to [min, max] of specified value, or dataset

class LambdaTransform(AbstractLabelTransform):
     # this is a bit dicey, but apply an arbitrary function to a key

class ExponentialTransform(AbstractLabelTransform):
     # many properties have long-tailed distributions

The idea would be that you could freely compose these such that different labels can be transformed in different ways.

Alternatively:

As a pl.Callback; since it has access to discrete after/before_x_step regions, which could be helpful in getting access to batch data.
We could take the existing normalization steps that are being used in _compute_losses. However, caching and whatnot isn't as flexible.

Additional notes

A task list based on the transform-based solution (convert to issues/PRs for tracking):

#162
Implement concrete label transformations based on common use cases

[Feature request]: PyG installation instructions (esp. for XPUs)

Feature/behavior summary

I'm trying to get PyG to install and work well with Intel XPUs, and was hoping to use this repository as reference. At present, I see that PyG is never installed by default, and nor are any instructions for setting it up with XPUs available.

Request attributes

Would this be a refactor of existing code?
Does this proposal require new package dependencies?
Would this change break backwards compatibility?
Does this proposal include a new model?
Does this proposal include a new dataset?
Does this proposal include a new task/workflow?

Related issues

No response

Solution description

Unknown.

Additional notes

At present, working with a different repository (https://github.com/a-r-j/ProteinWorkshop), I've been trying to integrate your code for the XPU as a new accelerator in PyTorch Lightning: https://github.com/IntelLabs/matsciml/blob/main/matsciml/lightning/xpu.py.

So far, I'm able to get my trainer to identify the XPU as a device, but it seems like some torch_cluster operations are not compatible with tensor stored on XPUs. I would like to perform torch_cluster operations such as knn graph creation on XPU tensors so that I can do data processing in a batched manner or on-the-fly, as opposed to on the CPU.

Here is a minimal example which fails:

import torch
import intel_extension_for_pytorch as ipex
from torch_geometric.nn import knn_graph

device = torch.device('xpu:0' if torch.xpu.is_available() else 'cpu')

x = torch.tensor([[-1.0, -1.0], [-1.0, 1.0], [1.0, -1.0], [1.0, 1.0]]).to(device)
batch = torch.tensor([0, 0, 0, 0]).to(device)
edge_index = knn_graph(x, k=2, batch=batch, loop=False)

The resulting error is RuntimeError: x.device().is_cpu() INTERNAL ASSERT FAILED at "csrc/cpu/knn_cpu.cpp":12, please report a bug to PyTorch. x must be CPU tensor.

And here's a longer trace from the ProteinWorkshop codebase, which probably won't make any sense to MatSciML maintainers.

File "/home/ckj24/rds/hpc-work/envs/proteinworkshop/lib/python3.10/site-packages/torch_geometric/nn/pool/__init__.py", line 171, in knn_graph
    return torch_cluster.knn_graph(x, k, batch, loop, flow, cosine,
  File "/home/ckj24/rds/hpc-work/envs/proteinworkshop/lib/python3.10/site-packages/torch_cluster/knn.py", line 132, in knn_graph
    edge_index = knn(x, x, k if loop else k + 1, batch, batch, cosine,
  File "/home/ckj24/rds/hpc-work/envs/proteinworkshop/lib/python3.10/site-packages/torch_cluster/knn.py", line 81, in knn
    return torch.ops.torch_cluster.knn(x, y, ptr_x, ptr_y, k, cosine,
  File "/home/ckj24/rds/hpc-work/envs/proteinworkshop/lib/python3.10/site-packages/torch/_ops.py", line 692, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: x.device().is_cpu() INTERNAL ASSERT FAILED at "csrc/cpu/knn_cpu.cpp":12, please report a bug to PyTorch. x must be CPU tensor

[Feature request]: `ase` Trajectory logging to standard output

Feature/behavior summary

Would be convenient to write a custom class for outputting the state of a trajectory to standard output.

Request attributes

Would this be a refactor of existing code?
Does this proposal require new package dependencies?
Would this change break backwards compatibility?
Does this proposal include a new model?
Does this proposal include a new dataset?
Does this proposal include a new task/workflow?

Related issues

One of Carmelo's points raised in #215 review.

Solution description

I believe the trajectory writer should take arbitrary streams to write to, in which case we could potentially just use standard output like any other stream, and add some lip gloss to make it pretty.

Additional notes

No response

[Bug]: m3gnet_dgl example fails due to AttributeError: 'Tensor' object has no attribute 'system_embedding'

Expected behavior

m3gnet_dgl example runs without error

Actual behavior

the example crashes during the first epoch

/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0)/charset_normalizer (3.3.2) doesn't match a supported version!
  warnings.warn(
/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/torch/nn/modules/lazy.py:180: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
  warnings.warn('Lazy modules are a new feature under heavy development '
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Running in `fast_dev_run` mode: will run the requested loop using 10 batch(es). Logging and checkpointing is suppressed.
/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/utilities/model_summary/model_summary.py:411: UserWarning: A layer with UninitializedParameter was found. Thus, the total number of parameters detected may be inaccurate.
  warning_cache.warn(

  | Name         | Type       | Params
--------------------------------------------
0 | encoder      | M3GNet     | 273 K 
1 | loss_func    | MSELoss    | 0     
2 | output_heads | ModuleDict | 0     
--------------------------------------------
273 K     Trainable params
0         Non-trainable params
273 K     Total params
1.093     Total estimated model params size (MB)
/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:1595: PossibleUserWarning: The number of training batches (10) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 175.32it/s]
/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/matgl/graph/data.py:286: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:261.)
  state_attrs = torch.tensor(state_attrs)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 364.18it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 501.59it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 590.58it/s]
/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/dgl/backend/pytorch/tensor.py:352: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  assert input.numel() == input.storage().size(), "Cannot convert view " \
Traceback (most recent call last):
  File "/home/sjonathan/Downloads/matsciml_alexandria/examples/model_demos/m3gnet_dgl.py", line 23, in <module>
    trainer.fit(task, datamodule=dm)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
    call._call_and_handle_interrupt(
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
    results = self._run_stage()
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
    self._run_train()
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
    self.fit_loop.run()
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance
    batch_output = self.batch_loop.run(kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(optimizers, kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance
    result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 247, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 357, in _optimizer_step
    self.trainer._call_lightning_module_hook(
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1342, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1661, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
    step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 234, in optimizer_step
    return self.precision_plugin.optimizer_step(
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 121, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/torch/optim/optimizer.py", line 373, in wrapper
    out = func(*args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/torch/optim/adamw.py", line 161, in step
    loss = closure()
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 107, in _wrap_closure
    closure_result = closure()
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 147, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 133, in closure
    step_output = self._step_fn()
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 406, in _training_step
    training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1480, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 378, in training_step
    return self.model.training_step(*args, **kwargs)
  File "/home/sjonathan/Downloads/matsciml_alexandria/matsciml/models/base.py", line 933, in training_step
    loss_dict = self._compute_losses(batch)
  File "/home/sjonathan/Downloads/matsciml_alexandria/matsciml/models/base.py", line 897, in _compute_losses
    predictions = self(batch)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sjonathan/anaconda3/envs/fireworks_forcefield/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sjonathan/Downloads/matsciml_alexandria/matsciml/models/base.py", line 795, in forward
    outputs = self.process_embedding(embedding)
  File "/home/sjonathan/Downloads/matsciml_alexandria/matsciml/models/base.py", line 816, in process_embedding
    output = head(embeddings.system_embedding)
AttributeError: 'Tensor' object has no attribute 'system_embedding'
Epoch 0:   0%|          | 0/20 [00:00<?, ?it/s] ```                                               

### Steps to reproduce the problem

run: python m3gnet_dgl.py

### Specifications

absl-py==1.4.0
aiohttp==3.8.5
aioitertools==0.11.0
aiosignal==1.3.1
alabaster==0.7.13
annotated-types==0.5.0
anyio==3.7.0
argon2-cffi @ file:///opt/conda/conda-bld/argon2-cffi_1645000214183/work
argon2-cffi-bindings @ file:///tmp/build/80754af9/argon2-cffi-bindings_1644553347904/work
arrow==1.2.3
ase==3.22.1
asttokens==2.2.1
astunparse==1.6.3
async-timeout==4.0.3
atomate==1.0.3
atomate2 @ file:///home/sjonathan/Downloads/atomate2_other_forcefields
attrs==23.1.0
Babel @ file:///croot/babel_1671781930836/work
backcall @ file:///home/ktietz/src/ci/backcall_1611930011877/work
bandit==1.7.6
bcrypt==4.0.1
beautifulsoup4==4.12.2
biopython==1.81
black==23.12.1
bleach==6.0.0
boto3==1.28.4
botocore==1.31.4
bracex==2.3.post1
brotlipy==0.7.0
cachelib==0.9.0
cachetools==5.3.0
castepxbin==0.2.0
cclib==1.8
CellConstructor==1.3.2
certifi @ file:///croot/certifi_1671487769961/work/certifi
cffi @ file:///croot/cffi_1670423208954/work
cfgv==3.4.0
chardet==5.2.0
charset-normalizer==3.3.2
chemview==0.6
chgnet==0.2.0
click==8.1.7
cloudpickle==3.0.0
colorama==0.4.6
colormath==3.0.0
comm==0.1.3
contextlib2==21.6.0
contourpy==1.0.7
cryptography @ file:///croot/cryptography_1677533068310/work
crystal-toolkit==2023.6.1
crystaltoolkit-extension==0.6.0
custodian==2023.7.22
cycler==0.11.0
Cython==3.0.2
dash==2.10.2
dash-core-components==2.0.0
dash-html-components==2.0.0
dash-mp-components==0.4.34
dash-table==5.0.0
debugpy==1.6.7
decorator @ file:///opt/conda/conda-bld/decorator_1643638310831/work
defusedxml @ file:///tmp/build/80754af9/defusedxml_1615228127516/work
dgl==0.9.1
dgllife==0.3.2
distlib==0.3.8
dnspython==2.3.0
docstring-parser==0.15
docutils==0.20.1
dpdata==0.2.15
dscribe==2.1.0
e3nn==0.5.1
einops==0.7.0
email-validator==2.1.0.post1
emmet-core==0.64.0
entrypoints @ file:///tmp/build/80754af9/entrypoints_1649908313000/work
exceptiongroup==1.1.1
executing==1.2.0
f90wrap==0.2.13
fabric==3.1.0
fastapi==0.100.0
fastcore==1.5.29
fasteners==0.18
fastjsonschema==2.17.1
fforces==0.1
filelock==3.12.2
FireWorks==2.0.3
flake8==7.0.0
flake8-bandit==4.1.1
flake8-black==0.3.6
Flake8-pyproject==1.2.3
Flask==2.2.5
Flask-Caching==2.0.2
flask-paginate==2022.1.8
flatbuffers==23.3.3
flit_core @ file:///opt/conda/conda-bld/flit-core_1644941570762/work/source/flit_core
fonttools==4.39.0
fqdn==1.5.1
frozenlist==1.4.0
fsspec==2023.9.2
future==0.18.3
gast==0.4.0
gdown==4.7.1
geometric-algebra-attention==0.5.1
gitdb==4.0.11
GitPython==3.1.41
google-auth==2.17.1
google-auth-oauthlib==1.0.0
google-pasta==0.2.0
greenlet==3.0.3
GridDataFormats==1.0.1
grpcio==1.53.0
gunicorn==20.1.0
h11==0.14.0
h5py==3.8.0
hiphive==1.1
httpcore==1.0.2
httpx==0.26.0
hyperopt==0.2.7
identify==2.5.33
idna @ file:///croot/idna_1666125576474/work
imageio==2.31.0
imagesize==1.4.1
importlib-metadata==6.8.0
importlib-resources==6.1.1
inflect==6.0.4
iniconfig==2.0.0
invoke==2.1.3
ipykernel==6.23.2
ipython==8.14.0
ipython-genutils @ file:///tmp/build/80754af9/ipython_genutils_1606773439826/work
ipywidgets==7.7.5
isoduration==20.11.0
itsdangerous==2.1.2
jax==0.4.8
jedi==0.18.2
Jinja2 @ file:///croot/jinja2_1666908132255/work
jmespath==1.0.1
jobflow==0.1.13
joblib==1.2.0
json5 @ file:///tmp/build/80754af9/json5_1624432770122/work
jsonargparse==4.27.1
jsonpointer==2.3
jsonschema @ file:///croot/jsonschema_1676558650973/work
julia==0.6.1
jupyter @ file:///tmp/abs_33h4eoipez/croots/recipe/jupyter_1659349046347/work
jupyter-console==6.6.3
jupyter-events==0.6.3
jupyter_client==8.2.0
jupyter_core==5.3.1
jupyter_server==2.6.0
jupyter_server_terminals==0.4.4
jupyterlab @ file:///croot/jupyterlab_1675354114448/work
jupyterlab-pygments==0.2.2
jupyterlab-widgets==1.1.4
jupyterlab_server @ file:///croot/jupyterlab_server_1677143054853/work
kaleido==0.2.1
keras==2.12.0
kiwisolver==1.4.4
lark==1.1.8
latexcodec==2.0.1
lazy_loader==0.2
libclang==16.0.0
lightning-utilities==0.9.0
llvmlite==0.39.1
lmdb==1.3.0
lobsterpy==0.3.0
lovely-numpy==0.2.8
lxml @ file:///opt/conda/conda-bld/lxml_1657545139709/work
mace @ file:///home/sjonathan/Downloads/mace
maggma==0.56.0
Markdown==3.4.3
markdown-it-py==3.0.0
MarkupSafe==2.1.3
matgl==0.8.5
matminer==0.8.0
matplotlib==3.7.1
matplotlib-inline @ file:///opt/conda/conda-bld/matplotlib-inline_1662014470464/work
-e git+https://github.com/JonathanSchmidt1/matsciml_alexandria.git@9568e18f0d3546cbd565a87655a88bff9bf45d42#egg=matsciml
matscipy==0.8.0
mccabe==0.7.0
MDAnalysis==2.6.0
mdtraj==1.9.7
mdurl==0.1.2
mendeleev==0.14.0
mistune==2.0.5
ml-dtypes==0.0.4
mmtf-python==1.1.3
mongogrant==0.3.3
mongomock==4.1.2
monty==2023.9.5
mp-api==0.33.3
mpi4py==3.1.4
mpmath==1.3.0
mrcfile==1.4.3
msgpack==1.0.5
multidict==6.0.4
munch==2.5.0
mypy-extensions==1.0.0
nbclassic==1.0.0
nbclient==0.8.0
nbconvert==7.5.0
nbformat==5.9.0
nequip @ file:///home/sjonathan/fusessh/dgx3/nequip2/nequip3/nequip
nest-asyncio @ file:///croot/nest-asyncio_1672387112409/work
networkx==3.0
nglview==3.0.5
nodeenv==1.8.0
notebook==6.5.4
notebook_shim==0.2.3
numba==0.56.4
numpy==1.23.5
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py3==7.352.0
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
openai==0.28.1
opt-einsum==3.3.0
opt-einsum-fx==0.1.4
optimade==1.0.1
orjson==3.9.2
overrides==7.3.1
packaging==23.1
palettable==3.3.0
pandas==1.5.3
pandocfilters @ file:///opt/conda/conda-bld/pandocfilters_1643405455980/work
paramiko==3.2.0
parso @ file:///opt/conda/conda-bld/parso_1641458642106/work
pathspec==0.12.1
pbr==6.0.0
pdyna @ file:///home/sjonathan/Downloads/PDynA
periodictable==1.6.1
pexpect @ file:///tmp/build/80754af9/pexpect_1605563209008/work
phonopy==2.20.0
pickleshare @ file:///tmp/build/80754af9/pickleshare_1606932040724/work
Pillow==9.4.0
platformdirs==4.1.0
plotly==5.13.1
pluggy==1.0.0
ply==3.11
pooch==1.7.0
pre-commit==3.6.0
prettytable==3.7.0
prometheus-client==0.17.0
prompt-toolkit==3.0.38
protobuf==4.22.1
psutil==5.9.5
ptyprocess @ file:///tmp/build/80754af9/ptyprocess_1609355006118/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
PubChemPy==1.0.4
pure-eval @ file:///opt/conda/conda-bld/pure_eval_1646925070566/work
py4j==0.10.9.7
py4vasp==0.7.3
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.11.1
pybtex==0.24.0
pycodestyle==2.11.1
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pydantic==1.10.12
pydantic-settings==2.0.3
pydantic_core==2.14.6
pydash==7.0.5
pyfiglet==0.8.post1
pyflakes==3.2.0
Pygments==2.15.1
pymatgen==2023.7.20
pymatgen-analysis-defects==2023.8.22
pymatgen-analysis-diffusion==2022.7.21
pymatgen-db==2023.2.23
pymongo==4.3.3
PyNaCl==1.5.0
pynndescent==0.5.8
pyOpenSSL @ file:///croot/pyopenssl_1677607685877/work
pyparsing==3.0.9
PyProcar==6.0.0
PyQt5-sip==12.11.0
pyrsistent==0.19.3
PySocks @ file:///home/builder/ci_310/pysocks_1640793678128/work
pysr==0.12.0
pytest==7.3.2
python-dateutil @ file:///tmp/build/80754af9/python-dateutil_1626374649649/work
python-dotenv==1.0.0
python-json-logger==2.0.7
python-sscha==1.3.2.1
pytorch-lightning==1.8.6
pytz==2022.7.1
pyvista==0.40.1
PyWavelets==1.4.1
PyYAML==6.0.1
pyzmq==24.0.1
qtconsole==5.4.3
QtPy==2.3.1
quippy-ase==0.9.14
rdkit==2023.3.1
requests==2.28.2
requests-oauthlib==1.3.1
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.7.0
robocrys==0.2.8
rowan==1.3.0.post1
rsa==4.9
ruamel.yaml==0.17.21
ruamel.yaml.clib==0.2.7
s3transfer==0.6.1
schema==0.7.5
scikit-image==0.21.0
scikit-learn==1.0
scipy==1.10.1
scooby==0.7.2
seaborn==0.12.2
seekpath==2.1.0
Send2Trash==1.8.2
sentinels==1.0.0
sentry-sdk==1.26.0
shakenbreak==3.0.0
shapely==2.0.1
sip @ file:///tmp/abs_44cd77b_pu/croots/recipe/sip_1659012365470/work
six @ file:///tmp/build/80754af9/six_1644875935023/work
smmap==5.0.1
sniffio==1.3.0
snowballstemmer==2.2.0
soupsieve==2.4.1
sparse==0.14.0
spglib==2.0.2
Sphinx==7.2.6
sphinx-argparse==0.4.0
sphinx-pdj-theme==0.2.1
sphinxcontrib-applehelp==1.0.7
sphinxcontrib-devhelp==1.0.5
sphinxcontrib-htmlhelp==2.0.4
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.6
sphinxcontrib-serializinghtml==1.1.9
SQLAlchemy==2.0.25
sshtunnel==0.4.0
stack-data==0.6.2
starlette==0.27.0
stevedore==5.1.0
sumo==2.3.5
sympy==1.11.1
tabulate==0.9.0
tdscha==1.0.1
tenacity==8.2.2
tensorboard==2.12.1
tensorboard-data-server==0.7.0
tensorboard-plugin-wit==1.8.1
tensorboardX==2.6.2.2
tensorflow==2.12.0
tensorflow-estimator==2.12.0
tensorflow-io-gcs-filesystem==0.32.0
termcolor==2.2.0
terminado @ file:///croot/terminado_1671751832461/work
threadpoolctl==3.1.0
tifffile==2023.4.12
tinycss2 @ file:///croot/tinycss2_1668168815555/work
toml @ file:///tmp/build/80754af9/toml_1616166611790/work
tomli @ file:///opt/conda/conda-bld/tomli_1657175507142/work
torch==2.1.2
torch-cluster==1.6.3
torch-ema==0.3
torch-runstats==0.2.0
torch-scatter==2.1.2
torch-sparse==0.6.18
torchmetrics==1.2.0
tornado==6.3.2
tqdm==4.65.0
trainstation==1.0
traitlets==5.9.0
trimesh==3.22.3
triton==2.1.0
typeshed-client==2.4.0
typing==3.7.4.3
typing_extensions==4.8.0
umap-learn==0.5.3
uncertainties==3.1.7
uri-template==1.2.0
urllib3==2.1.0
uvicorn==0.23.2
Vapory==0.1.2
virtualenv==20.25.0
vtk==9.2.6
wcmatch==8.4.1
wcwidth==0.2.6
webcolors==1.13
webencodings==0.5.1
websocket-client==1.5.3
Werkzeug==2.2.3
widgetsnbextension==3.6.4
wrapt==1.14.1
yarl==1.9.2
zipp==3.16.2

Refactor ownership of `_get_inputs`

Currently, the respective _get_inputs method is owned by the various task LitModules, which means that if new architectures are developed, you would also have to subclass the tasks as well. This is unintuitive, and a better alternative would be to have the default method belong to AbstractEnergyModel, allowing concrete model implementations to overwrite it if necessary.

Considerations:

Tasks have a say in what data is required (e.g. force regression)
Models may need to override inputs (e.g. input features)

Migrate to pyproject.toml

Currently, we used both setup.py and setup.cfg jointly to overcome some backwards compatibility issues for older versions of setuptools. The better practice is to use pyproject.toml which is now supported in newer versions of setuptools, which provides a more comprehensive packaging and developer experience, as we're able to define formatter configurations, dependencies, and build tools in the same file.

[Bug]: Both the schnet_dgl and the egnn_pyg example fail with ddp

Expected behavior

The examples to run

Actual behavior

Running with 4 nodes and 4 workers results in the stacktrace posted below.
My guess would be that this is the case due to lazy modules not being initialized in time.
It seems to be related to the discussion at the end of this issue Lightning-AI/pytorch-lightning#13764
Adding a setup hook in the base model class fixes the training with num_workers=0

def setup(self, stage):
    match stage:
        case 'fit':
            dataloader = self.trainer.datamodule.train_dataloader()
        case 'validate':
            dataloader = self.trainer.datamodule.val_dataloader()
        case 'test':
            dataloader = self.trainer.datamodule.test_dataloader()
        case 'predict':
            dataloader = self.trainer.datamodule.predict_dataloader()
    dummy_batch = next(iter(dataloader))
    self.forward(dummy_batch)

However I had a few more issues with the test. The IS2RE devset seems to contain pickled graphs that are not loaded properly when loaded with a newer dgl version. So I had to use e.g. the NOMAD dataset to test it.
With higher numbers of workers I was also not able to get it to work yet.

/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/torch/nn/modules/lazy.py:180: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
  warnings.warn('Lazy modules are a new feature under heavy development '
/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/torch/nn/modules/lazy.py:180: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
  warnings.warn('Lazy modules are a new feature under heavy development '
/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/torch/nn/modules/lazy.py:180: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
  warnings.warn('Lazy modules are a new feature under heavy development '
/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/torch/nn/modules/lazy.py:180: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
  warnings.warn('Lazy modules are a new feature under heavy development '
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:67: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
Running in `fast_dev_run` mode: will run the requested loop using 1000 batch(es). Logging and checkpointing is suppressed.
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
[W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:15837 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [nid04102]:15837 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [nid04102]:15837 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [nid04102]:15837 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [nid04102]:15837 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [nid04102]:15837 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [nid04102]:15837 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [nid04102]:15837 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [nid04102]:15837 (errno: 97 - Address family not supported by protocol).
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

libibverbs: Could not locate libibgni (/usr/lib64/libibgni.so.1: undefined symbol: verbs_uninit_context)
libibverbs: Warning: couldn't open config directory '/opt/cray/rdma-core/27.1-7.0.3.1_4.6__g4beae6eb.ari/etc/libibverbs.d'.
libibverbs: Could not locate libibgni (/usr/lib64/libibgni.so.1: undefined symbol: verbs_uninit_context)
libibverbs: Warning: couldn't open config directory '/opt/cray/rdma-core/27.1-7.0.3.1_4.6__g4beae6eb.ari/etc/libibverbs.d'.
libibverbs: Could not locate libibgni (/usr/lib64/libibgni.so.1: undefined symbol: verbs_uninit_context)
libibverbs: Could not locate libibgni (/usr/lib64/libibgni.so.1: undefined symbol: verbs_uninit_context)
libibverbs: Warning: couldn't open config directory '/opt/cray/rdma-core/27.1-7.0.3.1_4.6__g4beae6eb.ari/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/opt/cray/rdma-core/27.1-7.0.3.1_4.6__g4beae6eb.ari/etc/libibverbs.d'.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Traceback (most recent call last):
  File "/scratch/snx3000/jschmidt/mirror/schnet/schnet_dgl.py", line 32, in <module>
Traceback (most recent call last):
  File "/scratch/snx3000/jschmidt/mirror/schnet/schnet_dgl.py", line 32, in <module>
Traceback (most recent call last):
  File "/scratch/snx3000/jschmidt/mirror/schnet/schnet_dgl.py", line 32, in <module>
    trainer.fit(task, datamodule=dm)
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
Traceback (most recent call last):
  File "/scratch/snx3000/jschmidt/mirror/schnet/schnet_dgl.py", line 32, in <module>
    trainer.fit(task, datamodule=dm)
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    trainer.fit(task, datamodule=dm)
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    trainer.fit(task, datamodule=dm)
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    call._call_and_handle_interrupt(
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    call._call_and_handle_interrupt(
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    call._call_and_handle_interrupt(
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
    return function(*args, **kwargs)
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
    return function(*args, **kwargs)
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 965, in _run
    self._run(model, ckpt_path=ckpt_path)
    self._run(model, ckpt_path=ckpt_path)
    self._run(model, ckpt_path=ckpt_path)
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 965, in _run
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 965, in _run
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 965, in _run
    self.strategy.setup(self)
    self.strategy.setup(self)
    self.strategy.setup(self)
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 169, in setup
    self.strategy.setup(self)
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 169, in setup
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 169, in setup
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 169, in setup
    self.configure_ddp()
    self.configure_ddp()
    self.configure_ddp()
    self.configure_ddp()
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 278, in configure_ddp
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 278, in configure_ddp
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 278, in configure_ddp
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 278, in configure_ddp
    self.model = self._setup_model(self.model)
    self.model = self._setup_model(self.model)
    self.model = self._setup_model(self.model)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 191, in _setup_model
    self.model = self._setup_model(self.model)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 191, in _setup_model
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 191, in _setup_model
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 191, in _setup_model
    return DistributedDataParallel(module=model, device_ids=device_ids, **self._ddp_kwargs)
    return DistributedDataParallel(module=model, device_ids=device_ids, **self._ddp_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    return DistributedDataParallel(module=model, device_ids=device_ids, **self._ddp_kwargs)
    return DistributedDataParallel(module=model, device_ids=device_ids, **self._ddp_kwargs)
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 656, in __init__
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 656, in __init__
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 656, in __init__
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 656, in __init__
    self._log_and_throw(
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 769, in _log_and_throw
    self._log_and_throw(
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 769, in _log_and_throw
    self._log_and_throw(
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 769, in _log_and_throw
    self._log_and_throw(
    raise err_type(err_msg)
RuntimeError: Modules with uninitialized parameters can't be used with `DistributedDataParallel`. Run a dummy forward pass to correctly initialize the modules
  File "/scratch/snx3000/jschmidt/envs/dgl_113/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 769, in _log_and_throw
    raise err_type(err_msg)
RuntimeError: Modules with uninitialized parameters can't be used with `DistributedDataParallel`. Run a dummy forward pass to correctly initialize the modules
    raise err_type(err_msg)
RuntimeError: Modules with uninitialized parameters can't be used with `DistributedDataParallel`. Run a dummy forward pass to correctly initialize the modules
    raise err_type(err_msg)
RuntimeError: Modules with uninitialized parameters can't be used with `DistributedDataParallel`. Run a dummy forward pass to correctly initialize the modules
srun: error: nid04103: task 1: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=51280837.0
srun: error: nid04104: task 2: Exited with exit code 1
srun: error: nid04105: task 3: Exited with exit code 1
srun: error: nid04102: task 0: Exited with exit code 1

Steps to reproduce the problem

run the schnet_dgl or egnn_pyg example on multiple nodes

Specifications

torch.version
'2.0.0+cu118'
dgl.version
'1.1.3+cu118'
matsciml.version
'1.1.0'
torch_geometric.version
'2.3.1'

MEGNet incorrect mapping of input dimensions

Current implementation doesn't allow flexibility in having different feature dimensionalities for edges, nodes, and graph attributes.

[Feature request]: Homogenization of data structures and physical representations

Feature/behavior summary

To ensure consistency in modeling, each dataset in Open MatSciML Toolkit should have uniform (or near uniform) kinds of data. For example, whether coordinates provided are fractional or Cartesian, ensuring every dataset has sufficient information to represent each data sample in a physically meaningful way, such as periodic boundary conditions (for use in e.g. shift vectors).

Request attributes

Would this be a refactor of existing code?
Does this proposal require new package dependencies?
Would this change break backwards compatibility?
Does this proposal include a new model?
Does this proposal include a new dataset?
Does this proposal include a new task/workflow?

Related issues

No response

Solution description

A good place to start would be to make sure each devset, and subsequently any serialized datasets we have conform to the following:

Check if the coordinates are fractional or not (if there are values outside of 0 and 1 then they're likely Cartesian)
Check to make sure we have enough information to create a Lattice object, can be just a cell key, or have the lattice parameters like materials project
Generally just print and list out the keys in the sample, construct a table of them, so that we can help contribute to #97

We should also check other projects, like Colabfit, to see what extent we can try and conform to community standards, too.

Additional notes

Can't assign Bin yet, but would be good for Bin to aggregate information, and between him and @melo-gonzo to help craft PRs to address things after the survey is done.

Run formatter and linter on vanilla MACE codebase

[Bug]: CI tests do not always use the latest code version

Expected behavior

When the CI testing triggers, I'd expect that the latest version of the codebase is being used to run the tests.

When testing happens, we should also emit a pip freeze and/or mamba env export to indicate what packages are part of the environment.

Actual behavior

While working on #154, it became apparent that while the CI was using the PR codebase at the beginning of the submission, it fell out of date after a few commits. I was trying to address the libze_loader.so not being found by adding exception handling, but didn't seem to pick it up.

Even though the installation was done in developer mode, somehow code changes escaped it and I patched the intended behavior by inserting a pip install . in the middle of the action so that the code is actually updated every time.

Steps to reproduce the problem

See action log for #154.

Specifications

N/A

[Bug]: Normalization keys mismatch fails silently

Expected behavior

When normalization parameters are passed, if there are no keys matched up we should throw an error or warning message to notify the user that nothing is being normalized.

Actual behavior

Currently, there is no "validation" persay of normalization keys passed, and so if they aren't matched with any of the targets, nothing happens.

Steps to reproduce the problem

Pass normalization kwargs, and have none of the keys match available targets.

Specifications

matsciml as of 46dd595

[Bug]: magtl generator/num_worker/gpu issue

Expected behavior

I made this issue to separate the discussion from the previous issue #92.
I had to change the lines in the torch code discussed in materialsvirtuallab/matgl#195 to make the gpu supported training in the example work even after changing the matgl code as @melo-gonzo mentioned and using the corrected example from his fork. Working with a preprocessed dataest the changes @melo-gonzo suggested in matgl did not do anything for my case, with a non-preprocessed dataset they actually stopped it from working.
@melo-gonzo could you please specify which versions and training script worked for you with the changed matgl version, maybe that would already solve my issues.

To make the single gpu version work I had to additionally set the generator in
/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/torch/utils/data/sampler.py
to device cuda .
The line in /project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/torch/utils/data/dataset.py I only had to adjust when going beyond the _from_devset example.
With the changes in torch it worked with multiple gpus, however I never managed to make it work with multiple workers.

Actual behavior

This is the stack trace with my changed torch version for a run on single gpu with multiple workers:

Sanity Checking: |                                                                                                                                                                           | 0/? [00:00<?, ?it/s]Traceback (most recent call last):
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1033, in _run_stage
    self._run_sanity_check()
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1062, in _run_sanity_check
    val_loop.run()
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 134, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 365, in _evaluation_step
    batch = call._call_strategy_hook(trainer, "batch_to_device", batch, dataloader_idx=dataloader_idx)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 269, in batch_to_device
    return model._apply_batch_transfer_handler(batch, device=device, dataloader_idx=dataloader_idx)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/pytorch_lightning/core/module.py", line 334, in _apply_batch_transfer_handler
    batch = self._call_batch_hook("transfer_batch_to_device", batch, device, dataloader_idx)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/pytorch_lightning/core/module.py", line 323, in _call_batch_hook
    return trainer_method(trainer, hook_name, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/pytorch_lightning/core/hooks.py", line 583, in transfer_batch_to_device
    return move_data_to_device(batch, device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/lightning_fabric/utilities/apply_func.py", line 102, in move_data_to_device
    return apply_to_collection(batch, dtype=_TransferableDataType, function=batch_to)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/lightning_utilities/core/apply_func.py", line 72, in apply_to_collection
    return _apply_to_collection_slow(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/lightning_utilities/core/apply_func.py", line 104, in _apply_to_collection_slow
    v = _apply_to_collection_slow(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/lightning_utilities/core/apply_func.py", line 125, in _apply_to_collection_slow
    v = _apply_to_collection_slow(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/lightning_utilities/core/apply_func.py", line 96, in _apply_to_collection_slow
    return function(data, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/lightning_fabric/utilities/apply_func.py", line 96, in batch_to
    data_output = data.to(device, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/dgl/heterograph.py", line 5709, in to
    ret._graph = self._graph.copy_to(utils.to_dgl_context(device))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/dgl/heterograph_index.py", line 255, in copy_to
    return _CAPI_DGLHeteroCopyTo(self, ctx.device_type, ctx.device_id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 227, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./function.pxi", line 217, in dgl._ffi._cy3.core.FuncCall3
dgl._ffi.base.DGLError: [19:29:26] /opt/dgl/src/runtime/cuda/cuda_device_api.cc:343: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: unspecified launch failure
Stack trace:
  [bt] (0) /project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/dgl/libdgl.so(+0x8b0b95) [0x15542dd05b95]
  [bt] (1) /project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/dgl/libdgl.so(dgl::runtime::CUDADeviceAPI::CopyDataFromTo(void const*, unsigned long, void*, unsigned long, unsigned long, DGLContext, DGLContext, DGLDataType)+0x82) [0x15542dd07ff2]
  [bt] (2) /project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::CopyFromTo(DGLArray*, DGLArray*)+0x10d) [0x15542db7e4cd]
  [bt] (3) /project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::CopyTo(DGLContext const&) const+0x103) [0x15542dbba033]
  [bt] (4) /project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/dgl/libdgl.so(dgl::UnitGraph::CopyTo(std::shared_ptr<dgl::BaseHeteroGraph>, DGLContext const&)+0x3ff) [0x15542dcc802f]
  [bt] (5) /project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/dgl/libdgl.so(dgl::HeteroGraph::CopyTo(std::shared_ptr<dgl::BaseHeteroGraph>, DGLContext const&)+0xf6) [0x15542dbc6876]
  [bt] (6) /project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/dgl/libdgl.so(+0x7802b6) [0x15542dbd52b6]
  [bt] (7) /project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x15542db63558]
  [bt] (8) /project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/dgl/_ffi/_cy3/core.cpython-311-x86_64-linux-gnu.so(+0x1a446) [0x155422826446]



During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/scratch/snx3000/jschmidt/mirror/schnet/m3gnet_dgl.py", line 45, in <module>
    trainer.fit(task, datamodule=dm)
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 68, in _call_and_handle_interrupt
    trainer._teardown()
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1012, in _teardown
    self.strategy.teardown()
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 528, in teardown
    self.lightning_module.cpu()
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 79, in cpu
    return super().cpu()
           ^^^^^^^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 954, in cpu
    return self._apply(lambda t: t.cpu())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 844, in _apply
    self._buffers[key] = fn(buf)
                         ^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 954, in <lambda>
    return self._apply(lambda t: t.cpu())
                                 ^^^^^^^
  File "/project/s1128/jschmidt/envs/schnet/lib/python3.11/site-packages/torch/utils/_device.py", line 62, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Steps to reproduce the problem

Experiment with the m3gnet_dgl example

Specifications

matgl.version

dgl.version
'1.1.3+cu118'
import torch
torch.version
'2.0.1+cu118'
import matgl
matgl.version
'0.9.1'

[Feature request]: Reconciling multi task models with `ase` `Calculator` interface.

Feature/behavior summary

Per #215 MultiTaskLitModules are a supported type of task to be passed into the calculator, but the calculate step doesn't treat it any differently even when it is actually required.

Request attributes

Would this be a refactor of existing code?
Does this proposal require new package dependencies?
Would this change break backwards compatibility?
Does this proposal include a new model?
Does this proposal include a new dataset?
Does this proposal include a new task/workflow?

Related issues

No response

Solution description

We need a way to:

identify when inference results come from a MultiTaskLitModule
In the case where only one subtask is relevant, figure out how to extract those results and map them into the Calculator
In the case where multiple subtasks are relevant, figure out how to aggregate the result and report a scalar as expected by ase

Additional notes

No response

[Bug]: Models missing from registry

Expected behavior

Model architectures should be consistently reference-able in the global registry, by decorating the class definition with @registry.register_model.

Actual behavior

The visualization below shows the dependency graph of registry, and only flags FaeNet as the only model module co-dependency.

Steps to reproduce the problem

N/A

Specifications

The codebase on main branch as of 88da915

[Bug]: unused parameters in m3gnet

Expected behavior

I would expect the model not to have any unused parameters during the backward process.

Actual behavior

Running ddp without strategy find_unused_parameters crashes.
My theory is that

class M3GNet(AbstractDGLModel):
    def __init__(
        self, element_types: list[str], return_all_layer_output: bool, *args, **kwargs
    ):
        super().__init__(atom_embedding_dim=len(element_types))
        self.elemenet_types = element_types
        self.all_embeddings = return_all_layer_output
        self.model = matgl_m3gnet(element_types, *args, **kwargs)
        self.atomic_embedding = self.model.embedding

self.atomic_embedding should be self.atom_embedding as that is the name in the AbstractDglTask.
As the m3gnet embedding block is more than just node embeddings it should probably just be:
self.atom_embedding = self.model.embedding.layer_node_embedding
This fixed it for me (I also removed the final m3gnet layer so maybe that also has to be done).

Steps to reproduce the problem

Run m3gnet with ddp.

Specifications

the latest matsciml version with dgl 1.1.3

[Feature request]: Preserving equivariance in model outputs

Feature/behavior summary

In the current workflow, every model is expected to emit a pair of tensors for embeddings that represent points/nodes and systems/graphs. With models like those that rely on tensor products, and/or those that learn multiple types or components of embeddings per hierarchical level, reducing prematurely just to get embeddings that conform to those forms will lose information (i.e. even if we flatten).

Implementation-wise, this is the difference:

...
# model architecture
...
return {"node": node_embeddings, "graph": graph_embeddings}

We want:

...
# tensor product architecture
...
return {
    "node": {"irrep_0": irrep_0_node, "irrep_1": irrep_1_node},
     "graph": {"irrep_0": irrep_0_graph, "irrep_1": irrep_1_graph},
}

Request attributes

Would this be a refactor of existing code?
Does this proposal require new package dependencies?
Would this change break backwards compatibility?
Does this proposal include a new model?
Does this proposal include a new dataset?
Does this proposal include a new task/workflow?

Related issues

No response

Solution description

Currently, tasks simply use OutputBlock as projection layers that operate on the embeddings returned. The two things we need to do to support this are:

Implement an OutputBlock class that supports a hierarchy of embeddings, e.g. e3nn Irreps
Refactor tasks to support the usage of this new projection set

Additional notes

No response

Hyperparameter configurations not saved with weights

The current logic for using save_hyperparameters does not properly encapsulate the actual regressor model hyperparameters, only the task LightningModule. While you're still able to reload the model, it isn't without pain and does require additional bookkeeping on the user side.

Suggest refactoring to make sure that hparams.yml in the logging result also includes this information, and probably modify the load_from_checkpoint method for pipelines to pipe these parameters in.

[Bug]: SAM callback fails to process losses in multi-data and multi-task pipelines.

Expected behavior

SAM would be able to process a single loss per optimizer.

Actual behavior

The losses returned from _compute_losses() are a nested dictionary with dataset and task specific mapping, e.g.:

>>> loss = task._compute_losses(self.batch)
>>> loss{'IS2REDataset': {'regression': {'loss': tensor(58.0430, grad_fn=<AddBackward0>), 'log': {'energy_init': tensor(32.1578, grad_fn=<MseLossBackward0>), 'energy_relaxed': tensor(25.8853, grad_fn=<MseLossBackward0>)}}}, 'S2EFDataset': {'regression': {'loss': tensor(265.4653, grad_fn=<AddBackward0>), 'log': {'force': tensor(0.1937, grad_fn=<MeanBackward0>), 'energy': tensor(265.2716, grad_fn=<MeanBackward0>)}}}}

Steps to reproduce the problem

Add SAM callback to any of the examples in ./examples/tasks/multitask

Specifications

Latest

[Bug]: Improving on-boarding experience

Expected behavior

There should be a clear "landing zone" for when new people looking at the repository: what exact script should I run to try out a specific model and dataset?

The next grade up from this would likely be how do I train the same model on a bigger dataset?

This will help ease users into ramping up on the project and lead them into development from laptop to workstation to cluster.

Actual behavior

N/A

Steps to reproduce the problem

Scan the README, and it might be a little like drinking from the fire hose right now as we get a sensory overload.

Specifications

N/A

Hyperparameter saving for data modules

Currently, only hyperparameters are being saved for models (and based on #5, not perfectly either) whereas it's important to know which transforms are being used, batch size, and so on for proper experiment tracking.

[Bug]: Missing `README` for materials project and LiPS datasets

Expected behavior

Other datasets (e.g. NOMAD, OQMD) have markdown files in matsciml/datasets/<dataset> that provide details on the contents of the LMDB, expected keys, etc.

The same documentation is missing for materials project and LiPS datasets.

Actual behavior

N/A

Steps to reproduce the problem

Compare file structures for OQMD/NOMAD with LiPS/Materials Project in matsciml/datasets

Specifications

N/A

[Feature request]: Run pre-commit on entire package to ensure consistent formatting

Feature/behavior summary

Run the pre-commit hooks on every file in one big PR so future PR's dont flood the change log with formatting updates.

Request attributes

Would this be a refactor of existing code?
Does this proposal require new package dependencies?
Would this change break backwards compatibility?
Does this proposal include a new model?
Does this proposal include a new dataset?
Does this proposal include a new task/workflow?

Related issues

No response

Solution description

Run pre-commit hooks on all files.

Additional notes

No response

[Bug]: the arguments of Union are not correct

Expected behavior

In matsciml\common\types.py, the arguments of Union are not correct. It shows "Union requires two or more type arguments"

Actual behavior

Line 28 and 29:
DataType = Union[ModelingTypes]
AbstractGraph = Union[GraphTypes]

Steps to reproduce the problem

In matsciml\common\types.py, the arguments of Union are not correct. It shows "Union requires two or more type arguments"

Specifications

matsciml as of [9568e18]

Example on how to create your own lmdb dataset for training.

Feature/behavior summary

We have some larger material datasets we would like to train on and your repo is the best I could find in terms of support for large-scale training.
It would be great if you could provide an example for how to create your own lmdb dataset similar to the existing ones from a list of structures and properties and use that for training. I am sure it's quite simple but I have to admit I am getting a bit lost in all the different classes of the datasets.
Or maybe you can point me in the right direction, i.e. what key and dictionary structure would be correct to use as input to write_lmdb_data(key: Any, data: Any, target_lmdb: lmdb.Environment) to be consistent with the existing datasets e.g. the SinglePointLmdbDataset.

thank you so much for your help.

Request attributes

Would this be a refactor of existing code?
Does this proposal require new package dependencies?
Would this change break backwards compatibility?
Does this proposal include a new model?
Does this proposal include a new dataset?
Does this proposal include a new task/workflow?

Related issues

No response

Solution description

Add an example on how to create your own lmdb dataset from a list of structures and properties to the tutorials.

Additional notes

No response

Implement abstract label transformation interface, including caching mechanism

Use of an embedding table in EGNN and MegNet

Currently, both EGNN and MegNet implementations do not use an embedding lookup for atom representations; they just transform the atomic numbers via MLPs. In order to carry out ablations and whatnot for PhAST transforms, we need to refactor these models to use embedding tables.

[Feature request]: Interface to atomic properties

Feature/behavior summary

Instead of hardcoded numbers littered around the codebase without units or traceability, we should provide a standardized interface for reference atomic/chemical property values by introducing maintained dependencies like mendeleev and periodictable.

As examples, MACE and M3gnet implementations have brought in dictionaries that provide utility, but are literally hardcoded mappings like matsciml.datasets.utils.atomic_number_map.

Request attributes

Would this be a refactor of existing code?
Does this proposal require new package dependencies?
Would this change break backwards compatibility?
Does this proposal include a new model?
Does this proposal include a new dataset?
Does this proposal include a new task/workflow?

Related issues

No response

Solution description

The core idea would be to provide a consistent, "vectorized" or broadcast-able interface to retrieve reference atomic and chemical properties, such as ionization energies and just atomic symbol-number mappings. I don't know if we need a class based implementation, but at the very least, functions that look like this:

import mendeleev

def symbols_to_elements(atomic_symbols: list[str]) -> list[mendeleev.Element]:
    return [mendeleev.element(symbol) for symbol in atomic_symbols]


def retrieve_atomic_numbers(elements: list[mendeleev.Element]) -> list[int]:
    return [element.atomic_number for element in elements]

We can choose to abstract things out more, but some things might get a bit more tricky (e.g. fully stripped ions, etc.)

Additional notes

No response

[Feature request]: Allow user specified reductions for embeddings

Feature/behavior summary

In certain instances it is desirable to have some flexibility in choice of embedding/output reduction. This line shows that mean reduction is hardcoded for process_embeddings.

Request attributes

Would this be a refactor of existing code?
Does this proposal require new package dependencies?
Would this change break backwards compatibility?
Does this proposal include a new model?
Does this proposal include a new dataset?
Does this proposal include a new task/workflow?

Related issues

#72 model definition uses a summation over intermediate layer energies/embeddings for their definition.

Solution description

The proposal is to provide an hparams argument for setting the reduction:

reduce_method = self.hparams.get("embedding_reduction", "mean")
output = reduce(output, "b ... d -> b d", reduction=reduce_method)

Additional notes

@Vaibhav525 this will need to be merged to get the "proper" MACE behavior. This doesn't prevent #72 from being merged now since the changes are modular.

[Bug]: pytorch_geometric import when not installed in datasets.transformation.frame_averaging

Expected behavior

schnet_dgl.py example should run in environment without torch_geometric

Actual behavior

Running the schnet_dgl example in an environment without pytorch geometric it crashes with

  File "/home/sjonathan/Downloads/matsciml_alexandria/examples/model_demos/schnet_dgl.py", line 6, in <module>
    from matsciml.datasets.transforms import DistancesTransform
  File "/home/sjonathan/Downloads/matsciml_alexandria/matsciml/datasets/transforms/__init__.py", line 5, in <module>
    from matsciml.datasets.transforms.frame_averaging import FrameAveraging
  File "/home/sjonathan/Downloads/matsciml_alexandria/matsciml/datasets/transforms/frame_averaging.py", line 13, in <module>
    import torch_geometric
ModuleNotFoundError: No module named 'torch_geometric'

because in frame_averaging.py there are two import statements of torch_geomtric before the registry check

import torch_geometric
from torch_geometric.transforms import LinearTransformation

Steps to reproduce the problem

run schnet_dgl.py in environment without torch_geometric

Specifications

absl-py==1.4.0
aiohttp==3.8.5
aioitertools==0.11.0
aiosignal==1.3.1
alabaster==0.7.13
annotated-types==0.5.0
anyio==3.7.0
argon2-cffi @ file:///opt/conda/conda-bld/argon2-cffi_1645000214183/work
argon2-cffi-bindings @ file:///tmp/build/80754af9/argon2-cffi-bindings_1644553347904/work
arrow==1.2.3
ase==3.22.1
asttokens==2.2.1
astunparse==1.6.3
async-timeout==4.0.3
atomate==1.0.3
atomate2 @ file:///home/sjonathan/Downloads/atomate2_other_forcefields
attrs==23.1.0
Babel @ file:///croot/babel_1671781930836/work
backcall @ file:///home/ktietz/src/ci/backcall_1611930011877/work
bandit==1.7.6
bcrypt==4.0.1
beautifulsoup4==4.12.2
biopython==1.81
black==23.12.1
bleach==6.0.0
boto3==1.28.4
botocore==1.31.4
bracex==2.3.post1
brotlipy==0.7.0
cachelib==0.9.0
cachetools==5.3.0
castepxbin==0.2.0
cclib==1.8
CellConstructor==1.3.2
certifi @ file:///croot/certifi_1671487769961/work/certifi
cffi @ file:///croot/cffi_1670423208954/work
cfgv==3.4.0
chardet==5.2.0
charset-normalizer==3.3.2
chemview==0.6
chgnet==0.2.0
click==8.1.7
cloudpickle==3.0.0
colorama==0.4.6
colormath==3.0.0
comm==0.1.3
contextlib2==21.6.0
contourpy==1.0.7
cryptography @ file:///croot/cryptography_1677533068310/work
crystal-toolkit==2023.6.1
crystaltoolkit-extension==0.6.0
custodian==2023.7.22
cycler==0.11.0
Cython==3.0.2
dash==2.10.2
dash-core-components==2.0.0
dash-html-components==2.0.0
dash-mp-components==0.4.34
dash-table==5.0.0
debugpy==1.6.7
decorator @ file:///opt/conda/conda-bld/decorator_1643638310831/work
defusedxml @ file:///tmp/build/80754af9/defusedxml_1615228127516/work
dgl==0.9.1
dgllife==0.3.2
distlib==0.3.8
dnspython==2.3.0
docstring-parser==0.15
docutils==0.20.1
dpdata==0.2.15
dscribe==2.1.0
e3nn==0.5.1
einops==0.7.0
email-validator==2.1.0.post1
emmet-core==0.64.0
entrypoints @ file:///tmp/build/80754af9/entrypoints_1649908313000/work
exceptiongroup==1.1.1
executing==1.2.0
f90wrap==0.2.13
fabric==3.1.0
fastapi==0.100.0
fastcore==1.5.29
fasteners==0.18
fastjsonschema==2.17.1
fforces==0.1
filelock==3.12.2
FireWorks==2.0.3
flake8==7.0.0
flake8-bandit==4.1.1
flake8-black==0.3.6
Flake8-pyproject==1.2.3
Flask==2.2.5
Flask-Caching==2.0.2
flask-paginate==2022.1.8
flatbuffers==23.3.3
flit_core @ file:///opt/conda/conda-bld/flit-core_1644941570762/work/source/flit_core
fonttools==4.39.0
fqdn==1.5.1
frozenlist==1.4.0
fsspec==2023.9.2
future==0.18.3
gast==0.4.0
gdown==4.7.1
geometric-algebra-attention==0.5.1
gitdb==4.0.11
GitPython==3.1.41
google-auth==2.17.1
google-auth-oauthlib==1.0.0
google-pasta==0.2.0
greenlet==3.0.3
GridDataFormats==1.0.1
grpcio==1.53.0
gunicorn==20.1.0
h11==0.14.0
h5py==3.8.0
hiphive==1.1
httpcore==1.0.2
httpx==0.26.0
hyperopt==0.2.7
identify==2.5.33
idna @ file:///croot/idna_1666125576474/work
imageio==2.31.0
imagesize==1.4.1
importlib-metadata==6.8.0
importlib-resources==6.1.1
inflect==6.0.4
iniconfig==2.0.0
invoke==2.1.3
ipykernel==6.23.2
ipython==8.14.0
ipython-genutils @ file:///tmp/build/80754af9/ipython_genutils_1606773439826/work
ipywidgets==7.7.5
isoduration==20.11.0
itsdangerous==2.1.2
jax==0.4.8
jedi==0.18.2
Jinja2 @ file:///croot/jinja2_1666908132255/work
jmespath==1.0.1
jobflow==0.1.13
joblib==1.2.0
json5 @ file:///tmp/build/80754af9/json5_1624432770122/work
jsonargparse==4.27.1
jsonpointer==2.3
jsonschema @ file:///croot/jsonschema_1676558650973/work
julia==0.6.1
jupyter @ file:///tmp/abs_33h4eoipez/croots/recipe/jupyter_1659349046347/work
jupyter-console==6.6.3
jupyter-events==0.6.3
jupyter_client==8.2.0
jupyter_core==5.3.1
jupyter_server==2.6.0
jupyter_server_terminals==0.4.4
jupyterlab @ file:///croot/jupyterlab_1675354114448/work
jupyterlab-pygments==0.2.2
jupyterlab-widgets==1.1.4
jupyterlab_server @ file:///croot/jupyterlab_server_1677143054853/work
kaleido==0.2.1
keras==2.12.0
kiwisolver==1.4.4
lark==1.1.8
latexcodec==2.0.1
lazy_loader==0.2
libclang==16.0.0
lightning-utilities==0.9.0
llvmlite==0.39.1
lmdb==1.3.0
lobsterpy==0.3.0
lovely-numpy==0.2.8
lxml @ file:///opt/conda/conda-bld/lxml_1657545139709/work
mace @ file:///home/sjonathan/Downloads/mace
maggma==0.56.0
Markdown==3.4.3
markdown-it-py==3.0.0
MarkupSafe==2.1.3
matgl==0.8.5
matminer==0.8.0
matplotlib==3.7.1
matplotlib-inline @ file:///opt/conda/conda-bld/matplotlib-inline_1662014470464/work
-e git+https://github.com/JonathanSchmidt1/matsciml_alexandria.git@9568e18f0d3546cbd565a87655a88bff9bf45d42#egg=matsciml
matscipy==0.8.0
mccabe==0.7.0
MDAnalysis==2.6.0
mdtraj==1.9.7
mdurl==0.1.2
mendeleev==0.14.0
mistune==2.0.5
ml-dtypes==0.0.4
mmtf-python==1.1.3
mongogrant==0.3.3
mongomock==4.1.2
monty==2023.9.5
mp-api==0.33.3
mpi4py==3.1.4
mpmath==1.3.0
mrcfile==1.4.3
msgpack==1.0.5
multidict==6.0.4
munch==2.5.0
mypy-extensions==1.0.0
nbclassic==1.0.0
nbclient==0.8.0
nbconvert==7.5.0
nbformat==5.9.0
nequip @ file:///home/sjonathan/fusessh/dgx3/nequip2/nequip3/nequip
nest-asyncio @ file:///croot/nest-asyncio_1672387112409/work
networkx==3.0
nglview==3.0.5
nodeenv==1.8.0
notebook==6.5.4
notebook_shim==0.2.3
numba==0.56.4
numpy==1.23.5
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py3==7.352.0
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
openai==0.28.1
opt-einsum==3.3.0
opt-einsum-fx==0.1.4
optimade==1.0.1
orjson==3.9.2
overrides==7.3.1
packaging==23.1
palettable==3.3.0
pandas==1.5.3
pandocfilters @ file:///opt/conda/conda-bld/pandocfilters_1643405455980/work
paramiko==3.2.0
parso @ file:///opt/conda/conda-bld/parso_1641458642106/work
pathspec==0.12.1
pbr==6.0.0
pdyna @ file:///home/sjonathan/Downloads/PDynA
periodictable==1.6.1
pexpect @ file:///tmp/build/80754af9/pexpect_1605563209008/work
phonopy==2.20.0
pickleshare @ file:///tmp/build/80754af9/pickleshare_1606932040724/work
Pillow==9.4.0
platformdirs==4.1.0
plotly==5.13.1
pluggy==1.0.0
ply==3.11
pooch==1.7.0
pre-commit==3.6.0
prettytable==3.7.0
prometheus-client==0.17.0
prompt-toolkit==3.0.38
protobuf==4.22.1
psutil==5.9.5
ptyprocess @ file:///tmp/build/80754af9/ptyprocess_1609355006118/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
PubChemPy==1.0.4
pure-eval @ file:///opt/conda/conda-bld/pure_eval_1646925070566/work
py4j==0.10.9.7
py4vasp==0.7.3
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.11.1
pybtex==0.24.0
pycodestyle==2.11.1
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pydantic==1.10.12
pydantic-settings==2.0.3
pydantic_core==2.14.6
pydash==7.0.5
pyfiglet==0.8.post1
pyflakes==3.2.0
Pygments==2.15.1
pymatgen==2023.7.20
pymatgen-analysis-defects==2023.8.22
pymatgen-analysis-diffusion==2022.7.21
pymatgen-db==2023.2.23
pymongo==4.3.3
PyNaCl==1.5.0
pynndescent==0.5.8
pyOpenSSL @ file:///croot/pyopenssl_1677607685877/work
pyparsing==3.0.9
PyProcar==6.0.0
PyQt5-sip==12.11.0
pyrsistent==0.19.3
PySocks @ file:///home/builder/ci_310/pysocks_1640793678128/work
pysr==0.12.0
pytest==7.3.2
python-dateutil @ file:///tmp/build/80754af9/python-dateutil_1626374649649/work
python-dotenv==1.0.0
python-json-logger==2.0.7
python-sscha==1.3.2.1
pytorch-lightning==1.8.6
pytz==2022.7.1
pyvista==0.40.1
PyWavelets==1.4.1
PyYAML==6.0.1
pyzmq==24.0.1
qtconsole==5.4.3
QtPy==2.3.1
quippy-ase==0.9.14
rdkit==2023.3.1
requests==2.28.2
requests-oauthlib==1.3.1
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.7.0
robocrys==0.2.8
rowan==1.3.0.post1
rsa==4.9
ruamel.yaml==0.17.21
ruamel.yaml.clib==0.2.7
s3transfer==0.6.1
schema==0.7.5
scikit-image==0.21.0
scikit-learn==1.0
scipy==1.10.1
scooby==0.7.2
seaborn==0.12.2
seekpath==2.1.0
Send2Trash==1.8.2
sentinels==1.0.0
sentry-sdk==1.26.0
shakenbreak==3.0.0
shapely==2.0.1
sip @ file:///tmp/abs_44cd77b_pu/croots/recipe/sip_1659012365470/work
six @ file:///tmp/build/80754af9/six_1644875935023/work
smmap==5.0.1
sniffio==1.3.0
snowballstemmer==2.2.0
soupsieve==2.4.1
sparse==0.14.0
spglib==2.0.2
Sphinx==7.2.6
sphinx-argparse==0.4.0
sphinx-pdj-theme==0.2.1
sphinxcontrib-applehelp==1.0.7
sphinxcontrib-devhelp==1.0.5
sphinxcontrib-htmlhelp==2.0.4
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.6
sphinxcontrib-serializinghtml==1.1.9
SQLAlchemy==2.0.25
sshtunnel==0.4.0
stack-data==0.6.2
starlette==0.27.0
stevedore==5.1.0
sumo==2.3.5
sympy==1.11.1
tabulate==0.9.0
tdscha==1.0.1
tenacity==8.2.2
tensorboard==2.12.1
tensorboard-data-server==0.7.0
tensorboard-plugin-wit==1.8.1
tensorboardX==2.6.2.2
tensorflow==2.12.0
tensorflow-estimator==2.12.0
tensorflow-io-gcs-filesystem==0.32.0
termcolor==2.2.0
terminado @ file:///croot/terminado_1671751832461/work
threadpoolctl==3.1.0
tifffile==2023.4.12
tinycss2 @ file:///croot/tinycss2_1668168815555/work
toml @ file:///tmp/build/80754af9/toml_1616166611790/work
tomli @ file:///opt/conda/conda-bld/tomli_1657175507142/work
torch==2.1.2
torch-cluster==1.6.3
torch-ema==0.3
torch-runstats==0.2.0
torch-scatter==2.1.2
torch-sparse==0.6.18
torchmetrics==1.2.0
tornado==6.3.2
tqdm==4.65.0
trainstation==1.0
traitlets==5.9.0
trimesh==3.22.3
triton==2.1.0
typeshed-client==2.4.0
typing==3.7.4.3
typing_extensions==4.8.0
umap-learn==0.5.3
uncertainties==3.1.7
uri-template==1.2.0
urllib3==2.1.0
uvicorn==0.23.2
Vapory==0.1.2
virtualenv==20.25.0
vtk==9.2.6
wcmatch==8.4.1
wcwidth==0.2.6
webcolors==1.13
webencodings==0.5.1
websocket-client==1.5.3
Werkzeug==2.2.3
widgetsnbextension==3.6.4
wrapt==1.14.1
yarl==1.9.2
zipp==3.16.2

[Bug]: SAM Callback Fails in MultiData MultiTask Setting With All `None` Param Gradients

Expected behavior

I would either expect some of the optimizer parameter gradients to not be None, or for the SAM callback to handle the case when all of the parameter gradients are None.

Actual behavior

When all the parameter gradients are none, this line fails to compute because the list is empty.

Condensed Stack Trace:

  File "/Users/carmelog/Projects/GitRepos/open-catalyst/public-repo/matsciml-fork/matsciml/lightning/callbacks.py", line 779, in on_before_optimizer_step
    org_weights = self._first_step(optimizer)
  File "/Users/carmelog/Projects/GitRepos/open-catalyst/public-repo/matsciml-fork/matsciml/lightning/callbacks.py", line 807, in _first_step
    scale = self.rho / (self._grad_norm(optimizer) + 1e-5)
  File "/Users/carmelog/Projects/GitRepos/open-catalyst/public-repo/matsciml-fork/matsciml/lightning/callbacks.py", line 794, in _grad_norm
    param_norms = torch.stack(
RuntimeError: stack expects a non-empty TensorList

The function _grad_norm() could possibly be updated to return a zero norm if the parameter gradients are empty, but I am unsure if this is meaningful or if it would produce unintended results from the SAM algorithm

def _grad_norm(self, optimizer: Optimizer) -> torch.Tensor:
    grad_norm_list = [
        (self._norm_weights(p) * p.grad).norm()
        for p in self._get_params(optimizer)
        if isinstance(p.grad, torch.Tensor)
    ]
    if grad_norm_list != []:
        param_norms = torch.stack(grad_norm_list)
    else:
        param_norms = torch.Tensor(0.)
    return param_norms.norm()

Steps to reproduce the problem

Run the example: ./examples/tasks/multitask/three_datasets.py with this updated trainer:

trainer = pl.Trainer(
    fast_dev_run=100,
    logger=False,
    enable_checkpointing=False,
    callbacks=[cb.GradientCheckCallback(), cb.SAM()],
)

Specifications

Latest and greatest.

[Bug]: Errors in Gathering Space Group Info For Materials Project

Expected behavior

Using the get_space_group_info() should not result in errors.

Actual behavior

The get_space_group_info() method called here gives various errors:

spglib: ssm_get_exact_positions failed (attempt=1). (line 115, /project/src/site_symmetry.c).
spglib: ssm_get_exact_positions failed (attempt=2). (line 115, /project/src/site_symmetry.c).
spglib: ssm_get_exact_positions failed (attempt=3). (line 115, /project/src/site_symmetry.c).
###
spglib: Attempt 0 tolerance = 1.000000e-02 failed(line 800, /project/src/spacegroup.c).
spglib: No point group was found (line 405, /project/src/pointgroup.c).
spglib: Attempt 1 tolerance = 9.500000e-03 failed(line 800, /project/src/spacegroup.c).
###
spglib: Too many lattice symmetries was found.
        Reduce angle tolerance to 4.286875
        (line 990, /project/src/symmetry.c).

Steps to reproduce the problem

Install the repo, and run this scrip with the full materials project dataset:

python examples/datasets/materials_project/single_task_symmetry.py

Specifications

Using v1.0.0, error occurs on both cpu/gpu.

Pip Freeze:
aiohttp==3.8.5
aiosignal==1.3.1
ase==3.22.1
async-timeout==4.0.3
attrs==23.1.0
certifi==2023.7.22
charset-normalizer==3.2.0
cloudpickle==2.2.1
contextlib2==21.6.0
contourpy==1.1.0
cycler==0.11.0
dgl==0.9.1
dgllife==0.3.2
docstring-parser==0.15
emmet-core==0.68.0
exceptiongroup==1.1.3
filelock==3.12.4
fonttools==4.42.1
frozenlist==1.4.0
fsspec==2023.9.0
future==0.18.3
geometric-algebra-attention==0.4.0
hyperopt==0.2.7
idna==3.4
importlib-resources==6.0.1
iniconfig==2.0.0
Jinja2==3.1.2
joblib==1.3.2
jsonargparse==4.24.1
kiwisolver==1.4.5
latexcodec==2.0.1
lightning-utilities==0.9.0
llvmlite==0.40.1
lmdb==1.3.0
MarkupSafe==2.1.3
matplotlib==3.7.3
-e git+https://github.com/melo-gonzo/matsciml.git@db46be635795d4ec556dc00222058a61909d0cf7#egg=matsciml
monty==2023.9.5
mp-api==0.34.3
mpmath==1.3.0
msgpack==1.0.5
multidict==6.0.4
munch==2.5.0
networkx==3.1
numba==0.57.1
numpy==1.24.4
packaging==23.1
palettable==3.3.3
pandas==2.1.0
Pillow==10.0.0
plotly==5.16.1
pluggy==1.3.0
protobuf==4.24.3
psutil==5.9.5
py4j==0.10.9.7
pybtex==0.24.0
pydantic==1.10.12
pymatgen==2023.9.10
pyparsing==3.1.1
pytest==7.4.2
python-dateutil==2.8.2
pytorch-lightning==1.8.6
pytz==2023.3.post1
PyYAML==6.0.1
rdkit==2023.3.1
requests==2.31.0
ruamel.yaml==0.17.32
ruamel.yaml.clib==0.2.7
schema==0.7.5
scikit-learn==1.3.0
scipy==1.11.2
six==1.16.0
spglib==2.1.0
sympy==1.12
tabulate==0.9.0
tenacity==8.2.3
tensorboardX==2.6.2.2
threadpoolctl==3.2.0
tomli==2.0.1
torch==2.0.1
torchmetrics==1.1.2
tqdm==4.66.1
typeshed-client==2.3.0
typing_extensions==4.7.1
tzdata==2023.3
uncertainties==3.1.7
urllib3==2.0.4
yarl==1.9.2
zipp==3.16.2