Coder Social home page Coder Social logo

Comments (3)

Linux-cpp-lisp avatar Linux-cpp-lisp commented on June 22, 2024

Hi @gshs12051 ,

Thanks for your interest in our work!

What is your PyTorch version? This looks like a familiar bug from PyTorch that should be resolved by updating to the latest support stable version (1.11).

from pair_allegro.

gshs12051 avatar gshs12051 commented on June 22, 2024

Thanks. I was using Pytorch 1.10 version and after updating to 1.11 version the problem solved.
I have two more questions. First question is in the case of MD simulation with MPI. LAMMPS didn't proceed after this stage.
mpirun -np 8 lmp -sf omp -pk omp 4 -in in.lammps

run 10
No /omp style for force computation currently active

While it works well in the case of mpirun -np 4 lmp -sf omp -pk omp 8 -in in.lammps like below.
I am wondering if there is a specific limit in MPI processor grid size. and sometime MD simluation ends with error below

  Unit style    : metal
  Current step  : 0
  Time step     : 0.0005
Per MPI rank memory allocation (min/avg/max) = 11.64 | 11.64 | 11.64 Mbytes
Step Temp TotEng PotEng Press Volume S/CPU CPULeft 
       0         1000   -502.17886   -517.56082    4733.1234    3471.2258            0            0 
      10    1037.6838   -502.17892   -518.14053    4911.4855    3471.2258   0.62056358    96670.196 
      20    1149.0517   -502.18269   -519.85735    5438.6038    3471.2258    3.3797967    57200.356 
      30    1366.1239   -502.20265   -523.21631    6466.0332    3471.2258    3.3869016    44029.363 
      40    1706.0198   -502.27363   -528.51555    8074.8025    3471.2258    3.3691646     37465.69 
      50      2092.37   -502.46846    -534.6532    9903.4456    3471.2258      0.80146    44927.751 
      60    2388.6437   -502.88855   -539.63056    11305.746    3471.2258   0.49348786    57677.206 
      70    2591.1369   -503.60771   -543.46446    12264.171    3471.2258   0.49194526    66832.571 
      80    2867.5918   -504.70262   -548.81179    13572.666    3471.2258   0.54046821    72327.097 
      90    3162.0488   -506.21135   -554.84985    14966.367    3471.2258   0.47370461    78332.381 
     100    3463.3768   -508.07882   -561.35234     16392.59    3471.2258   0.43357856    84302.633 
     110    3783.0973   -510.20537   -568.39681    17905.867    3471.2258    0.4624285    88399.774 
     120    4040.5194   -512.46371   -574.61481    19124.277    3471.2258    0.4751138    91522.343 
     130    3916.8145   -468.80556   -529.05384    18538.766    3471.2258   0.56733556    92585.622 
     140    4160.4834   -471.28922    -535.2856    19692.081    3471.2258   0.48887372    94704.054 
     150    5138.8348   -472.80773   -551.85307    24322.739    3471.2258   0.47169693    96834.505 
     160    5735.4544   -477.15941   -565.38192    27146.614    3471.2258   0.47489075    98642.676 

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 39872 RUNNING AT n020
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================


And next question is during the training, I tried to use the train set of multiple cell size. (for example some training set of 120 atoms and some training set of 60 atoms) Then the training ended with the errors below.

instantiate NpzDataset
   optional_args :                                         key_mapping
   optional_args :                                npz_fixed_field_keys
   optional_args :                                                root
   optional_args :                                  extra_fixed_fields <-                         dataset_extra_fixed_fields
   optional_args :                                           file_name <-                                  dataset_file_name
...NpzDataset_param = dict(
...   optional_args = {'key_mapping': {'z': 'atomic_numbers', 'E': 'total_energy', 'F': 'forces', 'R': 'pos'}, 'include_keys': [], 'npz_fixed_field_keys': ['atomic_numbers'], 'file_name': './train_set.npz', 'url': None, 'force_fixed_keys': [], 'extra_fixed_fields': {'r_max': 4.0}, 'include_frames': None, 'root': 'results/GeSe2'},
...   positional_args = {'type_mapper': <nequip.data.transforms.TypeMapper object at 0x2b9f505d7490>})
Traceback (most recent call last):
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/utils/auto_init.py", line 232, in instantiate
    instance = builder(**positional_args, **final_optional_args)
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/dataset.py", line 681, in __init__
    super().__init__(
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/dataset.py", line 123, in __init__
    super().__init__(root=root, transform=type_mapper)
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/utils/torch_geometric/dataset.py", line 90, in __init__
    self._process()
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/utils/torch_geometric/dataset.py", line 175, in _process
    self.process()
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/dataset.py", line 269, in process
    data_list = [
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/dataset.py", line 270, in <listcomp>
    constructor(**{**{f: v[i] for f, v in fields.items()}, **fixed_fields})
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/AtomicData.py", line 326, in from_points
    return cls(edge_index=edge_index, pos=torch.as_tensor(pos), **kwargs)
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/AtomicData.py", line 221, in __init__
    _process_dict(kwargs)
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/AtomicData.py", line 163, in _process_dict
    raise ValueError(
ValueError: atomic_numbers is a node field but has the wrong dimension torch.Size([72, 1])

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/gshs12051/anaconda3/envs/pytorch/bin/nequip-train", line 8, in <module>
    sys.exit(main())
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/scripts/train.py", line 74, in main
    trainer = fresh_start(config)
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/scripts/train.py", line 177, in fresh_start
    dataset = dataset_from_config(config, prefix="dataset")
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/_build.py", line 78, in dataset_from_config
    instance, _ = instantiate(
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/utils/auto_init.py", line 234, in instantiate
    raise RuntimeError(
RuntimeError: Failed to build object with prefix `dataset` using builder `NpzDataset`

from pair_allegro.

Linux-cpp-lisp avatar Linux-cpp-lisp commented on June 22, 2024

Hi @gshs12051 ,

Great, glad it resolved your issue!

Could you please open a new issue on pair_allegro (this repo) for the MPI question, and a separate issue on the nequip repo for the training issue? This helps keep information searchable and organized for future users.

Thanks!

from pair_allegro.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.