Comments (14)
Regarding OOM, that looks like preprocessing, yes. You can preprocess your dataset on a normal CPU node (where hopefully you can allocate more RAM) using nequip-benchmark
on the same YAML config, see mir-group/nequip#293, for example, for more discussion.
from pair_allegro.
Hi @nukenadal .
Thanks for your interest in our code and for trying this new feature!
The stress
branch is built for the defaults to be compatible with the current develop
branch of nequip
, soon to be released as 0.6.0. As a result, if your model is from an earlier version of nequip
, you'll need to specify pair_style allegro3232
on stress
instead of pair_style allegro
to get the compatible version.
Hopefully this resolves the issue, and please let me know if you observe any issues or suspicious results when using the stress
branch!
from pair_allegro.
Thank you so much for your rapid response!
The above problem is resolved with the pair_style allegro3232. I tested an NpT MD run, but I found that the stresses are not predicted as how I expected. I used the normal npt
ensemble from LAMMPS as if I were using a classical potential, with the command: fix 1 all npt temp 200 200 $(100*dt) tri 1.0 1.0 $(50*dt)
. The aim here is to simulate the structure under ambient pressure (1 bar as I used the metal
unit). Below is an example of my training dataset header:
Lattice="12.002029185229819 0.004064393052959 -0.051196621492764 -0.000000000004469 11.711580380388035 -0.008099807301418 0.000000000001293 -0.000000000000898 12.039917140248585"
Properties=species:S:1:pos:R:3:forces:R:3 energy=-1141.403858931822
stress="0.004046574721405 -0.001809377097797 0.000241330359958 -0.001809377097797 0.000898277454365 -0.001171032253305 0.000241330359958 -0.001171032253305 0.000988749363509"
free_energy=-1141.403858931822 pbc="T T T"
Where the stresses are saved with the unit eV per angstrom cubed. Then in the MD production, I had a printout like this:
Step CPU Time Temp Press Volume TotEng
0 0.00 0.000 200.000 50652.238 47241.633 -29903.285
50 58.76 0.025 266.807 1806.511 62650.361 -30039.377
100 97.48 0.050 421.660 5194.959 106511.009 -30309.355
150 120.55 0.075 385.755 1111.530 156182.512 -30445.874
200 136.21 0.100 316.264 851.025 203929.625 -30503.778
250 148.06 0.125 259.946 552.863 267116.349 -30550.031
300 157.38 0.150 217.007 369.768 342194.282 -30582.112
350 164.71 0.175 183.242 217.801 420573.110 -30603.389
400 170.68 0.200 167.577 125.765 497445.470 -30614.017
450 175.88 0.225 165.960 75.794 573206.403 -30616.751
500 180.61 0.250 172.086 64.869 652145.704 -30615.140
550 185.00 0.275 178.895 58.633 741854.160 -30612.851
600 189.05 0.300 184.556 49.415 842463.753 -30610.630
650 192.79 0.325 190.739 40.879 958897.820 -30608.741
700 196.31 0.350 191.660 37.834 1086485.734 -30607.029
750 199.63 0.375 196.898 30.968 1229353.201 -30605.731
800 202.77 0.400 199.433 25.713 1388276.031 -30604.834
850 205.78 0.425 200.112 20.940 1559556.374 -30604.274
900 208.66 0.450 200.412 19.134 1743025.867 -30603.748
950 211.45 0.475 200.821 16.344 1946113.740 -30603.529
1000 214.18 0.500 202.782 11.614 2161252.664 -30603.517
1050 216.84 0.525 200.039 10.945 2394540.916 -30603.602
1100 219.44 0.550 198.836 9.568 2640828.010 -30603.645
1150 221.99 0.575 199.692 8.230 2906593.920 -30603.578
The pressure was initially quite high, and then lowered towards 1 bar. I was wondering if there is any additional setting that is required to reconcile the stresses from the FF and how LAMMPS handles the pressure of the cell? I used both the main branch of Nequip=0.5.6 and Allegro=0.2.0 if that is relevant.
Thank you!
from pair_allegro.
Hi @nukenadal, you can have a try with the para_stress branch of Nequip and stress branch of pair_allegro and train with config.yaml including ParaStressForceOutput
instead of StressForceOutput
for your network. Details of the changes can be seen in PR_pair_allegro and PR_nequip and more discussions in this issue. It works well on my side and hope it helps.
from pair_allegro.
@nukenadal : have you verified with a test set / validation set, such as with nequip-evaluate
, that your model's stress predictions are indeed correct?
It's hard for us to say without more details on the system, training data, and LAMMPS input file...
from pair_allegro.
Sorry, I found that the above MD calculation had a mismatched FF and input structure. That is probably why the structure disintegrated rapidly. I retrained the FF with more data and epochs. It gives the following test results from nequip-evaluate
. The MAE of stresses corresponds to approx. 384 bar. Given that the stresses of this system size are about 50 kbar, I wonder if this level of accuracy can be considered as sufficiently good?
f_mae = 0.010270
f_rmse = 0.014187
stress_mae = 0.000240
stress_rmse = 0.000424
e_mae = 0.032459
e/N_mae = 0.000338
Moreover, I tried re-compiling LAMMPS because I accidentally wiped out the build
folder. With the same modules and package versions (also with allegro3232 enabled), the MD production gives the following errors:
[cx3-11-8:2563622:0:2563622] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid:2563622) ====
0 0x0000000000012c20 .annobin_sigaction.c() sigaction.c:0
1 0x0000000004c3b0c4 torch::jit::InterpreterStateImpl::callstack() ???:0
2 0x0000000004c3cbf1 torch::jit::InterpreterStateImpl::handleError() ???:0
3 0x0000000004c4ab85 torch::jit::InterpreterStateImpl::runImpl() ???:0
4 0x0000000004c3604f torch::jit::InterpreterState::run() ???:0
5 0x0000000004c264da torch::jit::GraphExecutorImplBase::run() ???:0
6 0x00000000048adf8e torch::jit::Method::operator()() ???:0
7 0x00000000007179ac torch::jit::Module::forward() /home/libtorch-new/include/torch/csrc/jit/api/module.h:114
8 0x0000000000720ee1 LAMMPS_NS::PairAllegro<(Precision)0>::compute() /home/lammps-stable_29Sep2021/src/pair_allegro.cpp:442
9 0x000000000055b366 LAMMPS_NS::Verlet::run() /home/lammps-stable_29Sep2021/src/verlet.cpp:312
10 0x00000000005163fb LAMMPS_NS::Run::command() /home/lammps-stable_29Sep2021/src/run.cpp:180
11 0x0000000000468e4a LAMMPS_NS::Input::execute_command() /home/lammps-stable_29Sep2021/src/input.cpp:794
12 0x0000000000469553 LAMMPS_NS::Input::file() /home/lammps-stable_29Sep2021/src/input.cpp:273
13 0x0000000000444b28 main() /home/lammps-stable_29Sep2021/src/main.cpp:98
14 0x0000000000023493 __libc_start_main() ???:0
15 0x000000000044502e _start() ???:0
=================================
/var/spool/PBS/mom_priv/jobs/7031777.pbs.SC: line 11: 2563622 Segmentation fault /home/lammps-stable_29Sep2021/build/lmp -in ./npt_org_perov.in > logfile_$PBS_JOBID
I tested multiple combinations of packages but this issue remains. I was wondering if you have any suggestions on this type of error? Thanks so much!
from pair_allegro.
Sorry, I found that the above MD calculation had a mismatched FF and input structure.
👍
The MAE of stresses corresponds to approx. 384 bar. Given that the stresses of this system size are about 50 kbar, I wonder if this level of accuracy can be considered as sufficiently good?
This is an error of less than 1%, which sounds excellent; in an absolute sense it also looks good to me... @simonbatzner ?
Regarding the segfault, in general we've seen this either in certain cases if you run out of memory, or if there is an empty simulation domain in a parallel LAMMPS simulation... do either of these sound applicable?
from pair_allegro.
With a few tests, I found that this is actually an out-of-memory error, I lowered the setting of FF training (4 layers to 2 layers) and the production can run without hitting the memory limit and also gave meaningful trajectories. Thank you so much for the valuable advice!
Some extra points are, though not quite relevant,
- The MD production with 1 GPU is actually faster than using 2 GPUs. I am not sure if this is related to my network setting, making it not very scalable. Or this is somehow a feature of this experimental version?
- I also tried this workflow on my local workstation. I got the following error when trying to load the dataset for training:
Traceback (most recent call last):
File "/home/.local/lib/python3.9/site-packages/torch/serialization.py", line 380, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "/home/.local/lib/python3.9/site-packages/torch/serialization.py", line 604, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
OSError: [Errno 28] No space left on device
I wonder if this is telling me that I don't have enough storage space for some temporary data or a memory error like above?
from pair_allegro.
Hi @nukenadal ,
With a few tests, I found that this is actually an out-of-memory error, I lowered the setting of FF training (4 layers to 2 layers) and the production can run without hitting the memory limit and also gave meaningful trajectories.
Great! This should also be possible to resolve, then, by running on more GPUs with the same model.
The MD production with 1 GPU is actually faster than using 2 GPUs. I am not sure if this is related to my network setting, making it not very scalable. Or this is somehow a feature of this experimental version?
You still have ~2500 atoms like your original post? And you mean two GPUs, and not two GPU nodes? You are measuring this through the LAMMPS performance summary printed at the end, and if so, how does the time percentage spent in "Comm" change for you from one to two GPUs?
I also tried this workflow on my local workstation. I got the following error when trying to load the dataset for training: ... OSError
This happens after it prints "Processing..." but before "Loaded Dataset"? If so, then yes it is failing to write out the preprocessed data file, which is in an efficient binary format but can be quite large due to the inclusion of full neighborlist information, which can be quite large for a big cutoff in a dense system.
from pair_allegro.
Hi,
You still have ~2500 atoms like your original post? And you mean two GPUs, and not two GPU nodes? You are measuring this through the LAMMPS performance summary printed at the end, and if so, how does the time percentage spent in "Comm" change for you from one to two GPUs?
I used about 2600 atoms again. I was comparing between 1 GPU on 1 node and 2 GPUs on 1 node. I compared the speed through the time taken for the same 2000 MD steps. I have got the following results:
for 1 GPU:
Loop time of 460.862 on 1 procs for 2000 steps with 2592 atoms
Performance: 0.187 ns/day, 128.017 hours/ns, 4.340 timesteps/s
99.5% CPU use with 1 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 458.13 | 458.13 | 458.13 | 0.0 | 99.41
Neigh | 2.0769 | 2.0769 | 2.0769 | 0.0 | 0.45
Comm | 0.047307 | 0.047307 | 0.047307 | 0.0 | 0.01
Output | 0.39902 | 0.39902 | 0.39902 | 0.0 | 0.09
Modify | 0.19997 | 0.19997 | 0.19997 | 0.0 | 0.04
Other | | 0.01258 | | | 0.00
Nlocal: 2592.00 ave 2592 max 2592 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost: 3591.00 ave 3591 max 3591 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs: 0.00000 ave 0 max 0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs: 123132.0 ave 123132 max 123132 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Total # of neighbors = 123132
Ave neighs/atom = 47.504630
Neighbor list builds = 174
Dangerous builds = 96
Total wall time: 0:07:58
And for 2 GPUs:
Loop time of 459.87 on 1 procs for 2000 steps with 2592 atoms
Performance: 0.188 ns/day, 127.742 hours/ns, 4.349 timesteps/s
20.3% CPU use with 1 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 457.31 | 457.31 | 457.31 | 0.0 | 99.44
Neigh | 2.0862 | 2.0862 | 2.0862 | 0.0 | 0.45
Comm | 0.051038 | 0.051038 | 0.051038 | 0.0 | 0.01
Output | 0.19311 | 0.19311 | 0.19311 | 0.0 | 0.04
Modify | 0.21184 | 0.21184 | 0.21184 | 0.0 | 0.05
Other | | 0.01453 | | | 0.00
Nlocal: 2592.00 ave 2592 max 2592 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost: 3592.00 ave 3592 max 3592 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs: 0.00000 ave 0 max 0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs: 123404.0 ave 123404 max 123404 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Total # of neighbors = 123404
Ave neighs/atom = 47.609568
Neighbor list builds = 175
Dangerous builds = 97
Total wall time: 0:08:05
I wonder if this allows us to find out why the scaling is abnormal? As the percentage of time spent on Comm was similar.
This happens after it prints "Processing..." but before "Loaded Dataset"? If so, then yes it is failing to write out the preprocessed data file, which is in an efficient binary format but can be quite large due to the inclusion of full neighborlist information, which can be quite large for a big cutoff in a dense system.
This happened before the line "Loaded Dataset", as the first few lines in the log are:
Torch device: cuda
Processing dataset...
terminate called after throwing an instance of 'c10::Error'
from pair_allegro.
20.3% CPU use with 1 MPI tasks x 1 OpenMP threads
^ from the second log
@nukenadal are you sure it is using two GPUs? have you checked nvtop
while it's running, for example?
from pair_allegro.
are you sure it is using two GPUs? have you checked nvtop while it's running, for example?
I found that I may not have properly set the multi-GPU process, I wonder what command should I use to call say 2 GPUs on the same node? Is it just the same with mpirun -n x lmp -in lmp.in
?
from pair_allegro.
Yes (or possibly a launcher like srun
, depending on your cluster)
from pair_allegro.
Thank you so much for your help! The issues of multi-processing and OOM errors are both resolved.
from pair_allegro.
Related Issues (20)
- Configuring LAMMPS with pair_allegro HOT 2
- Virial and Lammps interface HOT 14
- Running pair_allegro with Kokkos on multiple GPUs HOT 1
- 🐛 [BUG] Compilation of pair_allegro fails with `is protected within this context` HOT 2
- Problems parallelizing across more than 1 GPU HOT 11
- More trouble in LAMMPS compilation due to "LAMMPS_NS" HOT 5
- Using pair_allegro without stress on the newest version of LAMMPS HOT 2
- Mix Allegro and LJ type pair styles HOT 5
- Request for Raw Benchmark Data from Paper
- Some problems encountered when using multiple GPUs HOT 8
- Any plan of updates for newer LAMMPS? HOT 2
- Issue of running NEB with mpirun HOT 6
- Problem compiling lammps with kokkos HOT 1
- RuntimeError: CUDA error: device-side assert triggered HOT 2
- Simulated annealing calculation error using pair-allegro
- Calculating virial stress in lammps HOT 7
- allegro_pair style and empty partitions HOT 6
- Exception: expected scalar type Double but found Float HOT 1
- Error compiling pair_allegro with LAMMPS in google colab tutorial HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pair_allegro.