Hi, I have installed scr 1.2.0 and I'm willing to add a

SCR_CNTL_BASE variable,about llnl/scr

Comments (25)

adammoody commented on July 24, 2024

Sorry for the slow reply on this. The current behavior is at best misleading. We'll work to improve this in a future release. As it stands, the control directory is locked down to be set as either a compile time constant or through the system configuration file. The system configuration file defaults to /etc/scr.conf. Having said that, let me test things to verify whether that works.

from scr.

adammoody commented on July 24, 2024

Well, there is also a bug between the documentation and the code, unfortunately.

You should be able to get to to this work by specifying the control base path you want to use as a cmake option to compile in a default other than /tmp.

cmake -DSCR_CNTL_BASE=/tmp/username/cache ...

Then there is a bug in that the user guide says you should use the STORE key in a CKPT descriptor while the code actually looks for BASE. As a work around, change STORE-->BASE in your CKPT line:

SCR_COPY_TYPE=FILE
STORE=/tmp/username/cache GROUP=NODE COUNT=1
CKPT=0 INTERVAL=1 GROUP=NODE BASE=/tmp/username/cache TYPE=XOR SET_SIZE=8

Ultimately, we'll want to change the code to use STORE, but this should work for now.

Please let me know if that helps.

from scr.

adammoody commented on July 24, 2024

Pushed a commit to master to look for STORE key name as documented in user guide: f48f457

from scr.

gongotar commented on July 24, 2024

I have tested the most recent scr (after the commit) and I still get the same error:

SCR v1.2.0 ABORT: rank 0 on : Failed to create store descriptor for control directory [/tmp] @ <SCR_DIRECTORY>/src/scr_storedesc.c:351

Is there something else I should do? My scr.conf file is as follows:

SCR_COPY_TYPE=FILE
STORE=/tmp/username/cache          GROUP=NODE   COUNT=1
CKPT=0 INTERVAL=1 GROUP=NODE   STORE=/tmp/username/cache TYPE=XOR     SET_SIZE=8

I have also tried to replace STORE key with BASE (which I believe is not the case anymore), and I get following error:

SCR v1.2.0 ERROR: rank 0 on : Missing one or more files for dataset 1 @ <SCR_DIRECTORY>/src/scr_flush.c:528

from scr.

adammoody commented on July 24, 2024

This config file should place the CACHE directory at /tmp/username/cache. However, it requires a cmake option to move the control directory. I think it should work if you rerun your cmake step with the following option and rebuild SCR:

cmake -DSCR_CNTL_BASE=/tmp/username/cache ...

from scr.

gongotar commented on July 24, 2024

Hi, sorry for the late answer, I have installed SCR with the given option cmake -DSCR_CNTL_BASE=/tmp/username/cache and I get the same error as before. Nothing is changed.

from scr.

gongotar commented on July 24, 2024

Any news on this? I noticed that running SCR with the default store /tmp gives the same error too.
I install SCR as instructed by the documentation but running make test results in following output:

The following tests FAILED:
1 - serial_test_api_start (Failed)
2 - serial_test_api_restart (Failed)
4 - parallel_test_api_start (Failed)
5 - parallel_test_api_restart (Failed)
7 - serial_test_api_multiple_start (Failed)
8 - serial_test_api_multiple_restart (Failed)
10 - parallel_test_api_multiple_start (Failed)
11 - parallel_test_api_multiple_restart (Failed)
13 - serial_test_interpose_start (Failed)
14 - serial_test_interpose_restart (Failed)
16 - parallel_test_interpose_start (Failed)
17 - parallel_test_interpose_restart (Failed)
19 - serial_test_interpose_multiple_start (Failed)
20 - serial_test_interpose_multiple_restart (Failed)
22 - parallel_test_interpose_multiple_start (Failed)
23 - parallel_test_interpose_multiple_restart (Failed)
25 - serial_test_ckpt_C_start (Failed)
26 - serial_test_ckpt_C_restart (Failed)
28 - parallel_test_ckpt_C_start (Failed)
29 - parallel_test_ckpt_C_restart (Failed)
31 - serial_test_ckpt_F_start (Failed)
32 - serial_test_ckpt_F_restart (Failed)
34 - parallel_test_ckpt_F_start (Failed)
35 - parallel_test_ckpt_F_restart (Failed)
Errors while running CTest

However I can call SCR functions in my code without getting library errors. Then I have a sample code just as the example of the tutorial. scr.conf contains following default lines:

SCR_CNTL_BASE=/tmp
CACHEDIR=/tmp BYTES=12GB
#SCR_CACHE_BASE=/tmp      # commented scr cache base

SCR_DB_ENABLE=0

SCR_DB_HOST=host45
SCR_DB_USER=scr_insert
SCR_DB_PASS=12345
SCR_DB_NAME=scr

STORE=/tmp GROUP=NODE COUNT=2

The slurm script has a line pointing to the scr.conf:
export SCR_CONF_FILE=<path to scr conf>/scr.conf

Everything seems to be correct as instructed by the tutorial. But I get still the same error as before.

from scr.

adammoody commented on July 24, 2024

I see you refer to two different errors above. When you say the same error, do you mean you're currently seeing the "Failed to create store descriptor for control directory" or the "Missing one or more files for dataset 1" error?

The second error is further along than the first, meaning it created the control directory store descriptor and the directory itself.

from scr.

gongotar commented on July 24, 2024

The following is the error I see in the output file:

SCR v1.2.0 ERROR: rank 0 on <node name>: Missing one or more files for dataset 1 @ <SCR_DIRECTORY>/src/scr_flush.c:528
SCR v1.2.0 ERROR: rank 0 on <node name>: One or more processes are missing files for dataset 1 @ <SCR_DIRECTORY>/src/scr_flush.c:535
SCR v1.2.0 ABORT: rank 0 on <node name>: Failed to identify data for flush of dataset 1 @ <SCR_DIRECTORY>/src/scr_flush.c:742

I have inpected my code and found out this error is thrown by SCR_Complete_checkpoint(valid); after writing the checkpoint. I debugged the code from the line of reported error (scr_flush.c:528) and this is the call stack which results in failure:

1: function int scr_flush_identify in line scr_flush.c:526 calls int scr_cache_check_files
2: function int scr_cache_check_files in line scr_cache.c:502 calls int scr_meta_is_complete
3: function int scr_meta_is_complete in line scr_meta.c:266 calls int scr_hash_util_get_int
4: function int scr_hash_util_get_int in line scr_hash_util.c:164 calls int scr_hash_elem_get_first_val
5: function int scr_hash_elem_get_first_val in line scr_hash.c:870 calls char* scr_hash_elem_key

The last function returns the value of a given key from the hash. For my case the given key is COMPLETE and the delivered value is 0 which apparently should be 1 for the SCR_Complete_checkpoint(valid) to successfully be completed. The returned value (which is 0) is checked in scr_meta_is_complete at step 3 above. if it is 1, SCR_SUCCESS is returned, else SCR_FAILURE. So I think it's somehow related to the wrong state of COMPLETE key in the hash when calling SCR_Complete_checkpoint(valid). Here is the part of my test code for checkpointing as instructed by the tutorial:

SCR_Start_checkpoint();
char checkpoint_file[256];
sprintf(checkpoint_file, "%s/rank_%d.ckpt", checkpoint_dir, rank);
char scr_file[SCR_MAX_FILENAME];
SCR_Route_file(checkpoint_file, scr_file);
FILE* fs = fopen(scr_file, "w");
if (fs != NULL) {
    int rc = fwrite(buf, 1, size, fs);
    fclose(fs);
}
int valid;
SCR_Complete_checkpoint(valid);      // This line produces error
int should_exit;
SCR_Should_exit(&should_exit);
if (should_exit) {
    exit(0);
}

from scr.

adammoody commented on July 24, 2024

Ah, ok. So it looks like valid is not set here when calling SCR_Complete_checkpoint(). You'll want to set this value to be 1 if the process wrote its file successfully, or 0 otherwise. That flag indicates to SCR that the checkpoint file was written without errors. Since it's not set explicitly, it's likely assuming the "random" value of 0, which SCR_Complete_checkpoint takes to mean that the process failed in writing its file. Try setting valid=1 and see if that helps.

from scr.

gongotar commented on July 24, 2024

Oh, now it's working, thanks.
But I have noticed a huge performance degradation in case of mpi job running on more than one node. for my case the checkpoint size is 2 GB and I run following function for checkpointing (same as tutorial):

  double start_cp0 = MPI_Wtime();      // start measuring delta0
  SCR_Start_checkpoint();
  sprintf(checkpoint_file, "%s/rank_%d.ckpt", checkpoint_dir, rank);
  char scr_file[SCR_MAX_FILENAME];
  SCR_Route_file(checkpoint_file, scr_file);
  double start_cp1 = MPI_Wtime();     // start measuring delta1
  FILE* fs = fopen(scr_file, "w");
  int valid = 0;
  if (fs != NULL) {
    int rc = fwrite(buf, 1, (size_t)sizet, fs);
    if (rc >0){
      valid = 1;
    }
    fclose(fs);
  }
  double end_cp1 = MPI_Wtime();     // end measuring delta1
  double delta1 = end_cp1 - start_cp1;
  printf("rank %d inner bandwidth: %7.2f\n", rank, sizet/(delta1*GB_UNIT));
  SCR_Complete_checkpoint(valid);
  double end_cp0 = MPI_Wtime();     // end measuring delta0
  double delta0 = end_cp0 - start_cp0;
  printf("rank %d outer bandwidth: %7.2f\n", rank, sizet/(delta0*GB_UNIT));

The printed output of the code for a job having 2 ranks running on two nodes is as follows:

rank 0 inner bandwidth: 2.10
rank 1 inner bandwidth: 2.10
rank 0 outer bandwidth: 0.16
rank 1 outer bandwidth: 0.16

I have noticed the most performance degradation in SCR_Complete_checkpoint function call. Changing the size of the checkpoint doesn't change much the outer/inner bandwidth values which means the performance degradation is not a constant extra time spent for communication but it depends linearly to the checkpoint size.
In my scr.conf file I haven't used any schemes (xor, partner, ...). The only used scheme is the node's local storage. And I have set export SCR_FLUSH=-1 in the slurm wrapper script.
This is not however the case if the job uses only one node. In this case the inner and outer bandwidth have similar values. Do you know why I experience such performance degradation in case of multiple nodes?
However the problem with scr_run mentioned here is still there.

from scr.

adammoody commented on July 24, 2024

Ok, good. We're making progress.

To disable the flush, you'll want to set SCR_FLUSH=0 instead of -1. I would expect this setting to affect both the single node and two-node runs, though. I recommend this change, but I don't know that it will help your performance here.

SCR uses XOR encoding by default if there is more than one node, but it falls back to SINGLE if the job is using a single node. When going from one compute node to two, there will be some additional overhead for the XOR encoding vs SINGLE. With XOR, SCR must read the entire checkpoint file and compute and store redundancy data. The amount of redundancy data depends on the XOR set size, which in this case would be just two since you have two nodes. The major costs involved are:

SINGLE: write file
PARTNER: write file, read file back, network transfer of file, write file again (partner copy)
XOR: write file, read file back, XOR computation + network transfer of redundancy data, write redundancy data (where data size = file size when using 2 nodes)

Assuming the XOR computation and network costs are below the I/O costs, then you could expect XOR bandwidth might be 33% of the SINGLE bandwidth.

Do you see any difference with two ranks/two nodes vs XOR using PARTNER?

If so, that would suggest the XOR compute is slow. If not, then perhaps the network is slowing things down.

How does performance change with XOR when using 3 ranks on 3 nodes, and 4 ranks on 4 nodes?

Increasing the node count reduces the size of the redundancy data.

from scr.

gongotar commented on July 24, 2024

Thanks for the explanation. I tested the job (checkpoint size of 2 GB) with more nodes (connected with each other with 10 Gbit ethernet) and the performance got better (though not very much):

nodes = 2: checkpoint duration = 17s
nodes =16: checkpoint duration = 13s

knowing that both read and write bandwidth are almost 2 GB/s (1s write, 1s read, 1/(n-1)s XOR segment write) I assume for two nodes network+XOR computation takes 14s and for 16 nodes it takes almost 11s.
Now I assume there are two possible scenarios of how XOR is computed:

Each rank locally computes only its own XOR-parity-segment of size 2/(n-1)GB by receiving 1/(n-1) portion of checkpoints of each rank and computing locally the XOR data of size 2/(n-1)GB and then writes the segment locally.
All ranks send their checkpoints the a single rank (possibly rank 0), the whole XOR computation
is done in this single rank and then the parity segments of size 2/(n-1)GB are partitioned and distributed among all ranks.

In the first scenario the duration of network transferring of checkpoint-portion (size 2/(n-1)GB) changes according to the XOR set size (which I assume is the number of nodes). Also the XOR computation duration depends on the set size. To my underestanding this scenario would have delivered better performance for 16 nodes.
In the second scenario the checkpoint is transferred completely and also the XOR computation is done for the whole checkpoints. So the duration here is not a variable of the XOR set size. The only variable part in this case is segment transfer/write duration.
If I'm getting it correctly by comparing the cases of 2 nodes (14s) and 16 nodes (11s) I assume the second scenario is used in SCR (possibly to avoid network congestion) which justifies the results I get. Am I correct in this?

I have also tested Partner by adding following lines to the scr.conf file:

CKPT=0 INTERVAL=1 STORE=/tmp GROUP=NODE COUNT=1
CKPT=1 INTERVAL=2 STORE=/tmp GROUP=NODE COUNT=1 TYPE=PARTNER

The observed durations are almost the same as XOR and I don't see changes in duration every 2 checkpoints (interval=2). This is confusing, am I doing it correctly?
And something else, can I ever disable the default XOR scheme for more than one node?

from scr.

adammoody commented on July 24, 2024

Ok, this information is helpful. I agree with your assessment. This shows there is some cost savings in reducing the XOR size, but it doesn't appear to be the dominant cost. That suggests the rest of the cost is in the XOR computation or networking.

Each process computes data for its own XOR segment. It's closer to the first method you described. SCR uses a ring-based reduce-scatter algorithm described in this paper:

“Providing Efficient I/O Redundancy in MPI Environments”, William Gropp, Robert Ross, and Neill Miller, Lecture Notes in Computer Science, 3241:7786, September 2004. 11th European PVM/MPI Users Group Meeting, 2004, http://www.mcs.anl.gov/papers/P1178.pdf.

To switch to PARTNER, you could change your configuration to just:

CKPT=0 INTERVAL=1 STORE=/tmp GROUP=NODE TYPE=PARTNER

You can also change the default redundancy method from XOR to PARTNER with:

SCR_COPY_TYPE=PARTNER

Another good reference for more info:
http://scr.readthedocs.io/en/latest/users/config.html

It's still worth doing the PARTNER test, as it may shed more information.

Also given the data so far, as another test, you probably will see a boost in aggregate bandwidth by increasing the number of processes per node. Generally, one process per node does not max out the network/cpu/disk on that node. The existing algorithms in SCR do not attempt to parallelize or overlap these operations. We assume the common case is that applications run multiple MPI ranks per node, which provides some natural overlap.

How does performance change if you fix the node count to say 16 and then increase the number of tasks per node from 1, 2, 4, etc?

from scr.

adammoody commented on July 24, 2024

Also, my math was off on my previous statement regarding XOR cost that I should correct. The real cost should be:

XOR: write file, read file back, XOR computation + network transfer of the file, write redundancy data (where data size = file size when using 2 nodes)

Although, XOR writes redundancy data whose size is reduced with the size of the XOR set, it always computes and transfers an amount of data that is on the order of the original file size.

from scr.

gongotar commented on July 24, 2024

I tested the XOR scheme for 1 and 2 tasks per node. However testing more tasks per node will be complicated because of slurm max memory limit determined by the administrator. The results of XOR tests is as follows:

2 nodes, 1 task per node: duration 17s
2 nodes, 2 tasks per node: duration 32s
16 nodes, 1 task per node: duration 13s
16 nodes, 2 tasks per node: duration 22s

Then I tested the Partner scheme as well by setting SCR_COPY_TYPE=PARTNER (instead of XOR, only using Partner). The performance was as follows:

2 nodes, 1 task per node: duration 20s
2 nodes, 2 tasks per node: duration 35s
16 nodes, 1 task per node: duration 17s
16 nodes, 2 tasks per node: duration 30s

However I tested the speed of transmitting 2 GBs of checkpoints manually by MPI_Send and MPI_Recv. The whole data was transferred to the other node in 1.8s which is far shorter than the extra overhead I experience in case of partner scheme. The duration I get (20s) does not match the following estimation:

PARTNER duration = write file + read file back + network transfer of file + write file again
= 1s + 1s + 1.8s + 1s = 4.8s

And here I believe we don't have computations as in XOR scheme.

Do you have an idea why the Partner scheme (and possibly XOR) suffers from such performance degradation? It is also notable that during the tests the only job running on the whole cluster was my job.

from scr.

adammoody commented on July 24, 2024

Yes, this shows a big jump going from 1 process per node to 2 procs/node. We also have some internal timers in SCR that could shed some light. You should get some timings for applying the redundancy schemes with the following:

SCR_DEBUG=1

Can you run with that and let me know what it shows?

from scr.

gongotar commented on July 24, 2024

Below you find the debug output for running two ranks (each 1 GB checkpoint) on two nodes with following scr.conf:

SCR_COPY_TYPE=FILE
CKPT=0 INTERVAL=1 GROUP=NODE   STORE=/tmp TYPE=SINGLE
CKPT=1 INTERVAL=2 GROUP=NODE   STORE=/tmp TYPE=PARTNER

As you see we after every SINGLE checkpoint the next one will be PARTNER (SINGLE interval = 1, PARTNER interval = 2). This is the debug output:

SCR v1.2.0: rank 0 on cumu01-00: scr_fetch_files: return code 1, 0.000268 secs
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.1
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 0.003037 secs, 1.937768e+09 bytes, 608518.680197 MB/s, 304259.340099 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 1 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.2
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 6.709496 secs, 1.937768e+09 bytes, 275.430538 MB/s, 137.715269 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 2 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.3
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 0.002570 secs, 1.937768e+09 bytes, 718946.433969 MB/s, 359473.216984 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 3 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.4
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 6.425101 secs, 1.937768e+09 bytes, 287.621932 MB/s, 143.810966 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 4 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.5
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 0.003041 secs, 1.937768e+09 bytes, 607615.733548 MB/s, 303807.866774 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 5 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.6
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 6.422245 secs, 1.937768e+09 bytes, 287.749852 MB/s, 143.874926 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 6 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.7
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 0.002381 secs, 1.937768e+09 bytes, 776124.303901 MB/s, 388062.151950 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 7 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.8
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 6.502498 secs, 1.937768e+09 bytes, 284.198482 MB/s, 142.099241 MB/s per proc

Also in the following you see the debug output for running 4 ranks on two nodes (each rank 1 GB checkpoint, 2 ranks per node) with the same config:

SCR v1.2.0: rank 0 on cumu01-00: scr_fetch_files: return code 1, 0.000181 secs
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.1
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 0.004853 secs, 3.875537e+09 bytes, 761524.691282 MB/s, 190381.172821 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 1 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.2
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 11.135049 secs, 3.875537e+09 bytes, 331.924913 MB/s, 82.981228 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 2 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.3
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 0.004522 secs, 3.875537e+09 bytes, 817257.057600 MB/s, 204314.264400 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 3 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.4
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 10.765684 secs, 3.875537e+09 bytes, 343.313079 MB/s, 85.828270 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 4 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.5
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 0.004356 secs, 3.875537e+09 bytes, 848433.270862 MB/s, 212108.317715 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 5 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.6
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 10.763343 secs, 3.875537e+09 bytes, 343.387727 MB/s, 85.846932 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 6 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.7
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 0.004521 secs, 3.875537e+09 bytes, 817572.321941 MB/s, 204393.080485 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 7 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.8
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 10.737765 secs, 3.875537e+09 bytes, 344.205717 MB/s, 86.051429 MB/s per proc

from scr.

adammoody commented on July 24, 2024

Hmm, something strange is going on, but the problem is still not clear. Adding a second rank per node nearly doubles your cost, suggesting that something is slow and is effectively serializing the work.

For comparison, here are results I see writing with SINGLE to ramdisk, PARTNER to ramdisk, and PARTNER to an SSD:

SCR_DEBUG=1
SCR_COPY_TYPE=FILE
STORE=/tmp   GROUP=NODE   COUNT=1
STORE=/l/ssd GROUP=NODE   COUNT=1
CKPT=0 INTERVAL=1 GROUP=NODE   STORE=/tmp   TYPE=SINGLE
CKPT=1 INTERVAL=2 GROUP=NODE   STORE=/tmp   TYPE=PARTNER
CKPT=2 INTERVAL=3 GROUP=NODE   STORE=/l/ssd TYPE=PARTNER

srun -n2 -N2 ./test_api -s 1GB
Init: Min 0.011031 s    Max 0.011060 s  Avg 0.011045 s
No checkpoint to restart from
At least one rank (perhaps all) did not find its checkpoint
SCR v1.2.0: rank 0 on catalyst322: Starting dataset ckpt.1
SCR v1.2.0: rank 0 on catalyst322: scr_reddesc_apply: 0.000039 secs, 2.147484e+09 bytes, 52698985.595092 MB/s, 26349492.797546 MB/s per proc
Completed checkpoint 1.
SCR v1.2.0: rank 0 on catalyst322: Deleting dataset 1 from cache
SCR v1.2.0: rank 0 on catalyst322: Starting dataset ckpt.2
SCR v1.2.0: rank 0 on catalyst322: scr_reddesc_apply: 1.380486 secs, 2.147484e+09 bytes, 1483.535250 MB/s, 741.767625 MB/s per proc
Completed checkpoint 2.
SCR v1.2.0: rank 0 on catalyst322: Starting dataset ckpt.3
SCR v1.2.0: rank 0 on catalyst322: scr_reddesc_apply: 2.556553 secs, 2.147484e+09 bytes, 801.078681 MB/s, 400.539340 MB/s per proc
Completed checkpoint 3.

Then when I run 2 ranks per node, the PARTNER cost to ramdisk is about the same at 1.4 secs, and the PARTNER cost to SSD increases from 2.5 to 3.5 seconds.

srun -n4 -N2 ./test_api -s 1GB
Init: Min 0.013269 s    Max 0.013282 s  Avg 0.013274 s
No checkpoint to restart from
At least one rank (perhaps all) did not find its checkpoint
SCR v1.2.0: rank 0 on catalyst322: Starting dataset ckpt.1
SCR v1.2.0: rank 0 on catalyst322: scr_reddesc_apply: 0.000060 secs, 4.294967e+09 bytes, 68719477.280000 MB/s, 17179869.320000 MB/s per proc
Completed checkpoint 1.
SCR v1.2.0: rank 0 on catalyst322: Deleting dataset 1 from cache
SCR v1.2.0: rank 0 on catalyst322: Starting dataset ckpt.2
SCR v1.2.0: rank 0 on catalyst322: scr_reddesc_apply: 1.401434 secs, 4.294967e+09 bytes, 2922.719729 MB/s, 730.679932 MB/s per proc
Completed checkpoint 2.
SCR v1.2.0: rank 0 on catalyst322: Starting dataset ckpt.3
SCR v1.2.0: rank 0 on catalyst322: scr_reddesc_apply: 3.538541 secs, 4.294967e+09 bytes, 1157.539129 MB/s, 289.384782 MB/s per proc
Completed checkpoint 3.

By default, SCR operates on file chunks of 128KB. We can try different sizes here to see whether that helps in either accessing your storage or sending via the network. What do you see with the following:

SCR_MPI_BUF_SIZE=1MB
or
SCR_MPI_BUF_SIZE=8MB

from scr.

gongotar commented on July 24, 2024

For SCR_MPI_BUF_SIZE=1MB the following is the debug output for 4 ranks and 2 nodes (1 GB checkpoint for each rank):

SCR v1.2.0: rank 0 on cumu01-00: scr_fetch_files: return code 1, 0.000197 secs
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.1
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 0.004813 secs, 3.875537e+09 bytes, 767978.161727 MB/s, 191994.540432 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 1 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.2
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 10.398966 secs, 3.875537e+09 bytes, 355.419939 MB/s, 88.854985 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 2 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.3
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 0.004317 secs, 3.875537e+09 bytes, 856185.795936 MB/s, 214046.448984 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 3 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.4
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 9.737706 secs, 3.875537e+09 bytes, 379.555516 MB/s, 94.888879 MB/s per proc

And for 2 ranks and 2 nodes:

SCR v1.2.0: rank 0 on cumu01-00: scr_fetch_files: return code 1, 0.000205 secs
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.1
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 0.002839 secs, 1.937768e+09 bytes, 650825.941805 MB/s, 325412.970902 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 1 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.2
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 5.480171 secs, 1.937768e+09 bytes, 337.215761 MB/s, 168.607880 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 2 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.3
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 0.002584 secs, 1.937768e+09 bytes, 715221.226238 MB/s, 357610.613119 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 3 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.4
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 5.936323 secs, 1.937768e+09 bytes, 311.303826 MB/s, 155.651913 MB/s per proc

Now for SCR_MPI_BUF_SIZE=8MB, 4 ranks 2 nodes:

SCR v1.2.0: rank 0 on cumu01-00: scr_fetch_files: return code 1, 0.000176 secs
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.1
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 0.004576 secs, 3.875537e+09 bytes, 807654.355968 MB/s, 201913.588992 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 1 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.2
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 10.846292 secs, 3.875537e+09 bytes, 340.761607 MB/s, 85.190402 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 2 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.3
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 0.004633 secs, 3.875537e+09 bytes, 797811.029030 MB/s, 199452.757257 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 3 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.4
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 10.578323 secs, 3.875537e+09 bytes, 349.393771 MB/s, 87.348443 MB/s per proc

SCR_MPI_BUF_SIZE=8MB, 2 ranks 2 nodes:

SCR v1.2.0: rank 0 on cumu01-00: scr_fetch_files: return code 1, 0.000202 secs
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.1
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 0.002934 secs, 1.937768e+09 bytes, 629960.388645 MB/s, 314980.194323 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 1 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.2
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 5.711462 secs, 1.937768e+09 bytes, 323.559908 MB/s, 161.779954 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 2 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.3
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 0.002567 secs, 1.937768e+09 bytes, 719981.416255 MB/s, 359990.708127 MB/s per proc
SCR v1.2.0: rank 0 on cumu01-00: Deleting dataset 3 from cache
SCR v1.2.0: rank 0 on cumu01-00: Starting dataset scr.dataset.4
SCR v1.2.0: rank 0 on cumu01-00: scr_reddesc_apply: 5.853945 secs, 1.937768e+09 bytes, 315.684574 MB/s, 157.842287 MB/s per proc

Clearly I see no big difference for both buffer sizes. Maybe I can do a deeper analysis in the SCR code to find out where the bottleneck is. Or is there maybe already some deeper debugging levels in SCR providing more detailed information which I can use?

from scr.

adammoody commented on July 24, 2024

There is a higher debug level of SCR_DEBUG=2. I don't think that will provide much more info in this case.

It would be a useful test to run using ramdisk, e.g., set your cache to use /dev/shm. That should provide very high disk bandwidth. If the slowdown is still there, that would point more to the network. If it goes away, that would point more to the storage.

You could also try some tests in which you comment out portions of the swap_file code. For example, you could comment out the write(), read(), or send/recv calls in the while loop of this function:

scr/src/scr_cache_rebuild.c

Line 163 in e00fdd8

while (sending || receiving) {

This is the routine that partner copy uses to copy a file from one node to another. Commenting these lines out would lead to incorrect code, but it could help in isolating the cause of the slowdown.

from scr.

gongotar commented on July 24, 2024

Sorry for the late answer on this! I changed the scr.conf file to use /dev/shm as follows:

SCR_COPY_TYPE=FILE
CKPT=0 INTERVAL=1 GROUP=NODE   STORE=/dev/shm TYPE=SINGLE

This gives me following warnings and causes the job to be killed:

SCR v1.2.0 WARNING: rank 1 on cumu01-01: Failed to find store descriptor named /dev/shm @ <SCR_DIRECTORY>/src/scr_reddesc.c:504
SCR v1.2.0 WARNING: rank 0 on cumu01-00: Failed to find store descriptor named /dev/shm @ <SCR_DIRECTORY>/src/scr_reddesc.c:504
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: cumu01-00: task 0: Exited with exit code 255
srun: error: cumu01-01: task 1: Killed

Though /dev/shm path in available on every node. Is there something I should do so SCR is able to find the store descriptor?

EDIT: I have successfully tested the shared memory /dev/shm by setting the cache directory in the slurm wrapper using export SCR_CACHE_BASE=/dev/shm. However apparently it is not possible to set the cache directory to /dev/shm from the scr.conf file.

The performance I get using the shared memory /dev/shm are more reasonable regarding the duration of PARTNER scheme. It matches approximately the estimations as follows (for a job with two ranks on two nodes):

/dev/shm:

SINGLE: 0.33 seconds
PARTNER: 1.9 seconds

in contrast to /tmp:

SINGLE: 0.4 seconds
PARTNER: 6.5 seconds

It appears that the problem is related to the local storages of nodes. I investigate the problem a little bit more to have a better understanding of how it happens. Do you have any idea what the problem could be?

I have also commented out the while loop of scr/src/scr_cache_rebuild.c: 163 as told. Running the code gives following error after the modification:

SCR v1.2.0 ERROR: rank 1 on cumu01-01: Received file does not match expected size <SCR_CACHE_BASE>/scr.201/scr.dataset.2/rank_0.ckpt @ <SCR_DIRECTORY>/src/scr_cache_rebuild.c:462
SCR v1.2.0 ERROR: rank 1 on cumu01-01: scr_copy_files failed with return code 1 @ <SCR_DIRECTORY>/src/scr_reddesc_apply.c:445
SCR v1.2.0 ERROR: rank 0 on cumu01-00: Received file does not match expected size <SCR_CACHE_BASE>/scr.201/scr.dataset.2/rank_1.ckpt @ <SCR_DIRECTORY>/src/scr_cache_rebuild.c:462
SCR v1.2.0 ERROR: rank 0 on cumu01-00: scr_copy_files failed with return code 1 @ <SCR_DIRECTORY>/src/scr_reddesc_apply.c:445

However I wrote some code in the scr/src/scr_cache_rebuild.c file to print out the duration of the while loop. It turned out that the loop is not the problem. Because the duration is approximately as expected (1.5 seconds when using /tmp which is transfer duration + write in the destination). It is almost the same in case of using shared memory /dev/shm (1.4 seconds). Probably SCR hangs somewhere else.

Could you maybe point me some critical code blocks in the SCR code so that I can measure the duration of the blocks and isolate the problem?

from scr.

gongotar commented on July 24, 2024

I found the bottleneck. It is in fact the scr_close calls right after the while loop. It takes all the extra time not included in the estimations. The lengthy operation in scr_close is the fsync call. It is also the case for XOR scheme. It appears that for SINGLE scheme no fsync is called.
I should investigate the problem of fsync to know why it takes so long.

from scr.

adammoody commented on July 24, 2024

You could probably also create a STORE=/dev/shm entry in your configuration file as another way to clear up those store descriptor errors.

It's good to see that ramdisk provides better behavior. We generally recommend storing checkpoint data in ramdisk if you can. Not all applications have enough spare memory to do that, but if you can, it's very fast.

Nice work in tracking down fsync as the underlying problem. Let me know if you find a way to improve that. I don't have ideas off hand, but it has to do with the OS and file system flushing data from memory to disk.

To gain some performance, you could consider dropping the fsync call and rebuilding SCR without it. However, for correctness, it is functionally important to keep the fsync.

SCR uses this fsync to be sure the data is on disk before it considered the dataset to be "complete". If you drop the call to fsync, and the job were to be killed before the OS / file system has really written the data to disk, then SCR will think the dataset is valid when it is really not. If you're not too worried about that, you could drop the fsync and just be on the lookout for this failure case occurring.

from scr.

adammoody commented on July 24, 2024

FYI, very recent updates have resolved the cause for the original reason this ticket was opened.

It is now possible to set SCR_CNTL_BASE at runtime through an environment variable, whereas that previously was not possible.
SCR will now create the default store descriptor, even when there is no explicit entry for it in the config file.

I'll go ahead and close this issue, since it's been a while. Feel free to reopen if needed.

from scr.

SCR_CNTL_BASE variable about scr HOT 25 CLOSED

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent