Coder Social home page Coder Social logo

Execution error about schism HOT 24 OPEN

schism-dev avatar schism-dev commented on June 27, 2024
Execution error

from schism.

Comments (24)

josephzhang8 avatar josephzhang8 commented on June 27, 2024 1

from schism.

josephzhang8 avatar josephzhang8 commented on June 27, 2024 1

from schism.

josephzhang8 avatar josephzhang8 commented on June 27, 2024 1

from schism.

josephzhang8 avatar josephzhang8 commented on June 27, 2024

from schism.

brey avatar brey commented on June 27, 2024

Thanks @josephzhang8 . Two questions.

Is the mesh loaded in its entirety on one node before partitioning? If this is the case, what is the amount of memory we might need for such a big mesh? I suppose this creates a high demand for the master node is terms of memory, no?

Can you point me to the documentation on how to use static partition?

from schism.

pmav99 avatar pmav99 commented on June 27, 2024

@josephzhang8 Thank you! The VMs we use have 448GB RAM and 120 cores. But we can use fewer cores, e.g. 96 per node which would give us something like 4.5GB/core. So this amount of RAM per core should be doable.

Are the instructions the same for schism 5.9? Because that's what we currently use

from schism.

josephzhang8 avatar josephzhang8 commented on June 27, 2024

from schism.

pmav99 avatar pmav99 commented on June 27, 2024

I tried to follow the instructions for the static partitioning but the metis prepration step (i.e. step 2) is failing with a segmentation fault. The problem is that:

  1. we have a global mesh with the full resolution for the coastlines. This translates to 180491 land boundaries with the smallest one consisting of 3 nodes and the larger one (Eurasia+Africa) consisting of 1181108 nodes.
  2. The code tries to allocate a single 2D array for all the land boundaries, so it needs enough RAM for: 8 * 180491 * 1181108 = 1705GB and this obviously fails.

The good news is that, from what I understood, the metis prepration script does not really use the ilnd table. So if we comment out the lines referencing ilnd then the script runs and produces the graphinfo file. The relevant block of code is:

allocate(ilnd(nland,mnlnd),stat=stat)
! Aquire global land boundary segments and nodes
rewind(14); read(14,*); read(14,*);
do i=1,np; read(14,*); enddo;
do i=1,ne; read(14,*); enddo;
read(14,*); read(14,*);
do k=1,nope; read(14,*) nn; do i=1,nn; read(14,*); enddo; enddo;
read(14,*); read(14,*);
nlnd=0; ilnd=0;
do k=1,nland
read(14,*) nn
do i=1,nn
read(14,*) ip
nlnd(k)=nlnd(k)+1
ilnd(k,nlnd(k))=ip
if(isbnd(ip)==0) isbnd(ip)=-1 !overlap of open bnd
enddo !i
enddo !k

@josephzhang8 Can you confirm that ilnd is indeed not needed for the metis preparation?

BTW, the segmentation fault happens when we first try to assign a value to ilnd (i.e. line 287). Checking stat after the allocation (line 272) would make it a bit easier to figure out what is going on.

If you want, I can make a PR to remove ilnd or add a check after the allocation. No problem if you'd rather fix it on your end, too.

All that being said, I think that our main problem remains. If I understand the code correctly (and I should mention that my Fortran knowledge is nothing to speak of), the main schism code also tries to do the exact same allocation. The relevant lines are:

allocate(ilnd_global(nland_global,mnlnd_global),stat=stat);
if(stat/=0) call parallel_abort('AQUIRE_HGRID: ilnd_global allocation failure')

If this is True, then for the grid in question we do need 1705GB per process which unfortunately is not really feasible...

from schism.

josephzhang8 avatar josephzhang8 commented on June 27, 2024

from schism.

pmav99 avatar pmav99 commented on June 27, 2024

Thank you @josephzhang8
We did test the division of the land boundaries on a smaller model and indeed it seems to be working fine. We will let you know how it goes after testing it on the global model, too.

from schism.

brey avatar brey commented on June 27, 2024

Dear @josephzhang8. We have split the boundaries on the big mesh and although the sanity check seems to work we were unable to effectively run it on Azure. You can find the model here. Hopefully, you can use it as a test case for possible modifications in SCHISM. If you manage to make it work on your end, we would be interested to try it out. In the mean time we'll try something simpler. Thanks.

from schism.

josephzhang8 avatar josephzhang8 commented on June 27, 2024

from schism.

josephzhang8 avatar josephzhang8 commented on June 27, 2024

from schism.

pmav99 avatar pmav99 commented on June 27, 2024

Thank you for looking into this Joseph. Try this: https://ppwdevarchivesa.blob.core.windows.net/seareport/sflux_sample?sp=r&st=2023-06-12T20:57:49Z&se=2023-07-12T04:57:49Z&spr=https&sv=2022-11-02&sr=d&sig=5FZSchXoh1xv1ylZytrxit92%2FN7zBz5xTRnTcikU0mA%3D&sdd=1

from schism.

josephzhang8 avatar josephzhang8 commented on June 27, 2024

from schism.

josephzhang8 avatar josephzhang8 commented on June 27, 2024

from schism.

josephzhang8 avatar josephzhang8 commented on June 27, 2024

from schism.

brey avatar brey commented on June 27, 2024

Great news!

I know that by forcing it to follow such a convoluted coastline I am asking for trouble. I will try some ways to make it more manageable and let you know. We'll also try pre-partition and with your estimate of of ram/core well give it another try.

Based on your experience, could such a mesh work? I know that the mesh is not balanced and I wonder if the skewness of the elements might also give problems both in terms of stability and accuracy.

from schism.

pmav99 avatar pmav99 commented on June 27, 2024

@josephzhang8 Try with this: https://ppwdevarchivesa.blob.core.windows.net/seareport/sflux_sample/sflux_air_1.0001.nc?sp=r&st=2023-06-13T06:26:59Z&se=2023-06-17T14:26:59Z&spr=https&sv=2022-11-02&sr=b&sig=YvGUDz5EzKWbLUw3YOt%2BSFzajJTV7txLCszAHcXWHqQ%3D

from schism.

josephzhang8 avatar josephzhang8 commented on June 27, 2024

from schism.

pmav99 avatar pmav99 commented on June 27, 2024

@josephzhang8 The netcdf is indeed 24GB but it is uncompressed. Does schism support reading compressed/deflated Netcdf files?

from schism.

josephzhang8 avatar josephzhang8 commented on June 27, 2024

from schism.

brey avatar brey commented on June 27, 2024

Joseph, indeed the metro forcing is every hour. That means that wtiminc should be 3600?

from schism.

josephzhang8 avatar josephzhang8 commented on June 27, 2024

from schism.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.