Hi, With the SOAP vectors working perfectly, I turned to a "real" sy

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

With 250 000 atoms, each frame takes 1.5 hours. <p dir=

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Non-linear (quadratic?) scaling of SOAP calculation about dscribe HOT 10 CLOSED

singroup commented on June 5, 2024

Non-linear (quadratic?) scaling of SOAP calculation

from dscribe.

Comments (10)

lauri-codes commented on June 5, 2024

Hi @anjohan!

Wow, 250 k atoms :) Are you actually working with systems of this magnitude?

The linear scaling in classical MD is largely possible if there is no need to calculate neighbour lists. This is typically the case with fixed bonds. As soon as you have to start figuring out which atoms are within the radial cutoff, there is a need to calculate the pairwise distances between atoms.

SOAP is quite interesting in this case. In theory, there is no need to ever calculate pairwise distances. One could just form a gaussian atomic density based on all atoms. This would make the algorithm scale linearly. In practise however, this linear calculation will become quite expensive in terms of time and memory if the atomic density consists of thousands of atoms.

So instead, in our current implementation, only atoms within the cutoff are included in the atomic density when calculating the expansion on individual sites. This requires that we calculate the pairwise distances once. With the current implementation this results in the O(n^2) scaling you here demonstrate. The current distance calculation is quite naive and I can improve it to scale as O(n log n). This faster distance calculation is actually already implemented for many other descriptors in the package. It was not implemented for SOAP as with typical system sizes this distance calculation is negligible, and only starts to dominate the SOAP calculation at very large system sizes.

So in summary, I will implement the faster pairwise distance calculation, which will make it scale O(n log n). I will also see if the fully linear calculation is feasible, but I fear it has such a big constant prefactor that it will not be usable in practice.

from dscribe.

lauri-codes commented on June 5, 2024

With 250 000 atoms, each frame takes 1.5 hours.

With what SOAP settings do you get 1.5 h/configuration?

from dscribe.

anjohan commented on June 5, 2024

Thanks for the quick response!

Wow, 250 k atoms :) Are you actually working with systems of this magnitude?

Yes (unfortunately). I'm using the SOAP vectors to visualise phase transitions for fairly large systems. This may not have been your intended purpose, but it's working superbly!

The linear scaling in classical MD is largely possible if there is no need to calculate neighbour lists. This is typically the case with fixed bonds. As soon as you have to start figuring out which atoms are within the radial cutoff, there is a need to calculate the pairwise distances between atoms.

SOAP is quite interesting in this case. In theory, there is no need to ever calculate pairwise distances. One could just form a gaussian atomic density based on all atoms. This would make the algorithm scale linearly. In practise however, this linear calculation will become quite expensive in terms of time and memory if the atomic density consists of thousands of atoms.

So instead, in our current implementation, only atoms within the cutoff are included in the atomic density when calculating the expansion on individual sites. This requires that we calculate the pairwise distances once. With the current implementation this results in the O(n^2) scaling you here demonstrate. The current distance calculation is quite naive and I can improve it to scale as O(n log n). This faster distance calculation is actually already implemented for many other descriptors in the package. It was not implemented for SOAP as with typical system sizes this distance calculation is negligible, and only starts to dominate the SOAP calculation at very large system sizes.

So in summary, I will implement the faster pairwise distance calculation, which will make it scale O(n log n). I will also see if the fully linear calculation is feasible, but I fear it has such a big constant prefactor that it will not be usable in practice.

Hm. Thanks to dscribe, I haven't had to dive too deeply into the descriptor calculations, but I was hoping that it should be linearly scaling due to the cutoff. From what I understand, SOAP vectors rely on summing Gaussians centred at each neighbour and then projecting this onto some basis. This is conceptually similar to computing short-range forces in MD, so I was hoping the spatial binning trick would work, i.e.

Divide space into regions of size rcut x rcut x rcut.
For each atom, its neighbours are found either inside the same region or inside the nearest 26 regions.

The time thus ends up being proportional to (number of atoms) x (number of neighbours), where the number of neighbours is approximately rho x (27rcut^3), rho being the density. While the number of neighbours is a pretty large prefactor, it's still overall linear, and with N=250k, the number of neighbours per atom is very small compared to the total number of atoms.

Would this not work for SOAP vectors?

I did look around inside the libsoap directory, but the code is (understandably) quite involved.

From what I gather, the getFilteredPos function is called inside to soap function to collect the neighbours of a given atom (or position) by looping over all the atoms and determining the ones which are inside the cutoff sphere.

If you create a spatial binning of the atoms, the getFilteredPos function would only need to loop over a small subset of the atoms to find the neighbours.

Sorry if my explanation is rambly, here's some pseudo-code:

def soap(...):
    atomlist = way of storing atoms
    L = system size (assuming cubic)
    rcut = cutoff
    Nbins = ceil(L/rcut)
    binsize = L/Nbins
    all_atoms = atomlist with all the atoms
    soap_positions = where SOAPs should be calculated

    atomlists = atomlist[Nbins][Nbins][Nbins] # array of empty atom lists

    # do spatial binning, fill atomlists
    for atom in all_atoms:
        i = atom.x/binsize
        j = atom.y/binsize
        k = atom.z/binsize

        atomlists[i][j][k].append(atom)

    for position in soap_positions:
        neighbours = getFilteredPos(position, atomlists)
        calculate SOAPs based on neighbours

def getFilteredPos(position, atomlists):
    # Determine which bin
    this_i = position.x/binsize
    this_j = position.y/binsize
    this_k = position.z/binsize

    neighbours = []

    # Loop over neighbouring bins
    for i in this_i+[-1,0,1]:
        for j in this_j+[-1,0,1]:
            for k in this_j+[-1,0,1]:
                for atom in atomlists[i][j][k]:
                    if norm(atom.position - position) < rcut:
                        neighbours.append(atom)

    return neighbours

With what SOAP settings do you get 1.5 h/configuration?

rcut = 6.0, nmax=lmax=3. I'm essentially running the system in my example above with 50x50x50 unit cells. With 18x18x18 unit cells (11664 atoms), each frame took 10-20 seconds. These runs are on fairly standard intel nodes on the cluster. Since I have to calculate SOAP vectors for 100-1000 of these frames, I've parallelised it with mpi4py.

from dscribe.

lauri-codes commented on June 5, 2024

Thanks for the detailed answer! You're application sounds very interesting and I would be very interested to see what comes out of such visualization.

I think you are right and the binning might be the best solution since there is only one global cutoff. Should have thought about that earlier, thanks for pointing it out!

I will try my hand at implementing and testing it. Unfortunately I cannot promise any schedule, as I don't right now have too much time to spend on this.

from dscribe.

anjohan commented on June 5, 2024

No worries, this isn't urgent. The SOAP vector calculation still takes much less time than the MD simulation, since I'm only calculating descriptors for a small subset of the frames and using mpi4py.

I should also be able to add a naive (but linearly scaling) binning myself, now that I understand more of libsoap. I may have some time later this week.

from dscribe.

anjohan commented on June 5, 2024

I've started working on an implementation.

from dscribe.

lauri-codes commented on June 5, 2024

That's great! External contributors are very welcome.

Will you be doing this on the python side or in C/C++? For now this doesn't really matter, but eventually, I would want this functionality in the C/C++ side and available to all the other descriptors as well.

from dscribe.

anjohan commented on June 5, 2024

In C. As a result, I'm currently enjoying some segfaults.

from dscribe.

lauri-codes commented on June 5, 2024

Have fun :>

from dscribe.

lauri-codes commented on June 5, 2024

Thanks @anjohan for contributing the spatial binning algorithm! It is now fully tested and integrated into SOAP in version 0.3.2.

from dscribe.

Non-linear (quadratic?) scaling of SOAP calculation about dscribe HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent