Hi, I was running a spatial dataset of size 2,419 sampling units, 17

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Spatial Model running extremely slow about hmsc HOT 6 OPEN

LamuelCH commented on August 16, 2024

Spatial Model running extremely slow

from hmsc.

Comments (6)

gtikhonov commented on August 16, 2024 1

The runtimes you have observed look somewhat too large. Given that you number of species is small, I would expect that the spatial component does not work sufficiently performant. Using our internal performance comparisons with various datasizes, I would dare to say that it is approximately x10 compared to most equivalent tasks in my laptop, which is no HPC at all.

There are two potential reasons that I have in my mind right now:

If you are running chains in parallel (which you do) some R distribution-OS combinations ae known to start getting convoluted due to interplay between cross-chain paralellization (intended) and within chain default paralllization of called linear algebra routines (not very much intended). Can you check your short runs with nChains=1 or with nParallel=1 and report whether they significantly differ in terms of sec/iteration or not?
NNGP approximation algorithm is sensitive to the order of spatial units. E.g. if you rename your spatial units so that it perturbs their order, the approximation will be different. The rule of thumb is that the order shall be such that there are no neighbours far away in the order. If this is done randomly, then technically NNGP can be equally slow as full covariance GP. HMSC is not handling this aspect automatically. Could you please check if your spatial units are ordered (in term of factor/string values) in somewhat reasonable way, e.g. along the longest axis of your study area?

The nParallel in predictions can be different from its value in the sampling phase.

from hmsc.

gtikhonov commented on August 16, 2024 1

@MartinStjernman you need to sort the names of sData rows, so that its lexicographic order matches the desired one. Personally I typically add some numerical prefix, like 0001_first_site_original_name, 0002_second_site_original_name. Also, you would need to update the corresponding column of studyDesign accordingly.

I am quite sceptical whether TSP is best suited for this problem. First of all, you do not need to return to origin in NNGP scenario. Next, it is not the distance that we are worried about, but that the neighbours are not too far in the resulted order. My guess is that you can simply order along the lon/lat in many cases. Preferably, you shall project to the leading eigenvalue (principal component) of you sites' coordinates.

N = 100
X = cbind(2*runif(N), runif(N))
plot(X[,1], X[,2])
pc <- prcomp(X)
proj = X %*% pc$rotation[,1]
optOrder = rank(proj)
plot(X[,1], X[,2], type="n")
text(X[,1], X[,2], optOrder)

Of course, there are exceptions - if you are studying some coastal communities, then the best way would be to order along the coast.

from hmsc.

LamuelCH commented on August 16, 2024

Thanks heaps!!!

After spatial sorting the observation with the nearest neighbour, the processing speed gained is huge! now with thin = 500 I can finish within 5 hours !! Thanks for keeping my PhD alive :D

But this only works on my personal Mac Studio. If I tried to deploy it on HPC with a Linux system, it seems that spatial sorting does not help. Would you happen to have any ideas? I might need to increase the number of sampling units and species numbers later on which my personal Mac may become a bottleneck.

from hmsc.

LamuelCH commented on August 16, 2024

using parallel = 1, nChains = 1, samples = 250, thin = 1 and transient = 50
when I use my own Mac studio the running time is
[1] "MODEL START: Mon May 27 23:45:53 2024"
[1] "MODEL END: Mon May 27 23:45:58 2024"

5 seconds

but the same thing on HPC
[1] "MODEL START: Mon May 27 23:42:39 2024"
[1] "MODEL END: Mon May 27 23:48:22 2024"

nearly 6 minutes

below is my observation row order in my data

from hmsc.

MartinStjernman commented on August 16, 2024

Hi,
If I may tune in on this, as I have also problems with long running times and am seeking anything that can speed it up, I wonder about the sorting of observations suggested as one solution.

What exactly is sorted, is it the XData/studyDesign objects or the object provided as sData when constructing the random level object or both?
It seems the improvement reached by LamuelCH when sorting with nearest neighbour used Travelling Salesman Problem (TSP) "algorithm" and I wonder a) is this a good method to satisfy NNGP algorithm requirements and b) what package/function was used to get the ordering according to TSP?

Any help is highly appreciated!

Thanks!

from hmsc.

MartinStjernman commented on August 16, 2024

Thanks a lot Gleb!

I take it the reason I need to have names of my sites (i.e. rownames in sData), such that its lexicographic order matches the desired one, is that sData is sorted "under the hood" when constructing the random level object using HmscRandomLevel() (i.e. the step: rL$pi = as.factor(sort(rownames(sData))) ).
I will try this out although I think that my sites are already quite well sorted (site names are "sort of" coordinates).
I have, if I may, one additional question. My sites are aggregated in small clusters (cluster is also included as a non-spatial/unstructured random effect) and I have adjusted the alphapw prior for the site random effect to the scale of sites within clusters. With such a "local" prior, is it still of benefit (for speed) to spatially sort the clusters or is it enough for the sites to be spatially sorted within clusters?

Thanks again for the excellent package and help!

from hmsc.

Spatial Model running extremely slow about hmsc HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent