Coder Social home page Coder Social logo

Spatial Model running extremely slow about hmsc HOT 6 OPEN

LamuelCH avatar LamuelCH commented on August 16, 2024
Spatial Model running extremely slow

from hmsc.

Comments (6)

gtikhonov avatar gtikhonov commented on August 16, 2024 1

The runtimes you have observed look somewhat too large. Given that you number of species is small, I would expect that the spatial component does not work sufficiently performant. Using our internal performance comparisons with various datasizes, I would dare to say that it is approximately x10 compared to most equivalent tasks in my laptop, which is no HPC at all.

There are two potential reasons that I have in my mind right now:

  1. If you are running chains in parallel (which you do) some R distribution-OS combinations ae known to start getting convoluted due to interplay between cross-chain paralellization (intended) and within chain default paralllization of called linear algebra routines (not very much intended). Can you check your short runs with nChains=1 or with nParallel=1 and report whether they significantly differ in terms of sec/iteration or not?
  2. NNGP approximation algorithm is sensitive to the order of spatial units. E.g. if you rename your spatial units so that it perturbs their order, the approximation will be different. The rule of thumb is that the order shall be such that there are no neighbours far away in the order. If this is done randomly, then technically NNGP can be equally slow as full covariance GP. HMSC is not handling this aspect automatically. Could you please check if your spatial units are ordered (in term of factor/string values) in somewhat reasonable way, e.g. along the longest axis of your study area?

The nParallel in predictions can be different from its value in the sampling phase.

from hmsc.

gtikhonov avatar gtikhonov commented on August 16, 2024 1

@MartinStjernman you need to sort the names of sData rows, so that its lexicographic order matches the desired one. Personally I typically add some numerical prefix, like 0001_first_site_original_name, 0002_second_site_original_name. Also, you would need to update the corresponding column of studyDesign accordingly.

I am quite sceptical whether TSP is best suited for this problem. First of all, you do not need to return to origin in NNGP scenario. Next, it is not the distance that we are worried about, but that the neighbours are not too far in the resulted order. My guess is that you can simply order along the lon/lat in many cases. Preferably, you shall project to the leading eigenvalue (principal component) of you sites' coordinates.

N = 100
X = cbind(2*runif(N), runif(N))
plot(X[,1], X[,2])
pc <- prcomp(X)
proj = X %*% pc$rotation[,1]
optOrder = rank(proj)
plot(X[,1], X[,2], type="n")
text(X[,1], X[,2], optOrder)

Of course, there are exceptions - if you are studying some coastal communities, then the best way would be to order along the coast.

from hmsc.

LamuelCH avatar LamuelCH commented on August 16, 2024

Thanks heaps!!!

After spatial sorting the observation with the nearest neighbour, the processing speed gained is huge! now with thin = 500 I can finish within 5 hours !! Thanks for keeping my PhD alive :D

But this only works on my personal Mac Studio. If I tried to deploy it on HPC with a Linux system, it seems that spatial sorting does not help. Would you happen to have any ideas? I might need to increase the number of sampling units and species numbers later on which my personal Mac may become a bottleneck.

from hmsc.

LamuelCH avatar LamuelCH commented on August 16, 2024

using parallel = 1, nChains = 1, samples = 250, thin = 1 and transient = 50
when I use my own Mac studio the running time is
[1] "MODEL START: Mon May 27 23:45:53 2024"
[1] "MODEL END: Mon May 27 23:45:58 2024"

5 seconds

but the same thing on HPC
[1] "MODEL START: Mon May 27 23:42:39 2024"
[1] "MODEL END: Mon May 27 23:48:22 2024"

nearly 6 minutes

below is my observation row order in my data

image

from hmsc.

MartinStjernman avatar MartinStjernman commented on August 16, 2024

Hi,
If I may tune in on this, as I have also problems with long running times and am seeking anything that can speed it up, I wonder about the sorting of observations suggested as one solution.

  1. What exactly is sorted, is it the XData/studyDesign objects or the object provided as sData when constructing the random level object or both?
  2. It seems the improvement reached by LamuelCH when sorting with nearest neighbour used Travelling Salesman Problem (TSP) "algorithm" and I wonder a) is this a good method to satisfy NNGP algorithm requirements and b) what package/function was used to get the ordering according to TSP?

Any help is highly appreciated!

Thanks!

from hmsc.

MartinStjernman avatar MartinStjernman commented on August 16, 2024

Thanks a lot Gleb!

I take it the reason I need to have names of my sites (i.e. rownames in sData), such that its lexicographic order matches the desired one, is that sData is sorted "under the hood" when constructing the random level object using HmscRandomLevel() (i.e. the step: rL$pi = as.factor(sort(rownames(sData))) ).
I will try this out although I think that my sites are already quite well sorted (site names are "sort of" coordinates).
I have, if I may, one additional question. My sites are aggregated in small clusters (cluster is also included as a non-spatial/unstructured random effect) and I have adjusted the alphapw prior for the site random effect to the scale of sites within clusters. With such a "local" prior, is it still of benefit (for speed) to spatially sort the clusters or is it enough for the sites to be spatially sorted within clusters?

Thanks again for the excellent package and help!

from hmsc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.