Coder Social home page Coder Social logo

Comments (6)

gabrieltseng avatar gabrieltseng commented on September 18, 2024

Hi @kvantricht ;

Yes - I agree this is a confusing part of the codebase (apologies). The motivation behind this is that the latlon token never gets masked, and the SRTM token does - I wanted to group all continuous values which would get masked into x.

The trade off is that this means the SRTM values must be duplicated - if another static variable were to be added, it would make sense to split this into dynamic and static inputs. If you feel this would improve the clarity of the model, I'd be happy to give this a shot now.

from presto.

kvantricht avatar kvantricht commented on September 18, 2024

@gabrieltseng
as we discussed, it could be useful at some point to try out mono-temporal land cover as a static input or any other scalar value (thinking of soil properties e.g.). It should work by indeed for all these inputs also replicating them along the temporal dimension to make them fit x but I would assume that indeed splitting dynamic and static in the model where the user does not have to include a (matching) temporal dimension for static inputs is less confusing.

By the way, as for latlons, this means that location information is compulsory and we wouldn't be able to compute embeddings for location-unaware inputs?

from presto.

gabrieltseng avatar gabrieltseng commented on September 18, 2024

Yes; I agree. If we add another static-in-time input, I will rewrite the inputs to be x_dynamic and x_static. I'll leave this issue open until then - one reason I haven't done this yet is that its a bit quicker (in terms of training time) to mask the SRTM values then to split them from the dynamic data.

And yes, the model doesn't natively handle missing location-unaware inputs. Do you have a use case where you would want to pass location-unaware inputs to the model?

from presto.

kvantricht avatar kvantricht commented on September 18, 2024

I understand that training time is an important factor to take into account! So definitely, if that is a strong argument to keep them together, it could make sense. Just for my understanding: isn't this something you could still do under the hood, even if from user perspective the inputs would be separated?

As for the location-unaware inputs: mostly thinking of situations where data privacy could result in the sharing of training data with the inputs but stripped from location information. It might be a borderline case though.

On the other hand: from your experience, do you think the location impact on the embeddings is of such nature that geographical bias in training data could result in classification artefacts in downstream tasks? I could imagine a case where you want to train a classifier on Presto embeddings which are totally freed from location awareness. But also in this case, maybe too borderline.

from presto.

gabrieltseng avatar gabrieltseng commented on September 18, 2024

isn't this something you could still do under the hood

Yes, in principle. However, since SRTM is exported alongside all the other data in order to do it cleanly we have to separate it out - this is what incurs the computational cost. So this penalty is only for SRTM, and its an artifact of how we export the data. This is why the introduction of any new dataset would definitely justify the re-write.

Do you think the location impact on the embeddings is of such nature that geographical bias in training data could result in classification artefacts in downstream tasks?

We did a few (not thorough) experiments removing the latlon token, and saw a performance decrease. We didn't pursue it at the time because we couldn't see a scenario where the latlon token would be unused during training - the privacy concern is fair, but I haven't yet encountered it (especially when using S2-scale data). If this is a concern, we could definitely investigate it more though.

from presto.

kvantricht avatar kvantricht commented on September 18, 2024

We did a few (not thorough) experiments removing the latlon token, and saw a performance decrease. We didn't pursue it at the time because we couldn't see a scenario where the latlon token would be unused during training

Wanted to get back to this, as we were discussing internally (mostly triggered by this new preprint). So suppose in a crop mapping task you have training data on two particular crops in one country, but only on one of these crops in another country while let's assume theses crops look spectrally exactly the same in both countries. Your downstream classifier would likely learn a strong predictive power of location with respect to the country that lacks training data on one of the crops and therefore never predict the other crop here while it's actually there. Isn't this a case where the mandatory location information in Presto could do harm and where you might benefit from location-unaware embeddings? What's your view on that?

from presto.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.