Hi, I was recently doing a study comparing several deors for s

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Issue with the Coulomb matrix descriptor about dscribe HOT 4 CLOSED

commented on August 25, 2024

Issue with the Coulomb matrix descriptor

from dscribe.

Comments (4)

lauri-codes commented on August 25, 2024

Hi @pablos-p94!

Thank you for your message. This is a very good point that I can try to address.

Generally, it is a bad sign if your ability to train a model depends on the physical units that have been used. In this particular case I would expect that this is caused by the combination of two things:

Coulomb matrix (and many other of the descriptors) have features that do not scale uniformly as you change your distance/angle units. In the case of Coulomb Matrix the diagonal terms only depend on the elements, while the off-diagonals depend on the distance, and thus also on the units.
Some machine learning models cannot automatically tune the weighting of different features within an input. Take KRR with a gaussian kernel for instance: it relies only on a simple Euclidean distance between feature vectors but doesn't really care if that distance makes any sense or not. The only thing you can tune is the sensitivity by optimizing the kernel scaling. If using this kind of ML model you are responsible for doing the feature engineering: weighting the individual terms so that the used metric produces good results.

In terms of machine learning, the problem boils down to the fact that you don't know a priori which features carry important information and which can be considered noise. In your particular application, I would expect that making the off-diagonal terms less important by scaling them down (changing from Ångstrom to Bohr does this) is beneficial. But in general, this is not the "ground-truth": some predictions may depend more on the geometry of the samples and less on the chemistry, in which case even completely dropping the diagonal terms would produce better results.

When it comes to DScribe: we do not in general make any strong assumptions about which machine learning model you use and also cannot know what weighting of the features is optimal for your application. Thus we cannot really claim that the initial sizes of the features produced by DScribe are optimal for any particular application. Because of this, I'm a bit sceptic of changing from Ångstrom to Bohr since it does not really solve the root of the problem: every application and ML model is different and you as a user have a responsibility to tune your model.

Does this make sense?

from dscribe.

commented on August 25, 2024

Hi @lauri-codes,

First of all, thanks to you for your response and the useful explanation. Excuse me for the late response on the issue, as it was me who open it.

The idea of the previous message was not to make you adjust your amazing tool to some particular requirements of our project. Featurization of the descriptors is something that we are already doing, easier thanks in part to all the tools that you added to dscribe library. However in this case i think is an issue of consistency with the definition of the descriptor, and because of that I consider it should be fixed.

You said that diagonal elements of the Coulomb Matrix (CM) only depend on the elements, but honestly I thinks that's not completely true. CM describes electrostatic interaction and that for sure is measured in energy terms. To be consistent with this, scaling constant of diagonal elements is infered from atomic energies of the different elements which come in Hartrees energy units, which in turn is calculated from Bohr radii and consequently related to Bohr distance units. That way, as soon as you change to Angstrom units that scaling factor should change accordingly. After that for sure featurization plays its role but I guess thats a later process that comes after a correct description of the CM. Of course that's only my point of view.

Once again thanks for your fantastic work and your efforts with the dscribe package.

Pablo.

from dscribe.

lauri-codes commented on August 25, 2024

Hi @pablos-p94,

I agree that our baseline implementation should ideally follow the original definition, but fine-tuning the weighting of diagonal vs. off-diagonal elements is still up to the end user.

Could you point out the exact location where the units are discussed by the authors? Do they e.g. mention atomic units or Bohr explicitly somewhere in the original article?

from dscribe.

lauri-codes commented on August 25, 2024

Closing for now as there was no reply, re-open if required.

from dscribe.

Issue with the Coulomb matrix descriptor about dscribe HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent