Coder Social home page Coder Social logo

torchmd-exp's People

Contributors

11carlesnavarro avatar eloisanchez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

torchmd-exp's Issues

Set up extensive logging system to file

Summary

Currently, we do not log anything beyond the losses and model checkpoints. I think it would be nice to set up a logger that could be set to print everything to a file in the logdir. This includes, but is not limited to:

  • The epoch and batch information for each batch.
  • In which module or function the code is at each moment.
  • Intermediary values of interest or some checks that indicate that everything is working fine (e.g. gradient values, weight values, energies for the reweighting...)

TODO

The idea would be to have a logging object set up and accessible from all modules that easily allows calls like log.info('some info') and that stores where this call is being made from and the corresponding information.

Embeddings not properly passed to compute gradients

Bug found when training with several molecules and sim_batch_size > 1. The first epoch works fine, but after that there is a chance that the states of a molecule are sent together with the embeddings of another one on the self.local_we_worker.compute_gradients(...) here.

I am currently trying to find the error. My guess is that there is some random sampling at some point that affects states but not embeddings and embeddings are not updated to match the new ordering before sending them.

Struggle with replicating the training curve

Hi! Thank you for the amazing paper and for the open-sourcing the code!

I was trying to reproduce the training process and encountered few difficulties.
For some reason, the given datasets, when loaded, has all the grain masses equal to zero:
Screenshot 2024-05-18 at 21 37 52
Screenshot 2024-05-18 at 21 37 20
Which, obviously, breaks the program when divided by.

I was trying to set it None for each molecule since in this case torchmd guesses masses during Parameters building. However, it sets all to 12. (which, I guess, corresponds to C-alpha atoms). This is probably not the way it is supposed to be, since it implies the wrong physics. And the training loss is stuck around 2.8:
Screenshot 2024-05-19 at 01 14 54

After that, I've mapped resname-s to the known AA masses. This indeed improved the train loss being started from 2 and decreased to 1 (and still being slowly on the way down):
Screenshot 2024-05-19 at 13 14 59
However, the training curve looks nothing like in the example notebook where it start from around 5 and drops to almost zero.

I am using train_ff.yaml with only "log_dir" and "device" modified.

Do you have an idea of what might be wrong?
From what I can see, the input.yaml of the newly trained model differs from the one in data/models/fastfolders, particularly in such fields as max_num_neighbors and some other , so my next step would be to try using the same values, I guess.

I would much appreciate If you could help me with replicating the results. I am eager to use the trajectory reweighting method with another CG-potentials and slightly extended CG-systems and I really hope your implementation to help me a lot with that.

P.S. In order to launch I also had to resolve an environment (which doesn't work from the environment.yaml missing certain packages that conflict with each other) and add a "timestep" key to the logger. I can make a PR with the environment.yaml that worked for me.

Edit: The one with mapped AA masses eventually went to the zero-proximity after 5k steps. Still would be great to make optimisation faster, like in the example notebook.

Modify Learner/Logger so that not entered keys that appear in results_dict don't raise error

Right now creating the learner instance is something like:

learner = Learner(scheme, steps, output_period, train_names=train_names, log_dir=args.log_dir,
                      keys = ('epoch', 'level', 'steps', 'train_loss', 'val_loss', 'loss_1', 'loss_2', 'val_loss_1', 'val_loss_2'))

The problem is that the keys argument requires some of this keys to be introduced, otherwise there will be an error when using the logger to write to file. Also, the given keys will be written even though they may not be used in that specific training.

The idea would be to pass to the Learner only the keys that the user really wants to write, e.g. keys=('epoch', 'level', 'steps', 'train_loss', 'val_loss',), and, although the results dict has some other keys, the logger should only write the ones passed (plus the train_names losses if given) without throwing an error.

Reproducibility of torchmdexp

Currently torchmexp is not reproducible because there are various factors that use RNGs that we are not seeding.

Torch: Torch is seeded manually at the beginning of the train scripts, so it should not be a problem

Python random: The standard random library is used to produce the randomness in ProteinDataset.shuffle(), used in both folding and docking. At least in the case of docking, this was a source of non-reproducibility since the order of the batching depended on this. Also, when using a number of molecules that is not divisible by the batch size, a random sample of the molecules is used to fill the last batch, which is also determined by the standard random library.

This is fixed adding random.seed(args.seed) on the train scripts.

Numpy random: In the funcions to add noise, both for folding and for docking, we use np.random.normal(...). This is another source of non reproducibility because the initial structures for the training will be different.

This is fixed adding np.random.seed(args.seed) on the train scripts.

TorchMD randomness: Simulations use random sampling of velocities with torch.randn(...). Therefore, I think that if we add seedings to the standard python random and to numpy, this should already be reproducible because we already seed torch.


The questions are:

  1. Are there other sources of randomness that I have not considered here?
  2. Do we want to allow for 100% reproducibility?
  3. If so, we could make somehting like torchmdexp.utils.init(args.seed) that initializes and seeds everything. Or we could change the randomness from python and numpy and use only torch.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.