mir-group / flare Goto Github PK
View Code? Open in Web Editor NEWAn open-source Python package for creating fast and accurate interatomic potentials.
Home Page: https://mir-group.github.io/flare
License: MIT License
An open-source Python package for creating fast and accurate interatomic potentials.
Home Page: https://mir-group.github.io/flare
License: MIT License
#27 snuck a call to set_L_alpha() into the update_db method of gp. This leads to inefficiencies when constructing large gp models based on hundreds of structures, since it requires the covariance matrix to be updated from scratch every time a new structure is added to the training set. Eliminating the call to set_L_alpha resolves the issue.
Docstrings annotating arguments and outputs to various methods would improve user readability.
To test that it works for the current version of otf.py (and its output conventions), the test should be based on a current output file, e.g. the ones generated by test_OTF.py.
Would be a three-liner in util.py to have a rough version:
_Z_to_element = {z: elt for elt, z in _element_to_Z.items()}
def Z_to_element(Z): return _Z_to_element[Z]
Would be useful for mapping 'coded species' integers to species names as @nw13slx mentioned in the comments for issue 28. I'll add this later once I learn how to use branches (or somebody else is welcome to add it, no need to wait for me :) )
Write flare.io module, which is an input/output module specifically for flare data structures. Should take md_trajectory_to/from_file from vasp_util.py and put it here.
to "mgp" (Mapped Gaussian Process) (credit to Jon)
Travis or CircleCI would be useful to look into for continuous integration, so we can eliminate friction with our pull requests and development.
Methods for serializing certain objects which are passed between models (i.e. atomic environments, structures, etc), or even models themselves, would be useful. The advantage of this over pickled objects is that they can be more human-readable (and I understand that pickled objects have some security risks associated with them).
One example application is that JSON objects are easily storable in certain database architectures. This might be relevant for e.g. FOOGA in the near future if we want to automate the process of training GP models for different datasets, as this would let us store them more easily.
In my development branch I've done this for the AtomicEnvironment object. There are ways we could standardize this or easily implement it across our codebase (e.g. by using Monty, which has an object type which allows for effortless JSON serialization of different Python objects).
set_L_alpha gives the option to compute the covariance matrix in parallel. It would be helpful to have the same option for update_L_alpha, especially for large GPs
The training process can overload compute nodes when both gp and predict are parallelized.
This should not happen because the paralleled functions are never called in the same step.
But it could be because we use multithread and concurrence at the same time.
As is, the tests which involve calling Espresso fail if the PWSCF command is not found.
It would be nice if the PWSCF command was not detected for the unit tests for calling QE to not fail, but simply be skipped. This may be more informative to the user, as there are reasons why it could fail in attempting to call QE which would require debugging.
We could also print a message encouraging the user to fix their environment variable.
Before officially releasing the first version of the code, it would be great to include detailed tutorials in the documentation explaining how to use key features (especially otf, mff, and the mff pairstyle in LAMMPS).
currently it stops printing the likelihood after the hyper-parameter training phase
Would improve flexibility for the user, as Pymatgen structures have a lot of terrific methods and enjoy a wide user base.
One feature that would help to accelerate the pipeline of GP from AIMD workflows would be helper functions which parse DFT outputs (like VASP) and turn them into a file of serialized structures decorated with force information. A second helper function could generate atomic environments from a .json file. The use of functions like pymatgen for parsing would be extremely welcome here, as they have very high quality and externally maintained parsers for VASP. These wrappers would be simple to implement and useful.
Relevant to #19.
Before trying to make any kernel optimizations, I thought it would be good to make a suite of benchmarks so we can easily and consistently measure any performance boosts.
I am thinking a two phase benchmark would work well.
There is some redundancy in certain methods setting the like
or likelihood
variable in different places.
The tests should check that the resulting matrices agree with set_L_alpha.
Hat tip to Lixin and Yu who discovered the following problem (1) after laborious debugging:
1.Apparently parallelization for python fails when multiple processes are operating on the same instance from the same class object.
For instance, the module I'm developing of gp_from_aimd has great cause to use the predict functions, and to avoid duplicating code, having them be in a different file allows them to be called without an OTF instance. I currently have implemented this in my development branch, for reviving gp_from_aimd. Lixin has done the same in hers, so one of us will push it eventually.
Occasionally OTF could be interrupted (either by bug, or the user wants to interrupt the training and change the condition) and needs to restart from middle. It will be good to have a restart module, or a method inside otf module.
I have a script to do this while I was training my stanene system. If you guys have written any better wrapped module, that will be great.
We need to reform the OTF class to interface with
option to freeze certain hyperparameters
different hyperparameters for different species
Flare code can hang in the SLURM job on Odyssey. It could be related to the memory setup. Specifying the memory for SLURM and set the memory limit up can help.
`#SBATCH --mem-per-cpu=6000
ulimit -s unlimited`
But we should look at memory profiling for the code at some point
The output file in output.py should not be repeatedly open and close. And we also need to allow multiple output files
Relates to issue #19 ; generate a file demonstrating how to set up / run / parse VASP files.
A setup.py file contains lots of useful info-- among other things, a list of all python packages which are required. Running this can install everything that is needed and so would be handy for new users.
One such guide to doing so is here:
https://the-hitchhikers-guide-to-packaging.readthedocs.io/en/latest/quickstart.html
We can include this on the wishlist for V 1.0.
open a new branch, tasks:
Would be helpful to (short term) provide built-in methods to parse VASP files for model training, and (long term) support VASP interface for OTF runs
IMPI does not take "mpirun exec <input" as OpenMPI does.
It should be "mpirun -np 1 exec < input"
We should maks as much output files as possible in simple column format, which can be easily read by numpy.loadtxt
@YuuuuXie @jonpvandermause
Could you please get a list of output that can be formatted? like hyper parameter, mae/likelihood each step, ...
eom
need a module to
It would be helpful to have the option to parallelize efficiently over multiple compute nodes (i.e. with the -npool flag) for large, expensive otf simulations. Should be an easy fix -- just have to give the option in otf.py to use "run_dft_npool" instead of "run_dft_par".
In line 50 of md_run.py, the function call to output.write_header is missing the positional argument std_tolerance.
Would help to make managing the results of runs easier.
This will require some design choice and discussion, but, this would help make our abstraction for the OTF and Trajectory Trainer make more sense, and share methods which make sense (i.e. prediction or std in bound methods).
I'm making this for my own development process
np Arrays are behaving strangely in the Sphinx documentation when used as typehints; this has something to do with their 'mock import' in the configuration. I will look into this and see if this can be fixed so that they render correctly in Sphinx.
@YuuuuXie may have already done this, but
Both of these would enhance user pre-processing flexibility.
Related to issue 14: #14
Because the is_std_in_bound() function is a convenience method for OTF or (the in development) TrajectoryTrainer, it would eliminate redundancy to have it exist outside of the OTF class. If we end up splitting predict functions to a separate file, I think that this function would be a natural fit for there (given that it is to diagnose the result of a prediction on a structure).
This idea is courtesy of @dmclark17. To keep branches organized and easy to navigate, I propose we adhere to the following convention for naming branches:
type of change/branch owner/description of change
Some examples:
bug/jon/gp-hotfix
feature/jon/cool-new-kernel
docs/jon/env-docstrings
etc.
check both against underlying gp model
Missing code of conduct, issue template, pull request template, etc.
Unit tests for the ASE interface.
we currently use mpirun. But we should allow the user to define their own mpi environment, such as mpirun, mpiexec and srun.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.