Coder Social home page Coder Social logo

benchmarkff's Introduction

BenchmarkFF

Language grade: Python DOI

README last updated: Nov 09 2020

About

Objective: Compare optimized geometries and energies from various force fields with respect to a QM reference.

This repository comprises code to extract molecule datasets from QCArchive, run energy minimizations with various force fields, and analyze the resulting geometries and energies with respect to QM reference data from QCArchive.

See our work in this preprint: Lim et. al.; Benchmark Assessment of Molecular Geometries and Energies from Small Molecule Force Fields. 2020.

Contents

Directories in this repo:

  • 01_setup: Extract molecules from QCArchive, convert to OpenEye mols, and standardize conformers and titles.
  • 02_calc: Run energy minimizations for various force fields.
  • 03_analysis: Analyze output energies and geometries.
  • examples: See this directory for example results and plots.
  • molecules: The molecule sets used in our benchmark analyses.
  • tools: A handful of helpful scripts (align structures for PDF output, find specific moieties, extract conformers by SD tag value).

File descriptions:

directory file description
01_setup extract_qcarchive_dataset.ipynb write out molecules from a QCArchive database which have completed QM calculations
01_setup combine_conformers.ipynb of the molecules from extract_qcarchive_dataset.ipynb, combine conformers that are labeled as different molecules
02_calc minimize_ffs.py minimize all molecules in an input SDF file with a specified force field
03_analysis color_by_moiety.py generate ddE vs TFD (or RMSD) scatter plots highglighting specific moieties by color
03_analysis compare_ffs.py compare FF-minimized molecules on their geometries and energies (no conformer matching)
03_analysis match_minima.py similar to compare_ffs of comparing geometries and energies but analyzing RMSD-matched structures
03_analysis probe_parameter.py find all molecules in a set that use certain specified parameter(s)
03_analysis reader.py reader for molecule sets and text input files called by the other analysis scripts
03_analysis tailed_parameters.py identify parameters that may be overrepresented in high RMSD/TFD tails for FFXML force fields

Python setup

Package dependencies

  • numpy, matplotlib, seaborn
  • OpenEye
  • RDKit (solely for TFD calculations)
  • OpenMM
  • OpenForceField
  • QCFractal, QCPortal

Conda setup

conda create -n parsley python=3.6 matplotlib numpy seaborn
conda activate parsley
conda install -c openeye -c conda-forge -c omnia rdkit openeye-toolkits qcfractal qcportal openforcefield cmiles openmm

The packages in VTL's conda environment is documented in this repo as parsley.yml.

Brief overview of usage

Setup

  1. Write out from a QCArchive database which have completed QM calculations, using extract_qcarchive_dataset.ipynb.
  2. Reorganize up the molecule set for conformers which are labeled as different molecules:
    1. Group all the same conformers together since they are separated by intervening molecules, using combine_conformers.ipynb.
    2. Of the full set, write out the good molecules that don't need conformer reorganization, using molextract.py* from OEChem.
    3. Combine the results of steps 2.1 and 2.2: cat whole_02_good.sdf whole_03_redosort.sdf > whole_04_combine.sdf
    4. Read in the whole set with proper conformers, and rename titles for numeric order, using combine_conformers.ipynb.
    5. (opt) Write out associated SMILES: awk '/SMILES/{getline; print}' whole_05_renew.sdf > whole_05_renew.smi
    6. (opt) Generate molecular structures in PDF format, using mols2pdf.py.

FF calculations

  1. (opt) Break up the full set into smaller chunks for managable computations, using molchunk.py.*
  2. Run the minimizations, using minimize_ffs.py.
  3. Remove the molecules that were unable to minimize (e.g., due to missing parameters) from all files, using cat_mols.py.*
  4. (opt) If the full set was broken up, concatenate all the constituent files back together, using cat or cat_mols.py.*

Analysis

  1. Analyze geometries and relative energies with compare_ffs.py.
    1. (opt) If some conformers (not full molecules) are outliers, can remove using get_by_tag.py.
    2. Mark certain moieties of interest in energy v. geometry scatter plots using color_by_moiety.py.
  2. Analyze energies of structurally similar conformers using match_minima.py.
  3. Explore parameters overrepresented in high TFD/RMSD regions using tailed_parameters.py and probe_parameter.py.

*The OEChem scripts referred to above are located here.

  • molextract.py
  • molchunk.py -- VTL modified to use OEAbsCanonicalConfTest

Note: Some of the analysis can take a long time for multiple force fields and many molecules (e.g. up to 2 hours on compare_ffs.py or 30-45 min on tailed_parameters.py). To explore the analyzed data, adjust plots, etc. without re-analyzing data, you can input the pickle file written out from the previously run analysis.

Contributors

  • Authors: Victoria T. Lim, David F. Hahn
  • Advising: David Mobley, Gary Tresadern, Chris Bayly
  • Code review: Jeffrey Wagner, Daniel Smith
  • Discussions: Jessica Maat, Caitlin Bannan, Hyesu Jang, Lee-Ping Wang

Big picture wish list / to do tasks

See more focused issues in the issue tracker.

  • Format code with YAPF/Black
  • Use logging module instead of print statements
  • Look into automatically serializable representations (e.g., Pydantic) instead of pickle
  • Use type hints for functions
  • Allow user to pass in dict for plotting parameters (i.e., talk or paper font sizes)
  • Generate plots with Plotly

benchmarkff's People

Contributors

davidlmobley avatar dfhahn avatar vtlim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

benchmarkff's Issues

Plot FF relative energy vs. QM relative energy

It could be interesting to see on a scatter plot specific conformers and their energies. Based on the histogram skews, there may be more points below y=x (perfect agreement) as the FF energies have skewed on the lower side (ddE<0) compared to QM.

handling outlier molecules in `match_minima.py`

On the full benchmark set, while generating plots with match_minima.py using data read in from pickle file, the memory grows exceedingly high (observed > 60 Gb) and is eventually killed.

I am working on identifying high memory use areas using the memory_profiler package with
PYTHONPATH=../ python -m memory_profiler ../match_minima.py -i match.in --cutoff 1.0 --plot --readpickle

Explore possibility of shallow potential energy surfaces for high TFD (low ddE) molecules

From Alberto Gobbi:

One way to test the shallow surface hypothesis would be to compare the energy trace during the geometry optimization.
Or you could use constraint minimizations. We are computing strain energy using a combination of conformational sampling and constraint minimization that allow to see how shallow the surface is around the starting point:

Lee, Man-Ling, Ignacio Aliagas, Jianwen A. Feng, Thomas Gabriel, T. J. O’Donnell, Benjamin D. Sellers, Bernd Wiswedel, and Alberto Gobbi. “Chemalot and Chemalot_knime: Command Line Programs as Workflow Tools for Drug Discovery.” Journal of Cheminformatics 9 (June 12, 2017): 38. https://doi.org/10.1186/s13321-017-0228-9.

I think the original work is by Bostroem:
Boström J, Norrby PO, Liljefors T (1998) Conformational energy penalties
of protein-bound ligands. J Comput Aided Mol Des 12(4):383–396

KDE smoothing of RMSD/TFD

These metrics have a minimum of zero, but with smoothing, the curves go below zero. Other than plotting histograms, a potential solution is to (1) mirror data to negative side, (2) apply KDE, (3) take the distribution for x>=0 only.

Molecules of N-N, azetidine, and octahydrotetracene-like compounds

Hi,
I have read the preprint article. At the end of the paper, it shows three kinds of chemical structures that have outlying energies or geometries, i.e. N-N single bond (3824 structures), azetidine ring (543 structures), and octahydrotetracene (50 structures). I'm wondering if you can provide me those structures with coordnates and connectivity. I'd like to see the improvements of our force filed on these molecules. Thank you!

Explore subset of TFD bins

From Mike Gilson:

I'm wondering whether you would in fact see an increase in the standard deviation of ddE in "TFD slices" of the data that move to the right. That is, maybe bin the data by TFD (0-0.05, 0.05-0.10, 0.10-0.15, etc), and compute the standard deviation of ddE within each slice.

Also make a histogram of the ddE values within different TFD ranges.

Generalize some variables for plotting/analysis

  • It might better to have an argument for draw_density2d() in compare_ffs.py to generalize the x, y, z, ranges.

  • color_by_moiety.py could be generalized to support more moiety groups. The script currently has three encoded color codes (see subset_colors) which was chosen for the manuscript figures to best distinguish between the three groups presented.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.