Coder Social home page Coder Social logo

mobleylab / freesolv Goto Github PK

View Code? Open in Web Editor NEW
95.0 25.0 53.0 213.28 MB

Experimental and calculated small molecule hydration free energies

Home Page: http://www.escholarship.org/uc/item/6sd403pz

Python 81.41% Jupyter Notebook 18.59%
free-energies hydration solvation experimental-data calculated-values python experimental-values database

freesolv's Introduction

FreeSolv: Experimental and Calculated Small Molecule Hydration Free Energies

This repository provides an issue tracker and revision control for the FreeSolv database, initially described in JCAMD (10): http://dx.doi.org/10.1007/s10822-014-9747-x. If you find any issues, please raise an issue in the issue tracker or file a pull request!

Releases are automatically assigned unique DOIs via Zenodo. Latest release: DOI

Abstract:

This work provides a curated database of experimental and calculated hydration free energies for small molecules in water, along with experimental values and input files. Experimental values are taken from prior literature and will continue to be curated, with updated experimental references and data added as it becomes available. Calculated values are based on the GAFF small molecule force field in TIP3P water with AM1-BCC charges, as in the provided parameter files. Values were calculated using the GROMACS simulation package, with full details given in references cited within the database itself. This database builds on previous work from the Mobley lab and others, and extends the prior database. With deposition in eScholarship, the database is now versioned, allowing citation of specific versions of the database, and easier updates.

Background:

This page provides an update of David Mobley's hydration free energy database. The current goal is to provide curated calculated and experimental values for every molecule which the Mobley group has studied at any point, and to allow these to be updated in a versioned manner as issues are found, better experimental data is tracked down or obtained, and so on.

The prior database gives calculated and experimental values for a 504 molecule set which has been called the "504 molecule set", or "the Mobley set" or similar variants. The explicit solvent study on this set was published in (1) and the implicit solvent version in (2), and the full database is in the supporting information. The "504 molecule set" built on earlier sets, notably that from Rizzo (3) and earlier hydration studies by David Mobley and collaborators.

The current set and format is motivated by several factors:

  • There were several problems with specific molecules and/or experimental values in the 504 molecule set which needed correcting
  • We have studied many additional molecules since then and these need adding to the set
  • We need a way to continue sharing and expanding our set, providing both experimental data with references and calculated values (with parameters) as these are used as inputs to test other methods
  • We want to be able to update the set in a versioned manner without having to write a new paper for every update, which necessitates migrating away from journal supporting information.

What we provide:

The database consists of a .tar.gz file containing:

  • database.txt: A semicolon delimited text file containing compound IDs, SMILES, IUPAC names or similar, experimental values and uncertainties, calculated values, DOIs for references, and notes. Format described in the header
  • database.pickle: Python pickle file containing the same database, with some extra fields as well including 'groups', which provides functional groups for the compounds as assigned by checkmol), PubChem compound IDs, calculated enthalpies of hydration, some experimental enthalpies of hydration (from ORCHYD), and components of the enthalpy of hydration and hydration free energy (as described in our forthcoming paper, to be linked here soon).
  • groups.txt: Functional groups for compounds as assigned by checkmol. Semicolon delimited. First field is compound ID, second field is compound name, and subsequent fields are functional groups.
  • iupac_to_cid.pickle, smiles_to_cid.pickle: Python pickle files containing conversion of IUPAC name to compound id and SMILES string to compound id, stored in dictionaries
  • Structure files:
    • mol2files_sybyl.tar.gz: mol2 files with partial charges as written by OEChem in Sybyl format/Sybyl atom types
    • mol2files_gaff.tar.gz: mol2 files with partial charges as used for our hydration free energy calculations (AMBER GAFF atom types)
    • sdffiles.tar.gz: sdf files with partial charges as written by OEChem
    • gromacs_original.tar.gz: GROMACS format topology and coordinate files as used for our AM1-BCC GAFF hydration free energy calculations. Technical note: There may be some variation as to whether water molecules are or are not included in these files; these are intended to be used for the small molecule parameters only.

(See the Manifest below for a more complete list of all available files.)

The future:

The database is maintained on the cite-able eScholarship repository of the University of California. It is currently available on that site at www.escholarship.org/uc/item/6sd403pz. Updated versions will be maintained there, mirroring point releases provided via this GitHub site.

Please cite:

Mobley, David L. (2013). Experimental and Calculated Small Molecule Hydration Free Energies. UC Irvine: Department of Pharmaceutical Sciences, UCI. Retrieved from: http://www.escholarship.org/uc/item/6sd403pz

Manifest

  • gromacs_analysis: Contains plots resulting from GROMACS analysis of some of the data in FreeSolv.
  • gromacs_energies: Contains XVG files associated with the most recent (2017) update of FreeSolv calculated values; these files are large and are only available in the archived version of the database and not on GitHub.
  • gromacs_mdpfiles: Contains GROMACS run (.mdp) files used for the calculations connected with the most recent (2017) update of the calculated hydration free energies and enthalpies reported here.
  • mol2files_gaff.tar.gz: contains mol2 files for all compounds with AM1-BCC charges and GAFF atom types
  • mol2files_sybyl.tar.gz: contains mol2 files for all compounds with AM1-BCC charges and SYBYL atom types
  • primary-data: Primary data from which the contents of this database can be re-generated; obtained from full database via scripts/extract-primary-data.py
  • scripts: Scripts pertaining to the material deposited here
  • sdffiles.tar.gz: SDF-format files for all of the molecules deposited here (as in mol2files_gaff and mol2files_sybyl)
  • amber.tar.gz: AMBER format parameter, coordinate, and frcmod files corresponding to the systems we ultimately simulated in GROMACS.
  • gromacs_original.tar.gz: GROMACS format topology and coordinate files for the calculations associated with the computed values in FreeSolv, for calculations in gas phase. These were generated from AMBER files via acpype, prior to our more recent migration to ParmEd.
  • gromacs_solvated.tar.gz: GROMACS format topology and coordinate files for the calculations associated with the computed values in FreeSolv, for calculations in solution, again generated from AMBER files via acpype.
  • lammps.tar.gz: LAMMPS format topology and coordinate files for the calculations associated with the computed values in FreeSolv, automatically converted using InterMol from the AMBER files
  • charmm.tar.gz: CHARMM format topology and coordinate files for the calculations associated with the computed values in FreeSolv, automatically converted using ParmEd (via InterMol) from the AMBER files
  • gromacs.tar.gz: GROMACS format topology and coordinate files for the calculations associated with the computed values in FreeSolv, automatically converted using ParmEd (via InterMol) from the AMBER files
  • desmond.tar.gz: DESMOND format topology and coordinate files for the calculations associated with the computed values in FreeSolv, automatically converted using InterMol from the AMBER files
  • simulation_comparison_input/: directory containing input files used for the validation of the input conversion files by comparing energy files, description of automated conversion process, and the energy comparisons. See simulation_comparison_input/README.md for more details.
  • README.md: This file
  • database.pickle: Python pickle file of the FreeSolv database
  • database.json: JSON format version of the FreeSolv database also stored in database.pickle
  • database.txt: Text format version of some of the fields from the database
  • groups.txt: Functional groups assigned to the different compounds in the database
  • iupac_to_cid.pickle and .json: Python pickle file and JSON file containing a dictionary for converting IUPAC names to FreeSolv compound IDs
  • smiles_to_cid.pickle and .json: Python pickle and JSON file containing a dictionary for converting SMILES strings to FreeSolv compound IDs
  • notebooks/OrionDB.ipynb: iPython notebook providing an example of concatenating molecules and associating generic data.

Rebuilding FreeSolv

The input files deposited here can be rebuilt (from SMILES strings) using the script scripts/rebuild_freesolv.py, which requires the Chodera lab's openmoltools package and the Mobley Lab's SolvationToolkit, both of which are conda installable from the omnia channel.

Change log/version history:

This dataset started by taking all of the compounds we have studied previously with hydration free energies (references 1, 2, 4-9) including those from SAMPL4 and compiling them all into one big set, removing any redundancies and providing data, references, etc. for all of them. Details of changes for specific versions are found below.

On 12/20/2013 this database was moved to the eScholarship site of the University of California, at http://www.escholarship.org/uc/item/6sd403pz.

Version 0.1:

  • We corrected the following problems from the 504 molecule set (1-2):
    • Removal of 504/triacetyl glycerol, which was not the intended molecule (and the intended molecule, glycerol triacetate, is present in v0.1 anyway as it comes in via reference (5)
    • Correction of the experimental value for hexafluoropropene, which had (via (3)) incorrectly been the value for hexafluoro-propan-2-ol
    • Removed several duplicates within the set:
      • 2-methylbut-2-ene under two names
      • 3-methylbut-1-ene
      • benzonitrile vs cyanobenzene
    • Removed a "duplicate" butanal which had an incorrect experimental value
  • We also corrected issues from other sets:
    • The molecule labeled pentan-2-one in the set of (4) was pentan-3-one; the corresponding experimental value was corrected from -3.52 kcal/mol to -3.41 kcal/mol.
    • The molecule "lindane" was removed from the set of reference (6) because the 3D structure has the incorrect stereoisomer and thus the calculations were wrong; this issue seems to have originated with the Guthrie 2009 experimental paper providing the source data.
    • We removed 'prometryn' (set of reference 6) because chemical structure (3D/2D) does not match the name -- an ethyl where there should be a dimethyl. Again this seems to have originated from Guthrie 2009 experimental paper with the source data.
    • We removed 'ethylene glycol diacetate' from the set of reference (5) because the 3D structure does not match the 2D structure as indicated in the paper. [See v0.2 notes -- this revision was actually a mistake, and in fact this was the correct compound, though the tools we were using did not properly parse the alternative name, "ethylene glycol diacetate".
    • Sulfonyl urea compounds with questionable vapor pressure were removed from the set of SAMPL1 (6) after consultation with J. Peter Guthrie, who had concerns about the quality of this data.
  • Based on a cross-comparison with data from J. Peter Guthrie's dataset (in preparation), we updated several experimental values. Details of how these were changed and why are provided in the 'notes' field within the database itself. The compounds affected were:
    • 4-propylphenol
    • 4-bromophenol
    • 3-hydroxybenzaldehyde
    • 2-methoxyethanol
    • dimethyl sulfoxide (methanesulfinylmethane)
  • Notes were added in a few other cases, especially for formaldehyde, and a number of IUPAC names were standardized

Currently this set contains 642 molecules. Full details will be provided in a paper reporting this database. Please also note that some discrepancies between experimental values here and values in J. Peter Guthrie's database are still being investigated, so we expect that a new version will be released relatively shortly which will update some subset of the experimental values (less than 60, but more than zero).

Version 0.2:

  • Corrected the experimental references from one of our earlier papers (10.1021/jp0667442) which incorrectly reported the data as having come from the Rizzo set, but it instead came from Abraham et al. 1990 (10.1039/P29900000291). Updated experimental uncertainty estimates for this set to match the Abraham et al. "suggestion" of 0.2 kcal/mol
  • Corrected the experimental value for 1,3-butadiene (and the experimental reference), as pointed out by Christopher Bayly (OpenEye Software). Specifically, the Hine and Mookerjee paper (JOC (1975) 40:292) finds two experimental values for 1,3-butadiene: -log(cg)=1.39 and -log(cw)=1.87. From these, he derives a value of -0.41 for the former minus the latter, which leads to a transfer free energy of 0.56 kcal/mol. The correct difference is -0.48 not -0.41, which leads to a transfer free energy of 0.65 kcal/mol. This applies to compound mobley_511661, IUPAC 'butadiene'. The prior value was listed as 0.6 kcal/mol in this set (0.56 kcal/mol rounded). The citation was updated as well to point to this original experimental data.
  • Updating 2,6-dichlorosyringaldehyde (mobley_6195751) and 3,5-dichloro-2,6-methoxyphenol (mobley_6688723) with improved values from J. Peter Guthrie's SAMPL4 writeup which were NOT used in the SAMPL4 challenge, as he didn't make final changes until many people's manuscripts were submitted. These took the values for 2,6-dichlorosyringaldehyde from -8.24+/-0.76 to -8.68+/-0.76, and 3,5-dichloro-2,6-methoxyphenol from -6.24+/-0.38 kcal/mol to -6.44+/-0.38 kcal/mol
  • Updated (2E)-hex-2-enal, mobley_2792521, with detailed experimental references and a slight update to the hydration free energy (-3.60 kcal/mol, vs previous value of -3.68 kcal/mol) based on a weighted average of the available experimental data.
  • Re-added "ethylene glycol diacetate" (which was removed under v0.1) from reference (5) as this was in fact the correct compound, and had been removed because of issues relating to handling of the name. This has been assigned the more standard name, "2-acetoxyethyl acetate".
  • Updated uncertainty estimates for experimental values in the set of reference (5) to 0.2 rather than 0.6 kcal/mol, to match the estimate given in reference (5).
  • Experimental references were updated/corrected, typically by drilling down (for example, in v0.1, the experimental citation for the 504 molecule set was listed as reference (2); now, references point to reference (3), the reference for the Bordner set, and to original source data, depending on the compound). Much more could be done here, but as substantial manual intervention is needed it is unlikely to happen soon.
  • In v0.1, IUPAC names for various compounds were supposed to have been modified to make them easier to parse (essentially, standardization of various nonstandard names) and this was reflected in the notes field for these compounds. However, the IUPAC names themselves were never updated. These have now been corrected.
  • In preparation for adding PubChem compound IDs, we detected several IUPAC name/SMILES string pairs which did not lead to a compound on PubChem. Alternate IUPAC names were assigned as follows:
    • mobley_2636578, formerly 1,3-bis-(nitrooxy)propane, renamed as 3-nitrooxypropyl nitrate
    • mobley_819018, formerly trans-3,7-Dimethylocta-2,6-dien-1-ol, renamed as (2E)-3,7-dimethylocta-2,6-dien-1-ol
  • PubChemIDs for all compounds were added automatically using PubChemPy by looking up compounds via IUPAC name, with a fallback to SMILES string. In several cases cases (mobley_6843802, [(1R)-1,2,2-trifluoroethoxy]benzene; mobley_7869158, [(2S)-butan-2-yl] nitrate; and mobley_9741965, 1,3-bis-(nitrooxy)butane) the PubChem ID was assigned manually because of issues with PubChem's name for the compound and/or issues relating to PubChem not specifying stereochemistry for a chiral center.

Version 0.21:

  • The structure files for 2-acetoxyethyl acetate, mobley_4689084, SMILES CC(=O)OCCOC(=O)C, contained multiple conformations of the molecule. This was corrected. Additionally, the .sdf file for this molecule had been written in mol2 format.

Version 0.3 (Feb. 4, 2014):

  • Due to bug(s) in Checkmol and issues with its handling of the .mol2 file format, functional groups assigned to some molecules were incorrect (for example, around eight molecules were incorrectly labeled as cations, with no other groups correctly assigned). After correspondence with the authors, we switched to running checkmol on the associated .sdf files, which are better supported by the program, eliminating these problems. All functional groups were re-computed and re-stored.

Version 0.31 (Sept. 25, 2014):

  • Repaired partial charges in some .mol2/.sdf files: Due to a human error in retrieving old files, the .mol2 and .sdf files for the compounds from the Dumont set (calculated value reference key 10.1021/jp0667442) contained partial charges which were inconsistent with those used for the calculated values. In six cases, the partial charges in the distributed files were zero, whereas in the remainder of cases they were only slightly different due to use of an apparently different charge calculation procedure. The six compounds with zero charges were mobley_186894, mobley_2005792, mobley_3738859, mobley_5157661, mobley_5449201, and mobley_9055303, while the full list of affected compounds was IDs mobley_1323538, mobley_5449201, mobley_3053621, mobley_3738859, mobley_8427539, mobley_1873346, mobley_5157661, mobley_9979854, mobley_2005792, mobley_9055303, mobley_1923244, mobley_3727287, mobley_20524, mobley_2068538, mobley_1875719, mobley_186894, mobley_2049967, mobley_511661, mobley_2972906, mobley_4035953, mobley_525934, mobley_1728386, mobley_2178600.
  • Some .mol2 files had residue names listed as <0>, which can cause problems for some codes. All .mol2 files were standardized to use the residue name “MOL”.
  • Corrected expt_reference field for 423 molecules to correctly point to the Rizzo et al. work (10.1021/ct050097l) rather than the Mobley et al. 504 molecule study. Corrected expt_reference field for mobley_8809274 as it had been in error in the Rizzo work (personal correspondence, RC Rizzo)
  • Minor details:
    • Updated database.txt to have correct release date and version, and to list units of free energies in the headers.
    • Added citation detail for FreeSolv to the References section below.
    • Corrected IUPAC name of ‘biphenyle’ to ‘biphenyl’; the notes already said this had been done, but the name had not been updated.

Version 0.32 (Sept. 29, 2015):

  • Corrected SMILES strings (and other files) for nitro-containing compounds mobley_3802803 and mobley_9741965. Due to some type of earlier error, the GAFF and SYBYL .mol2 files for these contained incorrect bonding in the nitro group(s), which resulted in generation of incorrect SMILES when generating FreeSolv. These SMILES strings have now been corrected, as has the bond type in the .mol2 files. Partial charges in the topology files and .mol2 files were retained as use for the calculations reported here, and will be updated in a subsequent release when the calculations are repeated. Checkmol groups for these compounds were also updated. Thanks to Christopher Bayly for noticing these issues. (9/29/15)
  • Added (temporarily?) unique, short nicknames to all compounds in database.txt and database.pickle; these consist of IUPAC names when short, or common/other names which are unique and lead to useful hits when used as search terms. (Approx. 10/21/14)
  • Removed mobley_4689084, which duplicates mobley_352111 (same experimental value and source data, but the calculated value of the former is older, and topology/coordinate file were less well curated). (10/24/14)

Version 0.320:

Same as the above but initiates Zenodo DOIs. DOI http://dx.doi.org/10.5281/zenodo/159499

Version 0.5 (Jan. 26, 2017) (10.5281/zenodo.264280):

  • Re-generates all input files (.mol2, .sdf, GROMACS and AMBER format files, etc.) from primary data (SMILES strings)
  • Deposits scripts used for re-generating the database in the scripts directory
  • Re-calculates all calculated values (in conjunction with forthcoming paper)
  • Adds calculated enthalpies of hydration and components of enthalpy
  • Adds charge and non-polar components of hydration free energy
  • Adds a few experimental enthalpies of hydration obtained from the ORCHYD dataset
  • Adds README.md files in some of the sub-directories better indicating their contents
  • Corrects tripos_mol2 back to mol2files_sybyl for consistency with mol2files_gaff (as in a prior version, but we had lost this change)
  • Provides JSON versions of database files

Version 0.51 (April 5, 2017) (10.5281/zenodo.495235):

  • Introduced automatically-generated input files for CHARMM, DESMOND, and LAMMPS, and alternate GROMACS files generated via ParmEd rather than acpype
  • Reorganizes naming convention of simulation structure files
  • Provides energy comparison of all automatically generated files in simulation_comparison_input
  • Addition of notebooks directory

The changes made in the Version 0.5 and 0.51 updates are described in our recent FreeSolv update/mini-review paper in the Journal of Chemical and Engineering Data.

Version 0.52 (Jan. 26, 2018, but changes made in Aug. 2017) ([10.5281/zenodo.1161245])(http://dx.doi.org/10.5281/zenodo.1161245)):

  • Remove redundant molecule mobley_4689084 (which duplicates mobley_352111 had the same experimental value, and a calculated value within uncertainty)
  • Add utility functionality to easily check for duplicates; rebuild database after removing above duplicate and checking for others
  • Update reference for calculated values to refer to the J Chem Engr Data reference.

Changes not yet in a release:

Contributors

(Please let us know if your name should be on this list but isn't)

  • David L. Mobley (UC Irvine)
  • J. Peter Guthrie (University of Western Ontario)
  • The many people who contributed to the SAMPL challenges over the years and our early studies on hydration free energies, prior to construction of this database.
  • Guilherme Duarte Ramos Matos (UC Irvine)
  • Daisy Y. Kyu (UC Irvine)
  • Caitlin Bannan (UC Irvine)
  • John D. Chodera (MSKCC)
  • Michael R. Shirts (Colorado)
  • Hannes H. Loeffler (STFC Daresbury)
  • Nathan M. Lim (UC Irvine)

References

  • (1) Mobley, D. L., Bayly, C. I., Cooper, M. D., Shirts, M. R., & Dill, K. A. (2009). Small Molecule Hydration Free Energies in Explicit Solvent: An Extensive Test of Fixed-Charge Atomistic Simulations. Journal of Chemical Theory and Computation, 5(2), 350–358.
  • (2) Mobley, D. L., Dill, K., & Chodera, J. D. (2008). Treating entropy and conformational changes in implicit solvent simulations of small molecules.The Journal of Physical Chemistry B,112(3), 938.
  • (3) Rizzo, R. C., Aynechi, T., Case, D. A., & Kuntz, I. D. (2006). Estimation of Absolute Free Energies of Hydration Using Continuum Methods: Accuracy of Partial Charge Models and Optimization of Nonpolar Contributions.Journal of Chemical Theory and Computation,2(1), 128–139. doi:10.1021/ct050097l
  • (4) Mobley, D. L., Dumont, É., Chodera, J. D., & Dill, K. (2007). Comparison of charge models for fixed-charge force fields: Small-molecule hydration free energies in explicit solvent.The Journal of Physical Chemistry B,111(9), 2242–2254.
  • (5) Nicholls, A., Mobley, D. L., Guthrie, J. P., Chodera, J. D., Bayly, C. I., Cooper, M. D., & Pande, V. S. (2008). Predicting small-molecule solvation free energies: an informal blind test for computational chemistry.Journal of Medicinal Chemistry,51(4), 769–779. doi:10.1021/jm070549+
  • (6) Mobley, D. L., Bayly, C. I., Cooper, M. D., & Dill, K. A. (2009). Predictions of hydration free energies from all-atom molecular dynamics simulations.The Journal of Physical Chemistry B,113(14), 4533–4537. doi:10.1021/jp806838b
  • (7) Klimovich, P., & Mobley, D. L. (2010). Predicting hydration free energies using all-atom molecular dynamics simulations and multiple starting conformations.Journal of Computer-Aided Molecular Design,24(4), 307–316.
  • (8) Mobley, D. L., Liu, S., Cerutti, D. S., Swope, W. C., & Rice, J. E. (2012). Alchemical prediction of hydration free energies for SAMPL.Journal of Computer-Aided Molecular Design,26(5), 551–562. doi:10.1007/s10822-011-9528-8
  • (9) Mobley, D. L., Wymer, K. L., Lim, N. M., Guthrie, J. P. (2014) "Blind prediction of solvation free energies from the SAMPL4 challenge", Journal of Computer-Aided Molecular Design, 28:135-150 (2014).
  • (10) Mobley, D. L., and Guthrie, J. P., "FreeSolv: A database of experimental and calculated hydration free energies, with input files", Journal of Computer-Aided Molecular Design, 28(7):711-720 (2014)
  • (11) Duarte Ramos Matos, G. et al., "Approaches for calculating solvation free energies and enthalpies demonstrated with an update of the FreeSolv database", Journal of Chemical and Engineering Data 62(5):1559-1569 (2017) 10.1021/acs.jced.7b00104

freesolv's People

Contributors

davidlmobley avatar jchodera avatar kyleabeauchamp avatar leeping avatar mrshirts avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

freesolv's Issues

GAFF version

I've searched around, but I can't seem to find which version of GAFF was used. The difference between GAFF v1.7 and v1.8 could be noticeable (or perhaps an even older version was used?) If the exact GAFF version cannot be determined, the Amber release should give me a pretty good idea.

mobley_3323117 (sulfolane) has non-standard SMILES

Molecule mobley_3323117 (sulfolane) is written with the non-standard SMILES C1CC[S+2](C1)([O-])[O-], rather than the more standard C1CCS(=O)(=O)C1.

Despite being equivalent in total charge, these forms are inequivalent due to the provided formal charges (+2 for S, -1 for O) vs the standard SMILES (all atoms have 0 formal charge), which are rendered inequivalent in molecular representations in the OpenFF toolkit (with the OpenEye backend):

>>> from openff.toolkit.topology import Molecule
>>> freesolv_molecule = Molecule.from_smiles('C1CC[S+2](C1)([O-])[O-]')
>>> standard_molecule = Molecule.from_smiles('C1CCS(=O)(=O)C1')
>>> freesolv_molecule.generate_unique_atom_names()
>>> standard_molecule.generate_unique_atom_names()
>>> [(atom.name, atom.formal_charge.m) for atom in freesolv_molecule.atoms]
[('C1x', 0), ('C2x', 0), ('C3x', 0), ('S1x', 2), ('C4x', 0), ('O1x', -1), ('O2x', -1), ('H1x', 0), ('H2x', 0), ('H3x', 0), ('H4x', 0), ('H5x', 0), ('H6x', 0), ('H7x', 0), ('H8x', 0)]
>>> [(atom.name, atom.formal_charge.m) for atom in standard_molecule.atoms]
[('C1x', 0), ('C2x', 0), ('C3x', 0), ('S1x', 0), ('O1x', 0), ('O2x', 0), ('C4x', 0), ('H1x', 0), ('H2x', 0), ('H3x', 0), ('H4x', 0), ('H5x', 0), ('H6x', 0), ('H7x', 0), ('H8x', 0)]

Would it be reasonable to correct the non-standard SMILES string and re-generate the database?
Or are there ways to automatically standardize the formal charges?

Problems with processing some SMILES - omega returned error code 0

Hello!

When using the FreeSolv workflow for converting SMILES to .mol2 files I am encountering an error
RuntimeError: omega returned error code 0
for the SMILES copied below.
I would be grateful for suggestions how to overcome this. I am also attaching a copy of my conda environment.

Best wishes,
Kadi

OCc1cc(ccc1O)C(O)CNC(C)(C)C
CC(C)NCC(COc1ccc(cc1)CC(=O)N)O
O=S(c1ccc(cc1)\C=C3/c2ccc(F)cc2\C(=C3C)CC(=O)O)C
Fc2cc(ccc2c1ccccc1)C(C(=O)O)C
O=S(=O)(N=C(N)CCSCc1nc(/N=C(/N)N)sc1)N
CC(C)Cc1ccc(cc1)C(C)C(=O)O
CC(C)NCC(COc1cccc2c1cccc2)O
O=C(O)C(c1ccc(cc1)N3C(=O)c2ccccc2C3)C
CC(C)NCC(O)COc1cccc2[nH]ccc12
OC(CNC(C)(C)C)COc1cccc2c1CC@HC@HC2
c1ccc2c(c1)C(=O)N(C2=O)C3CCC(=O)NC3=O
CN3[C@H]1CC[C@@h]3CC@@HOC(=O)C(CO)c2ccccc2
O=S(=O)(c1ccc(cc1)C)NC(=O)NN3CC2CCCC2C3
OC(c1ccccc1)(CCN2CCCCC2)C3CCCCC3

environment.yml.txt

Re-construct database files from primary data

This is implied by several other issues, but ought to exist as its own issue.

To complete this issue, we must first resolve Issue #14, #13 , and #19.

This issue must be completed before we can resolve Issues #15 and #16 .

Resolving this issue will also provide the best resolution of the issue with water molecules in topology files (# to be inserted here).

Delete ANTECHAMBER.AC file in top directory?

I uploaded all the files from the v0.31 release as the first commit. However, it seems like there were some stray antechamber files living in the top directory of the tarball. Can we delete those @davidlmobley ? If so I'll file a Pull Request

Decide any other supporting files/data which ought to be captured when database is re-constructed from primary data

There are other supporting files/information aside from those we currently provide which have been requested:

Migrate issues and close this repo?

@jchodera - do you have any objection to closing this repo, or at least updating the README.md to reflect that the official repo is elsewhere, if I migrate the issues over to github.com/mobleylab/freesolv?

Now that my group is up on GitHub, it would probably be best for me to continue to maintain this on our site rather than yours, especially since we have some updates coming soon.

Cannot generate GAFF mol2 from Tripos mol2 file

Processing the provided Tripos mol2 file for this molecule by antechamber leads to failure:

@<TRIPOS>MOLECULE
1,2,3,4,5-pentachloro-6-nitro-benzene
   14    14     1     0     0
SMALL
No Charge or Current Charge


@<TRIPOS>ATOM
      1 C1          1.8850   -1.0360   -0.1120 ca        1 MOL     -0.153400
      2 C2          2.9210   -1.6310    0.6080 ca        1 MOL      0.071500
      3 C3          2.8780   -1.6520    2.0020 ca        1 MOL      0.005700
      4 C4          1.8000   -1.0790    2.6760 ca        1 MOL      0.054600
      5 C5          0.7640   -0.4840    1.9550 ca        1 MOL      0.005700
      6 C6          0.8070   -0.4630    0.5610 ca        1 MOL      0.071500
      7 Cl1        -0.4790    0.2740   -0.3570 cl        1 MOL      0.005500
      8 Cl2        -0.5850    0.2340    2.7930 cl        1 MOL     -0.012400
      9 Cl3         1.7470   -1.1050    4.4180 cl        1 MOL     -0.009400
     10 Cl4         4.1750   -2.3960    2.8980 cl        1 MOL     -0.012400
     11 Cl5         4.2590   -2.3420   -0.2520 cl        1 MOL      0.005500
     12 N1          1.9280   -1.0150   -1.5410 no        1 MOL      0.316800
     13 O1          0.9850   -0.4760   -2.1600 o         1 MOL     -0.174500
     14 O2          2.9070   -1.5360   -2.1180 o         1 MOL     -0.174500
@<TRIPOS>BOND
     1    1    6 ar  
     2    1    2 ar  
     3    2    3 ar  
     4    3    4 ar  
     5    4    5 ar  
     6    5    6 ar  
     7    6    7 1   
     8    5    8 1   
     9    4    9 1   
    10    3   10 1   
    11    2   11 1   
    12    1   12 1   
    13   12   13 1   
    14   12   14 1   
@<TRIPOS>SUBSTRUCTURE
     1 MOL         1 TEMP              0 ****  ****    0 ROOT

To reproduce (using AmberTools 18.0):

$ antechamber -i in.mol2 -fi mol2 -o out.mol2 -fo mol2 -s 2 -at gaff2 -c bcc 

Welcome to antechamber 17.3: molecular input file processor.

acdoctor mode is on: check and diagnosis problems in the input file.
-- Check Format for mol2 File --
   Status: pass
Info: Finished reading file (in.mol2).
-- Check Unusual Elements --
   Status: pass
-- Check Open Valences --
   Status: pass
-- Check Geometry --
      for those bonded   
      for those not bonded   
   Status: pass
-- Check Weird Bonds --
/Users/choderaj/miniconda/bin/to_be_dispatched/antechamber: Fatal Error!
Weird atomic valence (5) for atom (ID: 12, Name: N1).
       Please check atomic connectivity.

Even running without request for charges leads to failure

$ antechamber -i in.mol2 -fi mol2 -o out.mol2 -fo mol2

Welcome to antechamber 17.3: molecular input file processor.

acdoctor mode is on: check and diagnosis problems in the input file.
-- Check Format for mol2 File --
   Status: pass
-- Check Unusual Elements --
   Status: pass
-- Check Open Valences --
   Status: pass
-- Check Geometry --
      for those bonded   
      for those not bonded   
   Status: pass
-- Check Weird Bonds --
/Users/choderaj/miniconda/bin/to_be_dispatched/antechamber: Fatal Error!
Weird atomic valence (5) for atom (ID: 12, Name: N1).
       Please check atomic connectivity.

I could not locate the script in this repo used to generate the GAFF mol2 files, so I could not check (1) which AmberTools version, and (2) which options were used.

Re-compute hydration free energies for all compounds

  • Compute explicit solvent hydration free energies for all compounds
    • Store both charging and non-polar (vdW) components separately - these are useful to implicit solvent developers
  • Compute implicit solvent hydration free energies for all compounds
    • Again, polar and non-polar components should both be stored

If this is handled prior to trajectory regeneration (#15) it should be done at the same time.

Set up Travis-CI testing

I want to enable Travis-CI testing. At the very least it should probably:

  • Check that the info in the database is complete
  • Check for duplicate molecules
  • Check that reading in the mol2/sdf files into OEMols results in the same isomeric SMILES as those corresponding to the "source data"
  • Re-make the plots and re-compute the statistics (and check that statistics haven't changed?)

More ambitiously, it could also:

  • Rebuild input files (all of them is probably too slow, though this could be flagged as a "slow" test and skipped by Travis-CI; perhaps a test could rebuild some of the input files)
  • ...?

Update column names to be more informative

Now that we're distributing FreeSolv in multiple forms (SDF files, Orion datasets), it would be useful if the column names (that become SD tags) were more informative and self-documenting about things like units. For example, the "expt" label is problematic since (1) it doesn't describe which of the many properties this is the experimental value for, and (2) it doesn't give the units. More informative names would very much help!

Re-generate trajectories for all compounds - implicit solvent, explicit solvent, and vacuum

I've had some requests for trajectory files associated with the calculated hydration free energies - especially, for endpoint trajectories. I would like to see all of these re-generated via a consistent, reproducible protocol from the latest version of input files.

This could be part of an effort to re-calculate all of the hydration free energies with a consistent protocol (which has been discussed elsewhere) or simply a re-generation of endpoint trajectories separately from that.

Relatedly, there is also some interest in having calculated hydration free energies in implicit solvent, along with associated endpoint trajectories.

Thoughts?

Sanitize SDF files

The current SDF files have about ~40 molecules in SDF format that are non-neutral. Here's a script that regenerates correct ones.

import csv
import os
from rdkit import Chem
from rdkit.Chem import AllChem

def is_neutral(mol):
    net_charge = 0
    for a in mol.GetAtoms():
        net_charge += a.GetFormalCharge()
    return net_charge == 0

mols = []

mmff_fail_count = 0

with open('database.txt', newline='') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=';', quotechar='|')
    for line, row in enumerate(spamreader):
        if line > 2:
            name = row[0]
            smiles = row[1]
            
            
            mol = Chem.MolFromSmiles(smiles)      
            mol = Chem.AddHs(mol)
            
            print(smiles)
            res = AllChem.EmbedMolecule(mol)
            assert res == 0 
            res = AllChem.MMFFOptimizeMolecule(mol)
            
            if res != 0:
                mmff_fail_count += 1

            exp_dG = float(row[3])
            exp_dG_err = float(row[4])
            

            mol.SetProp('_Name', name)
            mol.SetProp('dG', str(exp_dG))
            mol.SetProp('dG_err', str(exp_dG_err))
            
            assert is_neutral(mol)
            
            mols.append(mol)

print("mm_fail", mmff_fail_count)

w = Chem.SDWriter('freesolv.sdf')
for m in mols: w.write(m)
w.flush()

print("wrote", len(mols), "mols")

rebuild_freesolv.py script

Hello!

I have been trying to run the rebuild_freesolv.py script but it seems the molecules do not get correct coordinates, in fact they get generated with the coordinates of all atoms as zeros (as opposed to the files that are already available in the mol2files_sybyl folder). Do you know what the reason for this may be?

Best wishes,
Kadi

Strip water molecules from all topology/coordinate files in current database

Because of prior manual curation of files, not all topology and coordinate files contain water molecules. And additionally, I just found out (from Sereina Riniker - e-mail excerpt below) that some of these contain TIP4P-Ew water molecules rather than TIP3P. Again, this is a result of manually gathering the topology/coordinate files for these (in some cases by students). The best long-term solution is to re-generate all topology/coordinate files from original source data (Issue #20), but an interim solution is just to strip all water molecules from existing topology/coordinate files.

Riniker's e-mail said this, in part:
"Regarding the [input files] I noticed two things which I thought you might like to know if you do not already. In the most recent version v0.31, I encountered 78 molecules where the GROMACS coordinate file .gro does not contain the solvent coordinates. In addition, there are 23 molecules where the solvent model in the coordinate file is not TIP3P (it contains 4 coordinates per solvent molecule). I attach the list of molecule numbers in case you would like to have a look at them."

The compound ID numbers for setups with TIP4P are:
1323538
1728386
186894
1873346
1875719
1923244
2005792
2049967
20524
2068538
2178600
2972906
3053621
3727287
3738859
4035953
511661
5157661
525934
5449201
8427539
9055303
9979854

And those for setups with no water are:
1034539
1160109
1469079
172879
1893815
1905088
1944394
2126135
2316618
242480
2484519
2492140
2613240
2636578
2659552
2844990
2845466
2850833
2960202
2972345
3040612
3083321
3211679
3265457
3269819
3359593
3515580
3686115
3802803
3976574
4149784
4371692
4479135
4587267
4603202
4613090
4678740
4689084
486214
4936555
5003962
5006685
5282042
5371840
5456566
5510474
5538249
5561855
5616693
5917842
6102880
6190089
6195751
6198745
628951
6359156
667278
6688723
6935906
7239499
7417968
7676709
7913234
8052240
819018
8208692
8311303
8337722
8823527
8827942
8883511
9257453
9510785
9653690
9717937
9741965
9821936
9897248

Potential duplicate molecules in FreeSolv Set

While typing FreeSolv molecules with smirnoff99Frosst, I found 4 molecules that are potentially duplicated in the FreeSolv set. Below is the code snippet I used that found the duplicates:

import glob
from openforcefield.utils import read_molecules
from openeye import oechem

# untarred mol2files_sybyl.tar.gz
DBpath = "/FreeSolv/mol2files_sybyl/*.mol2"
for file in glob.glob(DBpath):
	mol = read_molecules(file, verbose = False)[0]
	f = file.split('/')[-1]
	c_mol = oechem.OEMol(mol)
	oechem.OEAddExplicitHydrogens(c_mol)
	    smi = oechem.OECreateIsoSmiString(mol)
    f = file.split('/')[-1]
    if smi in isosmiles_to_mol:
        print("File:   %35s %35s" % (f, smi_to_file[smi]))
        print("Title:  %35s %35s" % (c_mol.GetTitle(), isosmiles_to_mol[smi].GetTitle()))
        print("SMILES: %35s %35s" % (smi, oechem.OECreateIsoSmiString(isosmiles_to_mol[smi])))
        print('\n')

    isosmiles_to_mol[smi] = c_mol
    smi_to_file[smi] = f

# OUTPUT: 

#File:                   mobley_4689084.mol2                  mobley_352111.mol2
#Title:               2-acetoxyethyl acetate              2-acetoxyethyl acetate
#SMILES:                    CC(=O)OCCOC(=O)C                    CC(=O)OCCOC(=O)C
#
#
#File:                   mobley_9897248.mol2                  mobley_819018.mol2
#Title:  (2Z)-3,7-dimethylocta-2,6-dien-1-ol (2E)-3,7-dimethylocta-2,6-dien-1-ol
#SMILES:                   CC(=CCCC(=CCO)C)C                   CC(=CCCC(=CCO)C)C
#
#
#File:                   mobley_9913368.mol2                 mobley_4465023.mol2
#Title:             (E)-1,2-dichloroethylene            (Z)-1,2-dichloroethylene
#SMILES:                           C(=CCl)Cl                           C(=CCl)Cl
#
#
#File:                   mobley_9979854.mol2                  mobley_628086.mol2
#Title:      (2R)-1,1,1-trifluoropropan-2-ol     (2S)-1,1,1-trifluoropropan-2-ol
#SMILES:                       CC(C(F)(F)F)O                       CC(C(F)(F)F)O

mobley_352111 and mobley_4689084 are the same molecule

These two molecules have the same IUPAC name, SMILES string, identical experimental values and nearly identical calculated values. I think these are the only two entries in the set where the molecules are the same by the above criteria.

A few things are different such as the uncertainty estimate and the citation. I'm wondering if we should remove one of the entries?

Which database format should store primary data?

Currently, there is a Python pickle file that stores both primary data and derived data. This is very convenient for Python, but less convenient for anything that is not Python.

I wonder if we want to keep just the primary data in a nice, portable, small file from which everything (including convenient Python pickles) is derived. But what format should this be?

  • Python pickle (still not super convenient)
  • JSON?
  • XML?
  • SQLite?

As a reminder, we decided the primary data consisted of the following:

  • canonical isomeric SMILES
  • experimental data:
    • experimental value
    • experimental uncertainty
    • citation for experimental data
  • notes field

Eventually, it would be great if there was also more provenance data for the experimental value (e.g. if Peter Guthrie had computed it from combining data from multiple publications and applying a conversion) but this is a more advanced topic.

Should SDF files contain other fields from database?

.sdf files can contain multiple key-value pairs, meaning we could store other fields from the database in these files.

  • Should we do this?
  • If so, which fields should we store? All of them, or a subset?
  • If a subset, which fields?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.