mcs07 / molvs Goto Github PK
View Code? Open in Web Editor NEWMolecule Validation and Standardization
Home Page: https://molvs.readthedocs.io/
License: MIT License
Molecule Validation and Standardization
Home Page: https://molvs.readthedocs.io/
License: MIT License
Hi, I have been trying your module to standardize molecules. However it gets stuck with molecule ZINC000100026244. By the stacktrace when I do a keyboard interruption, it seems to get stuck at the reionize step. Is there a way to set a timeout period so that if it gets stuck like this I can just disregard this molecule and continue?
I put below the bug example and the stacktrace:
from rdkit import Chem
import molvs
smi="CCOC(=O)C(=O)[CH-]C#N"
s = molvs.Standardizer()
mol=Chem.MolFromSmiles(smi)
mol = s.standardize(mol)
KeyboardInterrupt Traceback (most recent call last)
in ()
2 s = molvs.Standardizer()
3 mol=Chem.MolFromSmiles(smi)
----> 4 mol = s.standardize(mol)
5 print(Chem.MolToSmiles(mol, True))
C:\Software\Miniconda\lib\site-packages\molvs\standardize.py in standardize(self, mol)
97 mol = self.disconnect_metals(mol)
98 mol = self.normalize(mol)
---> 99 mol = self.reionize(mol)
100 Chem.AssignStereochemistry(mol, force=True, cleanIt=True)
101 # TODO: Check this removes symmetric stereocenters
C:\Software\Miniconda\lib\site-packages\molvs\charge.py in call(self, mol)
152 def call(self, mol):
153 """Calling a Reionizer instance like a function is the same as calling its reionize(mol) method."""
--> 154 return self.reionize(mol)
155
156 def reionize(self, mol):
C:\Software\Miniconda\lib\site-packages\molvs\charge.py in reionize(self, mol)
195
196 while True:
--> 197 ppos, poccur = self._strongest_protonated(mol)
198 ipos, ioccur = self._weakest_ionized(mol)
199 if ioccur and poccur and ppos < ipos:
C:\Software\Miniconda\lib\site-packages\molvs\charge.py in _strongest_protonated(self, mol)
211 def _strongest_protonated(self, mol):
212 for position, pair in enumerate(self.acid_base_pairs):
--> 213 for occurrence in mol.GetSubstructMatches(pair.acid):
214 return position, occurrence
215 return None, None
KeyboardInterrupt:
Many thanks for providing the very useful Molvs library licence free. I have been using it in a script to detect structure duplicates in different databases when only structural information are provided (no name, no CAS number), by looping through all possible tautomeric forms of structures that have the same molecular formula. It works greats and I am happy to share the code if you are interested. There is one limitation that I would like to cover: the ring-chain tautomerism. As an example, the drug warfarin can exist in an open form O=C(OC1=CC=CC=C12)C(C(CC(C)=O)C3=CC=CC=C3)=C2O or a closed form O=C(OC1=CC=CC=C12)C3=C2OC(O)(C)C(C(C)=O)C3C4=CC=CC=C4. Same for many sugars… I have an exhaustive list of those additional ring-chain tautomeric transformations and I would like to add them to your default Tautomer_transforms knowledge base. Could you please let me know how I could achieve that by telling me how I can add additional SMARTS to cover the missing tautomer transforms, and how I can call this updated dictionary of tautomer_transforms from my script?
Many thanks and regards,
Alexis Parenty
Hi Matt, a bug report for something I've noticed in the latest release.
The standardize_smiles, Standardizer().tautomer_parent and Standardizer().standardize functions all fail because of line 133. These all work okay in 0.0.9
>>> mol = Chem.MolFromSmiles('[Na]OC(=O)c1ccc(C[S+2]([O-])([O-]))cc1')
>>> Standardizer().standardize(mol)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/travis/anaconda3/envs/rdkit/lib/python3.6/site-packages/molvs/standardize.py", line 98, in standardize
mol = self.normalize(mol)
File "/home/travis/anaconda3/envs/rdkit/lib/python3.6/site-packages/molvs/normalize.py", line 105, in __call__
return self.normalize(mol)
File "/home/travis/anaconda3/envs/rdkit/lib/python3.6/site-packages/molvs/normalize.py", line 124, in normalize
fragments.append(self._normalize_fragment(fragment))
File "/home/travis/anaconda3/envs/rdkit/lib/python3.6/site-packages/molvs/normalize.py", line 133, in _normalize_fragment
for n in six.moves.range(self.max_restarts):
NameError: name 'six' is not defined
I had a look at the file and it seems that the 'import six' line is missing. When this is added, I get another error, but I think this may be a Py2 vs Py3 difference:
>>> Standardizer().standardize(mol)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/travis/anaconda3/envs/rdkit/lib/python3.6/site-packages/molvs/standardize.py", line 98, in standardize
mol = self.normalize(mol)
File "/home/travis/anaconda3/envs/rdkit/lib/python3.6/site-packages/molvs/normalize.py", line 106, in __call__
return self.normalize(mol)
File "/home/travis/anaconda3/envs/rdkit/lib/python3.6/site-packages/molvs/normalize.py", line 125, in normalize
fragments.append(self._normalize_fragment(fragment))
File "/home/travis/anaconda3/envs/rdkit/lib/python3.6/site-packages/molvs/normalize.py", line 137, in _normalize_fragment
product = self._apply_transform(mol, normalization.transform)
File "/home/travis/anaconda3/envs/rdkit/lib/python3.6/site-packages/molvs/utils.py", line 28, in fget_memoized
setattr(self, attr_name, fget(self))
File "/home/travis/anaconda3/envs/rdkit/lib/python3.6/site-packages/molvs/normalize.py", line 42, in transform
return AllChem.ReactionFromSmarts(self.transform_str.encode('utf8'))
Boost.Python.ArgumentError: Python argument types in
rdkit.Chem.rdChemReactions.ReactionFromSmarts(bytes)
did not match C++ signature:
ReactionFromSmarts(char const* SMARTS, boost::python::dict replacements={}, bool useSmiles=False)
When I move ".encode('utf8')" from line 42 it works as expected in Py3, but I think this would probably produce an error on Py2.
Hope this helps, thanks for working on this, it's been really useful!
EDIT: I see that these changes are already reflected in the file on GitHub, it seems that the PyPI version is just a bit out of date
Or is there an option to trigger it?
Thanks,
F.
I would like to kindly ask you to add your BibTex citation to the project documentation, so you can take credit for the work.
Have you made plans for reconciling forms of sugars? For example, glucose can exist in an open and closed form. I recently integrated a large number of chemical databases (before knowing about MolVS) and am going through the well-documented woes you describe in this package.
Here is an example for glucose: https://en.wikipedia.org/wiki/Glucose#Open-chain_form
I'm excited to switch my fragmented code for cleaning structures over to your MolVS protocols. The canonical tautomer is something really significant for me. Thanks for putting this together.
Hi Matt,
I am evaluating MolVS for normalizing molecules prior to performing match pair analysis. My former colleagues at Genentech open sourced their OEChem based code for normalizing molecules. Normalization is part of the registration workflow. Staring in line 167 of this file https://github.com/chemalot/chemalot/blob/master/src/com/genentech/struchk/oeStruchk/Struchk.xml on github contains Genentech's business rules for normalization. Perhaps you can take a look to see if there are any SMIRKS you could use.
Best,
JW
Using the default settings, this ChEMBL molecule did not standardise overnight:
OP(=O)(O)[O-].OP(=O)([O-])[O-].[O-]S(=O)(=O)[O-].[Na+].[Na+].[Na+].[Mg+2].[Cl-].[Cl-].[K+].[K+] 2104840
Hi Matt,
A disubstituted 1,2,4 triazole have three possible tautomers. I ran molvs standardize on three different SMILES and got three different answers. I don't think tautomer.py covers canonicalization of trizoles.
Here are test cases to reproduce the problem:
molvs standardize -: "CC1=NN=C(CC)N1"
output: CCc1nnc(C)[nH]1
molvs standardize -: "CC1=NC(CC)=NN1"
output: CCc1n[nH]c(C)n1
molvs standardize -: "CC1=NNC(CC)=N1"
output: CCc1nc(C)n[nH]1
Note that I got three different outputs when I expected identical outputs.
Here are structures and their respective smiles for the three different inputs.
Thanks,
JW
Hi,
When number of tautomers exceeds max_tautomers
, we get a log warning. However, as a caller, I would like to report molecules where such warnings happened.
As far as I can see, the number of tautomers is not know before entering TautomerEnumerator.enumerate
. This complicates the access to this information.
We somehow need to propagate this info to callers like TautomerCanonicalizer
(as complexe return value? - but then, we are changing the API...)
Thanks and cheers,
christian
Hello,
I am wondering if the documentation of this project should reflect the fact that much of this API has been integrated into newer version of the RDKit?
or at least, there should be an option to trigger it
Dear Matt,
installing molvs with pip runs successfully on my Ubuntu desktop.
I find this package great and very useful!
However, when I try to print the help for the tautomer module in python console, I get the attached error:
molvs_tautomer_error.txt
I can fix this on my git clone by removing one {!r} from the repr method in the molvs.tautomer.TautomerTransform class (line 63) :
return 'TautomerTransform({!r}, {!r}, {!r}, {!r}, {!r})'.format(self.name, self.tautomer_str, self.bonds, self.charges)
Cheers,
Jose Manuel
Dear Matt,
sorry for spamming this page!
I have an issue when computing the canon tautomer of ibuprofen, see below the test code.
The chirality definition is lost upon processing.
Could you kindly indicate if this is a bug, a feature or simply if I'm doing something wrong?
This code actually worked for several other tests, so I'm not sure what is going on.
Thanks in advance for your help.
Cheers,
Jose Manuel
from rdkit import Chem
from molvs import tautomer
smiles = 'CC(C)C1=CC=C(C@HC(=O)[O-])C=C1'
m = Chem.MolFromSmiles(smiles)
canonicalizer = tautomer.TautomerCanonicalizer()
m_canon = canonicalizer.canonicalize(m)
print Chem.MolToSmiles(m, isomericSmiles=True)
'CC(C)C1=CC=C(C@HC(=O)[O-])C=C1'
print Chem.MolToSmiles(m_canon, isomericSmiles=True)
CC(C)c1ccc(C(C)C(=O)[O-])cc1
Hello,
I was using molvs standardizer on PubChem molecules and found out several molecules that cannot be standardized:
Link: https://pubchem.ncbi.nlm.nih.gov/compound/59827358
Link: https://pubchem.ncbi.nlm.nih.gov/compound/59827349
Code to reproduce:
from rdkit import Chem
from molvs import Standardizer
smiles = "CC1=CC=C(C=C1)C(S(=O)CC2=CC=C(C=C2)C(S(=O)CC3=CC=C(C=C3)C(S(=O)CC4=CC=C(C=C4)C(S(=O)C5=CC=C(C=C5)Br)S(=O)C6=CC=C(C=C6)Br)S(=O)CC7=CC=C(C=C7)C(S(=O)C8=CC=C(C=C8)Br)S(=O)C9=CC=C(C=C9)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)CC1=CC=C(C=C1)C(S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br"
mol = Chem.MolFromSmiles(smiles)
res = Standardizer().standardize(mol)
It seems that the flow goes into an infinite loop in function _apply_transform() (normalize.py). After 10 minutes of transformation still got no result.
Thanks,
Vladislav
Hi,
I have noticed some erratic behaviour in the tautomer canonicalization procedure. In this specific example, double bonds jump from one ring to the next, erasing stereocenters as they go.
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole, MolsToGridImage
import molvs
#MWE failing molecule
mol = Chem.MolFromSmiles("NC(=O)c1[nH]nc(c1O)[C@@H]2O[C@H](CO)[C@@H](O)[C@H]2O")
smi_std = molvs.Standardizer()
standardized_tautomer = smi_std.tautomer_parent(mol)
mols = [mol, standardized_tautomer]
MolsToGridImage(mols)
Output attached for reference.
Cheers,
Dries
Hi,
I am trying to enumerate tautomers using the tautomer.TautomerEnumerator() function. I am wondering if the returned list is sorted by scores so that the best-scoring tautomer in the list will have an index of zero, as it is not clear from the documentation, and if so, can I just limit to the most canonical tautomer by specifying the maximum tautomer parameter to 1
Thanks
Hi, thank you for this super userful library as well as others (Pubchempy, Chemspipy, CIRPy).
I have a python script for a chemical inventory program (Open Enventory) that scrapes mol files from various websites. The scraping file is doing a pretty good job but a small percentage of the mol files (mostly from pubchem) have explicit hydrogens and I want the mol files to have implicit hydrogens. I have discovered rdkit, and open babel to be able to clean the mol file (convert explixit hydrogens to implicit hydrogens). Your molvs is exactly what I am looking for and more (the standardize
module). However, I want my python scraping code to be more portable and usable for non technical people and I am just looking for a small library or code that can do the "standardize" function without installing rdkit or open babel.
I was wondering if you might know of such library or maybe you could direct me to some direction.
Thank you very much!
Best regards,
Khoi Van
According to
Line 33 in 26179ca
Hi,
I implemented a part of MolVS in C++.
Close to everything is working fine (against the examples in the test units).
Just the tautomer enumeration are getting different (extended) results.
(The canonicalize function returning the right results as well.)
In case of the example c1(ccccc1)CC(=O)C i get the following results:
C=C(O)C=C1C=CC=CC1
C=C(O)C=C1C=CCC=C1
C=C(O)Cc1ccccc1
CC(=O)C=C1C=CC=CC1
CC(=O)C=C1C=CCC=C1
CC(=O)Cc1ccccc1
CC(O)=Cc1ccccc1
This seems reasonable to me but in your test unit you are just testing against:
C=C(O)Cc1ccccc1
CC(=O)Cc1ccccc1
CC(O)=Cc1ccccc1
From my side it seems that you get just the aromatic results.
Now my questions:
Have you validated the tests?
Wich version of the RDKit was used? (I use currently 2017_09_3)
Any suggestions what could be wrong (With pleasure i will post my code if someone wants to help me!)
(Maybe i did a stupid mistake but after days of searching i couldn't find it)
Thank you in advance!
Hello!
I stumble upon possible issue, I am trying to limit the printed output to the jupyter notebook and if I understand correctly, it can be solved using stdout in Validator, like this
molvs.validate.Validator(stdout=False, raw=False, log_format='%(validation)s')
But I still got the explicit print out to Jupyter notebook with all found validations, I would like to block them all.
Thanks!
Dear Matt,
I like your function canonicalize in the TautomerCanonicalizer class.
By reading the code, I noticed that you actually enumerate all tautomers, score each of them and finally return the one with the highest score only.
Would it be possible to add a function that simply gives a score to a given tautomer?
Do you think this could be used by the TautomerEnumerator, for instance to filter tautomers with bad scores on the fly?
Thanks!
Cheers,
Jose Manuel
Thanks,
Using the following SMILES string, i have encountered the following error:
USED SMILES:
standardize_smiles('[Na].[Na].O[Se](O)=O')
OUTPUT:
log.info('Ionizing %s to balance previous charge corrections', self.acid_base_pairs[ppos].name)
TypeError: tuple indices must be integers or slices, not NoneType
The command works with this SMILES
O=[Se]([O-])[O-].[Na+].[Na+]
Thank you in advance
Thank you for providing this good package. But I have encountered a Error when I am running the example (Standardize)
rdkit.Chem.rdmolfiles.MolFromSmiles(unicode)
did not match C++ signature:
MolFromSmiles(std::string SMILES, bool sanitize=True, boost::python::dict replacements={})
Because of the "Chem.MolFromSmarts('CX3[OX2H1]')" need a string. While ''from future import unicode_literals" was added at the top .
If I remove the "from future import unicode_literals". Things goes well.
Could you please explain how to avoid this error? or this "from future import unicode_literals" is used in the other latter code? Can I just remove it?
I'm converting all the molecules in my database to canonical-tautomers and noticed that things like NADH looked weird. You can see it most plainly for phosphoric acid. I didn't expect the Hydrogen on the phosphorous. Is this the correct/expected behavior?
from rdkit import Chem
from rdkit.Chem import Draw
from molvs.tautomer import TautomerCanonicalizer
original_smiles = 'OP(=O)(O)O'
original_mol = Chem.MolFromSmiles(original_smiles)
tautomerized_mol = TautomerCanonicalizer().canonicalize(original_mol)
Draw.MolsToGridImage([original_mol,tautomerized_mol],
molsPerRow=3,subImgSize=(200,200),
legends=['original','tautomer'])
Hi,
I found that molvs
hugs on some molecules i.e.
from molvs import standardize_smiles, __version__
print(__version__)
standardize_smiles("[Na].C(C=C)C1(C(NC(NC1=O)=S)=O)CC1=CC=CC=C1")
returns 0.0.9
and then falls into infinite loop
The patterns in tautomer.py should be referenced to the original paper by Oellien, et al. The Impact of Tautomer Forms on Pharmacophore-Based Virtual Screening J. Chem. Inf. Model, 2006, 46, 2342-2354.
Hello,
molvs standardize -i smi input.smi -o smi -O output.smi
The input file has several molecules, the output a single one.
Is this the expected behavior?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.