mcs07 / molvs Goto Github PK

View Code? Open in Web Editor NEW

149.0 19.0 52.0 272 KB

Molecule Validation and Standardization

Home Page: https://molvs.readthedocs.io/

License: MIT License

Python 100.00%

rdkit python chemistry cheminformatics standardization validation

molvs's People

Contributors

Stargazers

Watchers

Forkers

lewisacidic ebjerrum gitter-badger abhik1368 gadsbyfly russodanielp mengwuxiao markussitzmann python3pkg jsdir veselovmark inambioinfo fabricecarles unixjunkie coleb joshuameyers pikulsomesh mrauha fossabot iiitmjay qize songsiwei chemphy clemenshug sulstice savvamotovilov simonmb gjgetzinger iwwwish haibaraes srijitseal bidd-group roguedog94 dajtmullaj masterwhook abazabaaa nounchanveasna yqyang733 caiyingchun sailfish009 takshan marcusrb umanskyt freeenergylab rnaimehaom ojeda-e yansonggu sandeepmahto17 os01 sjwdbl

molvs's Issues

Standardizer gets stuck on a molecule

Hi, I have been trying your module to standardize molecules. However it gets stuck with molecule ZINC000100026244. By the stacktrace when I do a keyboard interruption, it seems to get stuck at the reionize step. Is there a way to set a timeout period so that if it gets stuck like this I can just disregard this molecule and continue?
I put below the bug example and the stacktrace:

from rdkit import Chem
import molvs
smi="CCOC(=O)C(=O)[CH-]C#N"
s = molvs.Standardizer()
mol=Chem.MolFromSmiles(smi)
mol = s.standardize(mol)

KeyboardInterrupt Traceback (most recent call last)
in ()
2 s = molvs.Standardizer()
3 mol=Chem.MolFromSmiles(smi)
----> 4 mol = s.standardize(mol)
5 print(Chem.MolToSmiles(mol, True))

C:\Software\Miniconda\lib\site-packages\molvs\standardize.py in standardize(self, mol)
97 mol = self.disconnect_metals(mol)
98 mol = self.normalize(mol)
---> 99 mol = self.reionize(mol)
100 Chem.AssignStereochemistry(mol, force=True, cleanIt=True)
101 # TODO: Check this removes symmetric stereocenters

C:\Software\Miniconda\lib\site-packages\molvs\charge.py in call(self, mol)
152 def call(self, mol):
153 """Calling a Reionizer instance like a function is the same as calling its reionize(mol) method."""
--> 154 return self.reionize(mol)
155
156 def reionize(self, mol):

C:\Software\Miniconda\lib\site-packages\molvs\charge.py in reionize(self, mol)
195
196 while True:
--> 197 ppos, poccur = self._strongest_protonated(mol)
198 ipos, ioccur = self._weakest_ionized(mol)
199 if ioccur and poccur and ppos < ipos:

C:\Software\Miniconda\lib\site-packages\molvs\charge.py in _strongest_protonated(self, mol)
211 def _strongest_protonated(self, mol):
212 for position, pair in enumerate(self.acid_base_pairs):
--> 213 for occurrence in mol.GetSubstructMatches(pair.acid):
214 return position, occurrence
215 return None, None

KeyboardInterrupt:

new Tautomeric transform (ring-chain tautomerism)

Many thanks for providing the very useful Molvs library licence free. I have been using it in a script to detect structure duplicates in different databases when only structural information are provided (no name, no CAS number), by looping through all possible tautomeric forms of structures that have the same molecular formula. It works greats and I am happy to share the code if you are interested. There is one limitation that I would like to cover: the ring-chain tautomerism. As an example, the drug warfarin can exist in an open form O=C(OC1=CC=CC=C12)C(C(CC(C)=O)C3=CC=CC=C3)=C2O or a closed form O=C(OC1=CC=CC=C12)C3=C2OC(O)(C)C(C(C)=O)C3C4=CC=CC=C4. Same for many sugars… I have an exhaustive list of those additional ring-chain tautomeric transformations and I would like to add them to your default Tautomer_transforms knowledge base. Could you please let me know how I could achieve that by telling me how I can add additional SMARTS to cover the missing tautomer transforms, and how I can call this updated dictionary of tautomer_transforms from my script?
Many thanks and regards,
Alexis Parenty

MolVS 0.1.0 Standardization fails on Python 3

Hi Matt, a bug report for something I've noticed in the latest release.

The standardize_smiles, Standardizer().tautomer_parent and Standardizer().standardize functions all fail because of line 133. These all work okay in 0.0.9

>>> mol = Chem.MolFromSmiles('[Na]OC(=O)c1ccc(C[S+2]([O-])([O-]))cc1')
>>> Standardizer().standardize(mol)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/travis/anaconda3/envs/rdkit/lib/python3.6/site-packages/molvs/standardize.py", line 98, in standardize
    mol = self.normalize(mol)
  File "/home/travis/anaconda3/envs/rdkit/lib/python3.6/site-packages/molvs/normalize.py", line 105, in __call__
    return self.normalize(mol)
  File "/home/travis/anaconda3/envs/rdkit/lib/python3.6/site-packages/molvs/normalize.py", line 124, in normalize
    fragments.append(self._normalize_fragment(fragment))
  File "/home/travis/anaconda3/envs/rdkit/lib/python3.6/site-packages/molvs/normalize.py", line 133, in _normalize_fragment
    for n in six.moves.range(self.max_restarts):
NameError: name 'six' is not defined

I had a look at the file and it seems that the 'import six' line is missing. When this is added, I get another error, but I think this may be a Py2 vs Py3 difference:

>>> Standardizer().standardize(mol)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/home/travis/anaconda3/envs/rdkit/lib/python3.6/site-packages/molvs/standardize.py", line 98, in standardize
   mol = self.normalize(mol)
 File "/home/travis/anaconda3/envs/rdkit/lib/python3.6/site-packages/molvs/normalize.py", line 106, in __call__
   return self.normalize(mol)
 File "/home/travis/anaconda3/envs/rdkit/lib/python3.6/site-packages/molvs/normalize.py", line 125, in normalize
   fragments.append(self._normalize_fragment(fragment))
 File "/home/travis/anaconda3/envs/rdkit/lib/python3.6/site-packages/molvs/normalize.py", line 137, in _normalize_fragment
   product = self._apply_transform(mol, normalization.transform)
 File "/home/travis/anaconda3/envs/rdkit/lib/python3.6/site-packages/molvs/utils.py", line 28, in fget_memoized
   setattr(self, attr_name, fget(self))
 File "/home/travis/anaconda3/envs/rdkit/lib/python3.6/site-packages/molvs/normalize.py", line 42, in transform
   return AllChem.ReactionFromSmarts(self.transform_str.encode('utf8'))
Boost.Python.ArgumentError: Python argument types in
   rdkit.Chem.rdChemReactions.ReactionFromSmarts(bytes)
did not match C++ signature:
   ReactionFromSmarts(char const* SMARTS, boost::python::dict replacements={}, bool useSmiles=False)

When I move ".encode('utf8')" from line 42 it works as expected in Py3, but I think this would probably produce an error on Py2.

Hope this helps, thanks for working on this, it's been really useful!

EDIT: I see that these changes are already reflected in the file on GitHub, it seems that the PyPI version is just a bit out of date

Is there salt removal implemented in there?

Or is there an option to trigger it?
Thanks,
F.

Missing preferred citation format in the documentation.

I would like to kindly ask you to add your BibTex citation to the project documentation, so you can take credit for the work.

Canonical carbohydrates?

Have you made plans for reconciling forms of sugars? For example, glucose can exist in an open and closed form. I recently integrated a large number of chemical databases (before knowing about MolVS) and am going through the well-documented woes you describe in this package.

Here is an example for glucose: https://en.wikipedia.org/wiki/Glucose#Open-chain_form

I'm excited to switch my fragmented code for cleaning structures over to your MolVS protocols. The canonical tautomer is something really significant for me. Thanks for putting this together.

Additional normalization SMIRKS

Hi Matt,

I am evaluating MolVS for normalizing molecules prior to performing match pair analysis. My former colleagues at Genentech open sourced their OEChem based code for normalizing molecules. Normalization is part of the registration workflow. Staring in line 167 of this file https://github.com/chemalot/chemalot/blob/master/src/com/genentech/struchk/oeStruchk/Struchk.xml on github contains Genentech's business rules for normalization. Perhaps you can take a look to see if there are any SMIRKS you could use.

Best,

Molecule did not standardise overnight

Using the default settings, this ChEMBL molecule did not standardise overnight:

OP(=O)(O)[O-].OP(=O)([O-])[O-].[O-]S(=O)(=O)[O-].[Na+].[Na+].[Na+].[Mg+2].[Cl-].[Cl-].[K+].[K+] 2104840

Triazole tautomers not normalized

Hi Matt,

A disubstituted 1,2,4 triazole have three possible tautomers. I ran molvs standardize on three different SMILES and got three different answers. I don't think tautomer.py covers canonicalization of trizoles.

Here are test cases to reproduce the problem:
molvs standardize -: "CC1=NN=C(CC)N1"
output: CCc1nnc(C)[nH]1
molvs standardize -: "CC1=NC(CC)=NN1"
output: CCc1n[nH]c(C)n1
molvs standardize -: "CC1=NNC(CC)=N1"
output: CCc1nc(C)n[nH]1

Note that I got three different outputs when I expected identical outputs.

Here are structures and their respective smiles for the three different inputs.

Thanks,

Propagate the warning 'Tautomer enumeration stopped at maximum... '

Hi,

When number of tautomers exceeds max_tautomers, we get a log warning. However, as a caller, I would like to report molecules where such warnings happened.

As far as I can see, the number of tautomers is not know before entering TautomerEnumerator.enumerate. This complicates the access to this information.

We somehow need to propagate this info to callers like TautomerCanonicalizer (as complexe return value? - but then, we are changing the API...)

Thanks and cheers,

christian

RDKit Integration

Hello,

I am wondering if the documentation of this project should reflect the fact that much of this API has been integrated into newer version of the RDKit?

standardize should unsalt

or at least, there should be an option to trigger it

Problem with repr method of TautomerTransform class

Dear Matt,
installing molvs with pip runs successfully on my Ubuntu desktop.
I find this package great and very useful!
However, when I try to print the help for the tautomer module in python console, I get the attached error:
molvs_tautomer_error.txt

I can fix this on my git clone by removing one {!r} from the repr method in the molvs.tautomer.TautomerTransform class (line 63) :
return 'TautomerTransform({!r}, {!r}, {!r}, {!r}, {!r})'.format(self.name, self.tautomer_str, self.bonds, self.charges)

Cheers,
Jose Manuel

Loss of chirality when converting to canon tautomer

Dear Matt,

sorry for spamming this page!

I have an issue when computing the canon tautomer of ibuprofen, see below the test code.

The chirality definition is lost upon processing.

Could you kindly indicate if this is a bug, a feature or simply if I'm doing something wrong?

This code actually worked for several other tests, so I'm not sure what is going on.

Thanks in advance for your help.

Cheers,
Jose Manuel

Test code

from rdkit import Chem
from molvs import tautomer

smiles = 'CC(C)C1=CC=C(C@HC(=O)[O-])C=C1'
m = Chem.MolFromSmiles(smiles)
canonicalizer = tautomer.TautomerCanonicalizer()
m_canon = canonicalizer.canonicalize(m)

initial mol object:

print Chem.MolToSmiles(m, isomericSmiles=True)
'CC(C)C1=CC=C(C@HC(=O)[O-])C=C1'

canonicalized object:

print Chem.MolToSmiles(m_canon, isomericSmiles=True)
CC(C)c1ccc(C(C)C(=O)[O-])cc1

Unable to standardize some PubChem molecules

Hello,

I was using molvs standardizer on PubChem molecules and found out several molecules that cannot be standardized:

SMILES: CC(S(=O)CC1=CC=C(C=C1)C(S(=O)CC2=CC=C(C=C2)C(S(=O)CC3=CC=C(C=C3)C(S(=O)C4=CC=C(C=C4)Br)S(=O)C5=CC=C(C=C5)Br)S(=O)CC6=CC=C(C=C6)C(S(=O)C7=CC=C(C=C7)Br)S(=O)C8=CC=C(C=C8)Br)S(=O)CC9=CC=C(C=C9)C(S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)CC1=CC=C(C=C1)C(S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br

Link: https://pubchem.ncbi.nlm.nih.gov/compound/59827358

SMILES: CC1=CC=C(C=C1)C(S(=O)CC2=CC=C(C=C2)C(S(=O)CC3=CC=C(C=C3)C(S(=O)CC4=CC=C(C=C4)C(S(=O)C5=CC=C(C=C5)Br)S(=O)C6=CC=C(C=C6)Br)S(=O)CC7=CC=C(C=C7)C(S(=O)C8=CC=C(C=C8)Br)S(=O)C9=CC=C(C=C9)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)CC1=CC=C(C=C1)C(S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br

Link: https://pubchem.ncbi.nlm.nih.gov/compound/59827349

Code to reproduce:

from rdkit import Chem
from molvs import Standardizer

smiles = "CC1=CC=C(C=C1)C(S(=O)CC2=CC=C(C=C2)C(S(=O)CC3=CC=C(C=C3)C(S(=O)CC4=CC=C(C=C4)C(S(=O)C5=CC=C(C=C5)Br)S(=O)C6=CC=C(C=C6)Br)S(=O)CC7=CC=C(C=C7)C(S(=O)C8=CC=C(C=C8)Br)S(=O)C9=CC=C(C=C9)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)CC1=CC=C(C=C1)C(S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br)S(=O)CC1=CC=C(C=C1)C(S(=O)C1=CC=C(C=C1)Br)S(=O)C1=CC=C(C=C1)Br"
mol = Chem.MolFromSmiles(smiles)
res = Standardizer().standardize(mol)

It seems that the flow goes into an infinite loop in function _apply_transform() (normalize.py). After 10 minutes of transformation still got no result.

Thanks,
Vladislav

Tautomer canonicalization bug

Hi,

I have noticed some erratic behaviour in the tautomer canonicalization procedure. In this specific example, double bonds jump from one ring to the next, erasing stereocenters as they go.

from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole, MolsToGridImage
import molvs


#MWE failing molecule
mol = Chem.MolFromSmiles("NC(=O)c1[nH]nc(c1O)[C@@H]2O[C@H](CO)[C@@H](O)[C@H]2O")

smi_std = molvs.Standardizer()
standardized_tautomer = smi_std.tautomer_parent(mol)
mols = [mol, standardized_tautomer]

MolsToGridImage(mols)

Output attached for reference.

Cheers,

Dries

tautomer.TautomerEnumerator() function

Hi,
I am trying to enumerate tautomers using the tautomer.TautomerEnumerator() function. I am wondering if the returned list is sorted by scores so that the best-scoring tautomer in the list will have an index of zero, as it is not clear from the documentation, and if so, can I just limit to the most canonical tautomer by specifying the maximum tautomer parameter to 1

Thanks

standardize script of mol file without rdkit or openbabel

Hi, thank you for this super userful library as well as others (Pubchempy, Chemspipy, CIRPy).

I have a python script for a chemical inventory program (Open Enventory) that scrapes mol files from various websites. The scraping file is doing a pretty good job but a small percentage of the mol files (mostly from pubchem) have explicit hydrogens and I want the mol files to have implicit hydrogens. I have discovered rdkit, and open babel to be able to clean the mol file (convert explixit hydrogens to implicit hydrogens). Your molvs is exactly what I am looking for and more (the standardize module). However, I want my python scraping code to be more portable and usable for non technical people and I am just looking for a small library or code that can do the "standardize" function without installing rdkit or open babel.

I was wondering if you might know of such library or maybe you could direct me to some direction.
Thank you very much!

Best regards,

Khoi Van

Fluorine considered metal

According to

MolVS/molvs/metal.py

Line 33 in 26179ca

    
           self._metal_nof = Chem.MolFromSmarts('[Li,Na,K,Rb,Cs,F,Be,Mg,Ca,Sr,Ba,Ra,Sc,Ti,V,Cr,Mn,Fe,Co,Ni,Cu,Zn,Al,Ga,Y,Zr,Nb,Mo,Tc,Ru,Rh,Pd,Ag,Cd,In,Sn,Hf,Ta,W,Re,Os,Ir,Pt,Au,Hg,Tl,Pb,Bi]~[N,O,F]')

Fluorine is a metal. Ouch!
It seems that 'r' is missing, and that's actually Francium, not Fluorine.

test_tautomer.py extended results in C++ implementation

Hi,

I implemented a part of MolVS in C++.
Close to everything is working fine (against the examples in the test units).
Just the tautomer enumeration are getting different (extended) results.
(The canonicalize function returning the right results as well.)

In case of the example c1(ccccc1)CC(=O)C i get the following results:
C=C(O)C=C1C=CC=CC1
C=C(O)C=C1C=CCC=C1
C=C(O)Cc1ccccc1
CC(=O)C=C1C=CC=CC1
CC(=O)C=C1C=CCC=C1
CC(=O)Cc1ccccc1
CC(O)=Cc1ccccc1

This seems reasonable to me but in your test unit you are just testing against:
C=C(O)Cc1ccccc1
CC(=O)Cc1ccccc1
CC(O)=Cc1ccccc1

From my side it seems that you get just the aromatic results.

Now my questions:
Have you validated the tests?
Wich version of the RDKit was used? (I use currently 2017_09_3)
Any suggestions what could be wrong (With pleasure i will post my code if someone wants to help me!)
(Maybe i did a stupid mistake but after days of searching i couldn't find it)

Thank you in advance!

stdout=False - output is still printed

Hello!
I stumble upon possible issue, I am trying to limit the printed output to the jupyter notebook and if I understand correctly, it can be solved using stdout in Validator, like this
molvs.validate.Validator(stdout=False, raw=False, log_format='%(validation)s')
But I still got the explicit print out to Jupyter notebook with all found validations, I would like to block them all.
Thanks!

Not a bug but feature request: function to score a single tautomer

Dear Matt,

I like your function canonicalize in the TautomerCanonicalizer class.

By reading the code, I noticed that you actually enumerate all tautomers, score each of them and finally return the one with the highest score only.

Would it be possible to add a function that simply gives a score to a given tautomer?

Do you think this could be used by the TautomerEnumerator, for instance to filter tautomers with bad scores on the fly?

Thanks!

Cheers,
Jose Manuel
Thanks,

error standardizing ionization

Using the following SMILES string, i have encountered the following error:

USED SMILES:

standardize_smiles('[Na].[Na].O[Se](O)=O')

OUTPUT:
log.info('Ionizing %s to balance previous charge corrections', self.acid_base_pairs[ppos].name)
TypeError: tuple indices must be integers or slices, not NoneType

The command works with this SMILES

O=[Se]([O-])[O-].[Na+].[Na+]

Thank you in advance

A Error when doing the example

Thank you for providing this good package. But I have encountered a Error when I am running the example (Standardize)

rdkit.Chem.rdmolfiles.MolFromSmiles(unicode)
did not match C++ signature:
MolFromSmiles(std::string SMILES, bool sanitize=True, boost::python::dict replacements={})

Because of the "Chem.MolFromSmarts('CX3[OX2H1]')" need a string. While ''from future import unicode_literals" was added at the top .

If I remove the "from future import unicode_literals". Things goes well.

Could you please explain how to avoid this error? or this "from future import unicode_literals" is used in the other latter code? Can I just remove it?

TautomerCanonicalizer gives unexpected/forbidden form of phosphoric acid

I'm converting all the molecules in my database to canonical-tautomers and noticed that things like NADH looked weird. You can see it most plainly for phosphoric acid. I didn't expect the Hydrogen on the phosphorous. Is this the correct/expected behavior?

from rdkit import Chem
from rdkit.Chem import Draw
from molvs.tautomer import TautomerCanonicalizer

original_smiles = 'OP(=O)(O)O'

original_mol = Chem.MolFromSmiles(original_smiles)
tautomerized_mol = TautomerCanonicalizer().canonicalize(original_mol)

Draw.MolsToGridImage([original_mol,tautomerized_mol],
                     molsPerRow=3,subImgSize=(200,200),
                     legends=['original','tautomer'])

molvs hugs on some molecules

Hi,
I found that molvs hugs on some molecules i.e.

from molvs import standardize_smiles, __version__
print(__version__)
standardize_smiles("[Na].C(C=C)C1(C(NC(NC1=O)=S)=O)CC1=CC=CC=C1")

returns 0.0.9 and then falls into infinite loop

Tautomer patterns should be referenced to their original research paper.

The patterns in tautomer.py should be referenced to the original paper by Oellien, et al. The Impact of Tautomer Forms on Pharmacophore-Based Virtual Screening J. Chem. Inf. Model, 2006, 46, 2342-2354.

Only one molecule from the input SMILES file is processed...

Hello,

molvs standardize -i smi input.smi  -o smi -O output.smi

The input file has several molecules, the output a single one.
Is this the expected behavior?

mcs07 / molvs Goto Github PK

molvs's People

Contributors

Stargazers

Watchers

Forkers

molvs's Issues

Test code

initial mol object:

canonicalized object:

Recommend Projects

Recommend Topics

Recommend Org