pckroon / pysmiles Goto Github PK

View Code? Open in Web Editor NEW

142.0 4.0 21.0 283 KB

A lightweight python-only library for reading and writing SMILES strings

License: Apache License 2.0

Python 100.00%

smiles-strings writing-smiles smiles cheminformatics python hacktoberfest

pysmiles's People

Contributors

Stargazers

Watchers

pysmiles's Issues

Inconsistent writing and reading of mono-atomic smiles for Se and As

If we have a graph featuring a single non-organic atom like Se, this will be outputted by the smiles writer as 'Se'. According to smiles rules, I would have expected '[Se]'. And then, if we again read in the smile 'Se' with the smiles reader, it will fail and output a graph featuring 'S', because it relies on the brackets to recognise non-organic elements.
I don't know if this is an issue with the smiles reader or writer, but they are inconsistent.

Molecular Formula and Molecular Weight

Hello! I found your library very helpful in parsing SMILES.

Would you be interested in adding MF and MW as additional attributes?

Something along these lines:

from collections import default_dict
from pysmiles import read_smiles

AW = {
    'C': 12.0107,
    'H': 1.00794,
    # etc.
}

class MolecularFormula:
    def __init__(self, smiles: str):
        self.smiles = smiles
        self.mf = defaultdict(lambda: 0)

        try:
            mol = read_smiles(
                smiles,
                explicit_hydrogen=False,
                reinterpret_aromatic=False,
            )
            nodes = mol.nodes()

            for i in range(mol.number_of_nodes()):
                self.mf[nodes[i]['element']] += 1
                self.mf['H'] += nodes[i]['hcount']

            self.mw = 0
            for k, v in self.mf.items():
                self.mw += AW[k] * v

            self.mw = round(self.mw, 2)
        except Exception as e:
            # log or raise
            self.mw = 0

    def __repr__(self):
        return ''.join([str(k)+str(v) for k,v in self.mf.items()])

could you not PRINT warnings?

maybe it is better to write to stderr or at least offer an option to shut off the warnings like "I can't deal with stereo yet..."

`reinterpret_aromatic` erases aromatic rings containing a `N+`

Example: OCCc1c(C)[n+](cs1)Cc2cnc(C)nc2N or CCc(c1)ccc2[n+]1ccc3c2[nH]c4c3cccc4

Bug Simpliefied Smiles of 1 atom

In the current implementation the smile "O" translates yields just O with an hcount of 0. Shouldn't it yield an hcount of 1 and be the short form of water? Perhaps I'm missing some convention here.

fill valance fails for charged molecules

This little example should work in my opinion but does not:

import pysmiles
g=pysmiles.read_smiles("CCC[O-]")
pysmiles.smiles_helper.fill_valence(g, respect_hcount=False)
assert g.nodes[3]['hcount'] == 0

Origin of the problem is that fill_valence does not account for charges.

R/S chirality

RIght, so in openSMILES chirality is very elaborate [1]. I suggest supporting only tetrahedral chiralities, i.e. @ and @@ specifications, and later on E/Z chirality. This issue will be just about the tetrahedral R/S chirality.

The default case is fairly easy to cover since pysmiles guarantees that the node indices in the final molecule are in the same order as the atom entries in the input SMILES. openSMILES says to look along the axis from the "first" atom to the chiral center, and see whether the other atoms go clockwise (@@) or counter clockwise (@).
The complication arises from rings: the order of the bonds around the central atom are what matters according to openSMILES. In this case the node indices in the resulting graph do not necessarily correspond to the chiral order. Consider the following cases, where x is the chiral center with 4 different substituents (a, b, c, d), and .. means "any number of different atoms in between".

a[x@](b)(c)d  # Default case 0
b1..a[x@]1(c)d  # case 1
a[x@]1(c)(d)..b1  # case 2
a[x@]21(d)..c1..b2  # case 3
[x@]12cd..a1..b2  # case 4

Finally, implicit hydrogens are considered the "first" atom, but we can shove this under the carpet for now.

This results in the following

chiral atoms which are not involved with ring bonds do not need special treatment, since the node indices in the resulting graph are enough (case 0).
chiral atoms which are involved with ring bonds may need special treatment (cases 2, 3 and 4), but we can only do this once all atoms bonded to the chiral atom have been parsed (and have been given a node index).
The order of bonds around an atom will always be: implicit hydrogens, preceding atoms, ring bonds, other atoms.

I suggest lumping cases 1-4 together for sake of simplicity. This means when parsing we'll need to keep track of all atoms which are involved with ring bonds so we can post-process them. In addition, we'll need to track in which order (chiral) atoms have ring bonds, and once these edges are added to the graph, which node indices they correspond to (because ring indices can (and should) be reused).

So, concretely:

When "opening" a ring bond (first occurrence of a ring index): find the parent atom, store the ring indices in a list (this is effectively the inverse of ring_nums [2]), as well as the number of edges it already has (needed to distinguish cases 3 and 4).
When "closing" a ring bond, translate the previously stored ring index to an atom index.
After parsing the entire molecule, for every atom involved in ring bonds, see if it's chiral, and if so, construct the order in which nodes were bonded to it. Once done, set the atom's stereo attribute correctly (see below).

An open question is still how to represent the chirality at the graph level. I suggest sticking to the opensmiles @ @@ idea, which the exception that the order of the node inidices of the neighbours is determining (i.e. the neighbour with the lowest node index is the first, the one with highest last). We can then provide helper functions to translate that to R/S (and the other way). I think this is the most natural way of exposing it:

If you want to invert the chirality, switch @ for @@ or vice-versa.
If you want to change a CH3 group for a halogen, this could change your chirality even when you don't want to change the spatial orientation. Following our definition, this substitution is as simple as changing the 'element' attribute. If we store R/S you'd also have to recalculate the desired chirality.
This requires a little bit of attention when adding/removing explicit hydrogens, but we can cover this in the corresponding helper functions.

Open question is also how strict we want to be here: is it an error to provide a chiral specifier on a non-tetrahedral atom? What about symmetric substituents? In other words, what if I provide a chiral specifier for a non-chiral atom?

Lastly, careful thought is required for the SMILES writer, but I think we can push that back a little bit.

@Eljee thoughts?

[1] http://opensmiles.org/opensmiles.html#chirality
[2] https://github.com/pckroon/pysmiles/blob/master/pysmiles/read_smiles.py#L128

Stereochemistry

I was trying to get the nodes and adjacency matrix of a drug-like molecule. While doing this, I encountered the following message:

I don't quite know how to handle stereo yet...

However, code returns nodes vector and adjacency matrix that I wanted. I was wondering if the returned values are valid or they should be discarded.

Will pysmiles generate the only SMILES?

Will two identical structures have different SMILES? Could I use pysmiles to distinguish isomers?

Adjacency matrix

Hello,

I was trying to use this code to find a graph representation for a given SMILE string. I tried

from pysmiles import read_smiles
import networkx as nx

smiles = 'N=CN(C=O)CC=O' 
mol = read_smiles(smiles)

# atom vector (C only)
print(mol.nodes(data='element'))
# adjacency matrix
print(nx.to_numpy_matrix(mol))

Apparently, this does not differ between single/double/triple bonds. Is there a way to enforce that? Thanks!

please note occasional divergence of pysmiles' hcount vs. RDKit

Dear developers of pysmiles,

please note a recent post on mattermodelling.se which eventually compares the hydrogen count per non-H atom by pysmiles and RDKit for an N-alkylated dihydrobenzimidazole. Apparently, the present version 1.0.1 of pysmiles errs about this example, and wrongly assigns (still) a hydrogen present as if neuter nitrogen were tetravalent.

Since the two answers to the OP share MWEs and detailed results, it is possible to equally check the parent non-alkylated structure (c12ccccc1NCN2); in this case, the error is not observed.

NetworkX v3

To keep compatibility with vermouth / polyply and some other packages can pysmiles networkx dependency also be increased to v3?

Writing smiles silently breaks on graphs with multiple fragments

When writing smiles from a self-defined graph, it can happen that this graph has multiple unconnected fragments. I would have expected this to be recognized as zero-order bonds automatically. However, if encountering such a graph with multiple fragments, write_smiles() will only return the smiles string for one of the fragments and not even throw an error or warning. It assumes that zero-order bonds are explicitly mentioned as bonds with order zero in the graph, but this is an assumption that will probably fail in most use cases (like mine).

I propose to fix this issue by introducing zero-order bonds between unconnected fragments in the graph:

fragments_connectors = [list(frag)[0] for frag in nx.connected_components(graph)]
central_node = fragments_connectors.pop(0)
for idx in fragments_connectors:
    graph.add_edge(central_node, idx, order=0)

I actually don't know how to contribute to this in the best way by editing the code, but I'd be willing to give it a try if you could tell me how.

write_smiles can create invalid SMILES when provided with chemically invalid graphs

For some reason my graph is returning SMILES for aromatic groups that uses aromatic bond symbols e.g. NC:1:N:N:C:[N]1N.

RDKit does not recognize these symbols and it removes all the aromaticity to produce NC1NNCN1N, and openbabel produces the same result.

Some have speculated that its a smarts string

https://mattermodeling.stackexchange.com/questions/4981/how-to-canonicalize-smiles-written-with-aromatic-bond-symbols

others just say it's wrong.

openbabel/openbabel#2368

Do you know what is going on?

Thanks for your help!

Bug！In the conversion process of smiles and graph, the number of hydrogen atoms is converted incorrectly

code：
smiles = 'CCc1cn2c3c(cc(C(=O)NC(Cc4ccccc4)C(O)C[NH2+]Cc4cccc(OC)c4)cc13)N(C)S(=O)(=O)CC2'
graph = pysmiles.read_smiles(smiles)
pysmiles.write_smiles(graph)
Out：'O=C(NC(C(C[NH2+]Cc:1:c:c:c:c(:c1)OC)O)Cc:1:c:c:c:c:c1)c:1:c:c2:c:3NH[CH]CCC'

Question on rings

The bug / question
Most likely this question is due to my lack of understanding the smile syntax, but I encountered an odd behavior on rings. The following smile string [CH3](c1ccccc1)[CH2] correctly generates a graph of ethylbenzene, whereas this smile string [CH3]c1ccccc1[CH2] generates a graph of dimethylbenzene but one of the methyl groups lacks a hydrogen. My understanding is that the two smiles are the same but the second one is more sloppy as it lacks the braces. Should this perhaps raise an error?

Code to reproduce this behavior

import sys
import matplotlib.pyplot as plt
import networkx as nx
import pysmiles

mol = pysmiles.read_smiles(sys.argv[1], explicit_hydrogen=True)

nx.draw(mol, labels=labeldict, with_labels=True,  pos=nx.kamada_kawai_layout(mol) )
plt.show()

I tested with networkx version 2.8.1 and 3.1. The behavior is the same.

`[se]` and `[as]` are not properly recognized

While se and as can appear as aromatics if they are placed inside square brackets, however it seems this library fails to parse them

Add pysmiles to conda-forge

Hey, are there any plans to add pysmiles to conda-forge? I am using pysmiles in a python package I am soon gonna push to conda-forge myself, but it seems to be impossible to use a pip package as requirement for a conda package, understandably.

I have also seen that apparently it's quite simple to publish to conda if you're already on the pypi index as pysmiles is. There are functionalities like conda skeleton or grayskull that seem to make this quite simple. If it is out of your current availibilites to have a look at that, I could also try to find some time to have a look into this. I'm also doing all of this for the first time though.

Misinterpretation of the ring closure bonds

Dear developers

In some cases, the marker of ring closure bonds will locate after the marker of branching, for example C(CC)1OC1 and C1(CC)OC1. OpenBabel can accept them and yield the same structure, which can be validated by command line obabel -ismi -:'C1(CC)OC1' -osmi and obabel -ismi -:'C(CC)1OC1' -osmi, the converted SMILES expressions are C1(CC)OC1.

However pysmiles acting differently as following:

>>> import pysmiles
>>> number_first = pysmiles.read_smiles('C1(CC)OC1')
>>> number_last = pysmiles.read_smiles('C(CC)1OC1')
>>> number_first.edges
EdgeView([(0, 1), (0, 3), (0, 4), (1, 2), (3, 4)])
>>> number_last.edges
EdgeView([(0, 1), (0, 3), (1, 2), (2, 4), (3, 4)])
>>> number_inner = pysmiles.read_smiles('C(CC1)OC1')
>>> number_inner.edges
EdgeView([(0, 1), (0, 3), (1, 2), (2, 4), (3, 4)])

the number last expression is somehow been misinterpreted. would the pysmiles be permissive to this condition?

thiophene + adding hydrogen leads to incorrect count

Hi @pckroon,

When parsing the SMILES string describing thiophene (c1ccsc1) an incorrect hcount is assigned for the sulfur. For thiophene there is no hydrogen attached to the sulfur atom but the hcount is 1, consequently, one hydrogen is added. I've traced the accounting problem back to the fill_valence function in pysmiles.smiles_helper module. For this case the number of bonds attached to sulfur is 3 (i.e. 1.5 x 2), however, according to the listed valances, this would mean sulfur gets a valence of 4. And that is the problem because technically speaking I think the code works correctly.

I guess we need to check for the sulfur case if it is aromatic or not?

Treasure Hunt

Smiles that are aromatic and worked before the latest aromatic update:

dideoxyaraan1oxide2fluoro C1C(F)C(n2c3ncn(=O)c(N)c3nc2)OC1CO
2,5-dime-3,6-dichloropyrazine-1-oxide Cc1nc(Cl)c(C)n(=O)c1Cl
2,5-dime-6-chloropyrazine-1-oxide Cc1n(=O)c(Cl)c(C)nc1
2,3-di(4-cl-ph)-5-cl-pyrazine-1-oxide c1c(c3nc(Cl)cn(=O)c3c2ccc(Cl)cc2)ccc(Cl)c1
5-(1-oxo-1lambda~5~-pyridin-3-yl)pyrrolidin-2-one O=C1CCC(N1)c2cccn(=O)c2

pysmiles 'Unmatched ring indices [0]'

Given the String I get this error. Do you have any suggestions?

----> 5     return pysmiles.read_smiles(smiles_str,explicit_hydrogen=True)
      6 
      7 def graph_from_gnm(n,m):

~/anaconda3/envs/deepnv/lib/python3.6/site-packages/pysmiles/read_smiles.py in read_smiles(smiles, explicit_hydrogen, zero_order_bonds, reinterpret_aromatic)
    180             LOGGER.warning('E/Z stereochemical information, which is specified by "%s", will be discarded', token)
    181     if ring_nums:
--> 182         raise KeyError('Unmatched ring indices {}'.format(list(ring_nums.keys())))
    183 
    184     # Time to deal with aromaticity. This is a mess, because it's not super

KeyError: 'Unmatched ring indices [0]'

pckroon / pysmiles Goto Github PK

pysmiles's People

Contributors

Stargazers

Watchers

Forkers

pysmiles's Issues

Recommend Projects

Recommend Topics

Recommend Org