Coder Social home page Coder Social logo

pckroon / pysmiles Goto Github PK

View Code? Open in Web Editor NEW
140.0 4.0 20.0 218 KB

A lightweight python-only library for reading and writing SMILES strings

License: Apache License 2.0

Python 100.00%
smiles-strings writing-smiles smiles cheminformatics python hacktoberfest

pysmiles's Introduction

Build Status Coverage Status

pysmiles: The lightweight and pure-python SMILES reader and writer

This is a small project I started because I couldn't find any SMILES reader or writer that was easy to install (read: Python only). Currently, the writer is extremely basic, and although it should produce valid SMILES they won't be pretty, but see also issue #17. The reader is in a better state, and should be usable.

SMILES strings are assumed to be as specified by the OpenSmiles standard.

Molecules

Molecules are depicted as Networkx graphs. Atoms are the nodes of the graph, and bonds are the edges. Nodes can have the following attributes:

  • element: str. This describes the element of the atom. Defaults to '*' meaning unknown.
  • aromatic: bool. Whether the atom is part of an (anti)-aromatic system. Defaults to False.
  • isotope: float. The mass of the atom. Defaults to unknown.
  • hcount: int. The number of implicit hydrogens attached to this atom. Defaults to 0.
  • charge: int. The charge of this atom. Defaults to 0.
  • class: int. The "class" of this atom. Defaults to 0.

Edges have the following attributes:

  • order: Number. The bond order. 1.5 is used for aromatic bonds. Defaults to 1.

There is currently no way of specifying stereo chemical information, and this is discarded upon reading. Somewhere in the future this will probably be stored in the "stereo" attribute of nodes.

Reading SMILES

The function read_smiles(smiles, explicit_hydrogen=False, zero_order_bonds=True, reinterpret_aromatic=True) can be used to parse a SMILES string. It should not be used to validate whether a string is a valid SMILES string --- the function does very little validation whether your SMILES string makes chemical sense. Edges in the created molecule will always have an 'order' attribute. Nodes will have the relevant attributes in so far they are specified. Atoms for which the element is not known (*) will not have an element attribute.

  • explicit_hydrogen determines whether hydrogen atoms should be represented as explicit nodes in the created molecule, or implicit in the 'hcount' attribute.
  • zero_order_bonds determines whether zero-order bonds (.) in the SMILES string should result in edges in the produced molecule.
  • reinterpret_aromatic determines whether aromaticity should be reinterpreted, and determined from the constructed molecule, or whether the aromaticity specifications from the SMILES string (lower case elements) should be taken as leading. If True, will also set bond orders to 1 for bonds that are not part of an aromatic ring and have a bond order of 1.5. If False, will create a molecule using only the information in the SMILES string.

Stereochemical information

Currently the library cannot handle stereochemical information, neither E/Z nor R/S. Any stereochemical information that was in the SMILES string will be discarded upon parsing. This means there will be no difference between parsing e.g. N[C@](Br)(O)C, N[C@@](Br)(O)C and NC(Br)(O)C. Parsing these will result in the same molecule. The same holds for e.g. F/C=C/F and FC=CF. These will result in the same molecule.

Whenever stereochemical information is being discarded a warning will be logged using the built-in logging module. If you want to disable all the messages logged by pysmiles you can add the following snippet to your code, without interfering with any logging by your own code:

import logging
logging.getLogger('pysmiles').setLevel(logging.CRITICAL)  # Anything higher than warning

Writing SMILES

The function write_smiles(molecule, default_element='*', start=None) can be used to write SMILES strings from a molecule. The function does not check whether your molecule makes chemical sense. Instead, it writes a SMILES representation of the molecule you provided, and nothing else.

  • default_element is the element to use for nodes that do not have an 'element' attribute.
  • start is the key of the node where the depth first traversal should be started. Something clever is done if not specified.

Additional functions

In addition to these two core functions, four more functions are exposed that can help in creating chemically relevant molecules with minimal work.

  • fill_valence(mol, respect_hcount=True, respect_bond_order=True, max_bond_order=3) This function will fill the valence of all atoms in your molecule by incrementing the 'hcount' and, if specified, bond orders. Note that it does not use 'charge' attribute to find the correct valence.
    • repect_hcount: bool. Whether existing hcounts can be overwritten.
    • respect_bond_order: bool. Whether bond orders can be changed
    • max_bond_order: int. The maximum bond order that will be set.
  • add_explicit_hydrogens(mol) This function transforms implicit hydrogens, specified by 'hcount' attributes, to explicit nodes.
  • remove_explicit_hydrogens(mol) This function does the inverse of add_explicit_hydrogens: it will remove explicit hydrogen nodes and add them to the relevant 'hcount' attributes.
  • correct_aromatic_rings(mol) This function marks all (anti)-aromatic atoms in your molecule, and sets all bonds between (anti)-aromatic atoms to order 1.5. It fills the valence of all atoms (see also fill_valence) before trying to figure our which atoms are aromatic. It works by first finding all atoms that are in a ring. Next, for every atom in every ring it is checked whether the atoms are sp2 hybridized (note that this is a vague term. Strictly speaking we check whether their element is something that could be aromatic, and whether they have 2 or 3 bonds.). Finally, the number of electrons per ring is counted, and if this is even, the atoms in the ring are said to be aromatic. This function is the most fragile in the whole library, and I expect it to produce wrong answers in some cases. In particular for fused (aromatic) ring systems (such as indole) and rings with extracyclic heteroatoms (O=C1C=CC=C1). Buyer beware.

Examples

Reading

from pysmiles import read_smiles

smiles = 'C1CC[13CH2]CC1C1CCCCC1'
mol = read_smiles(smiles)

print(mol.nodes(data='element'))
# [(0, 'C'),
#  (1, 'C'),
#  (2, 'C'),
#  (3, 'C'),
#  (4, 'C'),
#  (5, 'C'),
#  (6, 'C'),
#  (7, 'C'),
#  (8, 'C'),
#  (9, 'C'),
#  (10, 'C'),
#  (11, 'C')]
print(mol.nodes(data='hcount'))
# [(0, 2),
#  (1, 2),
#  (2, 2),
#  (3, 2),
#  (4, 2),
#  (5, 1),
#  (6, 1),
#  (7, 2),
#  (8, 2),
#  (9, 2),
#  (10, 2),
#  (11, 2)]

mol_with_H = read_smiles(smiles, explicit_hydrogen=True)
print(mol_with_H.nodes(data='element'))
# [(0, 'C'),
#  (1, 'C'),
#  (2, 'C'),
#  (3, 'C'),
#  (4, 'C'),
#  (5, 'C'),
#  (6, 'C'),
#  (7, 'C'),
#  (8, 'C'),
#  (9, 'C'),
#  (10, 'C'),
#  (11, 'C'),
#  (12, 'H'),
#  (13, 'H'),
#  (14, 'H'),
#  (15, 'H'),
#  (16, 'H'),
#  (17, 'H'),
#  (18, 'H'),
#  (19, 'H'),
#  (20, 'H'),
#  (21, 'H'),
#  (22, 'H'),
#  (23, 'H'),
#  (24, 'H'),
#  (25, 'H'),
#  (26, 'H'),
#  (27, 'H'),
#  (28, 'H'),
#  (29, 'H'),
#  (30, 'H'),
#  (31, 'H'),
#  (32, 'H'),
# (33, 'H')]

Writing

import networkx as nx
from pysmiles import write_smiles, fill_valence

mol = nx.Graph()
mol.add_edges_from([(0, 1), (1, 2), (1, 3), (3, 4), (1, 5), (3, 6)])
for idx, ele in enumerate('CCCCOCO'):
    mol.nodes[idx]['element'] = ele
mol.nodes[4]['charge'] = -1
mol.nodes[4]['hcount'] = 0
mol.edges[3, 6]['order'] = 2

print(write_smiles(mol))
# [O-]C(=O)C([C])([C])[C]
fill_valence(mol, respect_hcount=True)
print(write_smiles(mol))
# [O-]C(=O)C(C)(C)C

Limitations

  • The writer produces non-recommended SMILES strings (as per OpenSmiles).
  • The writer is better described as a "serializer": if the graph provided doesn't make chemical sense the produced "SMILES" string will be an exact representation of that graph. Because of this, the SMILES string will be invalid though.
  • fill_valence does not use 'charge' to find the correct valence.
  • correct_aromatic_rings is fragile.
  • There is currently no way of specifying stereo chemical information. The parser can deal with it, but it will be discarded.
  • It only processes SMILES. This might later be extended to e.g. InChi, SLN, SMARTS, etc.

Requirements

Similar projects

There are more python projects that deal with SMILES, and I try to list at least some of them here. If yours is missing, feel free to open up a PR.

  • PySMILE: A similar named project, capable of encoding/decoding SMILE format objects. Doesn't deal with SMILES.
  • RDKit: A collection of cheminformatics and machine-learning software, capable of reading and writing SMILES, InChi, and others.
  • OpenEye Chem toolkit: The OpenEye chemistry toolkit is a programming library for chemistry and cheminformatics. It is capable of dealing with (canonical) SMILES and InChi.

License

PySmiles is distributed under the Apache 2.0 license. Copyright 2018 Peter C Kroon

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

pysmiles's People

Contributors

fgrunewald avatar pckroon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pysmiles's Issues

fill valance fails for charged molecules

This little example should work in my opinion but does not:

import pysmiles
g=pysmiles.read_smiles("CCC[O-]")
pysmiles.smiles_helper.fill_valence(g, respect_hcount=False)
assert g.nodes[3]['hcount'] == 0

Origin of the problem is that fill_valence does not account for charges.

please note occasional divergence of pysmiles' hcount vs. RDKit

Dear developers of pysmiles,

please note a recent post on mattermodelling.se which eventually compares the hydrogen count per non-H atom by pysmiles and RDKit for an N-alkylated dihydrobenzimidazole. Apparently, the present version 1.0.1 of pysmiles errs about this example, and wrongly assigns (still) a hydrogen present as if neuter nitrogen were tetravalent.

Since the two answers to the OP share MWEs and detailed results, it is possible to equally check the parent non-alkylated structure (c12ccccc1NCN2); in this case, the error is not observed.

write_smiles can create invalid SMILES when provided with chemically invalid graphs

For some reason my graph is returning SMILES for aromatic groups that uses aromatic bond symbols e.g. NC:1:N:N:C:[N]1N.

RDKit does not recognize these symbols and it removes all the aromaticity to produce NC1NNCN1N, and openbabel produces the same result.

Some have speculated that its a smarts string

https://mattermodeling.stackexchange.com/questions/4981/how-to-canonicalize-smiles-written-with-aromatic-bond-symbols

others just say it's wrong.

openbabel/openbabel#2368

Do you know what is going on?

Thanks for your help!

could you not PRINT warnings?

maybe it is better to write to stderr or at least offer an option to shut off the warnings like "I can't deal with stereo yet..."

Bug Simpliefied Smiles of 1 atom

In the current implementation the smile "O" translates yields just O with an hcount of 0. Shouldn't it yield an hcount of 1 and be the short form of water? Perhaps I'm missing some convention here.

Inconsistent writing and reading of mono-atomic smiles for Se and As

If we have a graph featuring a single non-organic atom like Se, this will be outputted by the smiles writer as 'Se'. According to smiles rules, I would have expected '[Se]'. And then, if we again read in the smile 'Se' with the smiles reader, it will fail and output a graph featuring 'S', because it relies on the brackets to recognise non-organic elements.
I don't know if this is an issue with the smiles reader or writer, but they are inconsistent.

Question on rings

The bug / question
Most likely this question is due to my lack of understanding the smile syntax, but I encountered an odd behavior on rings. The following smile string [CH3](c1ccccc1)[CH2] correctly generates a graph of ethylbenzene, whereas this smile string [CH3]c1ccccc1[CH2] generates a graph of dimethylbenzene but one of the methyl groups lacks a hydrogen. My understanding is that the two smiles are the same but the second one is more sloppy as it lacks the braces. Should this perhaps raise an error?

Code to reproduce this behavior

import sys
import matplotlib.pyplot as plt
import networkx as nx
import pysmiles

mol = pysmiles.read_smiles(sys.argv[1], explicit_hydrogen=True)

nx.draw(mol, labels=labeldict, with_labels=True,  pos=nx.kamada_kawai_layout(mol) )
plt.show()

I tested with networkx version 2.8.1 and 3.1. The behavior is the same.

Writing smiles silently breaks on graphs with multiple fragments

When writing smiles from a self-defined graph, it can happen that this graph has multiple unconnected fragments. I would have expected this to be recognized as zero-order bonds automatically. However, if encountering such a graph with multiple fragments, write_smiles() will only return the smiles string for one of the fragments and not even throw an error or warning. It assumes that zero-order bonds are explicitly mentioned as bonds with order zero in the graph, but this is an assumption that will probably fail in most use cases (like mine).

I propose to fix this issue by introducing zero-order bonds between unconnected fragments in the graph:

fragments_connectors = [list(frag)[0] for frag in nx.connected_components(graph)]
central_node = fragments_connectors.pop(0)
for idx in fragments_connectors:
    graph.add_edge(central_node, idx, order=0)

I actually don't know how to contribute to this in the best way by editing the code, but I'd be willing to give it a try if you could tell me how.

Stereochemistry

I was trying to get the nodes and adjacency matrix of a drug-like molecule. While doing this, I encountered the following message:

I don't quite know how to handle stereo yet...

However, code returns nodes vector and adjacency matrix that I wanted. I was wondering if the returned values are valid or they should be discarded.

NetworkX v3

To keep compatibility with vermouth / polyply and some other packages can pysmiles networkx dependency also be increased to v3?

thiophene + adding hydrogen leads to incorrect count

Hi @pckroon,

When parsing the SMILES string describing thiophene (c1ccsc1) an incorrect hcount is assigned for the sulfur. For thiophene there is no hydrogen attached to the sulfur atom but the hcount is 1, consequently, one hydrogen is added. I've traced the accounting problem back to the fill_valence function in pysmiles.smiles_helper module. For this case the number of bonds attached to sulfur is 3 (i.e. 1.5 x 2), however, according to the listed valances, this would mean sulfur gets a valence of 4. And that is the problem because technically speaking I think the code works correctly.

I guess we need to check for the sulfur case if it is aromatic or not?

Misinterpretation of the ring closure bonds

Dear developers

In some cases, the marker of ring closure bonds will locate after the marker of branching, for example C(CC)1OC1 and C1(CC)OC1. OpenBabel can accept them and yield the same structure, which can be validated by command line obabel -ismi -:'C1(CC)OC1' -osmi and obabel -ismi -:'C(CC)1OC1' -osmi, the converted SMILES expressions are C1(CC)OC1.

However pysmiles acting differently as following:

>>> import pysmiles
>>> number_first = pysmiles.read_smiles('C1(CC)OC1')
>>> number_last = pysmiles.read_smiles('C(CC)1OC1')
>>> number_first.edges
EdgeView([(0, 1), (0, 3), (0, 4), (1, 2), (3, 4)])
>>> number_last.edges
EdgeView([(0, 1), (0, 3), (1, 2), (2, 4), (3, 4)])
>>> number_inner = pysmiles.read_smiles('C(CC1)OC1')
>>> number_inner.edges
EdgeView([(0, 1), (0, 3), (1, 2), (2, 4), (3, 4)])

the number last expression is somehow been misinterpreted. would the pysmiles be permissive to this condition?

R/S chirality

RIght, so in openSMILES chirality is very elaborate [1]. I suggest supporting only tetrahedral chiralities, i.e. @ and @@ specifications, and later on E/Z chirality. This issue will be just about the tetrahedral R/S chirality.

The default case is fairly easy to cover since pysmiles guarantees that the node indices in the final molecule are in the same order as the atom entries in the input SMILES. openSMILES says to look along the axis from the "first" atom to the chiral center, and see whether the other atoms go clockwise (@@) or counter clockwise (@).
The complication arises from rings: the order of the bonds around the central atom are what matters according to openSMILES. In this case the node indices in the resulting graph do not necessarily correspond to the chiral order. Consider the following cases, where x is the chiral center with 4 different substituents (a, b, c, d), and .. means "any number of different atoms in between".

a[x@](b)(c)d  # Default case 0
b1..a[x@]1(c)d  # case 1
a[x@]1(c)(d)..b1  # case 2
a[x@]21(d)..c1..b2  # case 3
[x@]12cd..a1..b2  # case 4

Finally, implicit hydrogens are considered the "first" atom, but we can shove this under the carpet for now.

This results in the following

  • chiral atoms which are not involved with ring bonds do not need special treatment, since the node indices in the resulting graph are enough (case 0).
  • chiral atoms which are involved with ring bonds may need special treatment (cases 2, 3 and 4), but we can only do this once all atoms bonded to the chiral atom have been parsed (and have been given a node index).
  • The order of bonds around an atom will always be: implicit hydrogens, preceding atoms, ring bonds, other atoms.

I suggest lumping cases 1-4 together for sake of simplicity. This means when parsing we'll need to keep track of all atoms which are involved with ring bonds so we can post-process them. In addition, we'll need to track in which order (chiral) atoms have ring bonds, and once these edges are added to the graph, which node indices they correspond to (because ring indices can (and should) be reused).

So, concretely:

  • When "opening" a ring bond (first occurrence of a ring index): find the parent atom, store the ring indices in a list (this is effectively the inverse of ring_nums [2]), as well as the number of edges it already has (needed to distinguish cases 3 and 4).
  • When "closing" a ring bond, translate the previously stored ring index to an atom index.
  • After parsing the entire molecule, for every atom involved in ring bonds, see if it's chiral, and if so, construct the order in which nodes were bonded to it. Once done, set the atom's stereo attribute correctly (see below).

An open question is still how to represent the chirality at the graph level. I suggest sticking to the opensmiles @ @@ idea, which the exception that the order of the node inidices of the neighbours is determining (i.e. the neighbour with the lowest node index is the first, the one with highest last). We can then provide helper functions to translate that to R/S (and the other way). I think this is the most natural way of exposing it:

  • If you want to invert the chirality, switch @ for @@ or vice-versa.
  • If you want to change a CH3 group for a halogen, this could change your chirality even when you don't want to change the spatial orientation. Following our definition, this substitution is as simple as changing the 'element' attribute. If we store R/S you'd also have to recalculate the desired chirality.
  • This requires a little bit of attention when adding/removing explicit hydrogens, but we can cover this in the corresponding helper functions.

Open question is also how strict we want to be here: is it an error to provide a chiral specifier on a non-tetrahedral atom? What about symmetric substituents? In other words, what if I provide a chiral specifier for a non-chiral atom?

Lastly, careful thought is required for the SMILES writer, but I think we can push that back a little bit.

@Eljee thoughts?

[1] http://opensmiles.org/opensmiles.html#chirality
[2] https://github.com/pckroon/pysmiles/blob/master/pysmiles/read_smiles.py#L128

Molecular Formula and Molecular Weight

Hello! I found your library very helpful in parsing SMILES.

Would you be interested in adding MF and MW as additional attributes?

Something along these lines:

from collections import default_dict
from pysmiles import read_smiles

AW = {
    'C': 12.0107,
    'H': 1.00794,
    # etc.
}

class MolecularFormula:
    def __init__(self, smiles: str):
        self.smiles = smiles
        self.mf = defaultdict(lambda: 0)

        try:
            mol = read_smiles(
                smiles,
                explicit_hydrogen=False,
                reinterpret_aromatic=False,
            )
            nodes = mol.nodes()

            for i in range(mol.number_of_nodes()):
                self.mf[nodes[i]['element']] += 1
                self.mf['H'] += nodes[i]['hcount']

            self.mw = 0
            for k, v in self.mf.items():
                self.mw += AW[k] * v

            self.mw = round(self.mw, 2)
        except Exception as e:
            # log or raise
            self.mw = 0

    def __repr__(self):
        return ''.join([str(k)+str(v) for k,v in self.mf.items()])

pysmiles 'Unmatched ring indices [0]'

Given the String I get this error. Do you have any suggestions?

----> 5     return pysmiles.read_smiles(smiles_str,explicit_hydrogen=True)
      6 
      7 def graph_from_gnm(n,m):

~/anaconda3/envs/deepnv/lib/python3.6/site-packages/pysmiles/read_smiles.py in read_smiles(smiles, explicit_hydrogen, zero_order_bonds, reinterpret_aromatic)
    180             LOGGER.warning('E/Z stereochemical information, which is specified by "%s", will be discarded', token)
    181     if ring_nums:
--> 182         raise KeyError('Unmatched ring indices {}'.format(list(ring_nums.keys())))
    183 
    184     # Time to deal with aromaticity. This is a mess, because it's not super

KeyError: 'Unmatched ring indices [0]'

Adjacency matrix

Hello,

I was trying to use this code to find a graph representation for a given SMILE string. I tried

from pysmiles import read_smiles
import networkx as nx

smiles = 'N=CN(C=O)CC=O' 
mol = read_smiles(smiles)

# atom vector (C only)
print(mol.nodes(data='element'))
# adjacency matrix
print(nx.to_numpy_matrix(mol))

Apparently, this does not differ between single/double/triple bonds. Is there a way to enforce that? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.