Coder Social home page Coder Social logo

deepsmiles's Introduction

DeepSMILES

This Python module can convert well-formed SMILES (that is, as written by a cheminformatics toolkit) to DeepSMILES. It also does the reverse conversion.

Install the latest version with:

pip install --upgrade deepsmiles

DeepSMILES is a SMILES-like syntax suited to machine learning. Rings are indicated using a single symbol instead of two, while branches do not use matching parentheses but rather use a right parenthesis as a 'pop' operator.

For example, benzene is c1ccccc1 in SMILES but cccccc6 in DeepSMILES (where the 6 indicates the ring size). As a branch example, the SMILES C(Br)(OC)I can be converted to the DeepSMILES CBr)OC))I. For more information, please see the corresponding preprint (https://doi.org/10.26434/chemrxiv.7097960.v1) or the lightning talk at https://www.slideshare.net/NextMoveSoftware/deepsmiles.

The library is used as follows:

import deepsmiles
print("DeepSMILES version: %s" % deepsmiles.__version__)
converter = deepsmiles.Converter(rings=True, branches=True)
print(converter) # record the options used

encoded = converter.encode("c1cccc(C(=O)Cl)c1")
print("Encoded: %s" % encoded)

try:
    decoded = converter.decode(encoded)
except deepsmiles.DecodeError as e:
    decoded = None
    print("DecodeError! Error message was '%s'" % e.message)

if decoded:
    print("Decoded: %s" % decoded)

deepsmiles's People

Contributors

baoilleach avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepsmiles's Issues

Missing colon in bondchars

While working with SMILES having all their bonds explicit, I noticed the lack of a colon in bondchars variable. That's why something like this happens:

>>> smiles_all_bonds_explicit = '[O]=[C](-[OH])-[c]1:[cH]:[c](-[OH]):[c](-[OH]):[c](-[OH]):[cH]:1'
>>> converter = deepsmiles.Converter(branches=True, rings=True)
>>> converter.encode(smiles_all_bonds_explicit)
'[O]=[C]-[OH])-[c]:[cH]:[c]-[OH]):[c]-[OH]):[c]-[OH]):[cH]:%12' 12-membered ring

After adding a colon to the line No. 11 in encode.py, the output is as expected:

'[O]=[C]-[OH])-[c]:[cH]:[c]-[OH]):[c]-[OH]):[c]-[OH]):[cH]:6' 6-membered ring

Decoding still works correctly after this change.

non-canonical smiles using deepsmiles

smiles have non-canonical smiles and canonical ones.
Here are some examples of from three non-canonical smiles to the same canonical smiles

Chem.MolToSmiles(Chem.MolFromSmiles('C1=CC=CN=C1'))
'c1ccncc1'
Chem.MolToSmiles(Chem.MolFromSmiles('c1cccnc1'))
'c1ccncc1'
Chem.MolToSmiles(Chem.MolFromSmiles('n1ccccc1'))
'c1ccncc1'

For deepsmiles,

converter.encode('n1ccccc1')
'nccccc6'
converter.encode('C1=CC=CN=C1')
'C=CC=CN=C6'
converter.encode('c1cccnc1')
'ccccnc6'

It gives three different answers. Are they all reasonable valid deepsmiles? Is there a way to tell 'nccccc6' == 'C=CC=CN=C6' =='ccccnc6'? (a way to canonicalize them). Thank you in advance!

Shift closure values

The closure values "0" and "1" will never be seen in the current DeepSMILES. C0 is meaningless, and C1 has a loop to itself.

Proposal 1: Shift the closure numbers so that "CC0" corresponds to what is currently "CC2".

The closure value "2" can only be seen with dot disconnections, as for example C.C2. Otherwise, a 2 always links to the previous atom, as CC2 or CN)C2. If #6 is implemented, such that closures cannot cross a dot disconnection, then the closure value "2" will never exist in a valid DeepSMILES.

Proposal 2: Shift the closure numbers so that "CCC0" corresponds to what is currently "CCC3".

This would make the closure values 0, 1, and 2 be useful.

disallow closures after branches

I think that closures should not be allowed after ')' terms.

To start, I think there's a bug in how the deepsmiles implementation handles this case. Consider COPNBCl))=3.

I expected the )) to transform COPNBCl)) to COPN with a branch on the N, then followed by the =3 to give the final SMILES CO1PN=1(BCl).

But the code does the following:

>>> decode = deepsmiles.Converter(True, True).decode
>>> decode("COPNBCl))=3")
'COPN1BCl=1'
>>> decode("COPN=3")
'CO1PN=1'

The =3 acts on the Cl, linking it to N, even though the )) "should" have popped those two atoms.

This can be a problem because the closures allow multiple branch-pops in a sequence. My code expected that each atom would have at most a single close ')' after it. As a result, I get

>>> cdeepsmiles.decode("COPNB)=3)N")
'CO1P(N=1(B)N\x00'

when I think I "should: have gotten:

'CO1P(N=1(B))N'

Closures after branch-pops are unneeded, and SMILES doesn't support them.

Rather than trying to handle this case correctly (and the fact that two different code bases got it wrong suggests that it isn't easy to get right), how about making it not be allowed?

Consider compressing parentheses

It has been mentioned that replacing the multiple close parentheses by a number plus a single parenthesis would be a good compression strategy. This of course is true. What I don't know is whether it would make it easier for a ML method to use/learn/generate the string. But I guess I can add an option to control this.

In the meanwhile, maybe I can provide a piece of Python code that does the transformation for anyone interested.

incorrect closure decoding when N>=100

The following meaningless DeepSMILES:

Bbbbbb2522222522534b52522534bbb25222225225342522534b52b2522222522534b52522534bbb25222225225342522534b5252b52bbbb2522222522534b5252b5b6

converts to:

Bb28%11b%13%14%16%19b%12%21b1345679%10%20b123456789%10%11%12%13%15%17%18%23%29%32%36%39b%14%15%16%17%18%19%20%21%34%41%42b%33%40%45%51%54b%22%24%25%26%27%28%30%31%35%37%38%56%57%59%62b%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%43%55%64b%42%43%44%46%47%48%49%50%52%53%63b%44%45%46%47%48%49%50%51%52%53%54%55%56%58%60%61%66%72%75%79%82b%57%58%59%60%61%62%63%64%77%84%85%87b%76%83%89b%65%67%68%69%70%71%73%74%78%80%81b%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%86%88b%85%86%87%88%90b%89%90%92%98%101b%103%104%106b%102%108%109b%91%93%94%95%96%97%99%100b%91%92%93%94%95%96%97%98%99%100%101%102%103%105%107b%104%105%106%107b%108b%109

This shows that decoder doesn't handle %(NNN) closures correctly, when NNN>=100. For example, the last four characters - "%109" - should be "%(109)".

This is likely because of this line in decode.py:

           smi_bcsymbol = "%d" % digit if digit < 10 else "%%%d" % digit

which should likely be something more like:

           if digit < 10:
             smi_bcsymbol = str(digit)
           elif digit < 100:
             smi_bcsymbol = "%" + str(digit)
           else:
             smi_bcsymbol = "%(" + str(digit) + ")"

SMILES output by the decoder are not canonical

Not a bug, maybe, but annoying.
If you do a roundtrip: SMILES -> DeepSMILES -> SMILES, you expect
the 1st input file and the last output file to be the same.
In order for this to be true, it is necessary to build the molecule from the decoded
SMILES by rdkit, then let rdkit create the SMILES to output (this one is equal to the input SMILES, if the input SMILES was made by rdkit).

Rings across branches.

If I'm not mistaken, the current version of Deepsmiles cannot handle the ring in smiles such as CCCC(CC1)C(CC1)CCCC directly. (The example smiles is valid although not canonical.)
If I input the example smile into the current code, the output will be CCCCCC))CCC2))CCCC which makes no sense at all.

The reason for this may lie in the DFS constraint you mentioned in #6.

I am wondering whether this issue is worth fixing and whether this pattern will appear in canonical smiles.

dot disconnection combined with branches and closures

I think that the dot disconnection "." should reset the system state so that closures and branches cannot cross it.

Here's an example of using a branch to cross a dot:

>>> import deepsmiles
>>> conv = deepsmiles.Converter(True, True)
>>> conv.decode("CN.OP))S")
'CN.(OP)S'
>>> conv.decode("CN.OP)))S")
'CN(.OP)S'

In both cases, the result is not a valid SMILES.

Here is an example of using a closure to cross a dot:

>>> import deepsmiles
>>> conv = deepsmiles.Converter(True, True)
>>> conv.decode("C.C2")
'C.1C1'

You'll note that the result here is also not a valid SMILES.

While it is possible to fix the code to support these use cases, I think it's better to disallow dot disconnections to be used this way.

We both know that dot disconnections like this are useful in SMILES. However, the uses I can think of - like simple combinatorial generation - only work with closures, and depend on being able to label both sides of disconnection. That's not possible with DeepSMILES.

Indexing off in decode when DeepSMILES string has >= 100 rings

This seems similar to #5, but the changes made in the resulting PR don't seem to address the issues that I'm running into.

I've been using the deepsmiles format and conversion code to build up SMILES strings for polymers from the string of just the monomer. The process has been working great until my resulting polymer deepsmiles string is for a molecule with 100 or more rings.

It looks like the issue is the assumption that there are always 2 digits following a % sign, and therefore the indexing skips ahead by 2.

in def decode_branches() of decode.py

 96         elif x == '%':
 97             if i == 0:
 98                 raise exceptions.DecodeError(deepsmiles, i, "'%' not allowed as first character")
 99             bondchar = deepsmiles[i-1] if i > 0 and deepsmiles[i-1] in bondchars else ""
100             if deepsmiles[i+1] == '(':
101                 closebracket = deepsmiles.find(')', i+2)
102                 if closebracket == -1:
103                     raise exceptions.DecodeError(deepsmiles, i, "'%(' is missing the corresponding close parenthesis")
104                 digit = int(deepsmiles[i+2:closebracket])
105                 i = closebracket
106             else:
107                 try:
108                     digit = int(deepsmiles[i+1:i+3])
109                 except ValueError:
110                     raise exceptions.DecodeError(deepsmiles, i, "'%' should be followed by two digits")
111                 i += 2

The smiles string that is returned in this case is incorrect.

typo in decode.decode

The decode method for the default Converter gives an error:

>>> deepsmiles.Converter().decode("C")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "deepsmiles/converter.py", line 41, in decode
    return decode.decode(deepsmiles, rings=self.rings, branches=self.branches)
  File "deepsmiles/decode.py", line 242, in decode
    return smi
NameError: name 'smi' is not defined

This is because of a typo in decode.py:decode():

def decode(deepsmiles, rings=False, branches=False):
    ....
    if not rings and not branches:
        return smi

That last quoted line should be return deepsmiles.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.