baoilleach / deepsmiles Goto Github PK

DeepSMILES - A variant of SMILES for use in machine-learning

License: MIT License

Python 100.00%

generative-models smiles neural-networks machine-learning

deepsmiles's Introduction

DeepSMILES

This Python module can convert well-formed SMILES (that is, as written by a cheminformatics toolkit) to DeepSMILES. It also does the reverse conversion.

Install the latest version with:

pip install --upgrade deepsmiles

DeepSMILES is a SMILES-like syntax suited to machine learning. Rings are indicated using a single symbol instead of two, while branches do not use matching parentheses but rather use a right parenthesis as a 'pop' operator.

For example, benzene is c1ccccc1 in SMILES but cccccc6 in DeepSMILES (where the 6 indicates the ring size). As a branch example, the SMILES C(Br)(OC)I can be converted to the DeepSMILES CBr)OC))I. For more information, please see the corresponding preprint (https://doi.org/10.26434/chemrxiv.7097960.v1) or the lightning talk at https://www.slideshare.net/NextMoveSoftware/deepsmiles.

The library is used as follows:

import deepsmiles
print("DeepSMILES version: %s" % deepsmiles.__version__)
converter = deepsmiles.Converter(rings=True, branches=True)
print(converter) # record the options used

encoded = converter.encode("c1cccc(C(=O)Cl)c1")
print("Encoded: %s" % encoded)

try:
    decoded = converter.decode(encoded)
except deepsmiles.DecodeError as e:
    decoded = None
    print("DecodeError! Error message was '%s'" % e.message)

if decoded:
    print("Decoded: %s" % decoded)

deepsmiles's People

Contributors

Stargazers

Watchers

deepsmiles's Issues

Missing colon in bondchars

While working with SMILES having all their bonds explicit, I noticed the lack of a colon in bondchars variable. That's why something like this happens:

>>> smiles_all_bonds_explicit = '[O]=[C](-[OH])-[c]1:[cH]:[c](-[OH]):[c](-[OH]):[c](-[OH]):[cH]:1'
>>> converter = deepsmiles.Converter(branches=True, rings=True)
>>> converter.encode(smiles_all_bonds_explicit)
'[O]=[C]-[OH])-[c]:[cH]:[c]-[OH]):[c]-[OH]):[c]-[OH]):[cH]:%12' 12-membered ring

After adding a colon to the line No. 11 in encode.py, the output is as expected:

'[O]=[C]-[OH])-[c]:[cH]:[c]-[OH]):[c]-[OH]):[c]-[OH]):[cH]:6' 6-membered ring

Decoding still works correctly after this change.

non-canonical smiles using deepsmiles

smiles have non-canonical smiles and canonical ones.
Here are some examples of from three non-canonical smiles to the same canonical smiles

Chem.MolToSmiles(Chem.MolFromSmiles('C1=CC=CN=C1'))
'c1ccncc1'
Chem.MolToSmiles(Chem.MolFromSmiles('c1cccnc1'))
'c1ccncc1'
Chem.MolToSmiles(Chem.MolFromSmiles('n1ccccc1'))
'c1ccncc1'

For deepsmiles，

converter.encode('n1ccccc1')
'nccccc6'
converter.encode('C1=CC=CN=C1')
'C=CC=CN=C6'
converter.encode('c1cccnc1')
'ccccnc6'

It gives three different answers. Are they all reasonable valid deepsmiles? Is there a way to tell 'nccccc6' == 'C=CC=CN=C6' =='ccccnc6'? (a way to canonicalize them). Thank you in advance!

Shift closure values

The closure values "0" and "1" will never be seen in the current DeepSMILES. C0 is meaningless, and C1 has a loop to itself.

Proposal 1: Shift the closure numbers so that "CC0" corresponds to what is currently "CC2".

The closure value "2" can only be seen with dot disconnections, as for example C.C2. Otherwise, a 2 always links to the previous atom, as CC2 or CN)C2. If #6 is implemented, such that closures cannot cross a dot disconnection, then the closure value "2" will never exist in a valid DeepSMILES.

Proposal 2: Shift the closure numbers so that "CCC0" corresponds to what is currently "CCC3".

This would make the closure values 0, 1, and 2 be useful.

disallow closures after branches

I think that closures should not be allowed after ')' terms.

To start, I think there's a bug in how the deepsmiles implementation handles this case. Consider COPNBCl))=3.

I expected the )) to transform COPNBCl)) to COPN with a branch on the N, then followed by the =3 to give the final SMILES CO1PN=1(BCl).

But the code does the following:

>>> decode = deepsmiles.Converter(True, True).decode
>>> decode("COPNBCl))=3")
'COPN1BCl=1'
>>> decode("COPN=3")
'CO1PN=1'

The =3 acts on the Cl, linking it to N, even though the )) "should" have popped those two atoms.

This can be a problem because the closures allow multiple branch-pops in a sequence. My code expected that each atom would have at most a single close ')' after it. As a result, I get

>>> cdeepsmiles.decode("COPNB)=3)N")
'CO1P(N=1(B)N\x00'

when I think I "should: have gotten:

'CO1P(N=1(B))N'

Closures after branch-pops are unneeded, and SMILES doesn't support them.

Rather than trying to handle this case correctly (and the fact that two different code bases got it wrong suggests that it isn't easy to get right), how about making it not be allowed?

Consider compressing parentheses

It has been mentioned that replacing the multiple close parentheses by a number plus a single parenthesis would be a good compression strategy. This of course is true. What I don't know is whether it would make it easier for a ML method to use/learn/generate the string. But I guess I can add an option to control this.

In the meanwhile, maybe I can provide a piece of Python code that does the transformation for anyone interested.

incorrect closure decoding when N>=100

The following meaningless DeepSMILES:

Bbbbbb2522222522534b52522534bbb25222225225342522534b52b2522222522534b52522534bbb25222225225342522534b5252b52bbbb2522222522534b5252b5b6

converts to:

Bb28%11b%13%14%16%19b%12%21b1345679%10%20b123456789%10%11%12%13%15%17%18%23%29%32%36%39b%14%15%16%17%18%19%20%21%34%41%42b%33%40%45%51%54b%22%24%25%26%27%28%30%31%35%37%38%56%57%59%62b%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%43%55%64b%42%43%44%46%47%48%49%50%52%53%63b%44%45%46%47%48%49%50%51%52%53%54%55%56%58%60%61%66%72%75%79%82b%57%58%59%60%61%62%63%64%77%84%85%87b%76%83%89b%65%67%68%69%70%71%73%74%78%80%81b%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%86%88b%85%86%87%88%90b%89%90%92%98%101b%103%104%106b%102%108%109b%91%93%94%95%96%97%99%100b%91%92%93%94%95%96%97%98%99%100%101%102%103%105%107b%104%105%106%107b%108b%109

This shows that decoder doesn't handle %(NNN) closures correctly, when NNN>=100. For example, the last four characters - "%109" - should be "%(109)".

This is likely because of this line in decode.py:

           smi_bcsymbol = "%d" % digit if digit < 10 else "%%%d" % digit

which should likely be something more like:

           if digit < 10:
             smi_bcsymbol = str(digit)
           elif digit < 100:
             smi_bcsymbol = "%" + str(digit)
           else:
             smi_bcsymbol = "%(" + str(digit) + ")"

Preprint missing citations

http://advances.sciencemag.org/content/4/7/eaap7885 is relevant but missing. By @isayev et al.

Did you try to implementate deepsmiles to chemical_vae coded by Rafa Gómez-Bombarelli etc.?

Thank you for your great library.
It is such a great idea that we can insert this converter before encoding and after decoding. Do you actually implement it into any vae model, such as chemical_vae built by Rafa Gómez-Bombarelli etc? Many thanks

SMILES output by the decoder are not canonical

Not a bug, maybe, but annoying.
If you do a roundtrip: SMILES -> DeepSMILES -> SMILES, you expect
the 1st input file and the last output file to be the same.
In order for this to be true, it is necessary to build the molecule from the decoded
SMILES by rdkit, then let rdkit create the SMILES to output (this one is equal to the input SMILES, if the input SMILES was made by rdkit).

Rings across branches.

If I'm not mistaken, the current version of Deepsmiles cannot handle the ring in smiles such as CCCC(CC1)C(CC1)CCCC directly. (The example smiles is valid although not canonical.)
If I input the example smile into the current code, the output will be CCCCCC))CCC2))CCCC which makes no sense at all.

The reason for this may lie in the DFS constraint you mentioned in #6.

I am wondering whether this issue is worth fixing and whether this pattern will appear in canonical smiles.

dot disconnection combined with branches and closures

I think that the dot disconnection "." should reset the system state so that closures and branches cannot cross it.

Here's an example of using a branch to cross a dot:

>>> import deepsmiles
>>> conv = deepsmiles.Converter(True, True)
>>> conv.decode("CN.OP))S")
'CN.(OP)S'
>>> conv.decode("CN.OP)))S")
'CN(.OP)S'

In both cases, the result is not a valid SMILES.

Here is an example of using a closure to cross a dot:

>>> import deepsmiles
>>> conv = deepsmiles.Converter(True, True)
>>> conv.decode("C.C2")
'C.1C1'

You'll note that the result here is also not a valid SMILES.

While it is possible to fix the code to support these use cases, I think it's better to disallow dot disconnections to be used this way.

We both know that dot disconnections like this are useful in SMILES. However, the uses I can think of - like simple combinatorial generation - only work with closures, and depend on being able to label both sides of disconnection. That's not possible with DeepSMILES.

Indexing off in decode when DeepSMILES string has >= 100 rings

This seems similar to #5, but the changes made in the resulting PR don't seem to address the issues that I'm running into.

I've been using the deepsmiles format and conversion code to build up SMILES strings for polymers from the string of just the monomer. The process has been working great until my resulting polymer deepsmiles string is for a molecule with 100 or more rings.

It looks like the issue is the assumption that there are always 2 digits following a % sign, and therefore the indexing skips ahead by 2.

in def decode_branches() of decode.py

 96         elif x == '%':
 97             if i == 0:
 98                 raise exceptions.DecodeError(deepsmiles, i, "'%' not allowed as first character")
 99             bondchar = deepsmiles[i-1] if i > 0 and deepsmiles[i-1] in bondchars else ""
100             if deepsmiles[i+1] == '(':
101                 closebracket = deepsmiles.find(')', i+2)
102                 if closebracket == -1:
103                     raise exceptions.DecodeError(deepsmiles, i, "'%(' is missing the corresponding close parenthesis")
104                 digit = int(deepsmiles[i+2:closebracket])
105                 i = closebracket
106             else:
107                 try:
108                     digit = int(deepsmiles[i+1:i+3])
109                 except ValueError:
110                     raise exceptions.DecodeError(deepsmiles, i, "'%' should be followed by two digits")
111                 i += 2

The smiles string that is returned in this case is incorrect.

Conda package

Hello,
I'd like to have DeepSmiles available as a conda package and have submitted a recipe to conda-forge. Feel free to comment on the pull request here: conda-forge/staged-recipes#9434

typo in decode.decode

The decode method for the default Converter gives an error:

>>> deepsmiles.Converter().decode("C")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "deepsmiles/converter.py", line 41, in decode
    return decode.decode(deepsmiles, rings=self.rings, branches=self.branches)
  File "deepsmiles/decode.py", line 242, in decode
    return smi
NameError: name 'smi' is not defined

This is because of a typo in decode.py:decode():

def decode(deepsmiles, rings=False, branches=False):
    ....
    if not rings and not branches:
        return smi

That last quoted line should be return deepsmiles.

baoilleach / deepsmiles Goto Github PK

deepsmiles's Introduction

DeepSMILES

deepsmiles's People

Contributors

Stargazers

Watchers

Forkers

deepsmiles's Issues

Recommend Projects

Recommend Topics

Recommend Org