kohulan / smiles-to-iupac-translator Goto Github PK

View Code? Open in Web Editor NEW

122.0 8.0 24.0 38.92 MB

Transformer based SMILES to IUPAC Translator

License: MIT License

Python 92.23% Jupyter Notebook 7.77%

iupac-translator stout neural-machine-translation smiles

smiles-to-iupac-translator's Introduction

V2.0

Smiles TO iUpac Translator: Advanced Chemical Nomenclature Translation

Key Features • Installation • How To Use • Acknowledgements • Citation

Key Features

🧪 Translate SMILES to IUPAC names
🔬 Convert IUPAC names back to valid SMILES strings
🤖 Powered by advanced transformer models
💻 Cross-platform support (Linux, macOS, Windows via Ubuntu shell)
🚀 High-performance chemical nomenclature translation

Installation

Choose your preferred installation method:

📦 PyPI Installation

pip install STOUT-pypi

🐍 Conda Environment Setup

conda create --name STOUT python=3.10 
conda activate STOUT
conda install -c decimer stout-pypi

📥 Direct Repository Installation

pip install git+https://github.com/Kohulan/Smiles-TO-iUpac-Translator.git

How To Use

from STOUT import translate_forward, translate_reverse

# SMILES to IUPAC name translation
SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
IUPAC_name = translate_forward(SMILES)
print(f"🧪 IUPAC name of {SMILES} is: {IUPAC_name}")

# IUPAC name to SMILES translation
IUPAC_name = "1,3,7-trimethylpurine-2,6-dione"
SMILES = translate_reverse(IUPAC_name)
print(f"🔬 SMILES of {IUPAC_name} is: {SMILES}")

Acknowledgements

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC)

Part of the DECIMER Project

About Us

Citation

Rajan, K., Zielesny, A. & Steinbeck, C. STOUT: SMILES to IUPAC names using neural machine translation. J Cheminform 13, 34 (2021). https://doi.org/10.1186/s13321-021-00512-4

Model Card

Rajan, K., Steinbeck, C., & Zielesny, A. (2024). STOUT V2 - Model library. Zenodo. https://doi.org/10.5281/zenodo.13318286

Repository Analytics

Made with ❤️ by the Steinbeck Group

smiles-to-iupac-translator's People

Contributors

Stargazers

Watchers

smiles-to-iupac-translator's Issues

Dockerfile

I have been trying to install the package in numerous environments but I keep running into dependency issues. Is there any possibility you would be willing to create a Dockerfile to help simplify the install process?

Thanks thanks for you help and work!

PyPi - tensorflow 2.15.0

Hello, would it be possible to update the pypi requirments to work with tensorflow 2.15.0? When I call pip install STOUT-pypi==2.0.5 I get ERROR: Could not find a version that satisfies the requirement tensorflow==2.10.1

I'm using Windows 10 and python 3.11.4.

UnsatisfiableError: The following specifications were found to be incompatible with each other

Hello @Kohulan,
I'm trying to install STOUT on my system (WSL2, ubuntu 20.04). But i got error while running "conda install -c decimer stout-pypi"
it gave me

Any idea on how to resolve it?? from error it seems the stout required some other version of glibc, as it shows conflict with the installed version. Can you specify which version is required for STOUT?

Requirements are a little strict? And IUPAC name of O=Cc1ccccc1 is: styrene

Hi,

Awesome package. Just a comment that I wanted to install stout into an existing environment I'm using for Chemoinformatics. This environment has rdkit installed through conda, and tensorflow 2.6.2 (not tensorflow-gpu). I installed pystow manually and then ignored the requirements for stout, and it works fine. Just wondering if the requirement for tensorflow-gpu and the pypi version of rdkit is a bit strict?

So anyway all working nicely. I ran a bunch of test molecules, which mostly look great.
However this one caught my eye.
IUPAC name of O=Cc1ccccc1 is: styrene. Should be benzaldehyde.
Just thought I'd flag this as a test case, as its a relatively simple molecule. Not meant as a critisism or anything.

Keep up the great work.

Kind regards,
Will

Not working for the example of intended use on MacOS.

Hi,

After following the steps that you recommend to install the package, I got the following errors:

Downloading trained model to Trained_models/60/forward ...
... done downloading trained model!
Archive:  STOUT_trained_models_v2.1.zip
   creating: Trained_models/
   creating: Trained_models/60/
   creating: Trained_models/60/forward/
  inflating: Trained_models/60/forward/ckpt-1.data-00000-of-00001
  inflating: Trained_models/60/forward/ckpt-1.index
  inflating: Trained_models/60/forward/checkpoint
   creating: Trained_models/60/reverse/
  inflating: Trained_models/60/reverse/ckpt-1.data-00000-of-00001
  inflating: Trained_models/60/reverse/ckpt-1.index
  inflating: Trained_models/60/reverse/checkpoint
   creating: Trained_models/30/
   creating: Trained_models/30/forward/
  inflating: Trained_models/30/forward/ckpt-1.data-00000-of-00001
  inflating: Trained_models/30/forward/ckpt-1.index
  inflating: Trained_models/30/forward/checkpoint
   creating: Trained_models/30/reverse/
  inflating: Trained_models/30/reverse/ckpt-1.data-00000-of-00001
  inflating: Trained_models/30/reverse/ckpt-1.index
  inflating: Trained_models/30/reverse/checkpoint
Traceback (most recent call last):
  File "STOUT_V_2.1.py", line 303, in <module>
    main()
  File "STOUT_V_2.1.py", line 50, in main
    iupac_name = translate(selfies.encoder(canonical_smiles.decode('utf-8').strip()).replace("][","] ["))
  File "STOUT_V_2.1.py", line 167, in translate
    result, sentence = evaluate(sentence)
  File "STOUT_V_2.1.py", line 133, in evaluate
    inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
  File "STOUT_V_2.1.py", line 133, in <listcomp>
    inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
KeyError: '[Branch1]'

Include stats on performance

Include proper statistics into ReadMe

Ammonia

Hi,

Today, when I tried to generate the SMILES string for 'ammonia', I got '[NH2+]' back, which is certainly wrong.
>>> STOUT.translate_reverse('ammonia') '[NH2+]'

When I tried to convert 'Ammonia', I got back a mess of weird strings.

>>> STOUT.translate_reverse('Ammonia') '[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr].[Pr'

I also tried the systematic name.
>>> STOUT.translate_reverse('azane') 'N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.N.'

>>> STOUT.translate_reverse('Azane') '[15NH3]'

I'm not sure if this is intended and I guess the error is on my side, but could you please have a look? :)

In the other direction, it works well:
>>> STOUT.translate_forward('N') 'azane'

>>> STOUT.translate_forward('[NH2+]') 'azanium'

Thank you
Philipp

Wrong comment sign?

in the line
python3 STOUT_V_2.0.py 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C' -> SMILES to IUPAC
I assume that everything after including "->" is a comment.
Use a comment sign in this case.

how to train ?

Best regards. Could you please help with some instructions to re-train it with my own data?
Thank you in advance.

Conda Package

Is it possible to get a conda package? This would make it easier to create the conda environment then install this.

Rename Trained_models.zip to something more descriptive

I suggest to rename Trained_models.zip to something more descriptive, with a version number and an application name, such as STOUT_trained_models_v2.11.zip.

Step by step guide not working

When I try to run
pip install tensorflow-gpu==2.3.0 selfies matplotlib unicodedata2
I am getting ERROR: Could not find a version that satisfies the requirement tensorflow-gpu==2.3.0
ERROR: No matching distribution found for tensorflow-gpu==2.3.0

Where I can find the model?

SELFIES version

Thanks for this great model! Is the SELFIES version mentioned in the paper also available?

Can we do a SMILES to Common Chemical Name Translator?

Hello,

Is it possible to do a SMILES to Common Chemical Name as well as to IUPAC as well? I have the data already available in the form of a name to SMILES directly? Is that possible.

I will definitely build a connector to this. Been playing with a bit as well. Really like what you have done this is so awesome.

Possibility to change the input of models to accept a batch as a collection of SMILES/names?

Dear @Kohulan,

I'm wondering if it is possible to adjust the model so that it can accept multiple inputs. For example, input might be a batch of smiles / names, therefore increasing the performance of the model.

Right now the only way for multiple inputs is just passing them one by one in a for-loop:

# SMILES to IUPAC name translation
smiles_list = ['CC(=O)OC(CC(=O)O)C[N+](C)(C)C',
             'CC(CN)O',
             'C1=CC(=C(C=C1[N+](=O)[O-])[N+](=O)[O-])Cl',
             'CCN1C=NC2=C(N=CN=C21)N',]

for smiles in smiles_list:
    IUPAC_names = translate_forward(smiles)
    print("IUPAC name of "+smiles+" is: "+IUPAC_names)

Do you think it is possible to implement as an input a batch (collection of SMILES/names) to have something like that:

# SMILES to IUPAC name translation
smiles_list = ['CC(=O)OC(CC(=O)O)C[N+](C)(C)C',
             'CC(CN)O',
             'C1=CC(=C(C=C1[N+](=O)[O-])[N+](=O)[O-])Cl',
             'CCN1C=NC2=C(N=CN=C21)N',]

IUPAC_names = translate_forward(smiles_list)

Is it feasible by changing input shapes and do necessary preprocessing of input data or the only way is to re-train/fine-tune the model?