Coder Social home page Coder Social logo

macformer's Introduction

Macformer

This repo implements macrocyclization of linear molecules to generate macrocycles with chemical diversity and structural novelty.

Setup

Install Macformer from the .yaml file

conda env create -f Macformer_env.yaml  
conda activate Macformer  

Quick Start

1. Data processing

The acyclic-macrocyclic SMILES pairs extracted from ChEMBL and ZINC database, respectively, can be found in the data/ folder. Or researcher can process their own macrocyclic compounds from scratch using scripts in the utils/ folder.

fragmentation.py:  generate unique acyclic-macrocyclic SMILES pairs  
data_split.py:  split the acyclic-macrocyclic SMILES pairs into train, validation, and test datasets 
data_augmentation.py:  implement substructure-aligned data augmentation  

2. Input files generation

The preprocessing.sh script will generate following input files necessary to train the model.

*.train.pt: serialized PyTorch file containing training data  
*.valid.pt: serialized PyTorch file containing validation data  
*.vocab.pt: serialized PyTorch file containing vocabulary data  

3. Model training

Run the training.sh script to start model training.
The saved checkpoints can be averaged by running the average_models.sh script.

4. Model evaluation

Run the testing_beam_search.sh script to obtain predicted molecules.
The utils/model_evaluation.py script can be used to calculate the evaluation metrics, including recovery, validity, uniqueness, novelty, and macrocyclization.

To compare our model with previously reported non-deep learning approaches, we proposed a pipeline to construct macrocycles from three-dimensional (3D) structures of linear compounds through linker database searching (termed as MacLS). The detailed script can be found in the MacLS.py of the Utils fold. For internal ChEMBL and external ZINC test datasets, the conformations of the linear chemical structures were obtained in two ways, one was generated de novo from the SMILES strings (termed as MacLS_self) and the other was extracted from the 3D structures of corresponding target macrocycles (termed as MacLS_extra).

5. Pre-trained models and results reproduction

The models pretrained with ChEMBL dataset can be found in the models/ folder.
The metrics can be reproduced by the pre-trained models using internal ChEMBL test dataset (data/ChEMBL/a10/src-testa10) and external ZINC test dataset (data/ZINC/src-external-zinc-a10).

Tabel 1. Comparison of Macformer with different augmentation numbers and MacLS on ChEMBL test dataset.

Method Training data augmentation Recovery(%) Validity(%) Uniqueness(%) Novelty(mol,%) Novelty(linker,%) Macrocyclization(%)
Macformer None 54.85±14.28 66.74±2.29 63.18±6.38 89.30±1.94 40.56±2.33 95.00±0.74
Macformer ×2 96.09±0.61 80.34±1.38 64.43±0.23 91.58±0.15 58.91±0.36 98.62±0.17
Macformer ×5 97.54±0.16 81.94±1.42 65.36±0.13 91.79±0.16 62.11±0.65 98.80±0.11
Macformer ×10 97.02±0.05 82.59±1.57 64.44±0.46 91.76±0.22 60.27±0.96 98.46±0.04
MacLS_self / 0.01±0.01 17.05±0.29 95.33±0.01 100±0.00 0.00±0.00 100±0.00
MacLS_extra / 4.16±0.20 89.65±0.03 96.32±0.06 99.65±0.02 0.00±0.00 100±0.00

Tabel 2. Comparison of Macformer with different augmentation numbers and MacLS on ZINC test dataset.

Method Training data augmentation Recovery(%) Validity(%) Uniqueness(%) Novelty(mol,%) Novelty(linker,%) Macrocyclization(%)
Macformer None 2.70±1.31 72.91±2.05 47.74±8.98 96.10±0.81 44.24±2.05 96.39±0.71
Macformer ×2 76.37±3.23 81.97±1.20 44.99±5.37 99.31±0.19 53.03±0.65 99.48±0.08
Macformer ×5 81.86±0.756 84.73±1.01 45.14±4.60 99.39±0.09 53.98±1.00 99.53±0.05
Macformer ×10 84.25±0.845 85.35±1.33 45.26±0.46 99.43±0.09 50.00±0.95 99.27±0.07
MacLS_self / 0.00±0.00 13.02±0.79 83.68±0.74 100±0.00 0.00±0.00 100±0.00
MacLS_extra / 4.52±0.20 89.67±0.07 95.04±0.14 99.99±0.00 0.00±0.00 100±0.00

macformer's People

Contributors

yydiao1025 avatar

Stargazers

Gu Chunbin avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.