Coder Social home page Coder Social logo

synnet's People

Contributors

chrulm avatar degraff-roivant avatar tkramer-motion avatar wenhao-gao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

synnet's Issues

Errors in creating env and running unit tests.

Hi,

I'd like to use SynNet in my work. I have followed the instructions in the README to setup my environment.

  1. In the environment.yml file the name is rdkit not synthenv. As a result, source activate synthevn as instructed in the readme does not work. You may want to take a look at these.

  2. When I ran the unit tests, it gives me a few errors. I think it's originating from the incorrect path specifications. One of the errors I have got: FileNotFoundError: [Errno 2] No such file or directory: '/pool001/whgao/data/synth_net/st_hb/enamine_us_emb_gin.npy' I noticed that there are multiple pathways as such, which might make it difficult to use in future computations without having to change each and everyone of them.

Will you be able to help me with these? Thanks!

No output in Synthesis Planning

I have trained models and data ready, then follow steps 0-2 from the INSTRUCTIONS.md.

python /data/users/xx/projects/SynNet/src/00-extract-smiles-from-sdf.py \
    --input-file="/data/users/xx/data/enamine_us/Enamine_Rush-Delivery_Building_Blocks-US_222337cmpd_20230801.sdf" \
    --output-file="/data/users/xx/projects/SynNet/data/assets/building-blocks/enamine-us-smiles.csv.gz"
python /data/users/xx/projects/SynNet/src/01-filter-building-blocks.py \
    --building-blocks-file "/data/users/xx/projects/SynNet/data/assets/building-blocks/enamine-us-smiles.csv.gz" \
    --rxn-templates-file "/data/users/xx/projects/SynNet/data/assets/reaction-templates/hb.txt" \
    --output-bblock-file "/data/users/xx/projects/SynNet/data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
    --output-rxns-collection-file "/data/users/xx/projects/SynNet/data/pre-process/building-blocks-rxns/rxns-hb-enamine-us.json.gz" --verbose
python /data/users/xx/projects/SynNet/src/02-compute-embeddings.py \
    --building-blocks-file "/data/users/xx/projects/SynNet/data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
    --output-file "/data/users/xx/projects/SynNet/data/pre-process/embeddings/hb-enamine-embeddings.npy" \
    --featurization-fct "fp_256"

but after running the script of synthesis planning

BUILDING_BLOCKS_FILE=/data/users/xx/projects/SynNet/data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz
RXN_COLLECTION_FILE=/data/users/xx/projects/SynNet/data/pre-process/building-blocks-rxns/rxns-hb-enamine-us.json.gz
EMBEDDINGS_KNN_FILE=/data/users/xx/projects/SynNet/data/pre-process/embeddings/hb-enamine-embeddings.npy
python /data/users/xx/projects/SynNet/src/20-predict-targets.py \
    --building-blocks-file $BUILDING_BLOCKS_FILE \
    --rxns-collection-file $RXN_COLLECTION_FILE \
    --embeddings-knn-file $EMBEDDINGS_KNN_FILE \
    --data "/data/users/xx/projects/SynNet/data/assets/molecules/sample-targets.txt" \
    --ckpt-dir "/data/users/xx/projects/SynNet/checkpoints/" \
    --output-dir "/data/users/xx/projects/SynNet/results/demo-inference/"

I get the result as

targets,decoded,similarity
COc1cc(Cn2c(C)c(Cc3ccccc3)c3c2CCCC3)ccc1OCC(=O)N(C)C,,0
CCC1CCCC(Nc2cc(C(F)(F)F)c(Cl)cc2SC)CC1,,0
Clc1cc(Cl)c(C2=NC(c3cccc4c(Br)cccc34)=NN2)nn1,,0
COc1ccc(S(=O)(=O)c2ccc(-c3nc(-c4cc(B(O)O)ccc4O)no3)cn2)cc1,,0
CNS(=O)(=O)c1ccc(-c2cc3c4c(ccc3[nH]2)CCCN4C(N)=O)cc1,,0
CC(NC(=O)C1Cn2c(O)nnc2CN1)c1cc(F)ccc1N1CCC(n2nnn(-c3ccc(Br)cc3)c2=S)CC1,,0
COc1cc(-c2nc(-c3ccccc3)c(-c3ccccc3)s2)ccn1,,0
CCCn1c(C)nnc1CC(C)(O)C(=C(C)C)c1nccnc1S(=O)(=O)F,,0
CN(c1ccccc1)c1ccc(-c2nc3ncccc3s2)cn1,,0
COc1cc(-c2nc(-c3ccc(F)cc3)c(-c3ccc(F)cc3)n2c2cc(Cl)ccc2Cl)ccc1Oc1ccc(S(=O)(=O)N2CCCCC2)cc1[N+](=O)[O-],,0

It seems there is no output.
Can you help me with this? Thank you!

Bug: Missing argument `_tree`

It seems like the argument _tree is missing for the nn_search function.

Missing argument here:

dist, ind = nn_search(z_mol1)

And function signature here:

def nn_search(_e, _tree, _k=1):

As far as I can see, there is no _tree in the local scope. This will throw an error but is caught in scripts/_mp_predict_multireactant.py/ with a try ... except clause.

hardcoded paths in training `validation_step`

hello wenhao & rocio,

the unittests are great and gives a great overview of how different modules should be run. however, I saw that in these lines, the path to the building block embeddings are hardcoded to the path on the HPC cluster.

if out_feat == 'gin':
bb_emb_gin = np.load('/pool001/whgao/data/synth_net/st_hb/enamine_us_emb_gin.npy')
kdtree = BallTree(bb_emb_gin, metric='euclidean')
elif out_feat == 'fp_4096':
bb_emb_fp_4096 = np.load('/pool001/whgao/data/synth_net/st_hb/enamine_us_emb_fp_4096.npy')
kdtree = BallTree(bb_emb_fp_4096, metric='euclidean')
elif out_feat == 'fp_256':
bb_emb_fp_256 = np.load('/pool001/whgao/data/synth_net/st_hb/enamine_us_emb_fp_256.npy')
kdtree = BallTree(bb_emb_fp_256, metric=cosine_distance)
elif out_feat == 'rdkit2d':
bb_emb_rdkit2d = np.load('/pool001/whgao/data/synth_net/st_hb/enamine_us_emb_rdkit2d.npy')
kdtree = BallTree(bb_emb_rdkit2d, metric='euclidean')

so, I am unable to make pytest pass, specifically:

FAILED tests/test_Training.py::TestTraining::test_reactant1_network - UnboundLocalError: local variable 'kdtree' referenced before assignment
FAILED tests/test_Training.py::TestTraining::test_reactant2_network - UnboundLocalError: local variable 'kdtree' referenced before assignment

at least for the unittest, what should the correct path be? and would it be possible to make these paths user-passable arguments?


path_to_data = f'/pool001/whgao/data/synth_net/st_{args.rxn_template}/st_{args.data}.json.gz'

there's a similar hardcoding in this line, so I suppose we'll have to generate the .json.gz ourselves

Running optimize_ga.py

I'm trying to test everything is working in my setup by running

python optimize_ga.py --radius 2 --nbits 4096 --num_population 128 --num_offspring 512 --num_gen 200 --ncpu 48

It seems to run forever with the following output

Using backend: pytorch
Downloading gin_supervised_contextpred_pre_trained.pth from https://data.dgl.ai/dgllife/pre_trained/gin_supervised_contextpred.pth...
Pretrained model loaded
Downloading gin_supervised_contextpred_pre_trained.pth from https://data.dgl.ai/dgllife/pre_trained/gin_supervised_contextpred.pth...
Pretrained model loaded
Starting with 128 fps with 4096 bits
mat1 and mat2 shapes cannot be multiplied (1x12292 and 12288x1200)
mat1 and mat2 shapes cannot be multiplied (1x12292 and 12288x1200)
mat1 and mat2 shapes cannot be multiplied (1x12292 and 12288x1200)
mat1 and mat2 shapes cannot be multiplied (1x12292 and 12288x1200)
...
mat1 and mat2 shapes cannot be multiplied (1x12292 and 12288x1200)
mat1 and mat2 shapes cannot be multiplied (1x12292 and 12288x1200)
Initial: 0.000 +/- 0.000
Scores: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
Top-3 Smiles: [None, None, None]

How long should this run and is this output normal?

No mol_fp module in _mp_decode.py

I run optimize_ga.py for my molecule optimization.
But I got the error because no mol_fp module in _mp_decoe.py.

Traceback (most recent call last):
  File "/home/sejeong/codes/SynNet/scripts/optimize_ga.py", line 207, in <module>
    [decode.mol_fp(smi, args.radius, args.nbits) for smi in starting_smiles]
  File "/home/sejeong/codes/SynNet/scripts/optimize_ga.py", line 207, in <listcomp>
    [decode.mol_fp(smi, args.radius, args.nbits) for smi in starting_smiles]
AttributeError: module 'scripts._mp_decode' has no attribute 'mol_fp'

So, I changed the mol_fp to mol_fp function in predict_utils.py.

from syn_net.utils.predict_utils import mol_fp

            population = np.array(
                [mol_fp(smi, args.radius, args.nbits) for smi in starting_smiles]
            ) 

Then, I got the error like below.

Traceback (most recent call last):
  File "/home/sejeong/codes/SynNet/scripts/optimize_ga.py", line 210, in <module>
    population = population.reshape((population.shape[0], population.shape[2]))
IndexError: tuple index out of range

Can you help me with this error?

Error computing embeddings

When running the compute_embedding.py I get this error.

Using backend: pytorch
Downloading gin_supervised_contextpred_pre_trained.pth from https://data.dgl.ai/dgllife/pre_trained/gin_supervised_contextpred.pth...
Pretrained model loaded
Total data:  172988
  0%|                                                                                                                                                                                                            | 0/172988 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/ec2-user/SynNet/scripts/compute_embedding.py", line 143, in <module>
    embeddings.append(model(smi))
  File "/home/ec2-user/miniconda3/envs/rdkit/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
TypeError: forward() missing 2 required positional arguments: 'categorical_node_feats' and 'categorical_edge_feats'

When trying to run the compute_embedding_mp.py I get the following error

Using backend: pytorch
Downloading gin_supervised_contextpred_pre_trained.pth from https://data.dgl.ai/dgllife/pre_trained/gin_supervised_contextpred.pth...
Pretrained model loaded
Total data:  172988
Traceback (most recent call last):
  File "/home/ec2-user/SynNet/scripts/compute_embedding_mp.py", line 29, in <module>
    embeddings = pool.map(gin_embedding, data)
NameError: name 'gin_embedding' is not defined

I think this can be resolved by changing gin_embedding to model but that then results in the above error.

predict_multireactant_mp.py error

I run the code with my data (which have smiles data more than 2000).
And then, the sentence like below was printed.
Can you tell me why the error occurs?
I don't know the exact list object which provoke the error.

list index out of range

Decoder appears to decode random initial fingerprints to the same SMILES

I've recently been interested in running SynNet with the most recent version of the US Stock Enamine BBs. I ran steps 0-2 to preprocess the data and wanted to try reward-guided molecule generation using GA per the instructions in the readme. However, I notice that even with the initial randomly generated fingerprints, 70-80 of the initial 100 are decoded to the same SMILES string:

CC(C)(C)OC(=O)N1CC2NCCN(S(=O)(=O)CC(=O)c3ccccc3)C2C1

This causes the GA population update to hang forever, as insufficient unique new molecules are found to add to the pool and increment parent_idx to num_population in each step of the algorithm.

Could this be the result of the difference in the Enamine stock between the time of publication and now? Any help is appreciated!

Thank you,
Andrei

Unit Test Files

First off great work!

The unit tests reference files that are ignored in .gitignore

'./data/states_0_train.npz'
'./data/st_hb_test.json.gz'
'./data/building_blocks_matched.csv.gz'

can we add these to the repo so the unit tests can be run?

Issue running optimize_ga.py

When I try and run optimize_ga.py I am getting

Using backend: pytorch
Downloading gin_supervised_contextpred_pre_trained.pth from https://data.dgl.ai/dgllife/pre_trained/gin_supervised_contextpred.pth...
Pretrained model loaded
Starting with 128 fps with 4096 bits
Traceback (most recent call last):
  File "/home/ec2-user/SynNet/scripts/optimize_ga.py", line 205, in <module>
    scores, mols, trees = fitness(embs=population,
TypeError: fitness() got an unexpected keyword argument 'pool'

This is the command I am using

python optimize_ga.py --radius 2 --nbits 4096 --num_population 128 --num_offspring 512 --num_gen 200 --ncpu 48 --objective logp

add docs on `compute_embedding.py` needed for inference

hello (again),

sorry that I am raising multiple issues. just want to make it easier for everyone else to start using this awesome work.

i didn't a note about how one could compute molecular fingerprints / GNN embeddings for a dataset. only after some CTRL+F, i found that scripts/compute_embedding.py does it.
https://github.com/wenhao-gao/SynNet/blob/master/scripts/compute_embedding.py

so, it would be a good idea to add this to the README. I believe we need to do this step before running any inference.

ZINC csv used by publication

hello wenhao & rocio,

I see that we have to provide path/to/zinc.csv to run the genetic algorithm (to replicate how it was done in the paper)
https://github.com/wenhao-gao/SynNet#synthesizable-molecular-design-1

optimize_ga.py -i path/to/zinc.csv --radius 2 --nbits 4096 --num_population 128 --num_offspring 512 --num_gen 200 --ncpu 32 --objective gsk

is it possible to provide the exact zinc.csv that was used in the publication?

Seeds are randomly sampled from the ZINC database (Sterling & Irwin, 2015)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.