wenhao-gao / synnet Goto Github PK

View Code? Open in Web Editor NEW

76.0 76.0 21.0 23.34 MB

License: MIT License

Python 100.00%

synnet's People

Contributors

Stargazers

Watchers

synnet's Issues

index 192158 is out of bounds for axis 0 with size 179821

Dear authors, we can not run the 20-predict-targets.py file, the previous files can all be performed correctly.

Can you tell me what I should do to solve this? Thank you a lot!!

Errors in creating env and running unit tests.

Hi,

I'd like to use SynNet in my work. I have followed the instructions in the README to setup my environment.

In the environment.yml file the name is rdkit not synthenv. As a result, source activate synthevn as instructed in the readme does not work. You may want to take a look at these.
When I ran the unit tests, it gives me a few errors. I think it's originating from the incorrect path specifications. One of the errors I have got: FileNotFoundError: [Errno 2] No such file or directory: '/pool001/whgao/data/synth_net/st_hb/enamine_us_emb_gin.npy' I noticed that there are multiple pathways as such, which might make it difficult to use in future computations without having to change each and everyone of them.

Will you be able to help me with these? Thanks!

No output in Synthesis Planning

I have trained models and data ready, then follow steps 0-2 from the INSTRUCTIONS.md.

python /data/users/xx/projects/SynNet/src/00-extract-smiles-from-sdf.py \
    --input-file="/data/users/xx/data/enamine_us/Enamine_Rush-Delivery_Building_Blocks-US_222337cmpd_20230801.sdf" \
    --output-file="/data/users/xx/projects/SynNet/data/assets/building-blocks/enamine-us-smiles.csv.gz"

python /data/users/xx/projects/SynNet/src/01-filter-building-blocks.py \
    --building-blocks-file "/data/users/xx/projects/SynNet/data/assets/building-blocks/enamine-us-smiles.csv.gz" \
    --rxn-templates-file "/data/users/xx/projects/SynNet/data/assets/reaction-templates/hb.txt" \
    --output-bblock-file "/data/users/xx/projects/SynNet/data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
    --output-rxns-collection-file "/data/users/xx/projects/SynNet/data/pre-process/building-blocks-rxns/rxns-hb-enamine-us.json.gz" --verbose

python /data/users/xx/projects/SynNet/src/02-compute-embeddings.py \
    --building-blocks-file "/data/users/xx/projects/SynNet/data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
    --output-file "/data/users/xx/projects/SynNet/data/pre-process/embeddings/hb-enamine-embeddings.npy" \
    --featurization-fct "fp_256"

but after running the script of synthesis planning

BUILDING_BLOCKS_FILE=/data/users/xx/projects/SynNet/data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz
RXN_COLLECTION_FILE=/data/users/xx/projects/SynNet/data/pre-process/building-blocks-rxns/rxns-hb-enamine-us.json.gz
EMBEDDINGS_KNN_FILE=/data/users/xx/projects/SynNet/data/pre-process/embeddings/hb-enamine-embeddings.npy
python /data/users/xx/projects/SynNet/src/20-predict-targets.py \
    --building-blocks-file $BUILDING_BLOCKS_FILE \
    --rxns-collection-file $RXN_COLLECTION_FILE \
    --embeddings-knn-file $EMBEDDINGS_KNN_FILE \
    --data "/data/users/xx/projects/SynNet/data/assets/molecules/sample-targets.txt" \
    --ckpt-dir "/data/users/xx/projects/SynNet/checkpoints/" \
    --output-dir "/data/users/xx/projects/SynNet/results/demo-inference/"

I get the result as

targets,decoded,similarity
COc1cc(Cn2c(C)c(Cc3ccccc3)c3c2CCCC3)ccc1OCC(=O)N(C)C,,0
CCC1CCCC(Nc2cc(C(F)(F)F)c(Cl)cc2SC)CC1,,0
Clc1cc(Cl)c(C2=NC(c3cccc4c(Br)cccc34)=NN2)nn1,,0
COc1ccc(S(=O)(=O)c2ccc(-c3nc(-c4cc(B(O)O)ccc4O)no3)cn2)cc1,,0
CNS(=O)(=O)c1ccc(-c2cc3c4c(ccc3[nH]2)CCCN4C(N)=O)cc1,,0
CC(NC(=O)C1Cn2c(O)nnc2CN1)c1cc(F)ccc1N1CCC(n2nnn(-c3ccc(Br)cc3)c2=S)CC1,,0
COc1cc(-c2nc(-c3ccccc3)c(-c3ccccc3)s2)ccn1,,0
CCCn1c(C)nnc1CC(C)(O)C(=C(C)C)c1nccnc1S(=O)(=O)F,,0
CN(c1ccccc1)c1ccc(-c2nc3ncccc3s2)cn1,,0
COc1cc(-c2nc(-c3ccc(F)cc3)c(-c3ccc(F)cc3)n2c2cc(Cl)ccc2Cl)ccc1Oc1ccc(S(=O)(=O)N2CCCCC2)cc1[N+](=O)[O-],,0

It seems there is no output.
Can you help me with this? Thank you!

Bug: Missing argument `_tree`

It seems like the argument _tree is missing for the nn_search function.

Missing argument here:

SynNet/syn_net/utils/predict_utils.py

Line 871 in 56917a6

dist, ind = nn_search(z_mol1)

And function signature here:

SynNet/syn_net/utils/predict_utils.py

Line 290 in 56917a6

def nn_search(_e, _tree, _k=1):

As far as I can see, there is no _tree in the local scope. This will throw an error but is caught in scripts/_mp_predict_multireactant.py/ with a try ... except clause.

hardcoded paths in training `validation_step`

hello wenhao & rocio,

the unittests are great and gives a great overview of how different modules should be run. however, I saw that in these lines, the path to the building block embeddings are hardcoded to the path on the HPC cluster.

SynNet/syn_net/models/mlp.py

Lines 78 to 89 in 56917a6

    
           if out_feat == 'gin': 
        
               bb_emb_gin = np.load('/pool001/whgao/data/synth_net/st_hb/enamine_us_emb_gin.npy') 
        
               kdtree = BallTree(bb_emb_gin, metric='euclidean') 
        
           elif out_feat == 'fp_4096': 
        
               bb_emb_fp_4096 = np.load('/pool001/whgao/data/synth_net/st_hb/enamine_us_emb_fp_4096.npy') 
        
               kdtree = BallTree(bb_emb_fp_4096, metric='euclidean') 
        
           elif out_feat == 'fp_256': 
        
               bb_emb_fp_256 = np.load('/pool001/whgao/data/synth_net/st_hb/enamine_us_emb_fp_256.npy') 
        
               kdtree = BallTree(bb_emb_fp_256, metric=cosine_distance) 
        
           elif out_feat == 'rdkit2d': 
        
               bb_emb_rdkit2d = np.load('/pool001/whgao/data/synth_net/st_hb/enamine_us_emb_rdkit2d.npy') 
        
               kdtree = BallTree(bb_emb_rdkit2d, metric='euclidean')

so, I am unable to make pytest pass, specifically:

FAILED tests/test_Training.py::TestTraining::test_reactant1_network - UnboundLocalError: local variable 'kdtree' referenced before assignment
FAILED tests/test_Training.py::TestTraining::test_reactant2_network - UnboundLocalError: local variable 'kdtree' referenced before assignment

at least for the unittest, what should the correct path be? and would it be possible to make these paths user-passable arguments?

SynNet/scripts/predict_multireactant_mp.py

Line 29 in 56917a6

    
           path_to_data = f'/pool001/whgao/data/synth_net/st_{args.rxn_template}/st_{args.data}.json.gz'

there's a similar hardcoding in this line, so I suppose we'll have to generate the .json.gz ourselves

Can't locate env/synthenv.yml

Is the environment.yml same as "env/synthenv.yml"?

Running optimize_ga.py

I'm trying to test everything is working in my setup by running

python optimize_ga.py --radius 2 --nbits 4096 --num_population 128 --num_offspring 512 --num_gen 200 --ncpu 48

It seems to run forever with the following output

Using backend: pytorch
Downloading gin_supervised_contextpred_pre_trained.pth from https://data.dgl.ai/dgllife/pre_trained/gin_supervised_contextpred.pth...
Pretrained model loaded
Downloading gin_supervised_contextpred_pre_trained.pth from https://data.dgl.ai/dgllife/pre_trained/gin_supervised_contextpred.pth...
Pretrained model loaded
Starting with 128 fps with 4096 bits
mat1 and mat2 shapes cannot be multiplied (1x12292 and 12288x1200)
mat1 and mat2 shapes cannot be multiplied (1x12292 and 12288x1200)
mat1 and mat2 shapes cannot be multiplied (1x12292 and 12288x1200)
mat1 and mat2 shapes cannot be multiplied (1x12292 and 12288x1200)
...
mat1 and mat2 shapes cannot be multiplied (1x12292 and 12288x1200)
mat1 and mat2 shapes cannot be multiplied (1x12292 and 12288x1200)
Initial: 0.000 +/- 0.000
Scores: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
Top-3 Smiles: [None, None, None]

How long should this run and is this output normal?

No mol_fp module in _mp_decode.py

I run optimize_ga.py for my molecule optimization.
But I got the error because no mol_fp module in _mp_decoe.py.

Traceback (most recent call last):
  File "/home/sejeong/codes/SynNet/scripts/optimize_ga.py", line 207, in <module>
    [decode.mol_fp(smi, args.radius, args.nbits) for smi in starting_smiles]
  File "/home/sejeong/codes/SynNet/scripts/optimize_ga.py", line 207, in <listcomp>
    [decode.mol_fp(smi, args.radius, args.nbits) for smi in starting_smiles]
AttributeError: module 'scripts._mp_decode' has no attribute 'mol_fp'

So, I changed the mol_fp to mol_fp function in predict_utils.py.

from syn_net.utils.predict_utils import mol_fp

            population = np.array(
                [mol_fp(smi, args.radius, args.nbits) for smi in starting_smiles]
            )

Then, I got the error like below.

Traceback (most recent call last):
  File "/home/sejeong/codes/SynNet/scripts/optimize_ga.py", line 210, in <module>
    population = population.reshape((population.shape[0], population.shape[2]))
IndexError: tuple index out of range

Can you help me with this error?

Error computing embeddings

When running the compute_embedding.py I get this error.

Using backend: pytorch
Downloading gin_supervised_contextpred_pre_trained.pth from https://data.dgl.ai/dgllife/pre_trained/gin_supervised_contextpred.pth...
Pretrained model loaded
Total data:  172988
  0%|                                                                                                                                                                                                            | 0/172988 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/ec2-user/SynNet/scripts/compute_embedding.py", line 143, in <module>
    embeddings.append(model(smi))
  File "/home/ec2-user/miniconda3/envs/rdkit/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
TypeError: forward() missing 2 required positional arguments: 'categorical_node_feats' and 'categorical_edge_feats'

When trying to run the compute_embedding_mp.py I get the following error

Using backend: pytorch
Downloading gin_supervised_contextpred_pre_trained.pth from https://data.dgl.ai/dgllife/pre_trained/gin_supervised_contextpred.pth...
Pretrained model loaded
Total data:  172988
Traceback (most recent call last):
  File "/home/ec2-user/SynNet/scripts/compute_embedding_mp.py", line 29, in <module>
    embeddings = pool.map(gin_embedding, data)
NameError: name 'gin_embedding' is not defined

I think this can be resolved by changing gin_embedding to model but that then results in the above error.

predict_multireactant_mp.py error

I run the code with my data (which have smiles data more than 2000).
And then, the sentence like below was printed.
Can you tell me why the error occurs?
I don't know the exact list object which provoke the error.

list index out of range

Decoder appears to decode random initial fingerprints to the same SMILES

I've recently been interested in running SynNet with the most recent version of the US Stock Enamine BBs. I ran steps 0-2 to preprocess the data and wanted to try reward-guided molecule generation using GA per the instructions in the readme. However, I notice that even with the initial randomly generated fingerprints, 70-80 of the initial 100 are decoded to the same SMILES string:

CC(C)(C)OC(=O)N1CC2NCCN(S(=O)(=O)CC(=O)c3ccccc3)C2C1

This causes the GA population update to hang forever, as insufficient unique new molecules are found to add to the pool and increment parent_idx to num_population in each step of the algorithm.

Could this be the result of the difference in the Enamine stock between the time of publication and now? Any help is appreciated!

Thank you,
Andrei

Unit Test Files

First off great work!

The unit tests reference files that are ignored in .gitignore

'./data/states_0_train.npz'
'./data/st_hb_test.json.gz'
'./data/building_blocks_matched.csv.gz'

can we add these to the repo so the unit tests can be run?

Issue running optimize_ga.py

When I try and run optimize_ga.py I am getting

Using backend: pytorch
Downloading gin_supervised_contextpred_pre_trained.pth from https://data.dgl.ai/dgllife/pre_trained/gin_supervised_contextpred.pth...
Pretrained model loaded
Starting with 128 fps with 4096 bits
Traceback (most recent call last):
  File "/home/ec2-user/SynNet/scripts/optimize_ga.py", line 205, in <module>
    scores, mols, trees = fitness(embs=population,
TypeError: fitness() got an unexpected keyword argument 'pool'

This is the command I am using

python optimize_ga.py --radius 2 --nbits 4096 --num_population 128 --num_offspring 512 --num_gen 200 --ncpu 48 --objective logp

add docs on `compute_embedding.py` needed for inference

hello (again),

sorry that I am raising multiple issues. just want to make it easier for everyone else to start using this awesome work.

i didn't a note about how one could compute molecular fingerprints / GNN embeddings for a dataset. only after some CTRL+F, i found that scripts/compute_embedding.py does it.
https://github.com/wenhao-gao/SynNet/blob/master/scripts/compute_embedding.py

so, it would be a good idea to add this to the README. I believe we need to do this step before running any inference.

ZINC csv used by publication

hello wenhao & rocio,

I see that we have to provide path/to/zinc.csv to run the genetic algorithm (to replicate how it was done in the paper)
https://github.com/wenhao-gao/SynNet#synthesizable-molecular-design-1

optimize_ga.py -i path/to/zinc.csv --radius 2 --nbits 4096 --num_population 128 --num_offspring 512 --num_gen 200 --ncpu 32 --objective gsk

is it possible to provide the exact zinc.csv that was used in the publication?

Seeds are randomly sampled from the ZINC database (Sterling & Irwin, 2015)

	if out_feat == 'gin':
	bb_emb_gin = np.load('/pool001/whgao/data/synth_net/st_hb/enamine_us_emb_gin.npy')
	kdtree = BallTree(bb_emb_gin, metric='euclidean')
	elif out_feat == 'fp_4096':
	bb_emb_fp_4096 = np.load('/pool001/whgao/data/synth_net/st_hb/enamine_us_emb_fp_4096.npy')
	kdtree = BallTree(bb_emb_fp_4096, metric='euclidean')
	elif out_feat == 'fp_256':
	bb_emb_fp_256 = np.load('/pool001/whgao/data/synth_net/st_hb/enamine_us_emb_fp_256.npy')
	kdtree = BallTree(bb_emb_fp_256, metric=cosine_distance)
	elif out_feat == 'rdkit2d':
	bb_emb_rdkit2d = np.load('/pool001/whgao/data/synth_net/st_hb/enamine_us_emb_rdkit2d.npy')
	kdtree = BallTree(bb_emb_rdkit2d, metric='euclidean')

wenhao-gao / synnet Goto Github PK

synnet's People

Contributors

Stargazers

Watchers

Forkers

synnet's Issues

Recommend Projects

Recommend Topics

Recommend Org