awslabs / dgl-lifesci Goto Github PK

View Code? Open in Web Editor NEW

690.0 17.0 144.0 986 KB

Python package for graph neural networks in chemistry and biology

License: Apache License 2.0

Shell 0.28% Python 99.72%

deep-learning graph-neural-networks dgl cheminformatics bioinformatics geometric-deep-learning drug-discovery molecule

dgl-lifesci's Introduction

DGL-LifeSci

Documentation | Discussion Forum

We also have a slack channel for real-time discussion. If you want to join the channel, contact [email protected].

Introduction
Installation
Command Line Interface
Examples
Cite

Introduction

Deep learning on graphs has been an arising trend in the past few years. There are a lot of graphs in life science such as molecular graphs and biological networks, making it an import area for applying deep learning on graphs. DGL-LifeSci is a DGL-based package for various applications in life science with graph neural networks.

We provide various functionalities, including but not limited to methods for graph construction, featurization, and evaluation, model architectures, training scripts and pre-trained models.

For a list of community contributors, see here.

Installation

Requirements

DGL-LifeSci should work on

all Linux distributions no earlier than Ubuntu 16.04
macOS X
Windows 10

It is recommended to create a conda environment for DGL-LifeSci with for example

conda create -n dgllife python=3.6

DGL-LifeSci requires python 3.6+, DGL 0.7.0+ and PyTorch 1.5.0+.

Install pytorch

Install dgl

Additionally, we require RDKit. The easiest way to install RDKit is

pip install rdkit

If you need to work on the example of JTVAE, then you need RDKit 2018.09.3. We recommend installing it with

conda install -c rdkit rdkit==2018.09.3

For other installation recipes for RDKit, see the official documentation.

Pip installation for DGL-LifeSci

pip install dgllife

Installation from source

If you want to try experimental features, you can install from source as follows:

git clone https://github.com/awslabs/dgl-lifesci.git
cd dgl-lifesci/python
python setup.py install

Verifying successful installation

Once you have installed the package, you can verify the success of installation with

import dgllife

print(dgllife.__version__)
# 0.3.2

Command Line Interface

DGL-LifeSci provides command line interfaces that allow users to perform modeling without any background in programming and deep learning. You will need to first clone the github repo.

Examples

For a full list of work implemented in DGL-LifeSci, see here.

Cite

If you use DGL-LifeSci in a scientific publication, we would appreciate citations to the following paper:

@article{dgllife,
    title={DGL-LifeSci: An Open-Source Toolkit for Deep Learning on Graphs in Life Science},
    author={Mufei Li and Jinjing Zhou and Jiajing Hu and Wenxuan Fan and Yangkang Zhang and Yaxin Gu and George Karypis},
    year={2021},
    journal = {ACS Omega}
}

dgl-lifesci's People

Contributors

Stargazers

Watchers

Forkers

yuezhong-bio autodataming haozhu233 leelasd everwinding chaoyue729 erfaan-rostami vovallen jeanru sooheon skrsna karen-116 qize yangkzz cyhflight minghao2016 aymenwah roguedog94 batoolmm rookiecoder-chen mengtinghuang savithanagaraju ladadidadixx rollingstonezz briannaflynn joshuameyers mingchenchen willy20040711 wenx00 padr31 worldeditors richgene jiahuahe jamesthesnake acproject lifeixianshen phenylazide mar-volk koyurion ziqiaomeng masterwhook vigneshinzone zwvews jhmlam errol-andy nashid shunsunsun pbk0 atiahamidizadeh yupliu runom xzhang-ml mufeili natnaelt ddnguyenmath shuowang-ai ramanarayan86 sailfish009 caiyingchun nlp-kg bbyun28 rjd55 henrychang213 xy21hb y-minghao octaviomtz ustchope yemilawal jcheminform fulowl jamiekang siboehm deepsystemspharmacology yoheigon marcossilva wangyitian123 mariberry bwyueyue luis-cyber mayankkom-dev wudi-cqu backlitcat wangze09 prasannavd fujinomoto jacobumland meniapi byun-jinyoung qj-chen zhaoyanpeng208 shengtudai2 arrepath shunyang2018 fhuang233 babeltower ratthachat mdcao pinkdiamond1 flying-sheep knu-lcbc

dgl-lifesci's Issues

GPU slower than CPU

Tried to use dgllife model_zoo to extract molecule feature and found that running speed on GPU was much slower than on CPU. It's hard to train a model on 3090

	GPU	CPU
T4	0.3637	0.0535
3090	93.5422	0.01018

my env is as follows:

python 3.7
torch 1.7.0
dgl-cu101 0.6.1
dgllife 0.2.8

JTVAE's `pretrain` script results raises due to mismatched dtypes

Running examples/generative_models/jtvae/pretrain.py without any arguments (which should pretrain on ZINC) raises an Error:

/home/simon/miniconda3/envs/jtvae_dgl/lib/python3.7/site-packages/dgl/base.py:45: DGLWarning: The input graph for the user-defined edge function does not contain valid edges
  return warnings.warn(message, category=category, stacklevel=1)
Traceback (most recent call last):
  File "/home/simon/Documents/ETH/Masters_thesis/chemical_CPA/embeddings/jtvae/pretrain.py", line 192, in <module>
    main(args)
  File "/home/simon/Documents/ETH/Masters_thesis/chemical_CPA/embeddings/jtvae/pretrain.py", line 86, in main
    beta=0,
  File "/home/simon/miniconda3/envs/jtvae_dgl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/simon/miniconda3/envs/jtvae_dgl/lib/python3.7/site-packages/dgllife/model/model_zoo/jtvae.py", line 664, in forward
    word_loss, topo_loss, word_acc, topo_acc = self.decoder(batch_tree_graphs, tree_vec)
  File "/home/simon/miniconda3/envs/jtvae_dgl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/simon/miniconda3/envs/jtvae_dgl/lib/python3.7/site-packages/dgllife/model/model_zoo/jtvae.py", line 278, in forward
    reduce_func=fn.sum('h_nei', 'sum_h'))
  File "/home/simon/miniconda3/envs/jtvae_dgl/lib/python3.7/site-packages/dgl/heterograph.py", line 4653, in pull
    v = utils.prepare_tensor(self, v, 'v')
  File "/home/simon/miniconda3/envs/jtvae_dgl/lib/python3.7/site-packages/dgl/utils/checks.py", line 35, in prepare_tensor
    name, g.idtype, g.device, F.dtype(data), F.context(data)))
dgl._ffi.base.DGLError: Expect argument "v" to have data type torch.int32 and device context cuda:0. But got torch.int64 and cuda:0.

Output of conda list:

packages in environment at /home/simon/miniconda3/envs/jtvae_dgl:

Name Version Build Channel

_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 1_llvm conda-forge
argcomplete 1.12.3 pyhd8ed1ab_2 conda-forge
argon2-cffi 21.1.0 py37h5e8e339_2 conda-forge
arrow-cpp 2.0.0 py37hc02b082_15_cpu conda-forge
async_generator 1.10 py_0 conda-forge
attrs 21.2.0 pyhd8ed1ab_0 conda-forge
aws-c-common 0.4.59 h36c2ea0_1 conda-forge
aws-c-event-stream 0.1.6 had2084c_6 conda-forge
aws-checksums 0.1.10 h4e93380_0 conda-forge
aws-sdk-cpp 1.8.70 h57dc084_1 conda-forge
backcall 0.2.0 pyh9f0ad1d_0 conda-forge
backports 1.0 py_2 conda-forge
backports.functools_lru_cache 1.6.4 pyhd8ed1ab_0 conda-forge
blas 2.112 mkl conda-forge
blas-devel 3.9.0 12_linux64_mkl conda-forge
bleach 4.1.0 pyhd8ed1ab_0 conda-forge
boost 1.68.0 py37h8619c78_1001 conda-forge
boost-cpp 1.68.0 h11c811c_1000 conda-forge
brotli 1.0.9 h7f98852_6 conda-forge
brotli-bin 1.0.9 h7f98852_6 conda-forge
bzip2 1.0.8 h7f98852_4 conda-forge
c-ares 1.18.1 h7f98852_0 conda-forge
ca-certificates 2021.10.8 ha878542_0 conda-forge
cairo 1.16.0 h18b612c_1001 conda-forge
certifi 2021.10.8 py37h89c1867_1 conda-forge
cffi 1.15.0 py37h036bc23_0 conda-forge
charset-normalizer 2.0.7 pypi_0 pypi
cloudpickle 2.0.0 pypi_0 pypi
colorama 0.4.4 pyh9f0ad1d_0 conda-forge
cudatoolkit 10.2.89 h8f6ccaa_9 conda-forge
cycler 0.11.0 pyhd8ed1ab_0 conda-forge
dbus 1.13.6 h48d8840_2 conda-forge
debugpy 1.5.1 py37hcd2ae1e_0 conda-forge
decorator 5.1.0 pyhd8ed1ab_0 conda-forge
defusedxml 0.7.1 pyhd8ed1ab_0 conda-forge
dgl-cuda10.2 0.7.2 py37_0 dglteam
dgllife 0.2.8 pypi_0 pypi
entrypoints 0.3 py37hc8dfbb8_1002 conda-forge
expat 2.4.1 h9c3ff4c_0 conda-forge
fontconfig 2.13.1 he4413a7_1000 conda-forge
freetype 2.10.4 h0708190_1 conda-forge
future 0.18.2 pypi_0 pypi
gettext 0.19.8.1 h73d1719_1008 conda-forge
gflags 2.2.2 he1b5a44_1004 conda-forge
glib 2.70.0 h780b84a_1 conda-forge
glib-tools 2.70.0 h780b84a_1 conda-forge
glog 0.4.0 h49b9bf7_3 conda-forge
grpc-cpp 1.34.1 h2157cd5_4
gst-plugins-base 1.14.0 hbbd80ab_1
gstreamer 1.14.0 h28cd5cc_2
hyperopt 0.2.6 pypi_0 pypi
icu 58.2 hf484d3e_1000 conda-forge
idna 3.3 pypi_0 pypi
importlib-metadata 4.8.2 py37h89c1867_0 conda-forge
importlib_metadata 4.8.2 hd8ed1ab_0 conda-forge
importlib_resources 5.4.0 pyhd8ed1ab_0 conda-forge
ipykernel 6.5.0 py37h6531663_1 conda-forge
ipython 7.29.0 py37h6531663_2 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
ipywidgets 7.6.5 pyhd8ed1ab_0 conda-forge
jedi 0.18.0 py37h89c1867_3 conda-forge
jinja2 3.0.3 pyhd8ed1ab_0 conda-forge
joblib 1.1.0 pypi_0 pypi
jpeg 9d h36c2ea0_0 conda-forge
jsonschema 4.2.1 pyhd8ed1ab_0 conda-forge
jupyter 1.0.0 py37h89c1867_7 conda-forge
jupyter_client 6.1.12 pyhd8ed1ab_0 conda-forge
jupyter_console 6.4.0 pyhd8ed1ab_1 conda-forge
jupyter_core 4.9.1 py37h89c1867_1 conda-forge
jupyterlab_pygments 0.1.2 pyh9f0ad1d_0 conda-forge
jupyterlab_widgets 1.0.2 pyhd8ed1ab_0 conda-forge
kiwisolver 1.3.2 py37h2527ec5_1 conda-forge
krb5 1.19.2 h48eae69_3 conda-forge
lcms2 2.12 hddcbb42_0 conda-forge
ld_impl_linux-64 2.36.1 hea4e1c9_2 conda-forge
libblas 3.9.0 12_linux64_mkl conda-forge
libbrotlicommon 1.0.9 h7f98852_6 conda-forge
libbrotlidec 1.0.9 h7f98852_6 conda-forge
libbrotlienc 1.0.9 h7f98852_6 conda-forge
libcblas 3.9.0 12_linux64_mkl conda-forge
libcurl 7.80.0 h494985f_0 conda-forge
libedit 3.1.20191231 he28a2e2_2 conda-forge
libev 4.33 h516909a_1 conda-forge
libevent 2.1.10 h28343ad_4 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc-ng 11.2.0 h1d223b6_11 conda-forge
libgfortran-ng 11.2.0 h69a702a_11 conda-forge
libgfortran5 11.2.0 h5c6108e_11 conda-forge
libglib 2.70.0 h174f98d_1 conda-forge
libiconv 1.16 h516909a_0 conda-forge
liblapack 3.9.0 12_linux64_mkl conda-forge
liblapacke 3.9.0 12_linux64_mkl conda-forge
libnghttp2 1.43.0 ha19adfc_1 conda-forge
libnsl 2.0.0 h7f98852_0 conda-forge
libpng 1.6.37 h21135ba_2 conda-forge
libprotobuf 3.14.0 h780b84a_0 conda-forge
libsodium 1.0.18 h36c2ea0_1 conda-forge
libssh2 1.10.0 ha35d2d1_2 conda-forge
libstdcxx-ng 11.2.0 he4da1e4_11 conda-forge
libthrift 0.13.0 hfb8234f_6
libtiff 4.2.0 hbd63e13_2 conda-forge
libutf8proc 2.6.1 h7f98852_0 conda-forge
libuuid 2.32.1 h7f98852_1000 conda-forge
libuv 1.42.0 h7f98852_0 conda-forge
libwebp-base 1.2.1 h7f98852_0 conda-forge
libxcb 1.13 h7f98852_1004 conda-forge
libxml2 2.9.9 h13577e0_2 conda-forge
libzlib 1.2.11 h36c2ea0_1013 conda-forge
llvm-openmp 12.0.1 h4bd325d_1 conda-forge
lz4-c 1.9.3 h9c3ff4c_1 conda-forge
markupsafe 2.0.1 py37h5e8e339_1 conda-forge
matplotlib-base 3.4.3 py37h1058ff1_2 conda-forge
matplotlib-inline 0.1.3 pyhd8ed1ab_0 conda-forge
mistune 0.8.4 py37h5e8e339_1005 conda-forge
mkl 2021.4.0 h8d4b97c_729 conda-forge
mkl-devel 2021.4.0 ha770c72_730 conda-forge
mkl-include 2021.4.0 h8d4b97c_729 conda-forge
nbclient 0.5.8 pyhd8ed1ab_0 conda-forge
nbconvert 6.3.0 py37h89c1867_1 conda-forge
nbformat 5.1.3 pyhd8ed1ab_0 conda-forge
ncurses 6.2 h58526e2_4 conda-forge
nest-asyncio 1.5.1 pyhd8ed1ab_0 conda-forge
networkx 2.6.3 pyhd8ed1ab_1 conda-forge
notebook 6.4.5 pyha770c72_0 conda-forge
numpy 1.21.4 py37h31617e3_0 conda-forge
olefile 0.46 pyh9f0ad1d_1 conda-forge
openjpeg 2.4.0 hb52868f_1 conda-forge
openssl 3.0.0 h7f98852_2 conda-forge
orc 1.6.6 h7950760_1 conda-forge
packaging 21.0 pyhd8ed1ab_0 conda-forge
pandas 1.3.4 py37he8f5f7f_1 conda-forge
pandoc 2.16.1 h7f98852_0 conda-forge
pandocfilters 1.5.0 pyhd8ed1ab_0 conda-forge
parquet-cpp 1.5.1 1 conda-forge
parso 0.8.2 pyhd8ed1ab_0 conda-forge
pcre 8.45 h9c3ff4c_0 conda-forge
pexpect 4.8.0 py37hc8dfbb8_1 conda-forge
pickleshare 0.7.5 py37hc8dfbb8_1002 conda-forge
pillow 8.2.0 py37h4600e1f_1 conda-forge
pip 21.3.1 pyhd8ed1ab_0 conda-forge
pixman 0.38.0 h516909a_1003 conda-forge
prometheus_client 0.12.0 pyhd8ed1ab_0 conda-forge
prompt-toolkit 3.0.22 pyha770c72_0 conda-forge
prompt_toolkit 3.0.22 hd8ed1ab_0 conda-forge
pthread-stubs 0.4 h36c2ea0_1001 conda-forge
ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge
pyarrow 2.0.0 py37h9425694_15_cpu conda-forge
pycairo 1.20.1 py37hfff247e_1 conda-forge
pycparser 2.21 pyhd8ed1ab_0 conda-forge
pygments 2.10.0 pyhd8ed1ab_0 conda-forge
pyparsing 3.0.6 pyhd8ed1ab_0 conda-forge
pyqt 5.6.0 py37h13b7fb3_1008 conda-forge
pyrsistent 0.18.0 py37h5e8e339_0 conda-forge
python 3.7.12 hf930737_100_cpython conda-forge
python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge
python_abi 3.7 2_cp37m conda-forge
pytorch 1.10.0 py3.7_cuda10.2_cudnn7.6.5_0 pytorch
pytorch-mutex 1.0 cuda pytorch
pytz 2021.3 pyhd8ed1ab_0 conda-forge
pyzmq 22.3.0 py37h336d617_1 conda-forge
qt 5.6.3 h8bf5577_3
qtconsole 5.2.0 pyhd8ed1ab_0 conda-forge
qtpy 1.11.2 pyhd8ed1ab_0 conda-forge
rdkit 2018.09.3 py37h9c20d5c_0 conda-forge
re2 2020.11.01 h58526e2_0 conda-forge
readline 8.1 h46c0cb4_0 conda-forge
requests 2.26.0 pypi_0 pypi
scikit-learn 1.0.1 pypi_0 pypi
scipy 1.7.2 py37hf2a6cf1_0 conda-forge
send2trash 1.8.0 pyhd8ed1ab_0 conda-forge
setuptools 59.1.1 py37h89c1867_0 conda-forge
sip 4.18.1 py37hf484d3e_1000 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
snappy 1.1.8 he1b5a44_3 conda-forge
sqlite 3.36.0 h9cd32fc_2 conda-forge
tbb 2021.4.0 h4bd325d_1 conda-forge
terminado 0.12.1 py37h89c1867_1 conda-forge
testpath 0.5.0 pyhd8ed1ab_0 conda-forge
threadpoolctl 3.0.0 pypi_0 pypi
tk 8.6.11 h27826a3_1 conda-forge
tornado 6.1 py37h5e8e339_2 conda-forge
tqdm 4.62.3 pyhd8ed1ab_0 conda-forge
traitlets 5.1.1 pyhd8ed1ab_0 conda-forge
typing_extensions 3.10.0.2 pyha770c72_0 conda-forge
urllib3 1.26.7 pypi_0 pypi
wcwidth 0.2.5 pyh9f0ad1d_2 conda-forge
webencodings 0.5.1 py_1 conda-forge
wheel 0.37.0 pyhd8ed1ab_1 conda-forge
widgetsnbextension 3.5.2 py37h89c1867_0 conda-forge
xorg-kbproto 1.0.7 h7f98852_1002 conda-forge
xorg-libice 1.0.10 h7f98852_0 conda-forge
xorg-libsm 1.2.3 hd9c2040_1000 conda-forge
xorg-libx11 1.7.2 h7f98852_0 conda-forge
xorg-libxau 1.0.9 h7f98852_0 conda-forge
xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge
xorg-libxext 1.3.4 h7f98852_1 conda-forge
xorg-libxrender 0.9.10 h7f98852_1003 conda-forge
xorg-renderproto 0.11.1 h7f98852_1002 conda-forge
xorg-xextproto 7.3.0 h7f98852_1002 conda-forge
xorg-xproto 7.0.31 h7f98852_1007 conda-forge
xz 5.2.5 h516909a_1 conda-forge
zeromq 4.3.4 h9c3ff4c_1 conda-forge
zipp 3.6.0 pyhd8ed1ab_0 conda-forge
zlib 1.2.11 h36c2ea0_1013 conda-forge
zstd 1.4.9 ha95c52a_0 conda-forge

Loading preprocessed datasets fails due to missing attribute 'valid_ids'

   ...: ds = Lipophilicity(partial(smiles_to_bigraph, add_self_loop=True, num_virtual_nodes=1),
   ...:                    CanonicalAtomFeaturizer(),
   ...:                    CanonicalBondFeaturizer(self_loop=True),
   ...:                    load=True)
   ...: 
Loading previously saved dgl graphs...
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3417, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-32-e4139ee1c1b4>", line 17, in <module>
    load=True)
  File "/opt/conda/lib/python3.7/site-packages/dgllife/data/lipophilicity.py", line 120, in __init__
    self.chembl_ids = [self.chembl_ids[i] for i in self.valid_ids]
AttributeError: 'Lipophilicity' object has no attribute 'valid_ids'

Not sure why this fails, because valid_ids is created in _pre_process which is called in init.

Error in AttentiveFPBondFeaturizer

When self_loop in AttentiveFPBondFeaturizer is True, I got dgl._ffi.base.DGLError: Expect number of features to match number of edges. Got 87 and 60 instead.

Code:

import dgl

from dgllife.utils import smiles_to_bigraph, AttentiveFPAtomFeaturizer, AttentiveFPBondFeaturizer
from dgllife.model.model_zoo.attentivefp_predictor import AttentiveFPPredictor

config = {
    'node_feat': AttentiveFPAtomFeaturizer(),
    'edge_feat': AttentiveFPBondFeaturizer(self_loop=True)
}

smiles_lst = [
    'CC1=C2C=C(C=CC2=NN1)C3=CC(=CN=C3)OCC(CC4=CC=CC=C4)N',
    'CC(C)(C)C1=CC(=NO1)NC(=O)NC2=CC=C(C=C2)C3=CN4C5=C(C=C(C=C5)OCCN6CCOCC6)SC4=N3',
    'CCN1CCN(CC1)CC2=C(C=C(C=C2)NC(=O)NC3=CC=C(C=C3)OC4=NC=NC(=C4)NC)C(F)(F)F'
]

gs = [smiles_to_bigraph(smiles,
                        node_featurizer=config['node_feat'],
                        edge_featurizer=config['edge_feat']) for smiles in smiles_lst]

gs = dgl.batch(gs)

model = AttentiveFPPredictor(node_feat_size=config['node_feat'].feat_size(),
                             edge_feat_size=config['edge_feat'].feat_size())

node_feats = gs.ndata.pop('h')
edge_feats = gs.edata.pop('e')

res = model(gs, node_feats, edge_feats)

print(res.size())

Wrong Import at csv_dataset.py

To reproduce: execute dgllife/data/csv_dataset.py

    from dgl import save_graphs, load_graphs

Can be fixed by substituting dgl for dgl.data.utils

binding_affinity_prediction import error

Hey Guys, nice library. I've been trying to run the binding_affinity_prediction example. I built the GPU docker image and installed dgl and dgllifesci. But am unable to run python main.py --help. Any ideas? Cheers in advance.

(pytorch-ci) root@dgllifesci:/dgllife/dgl-lifesci/examples/binding_affinity_prediction# python main.py --help
DGL backend not selected or invalid.  Assuming PyTorch for now.
Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)
Using backend: pytorch
RDKit is not installed, which is required for utils related to cheminformatics
Traceback (most recent call last):
  File "main.py", line 9, in <module>
    from dgllife.utils.eval import Meter
  File "/opt/conda/envs/pytorch-ci/lib/python3.6/site-packages/dgllife/utils/__init__.py", line 6, in <module>
    from .analysis import *
  File "/opt/conda/envs/pytorch-ci/lib/python3.6/site-packages/dgllife/utils/analysis.py", line 14, in <module>
    from rdkit import Chem
  File "/opt/conda/envs/pytorch-ci/lib/python3.6/site-packages/rdkit/Chem/__init__.py", line 18, in <module>
    from rdkit import DataStructs
  File "/opt/conda/envs/pytorch-ci/lib/python3.6/site-packages/rdkit/DataStructs/__init__.py", line 13, in <module>
    from rdkit.DataStructs import cDataStructs
ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.11' not found (required by /opt/conda/envs/pytorch-ci/lib/python3.6/site-packages/rdkit/DataStructs/../../../../libRDKitDataStructs.so.1)

(pytorch-ci) root@dgllifesci:/dgllife/dgl-lifesci/examples/binding_affinity_prediction# python
Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from rdkit.DataStructs import cDataStructs
[works fine]

Given that the latter import works okay, it seems like an environment issue. The setup is done in the Dockerfile and I assumed that pytorch-ci was the conda env to install into?

[Rexgen] IndexError: list index out of range

@sparklytopaz I open a new issue since the issue you commented is for a different model. The issues are:

IndexError: list index out of range encountered when training a WLN on custom reaction data for reaction center prediction.
Is there an open source software that converts smiles to reaction smiles/smirks ?

Pretrained models' in_feat dimension error

When I try to make predictions on ESOL dataset with pertained models:'Weave_canonical_ESOL','Weave_attentivefp_ESOL','MPNN_canonical_ESOL','MPNN_attentivefp_ESOL', the feat_size of CanonicalBondFeaturizer and AttentiveFPBondFeaturizer are 12 and 10, but the pretrained models' in_feats are 13 and 11.

Question on the pre-trained model name and code

Hi, I want to use the pre-trained model, and when I loaded the pre-trained models, I'm confused that how these models were pre-trained.

The pre-trained model included four, as follows:

The paper in Hu et al., 2019 include four self-supervised method for node level (for example: infomax, edge prediction, attr masking, context prediction); and 1 supervised method for graph level pre-training. In addition, I noticed the datasets were different, which were 2M (ZINC ~2M), and ~450k (chembl) for node level and graph level, respectively.

I am confused that the pre-trained model name in dgllife-sci, were "gin_supervised_contextpred", "gin_supervised_infomax" and so on. Were they pre-trained with self-supervised method for node level (using the 2M dataset)? If ture, I think the name could be "self_supervised_contextpred"(which may be better).

In additon, could you provide the pre-training code (from scratch, which I did not find in the source code of dgllife-sci)? Furthermore, it will be very kind if you could provide the time consumption for the pre-training process, which I want to explore in the further.

Thanks with best regards.

update

I think I found the self supervised mode of "attr masking" here:
https://github.com/awslabs/dgl-lifesci/tree/master/examples/property_prediction/pretrain_gnns/chem

[JTVAE] Broken Training Script

Currently, there is a bug in the training script.

error g.ndata.pop ('h')

Hi
I do not understand why this command g.ndata.pop ('h') gives the following error when it wants to run again.

KeyError Traceback (most recent call last)
in ()
5 for epoch in range(num_epochs):
6 for i, (smiles, g, label, mask) in enumerate(sider_train):
----> 7 inputs = g.ndata.pop('h')
8 inputs = inputs.to(device)
9 targets = label.to(device)

2 frames
/usr/lib/python3.7/_collections_abc.py in pop(self, key, default)
793 '''
794 try:
--> 795 value = self[key]
796 except KeyError:
797 if default is self.__marker:

/usr/local/lib/python3.7/dist-packages/dgl/view.py in getitem(self, key)
64 return ret
65 else:
---> 66 return self._graph._get_n_repr(self._ntid, self._nodes)[key]
67
68 def setitem(self, key, val):

/usr/local/lib/python3.7/dist-packages/dgl/frame.py in getitem(self, name)
391 Column data.
392 """
--> 393 return self._columns[name].data
394
395 def setitem(self, name, data):

KeyError: 'h'

ACNN not working with dgl batch

To recreate

from dgllife.model.model_zoo.acnn import ACNN
import dgl
from rdkit import Chem
from rdkit.Chem import AllChem
import torch
from dgllife.utils import ACNN_graph_construction_and_featurization
model = ACNN()

protein = Chem.MolFromPDBFile("./6LU7.pdb")
protein_pos = torch.Tensor(protein.GetConformer().GetPositions())

ligand1 = Chem.MolFromSmiles("O=C(CC(c1ccccc1)c1ccccc1)N1CCN(S(=O)(=O)c2ccccc2[N+](=O)[O-])CC1")
AllChem.EmbedMolecule(ligand1)

ligand2 = Chem.MolFromSmiles("Cc1cc(C(=O)Nc2ccc(OCC(N)=O)cc2)c(C)n1C1CC1")
AllChem.EmbedMolecule(ligand2)

pos1 = torch.Tensor(ligand1.GetConformer().GetPositions())
pos2 = torch.Tensor(ligand2.GetConformer().GetPositions())

g1 = ACNN_graph_construction_and_featurization(ligand1, protein, pos1, protein_pos)
g2 = ACNN_graph_construction_and_featurization(ligand2, protein, pos2, protein_pos)
print(g1, g2)
batch = dgl.graph([g1, g2])

This throws the following error

Traceback (most recent call last):
  File "ACNN.py", line 24, in <module>
    batch = dgl.graph([g1, g2])
  File "/home/manan/miniconda3/lib/python3.8/site-packages/dgl/convert.py", line 151, in graph
    u, v, urange, vrange = utils.graphdata2tensors(data, idtype)
  File "/home/manan/miniconda3/lib/python3.8/site-packages/dgl/utils/data.py", line 169, in graphdata2tensors
    src, dst = elist2tensor(data, idtype)
  File "/home/manan/miniconda3/lib/python3.8/site-packages/dgl/utils/data.py", line 28, in elist2tensor
    u, v = zip(*elist)
  File "/home/manan/miniconda3/lib/python3.8/site-packages/dgl/heterograph.py", line 1968, in __getitem__
    raise DGLError('Invalid key "{}". Must be one of the edge types.'.format(orig_key))
dgl._ffi.base.DGLError: Invalid key "0". Must be one of the edge types.

Train rexgen on GPU and evaluate on CPU

I have trained the rexgen model from the examples on a CUDA-supporting maching and now I want to evaluate the model on a non-CUDA-supporting machine. That is, I trained the model on an AWS EC2 instance, and now I want to evaluate a few reactions on my laptop.

Running candidate_ranking_eval.py causes a deserialization error, when the center model is loaded. That is line 531 of utils.py.

I propose loading the model like in line 43 of find_reaction_center_eval.py or line 42 of candidate_ranking_eval.py

I am using commit 89be5a3

Traceback (most recent call last):
  File "candidate_ranking_eval.py", line 82, in <module>
    path_to_candidate_bonds = prepare_reaction_center(args, reaction_center_config)
  File "/home/mvolk/PyCharmProjects/dgl-lifesci/examples/reaction_prediction/rexgen_direct/utils.py", line 531, in prepare_reaction_center
    torch.load(args['center_model_path'])['model_state_dict'])
  File "/home/mvolk/anaconda3/envs/dgl_lifesci/lib/python3.7/site-packages/torch/serialization.py", line 594, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/home/mvolk/anaconda3/envs/dgl_lifesci/lib/python3.7/site-packages/torch/serialization.py", line 853, in _load
    result = unpickler.load()
  File "/home/mvolk/anaconda3/envs/dgl_lifesci/lib/python3.7/site-packages/torch/serialization.py", line 845, in persistent_load
    load_tensor(data_type, size, key, _maybe_decode_ascii(location))
  File "/home/mvolk/anaconda3/envs/dgl_lifesci/lib/python3.7/site-packages/torch/serialization.py", line 834, in load_tensor
    loaded_storages[key] = restore_location(storage, location)
  File "/home/mvolk/anaconda3/envs/dgl_lifesci/lib/python3.7/site-packages/torch/serialization.py", line 175, in default_restore_location
    result = fn(storage, location)
  File "/home/mvolk/anaconda3/envs/dgl_lifesci/lib/python3.7/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/home/mvolk/anaconda3/envs/dgl_lifesci/lib/python3.7/site-packages/torch/serialization.py", line 135, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Pretrain GIN models using UnlabeledSmiles

Hi,

I know that one can use pre-trained GIN models with masked and contextpred modes etc for downstream regression or classification. But I'm wondering if there are any training scripts or resources in dgl-life to pretrain GIN models with different node/edge features. If not I'm planning to implement it myself and I can open a PR here.

Thanks!!

The default PretrainAtomFeaturizer does not work for the ClinTox dataset.

Hi,

I was trying the script in dgl-lifesci/examples/property_prediction/moleculenet for molecular property prediction. I got the following error when running command python classification.py -d ClinTox -mo gin_supervised_masking

Using backend: pytorch
Directory classification_results already exists.
Processing dgl graphs from scratch...
Traceback (most recent call last):
File "classification.py", line 186, in
n_jobs=1 if args['num_workers'] == 0 else args['num_workers'])
File "/export/scratch/Zeren/conda/lib/python3.7/site-packages/dgllife/data/clintox.py", line 109, in init
n_jobs=n_jobs)
File "/export/scratch/Zeren/conda/lib/python3.7/site-packages/dgllife/data/csv_dataset.py", line 78, in init
load, log_every, init_mask, n_jobs, error_log)
File "/export/scratch/Zeren/conda/lib/python3.7/site-packages/dgllife/data/csv_dataset.py", line 139, in _pre_process
edge_featurizer=edge_featurizer))
File "/export/scratch/Zeren/conda/lib/python3.7/site-packages/dgllife/utils/mol_to_graph.py", line 375, in smiles_to_bigraph
canonical_atom_order, explicit_hydrogens, num_virtual_nodes)
File "/export/scratch/Zeren/conda/lib/python3.7/site-packages/dgllife/utils/mol_to_graph.py", line 276, in mol_to_bigraph
canonical_atom_order, explicit_hydrogens, num_virtual_nodes)
File "/export/scratch/Zeren/conda/lib/python3.7/site-packages/dgllife/utils/mol_to_graph.py", line 90, in mol_to_graph
g.ndata.update(node_featurizer(mol))
File "/export/scratch/Zeren/conda/lib/python3.7/site-packages/dgllife/utils/featurizers.py", line 1293, in call
self._atomic_number_types.index(atom.GetAtomicNum()),
ValueError: 0 is not in list

It seems that there exist atoms in the ClinTox dataset that return 0 when calling GetAtomicNum() that is out of the default atomic_number_types of PretrainAtomFeaturizer(). The problem could be resolved by passing node_featurizer=PretrainAtomFeaturizer(atomic_number_types=list(range(119))) when constructing the ClinTox dataset. But not sure what does a 0 atomic number mean.

Error when using mol_to_bigraph with node featurizer

When I execute the following block of code:

from rdkit.Chem import MolFromSmiles
from dgllife.utils import mol_to_bigraph, mol_to_graph
from dgllife.utils.featurizers import CanonicalAtomFeaturizer, CanonicalBondFeaturizer

mol = MolFromSmiles('CCO') 
mol_to_bigraph(mol,
               node_featurizer=CanonicalAtomFeaturizer,
               edge_featurizer=CanonicalBondFeaturizer)

I got the following error:

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-72-f95cc3af6dcb> in <module>()
      6 mol_to_bigraph(mol,
      7                node_featurizer=CanonicalAtomFeaturizer,
----> 8                edge_featurizer=CanonicalBondFeaturizer)

2 frames

/usr/local/lib/python3.7/site-packages/dgllife/utils/mol_to_graph.py in mol_to_bigraph(mol, add_self_loop, node_featurizer, edge_featurizer, canonical_atom_order, explicit_hydrogens, num_virtual_nodes)
    269     return mol_to_graph(mol, partial(construct_bigraph_from_mol, add_self_loop=add_self_loop),
    270                         node_featurizer, edge_featurizer,
--> 271                         canonical_atom_order, explicit_hydrogens, num_virtual_nodes)
    272 
    273 def smiles_to_bigraph(smiles, add_self_loop=False,

/usr/local/lib/python3.7/site-packages/dgllife/utils/mol_to_graph.py in mol_to_graph(mol, graph_constructor, node_featurizer, edge_featurizer, canonical_atom_order, explicit_hydrogens, num_virtual_nodes)
     83 
     84     if node_featurizer is not None:
---> 85         g.ndata.update(node_featurizer(mol))
     86 
     87     if edge_featurizer is not None:

/usr/lib/python3.7/_collections_abc.py in update(*args, **kwds)
    844                     self[key] = other[key]
    845             else:
--> 846                 for key, value in other:
    847                     self[key] = value
    848         for key, value in kwds.items():

TypeError: 'CanonicalAtomFeaturizer' object is not iterable

Environment:
python=3.7
rdkit=2020.09.02

JTNN example does not work

Hi,

I am trying to run the JTNN example in dgl-lifesci/examples/generative_models/jtvae. When I run python train.py, I get this error

"(rdkit-env) root@nucar-nice:/home/trinayan/dgl-lifesci/examples/generative_models/jtvae# python3 train.py
Using backend: pytorch
Traceback (most recent call last):
File "train.py", line 133, in
main(args)
File "train.py", line 31, in main
depth=args.depth)
File "/root/anaconda3/envs/rdkit-env/lib/python3.6/site-packages/dgllife-0.2.5-py3.6.egg/dgllife/model/model_zoo/jtnn/jtnn_vae.py", line 53, in init
FileNotFoundError: [Errno 2] No such file or directory: '/jtnn/vocab.txt'"

Not sure if i need to download anything since the github page says the datasets will be downloaded automatically for ZINC.

Any help will be appreciated

Thanks

Target size (torch.Size([128, 12])) must be the same as input size (torch.Size([1, 12]))

I am trying to binarily classify Tox21 data using dgllife GATPredictor. The code link is attached below and I can run only when 'batch_size': 1.
Whenever, I am using 'batch_size': 128 (or any >1 value), I am getting the error 'Target size (torch.Size([128, 12])) must be the same as input size (torch.Size([1, 12]))'. This is the case even when I am defining the batch_size in DataLoader which uses a collate function to batch the data as per the defined batch size.
How and where can I change the input size (or target size) so that the above discrepancy does not arise?

Code link: https://github.com/rajarshiche/GNNs/blob/main/GAT_trial1.py

Question regarding molecular graphs with Hydrogen atoms

Hello! Thank you for sharing this library!

I have a question concerning graph generation from CSV.

I want to create graphs with custom features, which can later be used with SchNet for property prediction task. I am using MoleculeCSVDataset class with custom node and edge featurizers partially taken from TencentAlchemyDataset (alchemy_nodes and alchemy_edges). How can I create graphs from custom CSV, so that the Hydrogen atoms are present as nodes, as that information is needed in SchNet?

Many thanks in advance and apologies for the trivial question.

ranking of candidate products

for the pretrained model, how is the ranking of the candidate products done?
Is there any way I can print it as a probability or confidence score of some kind?

I understand that this function ranks the candidate products - but can we convert the tensor to a better ranking evaluation like confidence score from 0 to 1 or probability in percent?

Add Example for ConcatFeaturizer and Update Doc

As @xnuohz suggested, we need to add an example for ConcatFeaturizer and update the doc on the website.

Utils that do not use RDKit

Some utils do not use RDKit - but it seems as though the requirement is there anyway.
Many of us deal with proteins and not small molecules - and on large clusters, RDKit installation can be a pain.

Is there any way to have utils be case-by-case for RDKit?. Two that are well used and independent are:

Meter
EarlyStopping

These clearly do not need RDKit, but it needs to be installed anyway.

pretrain models best architecture vs target

Thanks for all the pretrain models but it's difficult to be sure what is the best to select. cause we don't have the split and cannot compute the real CV RMSE or ACC/ROC for a given dataset on testset.

Can you provide a rating of your best architecture for regression targets like ESOL , FreeSolv, etc (RMSE) and classification (ACC, ROC) ?

thanks

No module named 'dgl.nn.functional'

Hi, I installed the dgl==0.5.2 and dgllife in the lastest docker image, but "import dgllife" gives me the error below. Could you look into it? Thanks!

>>> import dgllife
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/pytorch-ci/lib/python3.6/site-packages/dgllife-0.2.8-py3.6.egg/dgllife/__init__.py", line 9, in <module>
  File "/opt/conda/envs/pytorch-ci/lib/python3.6/site-packages/dgllife-0.2.8-py3.6.egg/dgllife/model/__init__.py", line 6, in <module>
  File "/opt/conda/envs/pytorch-ci/lib/python3.6/site-packages/dgllife-0.2.8-py3.6.egg/dgllife/model/gnn/__init__.py", line 20, in <module>
  File "/opt/conda/envs/pytorch-ci/lib/python3.6/site-packages/dgllife-0.2.8-py3.6.egg/dgllife/model/gnn/pagtn.py", line 11, in <module>
ModuleNotFoundError: No module named 'dgl.nn.functional'

rxn map strict

[H]C([H])([H])Oc1ccc(CCNC=O)cc1OC([H])([H])[H]>>[H]C([H])([H])Oc1cc2c(cc1OC([H])([H])[H])CCN=C2
[CH2:1]([CH2:2][NH:10][CH:11]=[O:21])[c:8]1[cH:7][cH:6][c:5]([O:4][CH3:3])[c:12]([cH:9]1)[O:13][CH3:14]>>[CH:11]1=[N:10][CH2:2][CH2:1][c:8]2[cH:9][c:12]([O:13][CH3:14])[c:5]([cH:6][c:7]12)[O:4][CH3:3]


COc1ccc(CCNC=O)cc1OC>>COc1cc2c(cc1OC)CCN=C2
[O:15]=[CH:1][NH:2][CH2:3][CH2:4][c:5]1[cH:6][cH:7][c:8]([O:9][CH3:10])[c:11]([O:12][CH3:13])[cH:14]1>>[CH:1]1=[N:2][CH2:3][CH2:4][c:5]2[cH:14][c:11]([O:12][CH3:13])[c:8]([O:9][CH3:10])[cH:7][c:6]12

I forgot to hidden the explicit hydrogen atoms. The mapped rxn is


[CH2:1]([CH2:2][NH:10][CH:11]=[O:21])[c:8]1[cH:7][cH:6][c:5]([O:4][CH3:3])[c:12]([cH:9]1)[O:13][CH3:14]>>[CH:11]1=[N:10][CH2:2][CH2:1][c:8]2[cH:9][c:12]([O:13][CH3:14])[c:5]([cH:6][c:7]12)[O:4][CH3:3]

run the command

python find_reaction_center_eval.py --test-path  2.rxns -np 1

it reports error

Traceback (most recent call last):
  File "find_reaction_center_eval.py", line 83, in <module>
    main(args)
  File "find_reaction_center_eval.py", line 29, in main
    load=args['load'])
  File "/home/zgong/nfs/program/anaconda2/envs/py36dgllifesci/lib/python3.6/site-packages/dgllife/data/uspto.py", line 395, in __init__
    self.load_reaction_data(path_to_reaction_file, num_processes)
  File "/home/zgong/nfs/program/anaconda2/envs/py36dgllifesci/lib/python3.6/site-packages/dgllife/data/uspto.py", line 452, in load_reaction_data
    mol, reaction, graph_edits = load_one_reaction(li)
  File "/home/zgong/nfs/program/anaconda2/envs/py36dgllifesci/lib/python3.6/site-packages/dgllife/data/uspto.py", line 331, in load_one_reaction
    atom_map_order[atom.GetIntProp('molAtomMapNumber') - 1] = j
IndexError: list assignment index out of range

aftre cannoical rxn, the problem will be solved!

[O:15]=[CH:1][NH:2][CH2:3][CH2:4][c:5]1[cH:6][cH:7][c:8]([O:9][CH3:10])[c:11]([O:12][CH3:13])[cH:14]1>>[CH:1]1=[N:2][CH2:3][CH2:4][c:5]2[cH:14][c:11]([O:12][CH3:13])[c:8]([O:9][CH3:10])[cH:7][c:6]12

dgllife.utils.early_stop write out epoch in checkpoint

After running a hyperparameter search using early stopping with hold-out validation, I would like to retrain the model on the training+validation datasets. For this I require the epoch of the Early Stopping checkpoint. To my knowledge, this is currently not being saved.

rexgen_direct notebook error

when I run https://data.dgl.ai/dgllife/reaction_prediction_pretrained.ipynb, I got this error:
IndexError: tensors used as indices must be long, byte or bool tensors

the error from this line,

with torch.no_grad():
_, biased_pred = reaction_center_prediction(str(device), center_model, batch_mol_graphs, batch_complete_graphs)

MGCN Edge embedding

Hi,

I was looking at mgc.py and I'm wondering if in get_edge_types of EdgeEmbedding it should be :

(torch.abs(node_type1 - node_type2) - 1) ** 2 // 4

instead of

(torch.abs(node_type1 - node_type2) - 1) ** 2 / 4

otherwise the edge types are not integer and torch doesn't really like that.

Thanks for all the work you've done on the library !

Virtual node support for mol_to_graph

I see this is on the roadmap in #18. I was looking to dig into this topic, and would prefer to build on top of an established project rather than reinvent the wheel.

The default way I'd approach it is to port the logic from here such that BaseFeaturizer can take a boolean arg to add dummy node, which would replace the 0 index node with a fully connected dummy node with a unique feature vector.

We would need a way to derive the dummy vector. The way MAT does it is by prepending a 0th index to the onehot vector, incrementing the # of classes). We could have handle this in each featurizing function (i.e. atom_type_one_hot has a add_dummy kwarg), or handle it at the level of the ConcatFeaturizer.

Why does GCNLayer have norm="none"?

Hello, thanks for this package! One quick question: why do we have norm="none" in GCNLayer? I think the default is both as described in the paper.

dgl-lifesci/python/dgllife/model/gnn/gcn.py

Lines 43 to 44 in c6fe2f2

    
           self.graph_conv = GraphConv(in_feats=in_feats, out_feats=out_feats, 
        
                                       norm='none', activation=activation)

Conda package

The conda package for osx is still at the 0.2.2 version. See https://anaconda.org/dglteam/dgllife

Could you upload the version for 0.2.4?

Also, could you consider moving all the dgl package to conda-forge? I can provide guidance and or help with this.

Which featurizers should work with smiles_to_bigraph?

Hello, thanks for the great project! I had a quick question. I have noticed when using smiles_to_bigraph some featurizers work and others don't. For example, these work:

`
first = smiles_to_bigraph('CCO', node_featurizer=AttentiveFPAtomFeaturizer(), edge_featurizer=AttentiveFPBondFeaturizer())

second = smiles_to_bigraph('CCO', node_featurizer=CanonicalAtomFeaturizer(), edge_featurizer=CanonicalBondFeaturizer())
`

But these don't with errors about mismatched edge to nodes:

`
third = smiles_to_bigraph('CCO', node_featurizer=PAGTNAtomFeaturizer(), edge_featurizer=PAGTNEdgeFeaturizer(max_length=1))

forth = smiles_to_bigraph('CCO', node_featurizer=WeaveAtomFeaturizer(), edge_featurizer=WeaveEdgeFeaturizer())

I have noticed the featurizers with "edgeFeaturizer" instead of "bondFeaturizer" will fail in these situations. I am guessing this is by design, but I am little lost of the best way to create node/edge features for the other featurizers?

Thanks,
Derek

TF backend support

I'm using tensorflow as dgl backend in my local macbook since apple release its Metal-accelerated version of tensorflow. But it seems that the dgl-lifesci does not support tensorflow backend when I import dgllife.

In [1]: import dgllife
Using backend: tensorflow
2021-01-05 11:13:59.073796: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-68d2941e7384> in <module>
----> 1 import dgllife

~/anaconda3/envs/main/lib/python3.8/site-packages/dgllife/__init__.py in <module>
      7
      8 from .libinfo import __version__
----> 9 from . import model
     10
     11 try:

~/anaconda3/envs/main/lib/python3.8/site-packages/dgllife/model/__init__.py in <module>
      4 # SPDX-License-Identifier: Apache-2.0
      5
----> 6 from .gnn import *
      7 from .readout import *
      8 from .model_zoo import *

~/anaconda3/envs/main/lib/python3.8/site-packages/dgllife/model/gnn/__init__.py in <module>
      6 # Graph neural networks for updating node representations
      7
----> 8 from .attentivefp import *
      9 from .gat import *
     10 from .gcn import *

~/anaconda3/envs/main/lib/python3.8/site-packages/dgllife/model/gnn/attentivefp.py in <module>
     12 import torch.nn.functional as F
     13
---> 14 from dgl.nn.pytorch import edge_softmax
     15
     16 __all__ = ['AttentiveFPGNN']

ModuleNotFoundError: No module named 'dgl.nn.pytorch'

So dgllife currently only support pytorch as backend engine and did not intend to be prepared in 0.3 roadmap?

accn.py depends on BatchedDGLHeteroGraph, which is no longer in dgl 5.0

Masks in Meter.update()

Hi,
I'm trying to use masks for multi task learning.
The documentation in eval.py about the use of masks is not clear to me.

mask : None or float32 tensor
            Binary mask indicating the existence of ground truth labels with
            shape ``(B, T)``. If None, we assume that all labels exist and create
            a one-tensor for placeholder.

If a mask is set to 1 it could mean that a) the label will be masked or b) the label is present. Which one is it?

If label = [5, None, None],
should I set mask = [1,0,0] or [0,1,1]

What's the convention if I wanted to exclude the "None" labels from the loss calculation?

Thanks a lot for clarifying!

PotentialNet edge dimension error

Hi,
I want to train a model using PotentialNet. The code fails due to an error with the edge dimensions:

Traceback (most recent call last):
  File "contextlib.py", line 130, in __exit__
python-BaseException
    self.gen.throw(type, value, traceback)
  File "dgl/heterograph.py", line 5614, in local_scope
    yield
  File "dgllife/model/model_zoo/potentialnet.py", line 238, in forward
    eids = graph.edata['e'][:,i].nonzero(as_tuple=False).view(-1).type(graph.idtype)
IndexError: index 8 is out of bounds for dimension 1 with size 8

There seems to be a mismatch with the constructed graphs.
According to a comment in the code:
n_etypes=n_etypes, # num_distance_bins + 5 covalent types
n_etypes should have length 9, but it is 8.

>>>distance_bins
[1.5, 2.5, 3.5, 4.5]
>>>d_one_hot.shape
(1384, 3)
>>>complex_knn_graph.edata['e'].shape
torch.Size([2068, 8])

I'd really appreciate your help!

PAGTN imeplementation

I had recently come across this paper PAGTN and code. I was planning on implementing this paper in dgl-lifesci. As per the contribution guidelines, I felt that I should get the repo maintainer's opinion before making a PR.

I have finished reproducing some of the results from this paper on MoleculeNet (ESOL, BACE, BBBP) and can be found on this collab notebook here.

Could you please let me know if adding this model will be fruitful to the dgl-lifesci community?

Here are a few short notes about the model to save time -

The model uses a complete graph for the molecule where each node is connected to every other node. For edge feature between any two nodes, they find the shortest path between two nodes as the path along with bonds and build features about it. Ex - The shortest path between node 8 and node 5 is 8 -> 7 -> 6 -> 5
The node features are a concatenation of one-hot-encoding vectors of Atom_type, formal_charge, valency, etc.
The edge features are a concatenation of bond type in the shortest path between two nodes and aromatic ring type.
About the neural network- it is very similar to the GAT model but a lot of residual connections and slightly different message-passing strategy which give it a lot of benefits.

Please let me know if I should proceed further working on this model. I will have to clean the existing code, add docs and few optimisations.

TypeError: 'method' object is not subscriptable - (pretrained_model )

In the pretrained model - reaction_prediction_pretrained.ipynb
Command -wget https://data.dgl.ai/dgllife/reaction_prediction_pretrained.ipynb
I am getting the above error.
python 3.8
The notebook is in proper directory.
In fact, i have run it previously and did not face this issue.
facing this issue after creating a new environment to train a model on my own dataset
please help

Rexgen example errors when reactants or products have no bonds

When the following reaction is included in a dataset of the rexgen example, the training and evaluation scripts error.
[C:1].[O:2].[O:3]>>[C:1][O:2][O:3]

The problem is that the code of the rexgen example and parts of the code of the dgl-lifesci package assume that all graphs have at least one edge. The reaction is not detected to be invalid because it can be imported with rdkit properly. Similar errors can occur, when during inference a bond change combination is evaluated that corresponds to breaking of all bonds.

commit: 367c79b
Example:

(dgl-lifesci) martin@martin-ThinkPad:~/PyCharmProjects/dgl-lifesci/examples/reaction_prediction/rexgen_direct$ python candidate_ranking_eval.py --model-path candidate_res/model_final.pkl --result-path candidate_res/ --test-path ../../../data/input_eval/example_problem_2.txt -cmp center_train/model_final.pkl -rcb 1 -np 1 -nw 1
Using backend: pytorch
Directory candidate_res/ already exists.
Stage 1/2: loading reaction data...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 219.16it/s]
Stage 2/2: loading candidate bond changes...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 32263.88it/s]
Traceback (most recent call last):
  File "candidate_ranking_eval.py", line 83, in <module>
    main(args, path_to_candidate_bonds)
  File "candidate_ranking_eval.py", line 46, in main
    prediction_summary = candidate_ranking_eval(args, model, test_loader)
  File "/home/martin/PyCharmProjects/dgl-lifesci/examples/reaction_prediction/rexgen_direct/utils.py", line 1150, in candidate_ranking_eval
    for batch_id, batch_data in enumerate(data_loader):
  File "/home/martin/anaconda3/envs/dgl-lifesci/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/martin/anaconda3/envs/dgl-lifesci/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
    return self._process_data(data)
  File "/home/martin/anaconda3/envs/dgl-lifesci/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
    data.reraise()
  File "/home/martin/anaconda3/envs/dgl-lifesci/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/martin/anaconda3/envs/dgl-lifesci/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/martin/anaconda3/envs/dgl-lifesci/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/martin/anaconda3/envs/dgl-lifesci/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/martin/PyCharmProjects/dgl-lifesci/python/dgllife/data/uspto.py", line 1549, in __getitem__
    self.edge_featurizer)
  File "/home/martin/PyCharmProjects/dgl-lifesci/python/dgllife/data/uspto.py", line 1316, in construct_graphs_rank
    combo_edge_feats = torch.stack(combo_edge_feats, dim=0)
RuntimeError: stack expects a non-empty TensorList

Scaffold splitter functionality unexpected behavior?

Hello dear authors,

I wanted to confirm my understanding of the scaffold functionality. When using the default initialization I noticed that the indices returned are continuous which I find suspicious since the documentation mentions:

Group molecules so that all molecules in a group have a same scaffold (see reference).
The dataset is then split at the level of groups.

To reproduce:

dataset = dgllife.data.Tox21(smiles_to_bigraph, CanonicalAtomFeaturizer())
split = ScaffoldSplitter()
train, test, val = split.train_val_test_split(dataset)
np.max(train.indices), np.min(train.indices), np.max(test.indices), np.min(test.indices), np.max(val.indices), np.min(val.indices)

Note that this does not happen when using the smiles scaffold_func.

Reason for occurrence

dgl-lifesci/python/dgllife/utils/splitters.py

Line 494 in 2fbf5fd

scaffolds = defaultdict(list)

I think this occurs because the object returned by AllChem.MurckoDecompose seems to always be unique (which I would not expect given the description of the function)

In [56]: AllChem.MurckoDecompose(mol)
Out[56]: <rdkit.Chem.rdchem.Mol at 0x7fc076dbe530>

In [57]: AllChem.MurckoDecompose(mol)
Out[57]: <rdkit.Chem.rdchem.Mol at 0x7fc076d198a0>

In [58]: obj1 = AllChem.MurckoDecompose(mol)

In [59]: obj2 = AllChem.MurckoDecompose(mol)

In [60]: obj1 == obj2
Out[60]: False

Thus when using the returned object as a key for the dictionary the key will always be unique.

scaffolds = defaultdict(list)

for i, mol in enumerate(molecules):
    # For mols that have not been sanitized, we need to compute their ring information
    
    FastFindRings(mol)
    if scaffold_func == 'decompose':
        mol_scaffold = AllChem.MurckoDecompose(mol)
    if scaffold_func == 'smiles':
        mol_scaffold = MurckoScaffold.MurckoScaffoldSmiles(
                mol=mol, includeChirality=False)
        # Group molecules that have the same scaffold
    scaffolds[mol_scaffold].append(i)

In [54]: all([len(x[1]) == 1 for x in scaffolds.items()] )
Out[54]: True

Smiles scaffold func

When I run a similar experiment with the smiles scaffold function the output as expected has multiple molecules belonging to a single scaffold

In [67]: pd.value_counts([len(x[1]) for x in scaffolds.items()])
Out[67]:
1       1773
2        261
3        112
4         48
5         32
6         19
7         11
11        10
8          8
12         5
9          5
10         4
13         4
14         3

I hope that this isn't a wild goose chase and will prove to be a useful investigation in improving functionality of this great library. Many thanks
-Phil

how to use pre-trained model correctly?

Hi, I am trying to use pre-trained model on ESOL dataset.

from tqdm import tqdm
import dgl
from dgllife.data import ESOL
from dgllife.model import load_pretrained
from dgllife.utils import smiles_to_bigraph, CanonicalAtomFeaturizer, AttentiveFPAtomFeaturizer, CanonicalBondFeaturizer, AttentiveFPBondFeaturizer

dataset_canonical = ESOL(smiles_to_bigraph, CanonicalAtomFeaturizer(),CanonicalBondFeaturizer())

model = load_pretrained('Weave_canonical_ESOL') # Pretrained model loaded
model.eval()

for smiles, g, label in tqdm(dataset_canonical):
    nfeats = g.ndata['h']
    efeats = g.edata['e']
    label_pred = model(g, nfeats, efeats)
    print(label_pred)
    print(label)

This throws the following error

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_184688/2242391364.py in <module>
      7     nfeats = g.ndata['h']
      8     efeats = g.edata['e']
----> 9     label_pred = model(g, nfeats, efeats)
     10     print(label_pred)
     11     print(label)

~/miniconda3/envs/dgl/lib/python3.9/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

~/miniconda3/envs/dgl/lib/python3.9/site-packages/dgllife/model/model_zoo/weave_predictor.py in forward(self, g, node_feats, edge_feats)
    103             Prediction for the graphs in the batch. G for the number of graphs.
    104         """
--> 105         node_feats = self.gnn(g, node_feats, edge_feats, node_only=True)
    106         node_feats = self.node_to_graph(node_feats)
    107         g_feats = self.readout(g, node_feats)

~/miniconda3/envs/dgl/lib/python3.9/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

~/miniconda3/envs/dgl/lib/python3.9/site-packages/dgllife/model/gnn/weave.py in forward(self, g, node_feats, edge_feats, node_only)
    208         """
    209         for i in range(len(self.gnn_layers) - 1):
--> 210             node_feats, edge_feats = self.gnn_layers[i](g, node_feats, edge_feats)
    211         return self.gnn_layers[-1](g, node_feats, edge_feats, node_only)

~/miniconda3/envs/dgl/lib/python3.9/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

~/miniconda3/envs/dgl/lib/python3.9/site-packages/dgllife/model/gnn/weave.py in forward(self, g, node_feats, edge_feats, node_only)
    107         # Update node features
    108         node_node_feats = self.activation(self.node_to_node(node_feats))
--> 109         g.edata['e2n'] = self.activation(self.edge_to_node(edge_feats))
    110         g.update_all(fn.copy_edge('e2n', 'm'), fn.sum('m', 'e2n'))
    111         edge_node_feats = g.ndata.pop('e2n')

~/miniconda3/envs/dgl/lib/python3.9/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

~/miniconda3/envs/dgl/lib/python3.9/site-packages/torch/nn/modules/linear.py in forward(self, input)
    101 
    102     def forward(self, input: Tensor) -> Tensor:
--> 103         return F.linear(input, self.weight, self.bias)
    104 
    105     def extra_repr(self) -> str:

~/miniconda3/envs/dgl/lib/python3.9/site-packages/torch/nn/functional.py in linear(input, weight, bias)
   1846     if has_torch_function_variadic(input, weight, bias):
   1847         return handle_torch_function(linear, (input, weight, bias), input, weight, bias=bias)
-> 1848     return torch._C._nn.linear(input, weight, bias)
   1849 
   1850 

RuntimeError: mat1 and mat2 shapes cannot be multiplied (68x12 and 13x256)

I check the shape of graph and the construction of WeavePredictor, find they are not match

>>>smiles, g, label = dataset_canonical[0]
>>>print(g.edata['e'].shape)
torch.Size([68, 12])
>>>print(model)
WeavePredictor(
  (gnn): WeaveGNN(
    (gnn_layers): ModuleList(
      (0): WeaveLayer(
        (node_to_node): Linear(in_features=74, out_features=256, bias=True)
        (edge_to_node): Linear(in_features=13, out_features=256, bias=True)
        (update_node): Linear(in_features=512, out_features=256, bias=True)
        (left_node_to_edge): Linear(in_features=74, out_features=256, bias=True)
        (right_node_to_edge): Linear(in_features=74, out_features=256, bias=True)
        (edge_to_edge): Linear(in_features=13, out_features=256, bias=True)
        (update_edge): Linear(in_features=768, out_features=256, bias=True)
      )
      ...

How can I solve this error? Thanks a lot for your help!

load_pretrained

I want to change the model selected as a pretrain, for example GCN_attentivefp_SIDER, the last layer and change the output to 100 instead of 27.
And also how can I access the implementations of this models?

UnlabeledSMILES fails with pre_train node/edge featurizer

Hi,

I noticed that UnlabeledSMILES class in regression_inference.py fails if one uses checkpoints from any GIN models. The specific error I got is:

Error Traceback.

Traceback (most recent call last):
  File "/home/ubuntu/Code/dgl-lifesci/examples/property_prediction/csv_data_configuration/regression_inference.py", line 104, in <module>
    main(args)
  File "/home/ubuntu/Code/dgl-lifesci/examples/property_prediction/csv_data_configuration/regression_inference.py", line 19, in main
    edge_featurizer=args['edge_featurizer'])
  File "/home/ubuntu/excape/lib/python3.7/site-packages/dgllife/data/smiles_inference.py", line 55, in __init__
    edge_featurizer=edge_featurizer))
  File "/home/ubuntu/excape/lib/python3.7/site-packages/dgllife/utils/mol_to_graph.py", line 225, in mol_to_bigraph
    canonical_atom_order, explicit_hydrogens)
  File "/home/ubuntu/excape/lib/python3.7/site-packages/dgllife/utils/mol_to_graph.py", line 79, in mol_to_graph
    g.edata.update(edge_featurizer(mol))
  File "/home/ubuntu/excape/lib/python3.7/_collections_abc.py", line 841, in update
    self[key] = other[key]
  File "/home/ubuntu/excape/lib/python3.7/site-packages/dgl/view.py", line 133, in __setitem__
    self._graph.set_e_repr({key : val}, self._edges)
  File "/home/ubuntu/excape/lib/python3.7/site-packages/dgl/graph.py", line 2373, in set_e_repr
    ' Got %d and %d instead.' % (nfeats, num_edges))
dgl._ffi.base.DGLError: Expect number of features to match number of edges. Got 87 and 60 instead.

This can be solved by adding self_loop=True to mol_to_graph function in UnlabeledSMILES class. I can open a PR with the fix. Thanks for this awesome tool.

[Roadmap] Release Plan for 0.3

This post is used to list the development plan for the next release. Feel free to leave comments if you have any requirement.

Support average precision metric
Pre-trained models on benchmarks like MoleculeNet, Alchemy, QM9, etc
Better support for attention visualization
Visualization for learned molecular representations
Adjust learning rate and add gradient clipping for ogbl-ppa.
Add better support for feature selection

Rexgen example on CPU

I tried out the rexgen example and found and error when running "find_reaction_center_eval.py" on a machine without CUDA support. The error comes from line 21 in find_reaction_center_eval.py, when torch.cuda.set_device is called on a CPU device. It can be reproduced in a nutshell by running:

import torch
device = torch.device('cpu')
torch.cuda.set_device(device)

The problem can be solved if "torch.cuda.set_device(args['device'])" is run only if CUDA is available.

I used commit: 3f07d1f
It occurs on pytorch 1.7 and 1.6

Traceback (most recent call last):
  File "find_reaction_center_eval.py", line 84, in <module>
    main(args)
  File "find_reaction_center_eval.py", line 21, in main
    torch.cuda.set_device(args['device'])
  File "/home/mvolk/anaconda3/envs/dgl-rexgen/lib/python3.7/site-packages/torch/cuda/__init__.py", line 261, in set_device
    device = _get_device_index(device)
  File "/home/mvolk/anaconda3/envs/dgl-rexgen/lib/python3.7/site-packages/torch/cuda/_utils.py", line 31, in _get_device_index
    raise ValueError('Expected a cuda device, but got: {}'.format(device))
ValueError: Expected a cuda device, but got: cpu

Parallel _pre_process in MoleculeCSVDataset

As SMILES counts get to 100k+, parallelization of graph construction and featurization becomes a necessity.

The following is a helper I use for this locally:

from joblib import Parallel, delayed, cpu_count


def pmap(pickleable_fn, data, n_jobs=cpu_count() - 1, verbose=1, **kwargs):
    """
    Parallel map using joblib.

    :param pickleable_fn: Fn to map over data.
    :param data: Data to be mapped over.
    :param kwargs: Additional args for f
    :return: Mapped output.
    """
    return Parallel(n_jobs=n_jobs, verbose=verbose)(
        delayed(pickleable_fn)(d, **kwargs) for d in data
    )


# usage
# pmap(smiles_to_graph, smiles_strings, node_featurizer=nf, edge_featurizer=ef) -> [graph]

This is pretty plug and play, but it will not work with the sequential log_every logger (but joblib has its own logger whose verbosity you can control), and requires another dependency. If you're okay with these downsides, I can go ahead and make a PR for this.

IndexError: list index out of range

for python find_reaction_center_train.py --train-path x --val-path y

could not reopen the closed issue as the issue wasn't started by me
this error is still persisting and i created a new environment just to train on my own dataset

DGLError: Expect number of features to match number of edges (MoleculeCSVDataset)

I'm trying to load my molecules into DGL using the MoleculeCSVDataset class and then I get the mentioned error:

DGLError: Expect number of features to match number of edges. Got 26 and 38 instead.

My code:

dataset = MoleculeCSVDataset(df, partial(smiles_to_bigraph, add_self_loop=True), CanonicalAtomFeaturizer(), 
        CanonicalBondFeaturizer(), 'SMILES', 'dglgraph.bin', n_jobs=6)

df = pandas DataFrame with a SMILES column and multiple label columns (for multi-label classification)

When I set CanonicalBondFeaturizer() to None, this works fine. The molecules in the dataset are prepared so that there aren't any weird things in there that could explain the issue (all are valid RDKit molecules).

How can I solve this problem?

Dataset download is not working with multiple jobs and load set to true

Dataset downloads have gotten very slow and are not working when multiple jobs are given.

Im running all my experiments on collab and can be reproduced here ( Code in the 4th block. Sorry for the long code but Im implementing a paper for contribution.)

It was working perfectly a few days ago and I was getting the following results (You can clearly see that it gets downloaded in 1min)-

This was today (It wasn't downloading until I removed the load and n_jobs parameters )-

This code has been running for 10mins, and still no results

Murcko scaffolds doesn't work for certain SMILES

Hi,

I noticed weird behavior with ScaffoldSplitter for certain datasets. Specifically, this code. The output from MurckoScaffold.MurckoScaffoldSmiles is blank and all the SMILES/data points are added to test instead of splitting between train/val/test. This can be resolved by using AllChem.MurckoDecompose as suggested in official rdkit's issue tracker. I can open a PR with the solution if you'd like along with some checks to make sure train indices doesn't cross over to val/test and vice versa.