Coder Social home page Coder Social logo

paddlepaddle / paddlehelix Goto Github PK

View Code? Open in Web Editor NEW
789.0 24.0 187.0 130.93 MB

Bio-Computing Platform Featuring Large-Scale Representation Learning and Multi-Task Deep Learning “螺旋桨”生物计算工具集

License: Apache License 2.0

CMake 0.01% Python 69.79% Shell 1.09% C++ 11.43% C 0.05% Jupyter Notebook 17.64%
biocomputing machine-learning deeplearning rna-structure-prediction dti representation-learning graph-networks protein-structure-prediction self-supervised-learning ppi

paddlehelix's Introduction

English | 简体中文


Version python version support os DOI

Latest News

2022.12.08 Paper "HelixMO: Sample-Efficient Molecular Optimization in Scene-Sensitive Latent Space" is accepted by BIBM 2022. Please refere to link1 or link2 for more details. We also deployed the drug design service on the website PaddleHelix.

2022.08.11 PaddleHelix released the codes of HelixGEM-2, a novel Molecular Property Prediction Network that models full-range many-body interactions. And it ranked 1st in the OGB PCQM4Mv2 leaderboard. Please refer to paper and codes for more details.

2022.07.29 PaddleHelix released the codes of HelixFold-Single, an MSA-free protein structure prediction pipeline relying on only the primary sequences, which can predict the protein structures within seconds. Please refer to paper and codes for more details. Welcome to PaddleHelix website to try out the structure prediction online service.

2022.07.18 PaddleHelix fully released HelixFold including training and inference pipeline. The complete training time are optimized from 11 days to 5.12 days. Ultra-long monomer protein (around 6600 AA) prediction is supported now. Please refer to paper and codes for more details.

2022.07.07 Paper "BatchDTA: implicit batch alignment enhances deep learning-based drug–target affinity estimation" is published in Briefings in Bioinformatics. Please refer to paper and codes for more details.

2022.05.24 Paper "HelixADMET: a robust and endpoint extensible ADMET system incorporating self-supervised knowledge transfer" is published in Bioinformatics. Refer to paper for more information.

2022.02.07 Paper "Geometry-enhanced molecular representation learning for property prediction" is published in Nature Machine Intelligence. Please refer to paper and codes to explore the algorithm.

More news ...

2022.01.07 PaddleHelix released the reproduction of AlphaFold 2 inference pipeline using PaddlePaddle in HelixFold.

2021.11.23 Paper "Multimodal Pre-Training Model for Sequence-based Prediction of Protein-Protein Interaction" is accepted by MLCB 2021. Please refer to paper and code for more details.

2021.10.25 Paper "Docking-based Virtual Screening with Multi-Task Learning" is accepted by BIBM 2021.

2021.09.29 Paper "Property-Aware Relation Networks for Few-shot Molecular Property Prediction" is accepted by NeurIPS 2021 as a Spotlight Paper. Please refer to PAR for more details.

2021.07.29 PaddleHelix released a novel geometry-level molecular pre-training model, taking advantage of the 3D spatial structures of the molecules. Please refer to GEM for more details.

2021.06.17 PaddleHelix team won the 2nd place in the OGB-LCS KDD Cup 2021 PCQM4M-LSC track, predicting DFT-calculated HOMO-LUMO energy gap of molecules. Please refer to the solution for more details.

2021.05.20 PaddleHelix v1.0 released. 1) Update from static framework to dynamic framework; 2) Add new applications: molecular generation and drug-drug synergy.

2021.05.18 Paper "Structure-aware Interactive Graph Neural Networks for the Prediction of Protein-Ligand Binding Affinity" is accepted by KDD 2021. The code is available at here.

2021.03.15 PaddleHelix team ranks 1st in the ogbg-molhiv and ogbg-molpcba of OGB, predicting the molecular properties.


Introduction

PaddleHelix is a bio-computing tool, taking advantage of the machine learning approaches, especially deep neural networks, for facilitating the development of the following areas:

  • Drug Discovery. Provide 1) Large-scale pre-training models: compounds and proteins; 2) Various applications: molecular property prediction, drug-target affinity prediction, and molecular generation.
  • Vaccine Design. Provide RNA design algorithms, including LinearFold and LinearPartition.
  • Precision Medicine. Provide application of drug-drug synergy.

Resources

Application Platform

PaddleHelix platform provides the AI + biochemistry abilities for the scenarios of drug discovery, vaccine design and precision medicine.

Installation Guide

PaddleHelix is a bio-computing repository based on PaddlePaddle, a high-performance Parallelized Deep Learning Platform. The installation prerequisites and guide can be found here.

Tutorials

We provide abundant tutorials to help you navigate the repository and start quickly.

Examples

We also provide examples that implement various algorithms and show the methods running the algorithms:

Competition Solutions

PaddleHelix team participated in multiple competitions related to bio-computing. The solutions can be found here.

Guide for Developers

  • To develope new functions based on the source code of PaddleHelix, please refer to guide for developers.
  • For more details of the APIs, please refer to the documents.

Welcome to Join Us

We are looking for machine learning researchers / engineers or bioinformatics / computational chemistry researchers interested in AI-driven drug design. We base in Shenzhen or Shanghai, China. Please send the resumes to [email protected] or [email protected].

paddlehelix's People

Contributors

agave233 avatar fairly avatar guoxiawang avatar jameslim-sy avatar jieegao avatar jinghu23 avatar joanna2019 avatar kanz76 avatar lihangliu avatar linxd5 avatar luohongyu avatar nickyoungforu avatar noisyntrain avatar quqxui avatar raindrops2sea avatar ruikangsun avatar superxiang avatar tata1661 avatar vickii-z avatar worldeditors avatar xiaoyao4573 avatar xreki avatar xueeinstein avatar xymyeah avatar zhangbopd avatar zj-liu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

paddlehelix's Issues

ModuleNotFoundError: No module named 'pahelix.featurizers.gem_featurizer'

workspace /dfs/data/PaddleHelix/apps/pretrained_compound/ChemRL/GEM sh scripts/pretrain.sh
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
Traceback (most recent call last):
File "pretrain.py", line 34, in
from pahelix.featurizers.gem_featurizer import GeoPredTransformFn, GeoPredCollateFn
ModuleNotFoundError: No module named 'pahelix.featurizers.gem_featurizer'

error occurred when running pretrain.py

我运行graph_dta的训练脚本但是,卡住不动,运行结果如下

graph_dta ❯ ./scripts/train.sh davis model_configs/fix_prot_len_gin_config.json
2021-06-11:22:51:55 [train.py:143] Load data ...
2021-06-11:22:51:55 [train.py:167] Data loaded.
2021-06-11:22:51:55 [train.py:182] ========== Epoch 0 ==========
/Users/apple/opt/anaconda3/envs/graph_dta/lib/python3.7/site-packages/paddle/nn/layer/norm.py:641: UserWarning: When training, we now always track global mean and variance.
"When training, we now always track global mean and variance.")

Question about the Branch Parallelism in Evoformer

Hi

I mention that you introduce branch parallelism in your arxiv paper. I wonder that is the model structure implemented by BP is identical to the one in Alphafold2 paper. It appears to me that computations are sequential in the paper.

Thanks!

SE(3) transformer implementation

Hi this might be a silly question, I noticed that SE(3) message passing is mentioned in the implemented networks but I couldn't find it. Can someone point me to the implementation?

Also a more general issue, I noticed that most of the network implementations have only a name, but no description of the detailed architectures, no use cases, and no documentation, so I'm having a bit difficulty getting started. Am I looking at the wrong places?

模型权重

请问哪里可以下载模型权重呀 没有看到scripts/download_all_data.sh这个文件
谢谢!

关于使用XPU

您好!
我目前在使用百度昆仑的R200 XPU服务器, 我想了解一下HelixFold是否支持在XPU上训练和推理,谢谢!

pcqm4m-v2中valid和test的3D结构信息如何获取?

你好,我看到GEM-2的输入特征会包含原子对距离和键角等信息,这些特征应该是需要根据输入分子的3D结构信息提取的,但我看到OGB仅提供了训练集的3D结构,这里valid和test的特征是如何提取的呢?
最好麻烦能指出具体做特征提取(距离,键角计算)的代码大致位置:)

【PaddlePaddle Hackathon】PaddleHelix 任务合集

【PaddlePaddle Hackathon】PaddleHelix 任务合集

hi,大家好,非常高兴的告诉大家,首届 PaddlePaddle Hackathon 开始啦。PaddlePaddle Hackathon 是面向全球开发者的深度学习领域编程活动,鼓励开发者了解与参与 PaddlePaddle。本次共有四大方向(PaddlePaddle、Paddle Family、Paddle Friends、Paddle Anything)四大方向,共计100个任务共大家完成。详细信息可以参考 PaddlePaddle Hackathon 说明。大家是否已经迫不及待了呢~

本 ISSUE 是 Paddle Family 专区 PaddleHelix 方向任务合集。具体任务列表如下:

序号 难度 任务ISSUE
83 ⭐️⭐️ 【PaddlePaddle Hackathon】83 预测化合物解离常数(pKa)
84 ⭐️ 【PaddlePaddle Hackathon】84 药物虚拟筛选系统设计

若想要认领本次活动任务,请至 PaddlePaddle Hackathon Pinned ISSUE 完成活动报名以及任务认领。

活动官网:PaddlePaddle Hackathon

请问能否提供PDBbind数据集放在AIStudio上呢?

在学习SIGN-Paddle例子的时候,里面讲到需要先处理一下数据集:convert the PDB-format files into MOL2-format files for feature extraction

提供了处理好的数据集,但是放在dropbox 里面,不方便使用。

请问:

1 是否可以提供该处理好的数据集放在AIStudio公开数据集中呢?

2 数据集处理,UCSF Chimera tool的使用能否讲详细一点呢? 是否用openbabel也可以转格式?

训练transformer encoder的过程中,输入超过2K的序列长度导致的训练过程异常和报错问题

版本、环境信息:
1)PaddlePaddle版本:2.2.2 gpu版本
3)GPU:Tesla V100 PCIE 32 G,cuda:10.1,cudnn:7.6.5
4)系统环境: Linux, python=3.7

训练信息
1)单卡和单机多卡都存在问题
2)32G显存,没有显存泄露情况,内存也正常

问题描述:参考 paddleHelix中的tape预训练模型,对序列数据进行训练,当最长序列设置为1024~2048的时候训练一切正常;然后增加最长序列长度限制后问题出现:

① 设置为3000,发现计算的loss在经过一定轮数后变为 nan

② 设置为4096,发现无法训练,报错提示非法内存访问,错误信息出现位置不一定,偶然会先训练几轮后抛出错误,偶尔会直接抛出错误,错误信息如下:

W0221 18:31:56.758687 28071 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W0221 18:31:56.769076 28071 device_context.cc:465] device: 0, cuDNN Version: 7.6.
/home/zhouhao/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/math_op_patch.py:253: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.bool, the right dtype will convert to paddle.float32
format(lhs_dtype, rhs_dtype, lhs_dtype))
epoch:1, step:1/163376, Loss:2.86357
epoch:1, step:2/163376, Loss:2.26879
epoch:1, step:3/163376, Loss:2.03603
epoch:1, step:4/163376, Loss:1.92704
epoch:1, step:5/163376, Loss:1.96022
epoch:1, step:6/163376, Loss:1.70171
epoch:1, step:7/163376, Loss:1.69766
epoch:1, step:8/163376, Loss:1.61913
epoch:1, step:9/163376, Loss:1.66900
epoch:1, step:10/163376, Loss:1.73385
(External) CUDA error(700), an illegal memory access was encountered.
[Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at /paddle/paddle/fluid/platform/gpu_info.cc:441)

Traceback (most recent call last):
File "train.py", line 127, in
main(args)
File "train.py", line 100, in main
pred = model(text, pos)
File "/home/zhouhao/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in call
return self._dygraph_call_func(*inputs, **kwargs)
File "/home/zhouhao/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/home/zhouhao/mRNA/pretrain/rna_sequence_model.py", line 118, in forward
output = self.model(input, pos)
File "/home/zhouhao/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in call
return self._dygraph_call_func(*inputs, **kwargs)
File "/home/zhouhao/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/home/zhouhao/mRNA/pretrain/rna_sequence_model.py", line 102, in forward
encoder_out = self.encoder_module(input, pos)
File "/home/zhouhao/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in call
return self._dygraph_call_func(*inputs, **kwargs)
File "/home/zhouhao/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/home/zhouhao/mRNA/pretrain/rna_sequence_model.py", line 84, in forward
encoder_output = self.encoder_model(input, pos)
File "/home/zhouhao/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in call
return self._dygraph_call_func(*inputs, **kwargs)
File "/home/zhouhao/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/home/zhouhao/mRNA/pretrain/rna_sequence_model.py", line 38, in forward
(pos == self.padding_idx).astype('float32')*1e-9, axis=[1,2]
File "/home/zhouhao/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/math_op_patch.py", line 264, in impl
return math_op(self, other_var, 'axis', axis)
OSError: (External) CUDA error(700), an illegal memory access was encountered.
[Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at /paddle/paddle/fluid/platform/gpu_info.cc:441)
[operator < equal > error]

Module not found

Hello, many thanks for the toolset provided by the author. However, I had some problems with pre-training, the corresponding module could not be found in the author's file, for example, the 'from pahelix.model_zoo.pretrain_gnns_model import PretrainGNNModel, AttrmaskModel ' cannot find 'AttrmaskModel' and the 'from pahelix.featurizers.pretrain_gnn_featurizer import AttrmaskTransformFn, AttrmaskCollateFn' cannot find 'AttrmaskTransformFn, AttrmaskCollateFn'. I just came into contact with this training, may I ask how to solve this problem. With best wishes.

helixfold-single issue

Install the package and requirements based on the instructions
https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold-single
(BTW, is the required module cudnn: 8.10.1 version having an issue, it seems cudnn only has 8.3.x?)

when run the inference script, get the issue

Error: Can not import avx core while this file exists: /xxx/bin/python/miniconda3/envs/helixfold-single/lib/python3.7/site-packages/paddle/fluid/core_avx.so
..........
..........
from . import core_avx
ImportError: libssl.so.1.1: cannot open shared object file: No such file or directory

I checked my cpu by cat /proc/cpuinfo | grep -i avx, it seems my cpu supports avx. Could you help to check what is the problem? Thank you.

`Model` object has no attribute decode in HelixFold

Stepped through this issue and I've found that <simtk.openmm.app.internal.pdbstructure.Model object at 0x7f424e6f6750> is passed in to this method within openmm that expects a file object. It is possible this is an openmm issue but this is currently blocking my usage of HelixFold. I have verified that I have the latest versions of both openmm and pdbfixer and also have recently pulled the updated setup_env file that made changes to the linking of openmm into simtek.
Traceback (most recent call last): File "run_helixfold.py", line 375, in <module> main(args) File "run_helixfold.py", line 280, in main random_seed=random_seed) File "run_helixfold.py", line 160, in predict_structure output_dir, 0, timings) File "/home/common/proj/FoldingBenchMarks/HelixFold/apps/protein_folding/helixfold/alphafold_paddle/model/model.py", line 283, in postprocess relaxed_pdb_str = relaxer.process(prot=prot)[0] File "/home/common/proj/FoldingBenchMarks/HelixFold/apps/protein_folding/helixfold/alphafold_paddle/relax/relax.py", line 63, in process max_outer_iterations=self._max_outer_iterations) File "/home/common/proj/FoldingBenchMarks/HelixFold/apps/protein_folding/helixfold/alphafold_paddle/relax/amber_minimize.py", line 939, in run_pipeline pdb_string = clean_protein(prot, checks=checks) File "/home/common/proj/FoldingBenchMarks/HelixFold/apps/protein_folding/helixfold/alphafold_paddle/relax/amber_minimize.py", line 187, in clean_protein as_file = openmm_app.PDBFile(pdb_structure) File "/home/grads/bernardm/.conda/envs/helixfold/lib/python3.7/site-packages/simtk/openmm/app/pdbfile.py", line 96, in __init__ pdb = PdbStructure(inputfile, load_all_models=True, extraParticleIdentifier=extraParticleIdentifier) File "/home/grads/bernardm/.conda/envs/helixfold/lib/python3.7/site-packages/openmm/app/internal/pdbstructure.py", line 153, in __init__ self._load(input_stream) File "/home/grads/bernardm/.conda/envs/helixfold/lib/python3.7/site-packages/openmm/app/internal/pdbstructure.py", line 161, in _load if not isinstance(pdb_line, str): AttributeError: 'Model' object has no attribute 'decode'

Is there a script file for model inference in GEM?

I am looking through the code of "apps/pretrained_compound/ChemRL/GEM", I see there is pre-train and finetune files for training model, but i have not found code for running inference. Is there an available script for inference if i want to do a simple test ( load model and dataset, then do inference) ? Thanks a lot!

How GEM2 use 3d information

Hi
I recently read your paper "GEM-2: Next Generation Molecular Property Prediction Network by Modeling Full-range Many-body Interactions" and I'm quite impressed by the performance of GEM2 on PCQM4Mv2.
However, I have some difficulties in understanding its implementation.
In the method call of class OptimusTransformerFn from https://github.com/PaddlePaddle/PaddleHelix/blob/dev/apps/pretrained_compound/ChemRL/GEM-2/src/featurizer.py
I see two methods to compute 3d coordiantes for each molecule: (1) raw3d and (2) rdkit3d
The first method seems to load the 3d information provided by PCQM4Mv2 dataset, so it only applies for training set.
The second method seems to use some built-in algorithm of rdkit to compute 3d information and can apply for both training and valid, test set.
So here are my questions:
(1) For the result reported in your paper, which method you use to compute 3d information? raw3d or rdkit3d?
(2) Does GEM2 use 3d information during inference on valid and test set? Or it just turn off 3d information?
(3) If possible, can you provide the pretrained weight for the GEM2 model reported in the paper?
Thank you!

SIGN算法的数据预处理错误

  1. setxor错误:举例输入 setxor(a=[1, 0], b=[0, 2]),将会得到 [0, 1, 0, 2], [], 实际上按照bond_graph_base的生成方式,只需要取a[0]和 b[1]即可

    bodyxor, link = setxor(body1, body2)

  2. 这里输出的atoms使用的是atom在特征矩阵的维度, 与后面的atom_type不符, 提供的处理好的数据是没有问题的(https://www.dropbox.com/sh/68vc7j5cvqo4p39/AAB_96TpzJWXw6N0zxHdsppEa)

    return lig_size, coords, feas, atoms

存在疑惑的地方:
3. 如果 a边:[0, 1 ], b边[1, 0], 则c边为[0, 0], 如果取dist_mat[0, 0],则c边长度为inf,计算可得夹角为180度(encode为5)但按照其它的边的夹角构造方式,则夹角应该为0度(encode为0)

c = dist_mat[bodyxor[0], bodyxor[1]]

image

关于PaddleHelix/apps/drug_target_interaction/sign/项目中数据处理问题

在运行KDD 2021 paper: "Structure-aware Interactive Graph Neural Networks for the Prediction of Protein-Ligand Binding Affinity".这篇文章的代码数据处理部分命令行时
python preprocess_pdbbind.py --data_path_core YOUR_DATASET_PATH --data_path_refined YOUR_DATASET_PATH --dataset_name pdbbind2016 --output_path YOUR_OUTPUT_PATH --cutoff 5
出现了下图中的错误,不知该怎样解决,求助大佬
image

使用的版本如下
Python 3.8.13
paddlepaddle-gpu 2.3.1.post112

麻烦问下这两个例子中化合物和蛋白的向量表示怎么获取?

https://github.com/PaddlePaddle/PaddleHelix/blob/dev/tutorials/compound_property_prediction_tutorial_cn.ipynb

https://github.com/PaddlePaddle/PaddleHelix/blob/dev/tutorials/protein_pretrain_and_property_prediction_tutorial_cn.ipynb

看代码案例好像最后都只是得出了化合物和蛋白的性质推断,中间的化合物和蛋白的表示怎么获取缺没有说明,能麻烦问下中间要怎么才能获取吗

image
最后的结果维度看着不是蛋白的向量表示
image

Molecular descriptor/fingerprint?

Hello, thanks for a wonderful repository.
I wonder if it is possible to extract, given a pre-trained network, a "fingerprint" or "descriptor" for each input molecule.
Thanks!
M

Error while processing the refined set

Hi,
When i executed the following command for python code under the following path (https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/drug_target_interaction/sign)

python preprocess_pdbbind.py --data_path_refined refined-set --dataset_name pdbbind2016 --output_path out --cutoff 5

the command ended with following error.

Traceback (most recent call last):
File "preprocess_pdbbind.py", line 362, in
process_dataset(args.data_path_core, args.data_path_refined, args.dataset_name, args.output_path, args.cutoff)
File "preprocess_pdbbind.py", line 308, in process_dataset
processed_dict[name] = gen_feature(path, name, featurizer)
File "preprocess_pdbbind.py", line 80, in gen_feature
assert x == 8
AssertionError

Looking for your support.

Thanks

Run error in GEM

An error raised when I am runing GEM
I am using single GPU which is GeForce RTX 2080 Ti with 11G memory
my code is the same as that in github:
`### build model
init_model = '/home/outdo/PaddleHelix/apps/pretrained_compound/ChemRL/GEM/pretrain_models-chemrl_gem/regr.pdparams'

compound_encoder = GeoGNNModel(compound_encoder_config)
model = DownstreamModel(model_config, compound_encoder)
if metric == 'square':
criterion = nn.MSELoss()
else:
criterion = nn.L1Loss()
encoder_params = compound_encoder.parameters()
head_params = exempt_parameters(model.parameters(), encoder_params)
encoder_opt = paddle.optimizer.Adam(args.encoder_lr, parameters=encoder_params)
head_opt = paddle.optimizer.Adam(args.head_lr, parameters=head_params)
print('Total param num: %s' % (len(model.parameters())))
print('Encoder param num: %s' % (len(encoder_params)))
print('Head param num: %s' % (len(head_params)))
for i, param in enumerate(model.named_parameters()):
print(i, param[0], param[1].name)

if not init_model is None and not args.init_model == "":
compound_encoder.set_state_dict(paddle.load(args.init_model))
print('Load state_dict from %s' % args.init_model)`

error information:
`---------------------------------------------------------------------------
OSError Traceback (most recent call last)
Input In [25], in <cell line: 5>()
4 import paddle.fluid as fluid
5 with fluid.device_guard("cpu"):
----> 6 compound_encoder = GeoGNNModel(compound_encoder_config)
7 model = DownstreamModel(model_config, compound_encoder)
8 if metric == 'square':

File ~/PaddleHelix/pahelix/model_zoo/gem_model.py:81, in GeoGNNModel.init(self, model_config)
78 self.bond_float_names = model_config['bond_float_names']
79 self.bond_angle_float_names = model_config['bond_angle_float_names']
---> 81 self.init_atom_embedding = AtomEmbedding(self.atom_names, self.embed_dim)
82 self.init_bond_embedding = BondEmbedding(self.bond_names, self.embed_dim)
83 self.init_bond_float_rbf = BondFloatRBF(self.bond_float_names, self.embed_dim)

File ~/PaddleHelix/pahelix/networks/compound_encoder.py:38, in AtomEmbedding.init(self, atom_names, embed_dim)
36 self.embed_list = nn.LayerList()
37 for name in self.atom_names:
---> 38 embed = nn.Embedding(
39 CompoundKit.get_atom_feature_size(name) + 5,
40 embed_dim,
41 weight_attr=nn.initializer.XavierUniform())
42 self.embed_list.append(embed)

File ~/anaconda3/envs/paddlehelix/lib/python3.8/site-packages/paddle/nn/layer/common.py:1453, in Embedding.init(self, num_embeddings, embedding_dim, padding_idx, sparse, weight_attr, name)
1451 self._remote_prefetch = False
1452 self._name = name
-> 1453 self.weight = self.create_parameter(
1454 attr=self._weight_attr,
1455 shape=self._size,
1456 dtype=self._dtype,
1457 is_bias=False)
1459 if in_dynamic_mode() and padding_idx != -1:
1460 with paddle.no_grad():

File ~/anaconda3/envs/paddlehelix/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py:423, in Layer.create_parameter(self, shape, attr, dtype, is_bias, default_initializer)
421 if isinstance(temp_attr, six.string_types) and temp_attr == "":
422 temp_attr = None
--> 423 return self._helper.create_parameter(temp_attr, shape, dtype, is_bias,
424 default_initializer)

File ~/anaconda3/envs/paddlehelix/lib/python3.8/site-packages/paddle/fluid/layer_helper_base.py:376, in LayerHelperBase.create_parameter(self, attr, shape, dtype, is_bias, default_initializer, stop_gradient, type)
370 if is_used:
371 raise ValueError(
372 "parameter name [{}] have be been used. "
373 "In dygraph mode, the name of parameter can't be same."
374 "Please check the parameter attr value passed to self.create_parameter or "
375 "constructor of dygraph Layers".format(attr.name))
--> 376 return self.main_program.global_block().create_parameter(
377 dtype=dtype,
378 shape=shape,
379 type=type,
380 stop_gradient=stop_gradient,
381 **attr._to_kwargs(with_initializer=True))
382 else:
383 self.startup_program.global_block().create_parameter(
384 dtype=dtype,
385 shape=shape,
386 type=type,
387 **attr._to_kwargs(with_initializer=True))

File ~/anaconda3/envs/paddlehelix/lib/python3.8/site-packages/paddle/fluid/framework.py:3572, in Block.create_parameter(self, *args, **kwargs)
3570 pass
3571 else:
-> 3572 initializer(param, self)
3573 return param

File ~/anaconda3/envs/paddlehelix/lib/python3.8/site-packages/paddle/fluid/initializer.py:605, in XavierInitializer.call(self, var, block)
603 if self._uniform:
604 limit = np.sqrt(6.0 / float(fan_in + fan_out))
--> 605 out_var = _C_ops.uniform_random('shape', out_var.shape, 'min',
606 -limit, 'max', limit, 'seed',
607 self._seed, 'dtype', out_dtype)
608 else:
609 std = math.sqrt(2.0 / float(fan_in + fan_out))

OSError: [operator < uniform_random > error]`

Could anyone help me? Thank you so much!

S2F code

Hi,

Very interested in the paper "[Multimodal Pre-Training Model for Sequence-based Prediction of Protein-Protein Interaction]", wonder when the code is available, thanks

Get a valueError when I try to run the helixfold-single.

I use th 2.3version paddle cuda 11.2 linux docker.
I solved the dependency according to the readme.
And I download the official init model. But when I run the code , I got the valueError.
The code is PaddleHelix/apps/protein_folding/helixfold-single/helixfold_single_inference.py

like robin said, what's the problem?

2022-09-26 09:24:25.062647: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2500000000 Hz
/usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py:3623: DeprecationWarning: Op `slice` is executed through `append_op` under the dynamic mode, the corresponding API implementation needs to be upgraded to using `_C_ops` method.
  "using `_C_ops` method." % type, DeprecationWarning)
Traceback (most recent call last):
  File "helixfold_single_inference.py", line 121, in <module>
    main(args)
  File "helixfold_single_inference.py", line 106, in main
    results = model(batch, compute_loss=False)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 929, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 914, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/tmp/helix/utils/model_tape.py", line 115, in forward
    batch = self._forward_tape(batch)
  File "/tmp/helix/utils/model_tape.py", line 98, in _forward_tape
    return_representations=True, return_last_n_weight=self.model_config.last_n_weight)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 929, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 914, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/tmp/helix/tape/others/protein_sequence_model_dynamic.py", line 218, in forward
    return_last_n_weight=return_last_n_weight)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 929, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 914, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/tmp/helix/tape/others/transformer_block.py", line 530, in forward
    is_recompute=self.training)
  File "/tmp/helix/tape/others/transformer_block.py", line 26, in recompute_wrapper
    return func(*args)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 929, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 914, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/tmp/helix/tape/others/transformer_block.py", line 480, in forward
    attn_results = self.self_attn(src, src, src, src_mask, relative_pos, rel_embeddings)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 929, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 914, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/tmp/helix/tape/others/transformer_block.py", line 398, in forward
    rel_att = self.disentangled_attention_bias(query_layer, key_layer, relative_pos, rel_embeddings, scale_factor)
  File "/tmp/helix/tape/others/transformer_block.py", line 367, in disentangled_attention_bias
    c2p_att = self.gather_4d(c2p_att, index=c2p_gather_idx)
  File "/tmp/helix/tape/others/transformer_block.py", line 343, in gather_4d
    stack_0 = paddle.tile(paddle.arange(start=0, end=a, step=1, dtype="float32").reshape([a, 1]), [b * c * d]).reshape([a, b, c, d]).cast(index.dtype)
  File "/usr/local/lib/python3.7/dist-packages/paddle/tensor/manipulation.py", line 3243, in reshape
    out, _ = _C_ops.reshape2(x, None, 'shape', shape)
ValueError: (InvalidArgument) The 'shape' in ReshapeOp is invalid. The input tensor X'size must be equal to the capacity of 'shape'. But received X's shape = [1, 1067237297], X's size = 1067237297, 'shape' is [1, 16, 2, 2], the capacity of 'shape' is 64.
  [Hint: Expected capacity == in_size, but received capacity:64 != in_size:1067237297.] (at /root/paddlejob/workspace/env_run/Paddle/paddle/fluid/operators/reshape_op.cc:204)

【PaddlePaddle Hackathon】84 药物虚拟筛选系统设计

(此 ISSUE 为 PaddlePaddle Hackathon 活动的任务 ISSUE,更多详见PaddlePaddle Hackathon

螺旋桨(PaddleHelix)是一个生物计算工具集,是用机器学习的方法,特别是深度神经网络,致力于促进以下领域的发展:新药发现疫苗设计精准医疗 等。

【任务说明】

  • 任务题目:药物虚拟筛选系统设计

  • 技术标签:深度学习、C++、python、生物计算

  • 任务难度:简单

  • 详细说明:虚拟筛选(Virtual Screening)是药物研发中的重要一环,其目的是从大量化合物中筛选出具备高生物活性的化合物,进而进行生物活性实验的验证。过往,一般是直接通过高通量筛选技术进行药物筛选,成本高,耗时长。近几年,随着机器学习和深度学习技术的发展,融合 AI 算法的药物虚拟筛选技术,能够大大提升药物筛选的效率,并越来越多的受到青睐。在这个任务中,需要你借助 PaddleHelix 提供的化合物活性预测(或称药物靶点相互作用,DTI)模型,设计药物虚拟筛选的系统及功能,以更好的辅助药物研发人员进行药物筛选。

【提交内容】

  1. 项目 PR 到 PaddleHelix
  2. 相关技术文档
  3. 项目单测文件

【技术要求】

  • 了解药物的虚拟筛选流程
  • 了解 Paddle 相关的技术背景
  • 产出一个药物虚拟筛选系统
  • 若系统包含可视化模块可额外加分

helixfold模型运行时如何控制显存

我按照helixfold的README_inference.md文件运行run_helixfold.py模型时遇到了显存溢出的问题,我使用的是一张12GB的3080Ti。我尝试降低batch的大小,但是看代码中batch似乎是要预测的蛋白质fasta文件的特征文件。所以有什么好的方法能够降低模型占用的显存容量吗,或者有其它能够帮助该模型在12G的显存上运行的建议吗?十分感谢您的帮助!!!!

【PaddlePaddle Hackathon】83 预测化合物的酸解离常数(pKa)

(此 ISSUE 为 PaddlePaddle Hackathon 活动的任务 ISSUE,更多详见PaddlePaddle Hackathon

螺旋桨(PaddleHelix)是一个生物计算工具集,是用机器学习的方法,特别是深度神经网络,致力于促进以下领域的发展:新药发现疫苗设计精准医疗 等。

【任务说明】

  • 任务题目:预测化合物的酸解离常数(pKa)

  • 技术标签:深度学习、C++、python、生物计算

  • 任务难度:中等

  • 详细说明:对于弱酸弱碱的化合物,酸解离常数(pKa)是最重要的理化性质指标之一,其决定化合物的溶解度、亲脂性、生物富集性、毒性、吸收、分布、代谢和排泄等性质。pKa 过往主要通过实验方法测定,耗时耗力,效率低。随着 AI 技术的发展,已有越来越多运用 AI 技术,预测化合物 pKa 的方法。在这个任务中,需要你借助 PaddleHelix 提供的深度学习基础工具集和模型,设计可提供化合物 pKa 预测功能的工具组件,预测化合物的 pKa,产出 pKa 预测算法模型。

【提交内容】

  1. 项目 PR 到 PaddleHelix
  2. 相关技术文档
  3. 测试数据集以及测试结果

【技术要求】

  • 熟练掌握深度学习方法
  • 了解化合物酸解离常数(pKa)的指标意义
  • 了解 Paddle相关的技术背景

【任务指引】

  1. Paddle AI Studio 螺旋桨系列教程
    https://aistudio.baidu.com/aistudio/projectdetail/1293361

  2. PaddleHelix 开源中英文 IPython Notebook 教程
    https://github.com/PaddlePaddle/PaddleHelix/tree/dev/tutorials

【参考 pKa 数据集】

  • 我们收集了少量化合物 pKa 的实验数据,请参考下列文章和链接:
  1. Baltruschat, M. & Czodrowski, P. Machine learning meets pK a. F1000Res 9, Chem Inf Sci-113 (2020).
    数据链接:https://github.com/czodrowskilab/Machine-learning-meets-pKa/tree/master/datasets
  1. Liao, C. & Nicklaus, M. C. Comparison of Nine Programs Predicting pKa Values of Pharmaceutical Substances. J. Chem. Inf. Model. 49, 2801–2812 (2009).
    数据链接:https://pubs.acs.org/doi/10.1021/ci900289x#si1

Installation Issue

  • Issue name:

Could not find a version that satisfies the requirement _pgl >= 1.2.0_ (from paddlehelix)

  • URL:

Your contact information: [email protected]

  • Expected result:

Download and install successfully

  • Actual result:

image

  • Action taken:
  1. Downgraded to python = 3.6 or 3.7
  2. Conda install python 3.6/3.7 sandbox environment

关于GraphPool

代码中多处出现GraphPool,但是并没有找到GraphPool的具体实现位置,请问GraphPool定义在哪里呢?

Implementing the RosettaFold model in PaddleHelix

Hi I'm interested in reimplementing the RossettaFold 3D protein structure prediction model in PaddleHelix (the model is written originally in PyTorch), but I'm still quite new with PaddleHelix so I thought that I should ask around a bit before I get down to actually do it.

  • Are you guys interested in implementing it yourselves? (Perhaps you have already implemented it somewhere...)
  • If not, are you interested in accepting a PR for it?
  • I feel that most of the components necessary for implementing the model is already present in PaddleHelix, perhaps someone can point me to the right places?
  • Are there any anticipated difficulty for implementing the model?

Basically I want to reimplement the model in PaddleHelix so I think I should discuss it here for some general suggestions. Comments are highly appreciated!

怎样使用蛋白质预训练模型

我已经下载了蛋白质预训练的三个模型,Transformer、LSTM和ResNet。怎么样才能在代码中使用它们呢?能给一些Demo吗?比如说怎样才能利用预训练的模型获得给定蛋白质的预训练表征。

Misspecified keyword in the protein pretraining and prediction tutorial

I'm looking at the TAPE protein pretraining tutorial, in the model config dict there is a keyword layer_num, but there is no such keyword argument in any of the TAPE models, only a n_layers argument. So whatever layer_num hidden layers we specify, since the models only search for the number of hidden layers with the n_layers keyword, and since there is no such keyword in model_config, it will always use the default value 8. Am I reading this right?

TLDR: I think the layer_num keyword in model configs should be n_layers.

Use newer OpenMM

The setup_env script pins OpenMM to 7.5.1, which is an old release that isn't supported anymore. Could that be updated to the current release, or alternatively could the pin just be removed? As far as I can tell nothing in the code requires the old version.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.