ProtFound-PD

The official code repository of "Hunting for peptide binders of specific targets with data-centric generative language models".

Overview
Requirements
Instructions for Use
- Data Preparation
- Model Pretraining
- Finetuning on Your Data
- Peptide Generation
- Visualization of the reduced distribution space
Pretrained and Finetuned Models Availability
Acknowledgments
License

Overview

The increasing frequency of emerging viral infections calls for more efficient and low-cost drug design methods. Peptide binders have emerged as a strong contender to curb the pandemic due to their efficacy, safety, and specificity. Here, we propose a customizable low-cost pipeline incorporating model auditing strategy and data-centric methodology for controllable peptide generation. A generative protein language model, pretrained on approximately 140 million protein sequences, is directionally fine-tuned to generate peptides with desired properties and binding specificity. The subsequent multi-level structure screening reduces the synthetic distribution space of peptide candidates regularly to identify authentic high-quality samples, i.e. potential peptide binders, at in silico stage. Paired with molecular dynamics simulations, the number of candidates that need to be verified in wet-lab experiments is quickly reduced from more than 2.2 million to 16. These potential binders are characterized by enhanced yeast display to determine expression levels and binding affinity to the target. The results show that only a dozen candidates need to be characterized to obtain the peptide binder with ideal binding strength and binding specificity. Overall, this work achieves efficient and low-cost peptide design based on a generative language model, increasing the speed of de novo protein design to an unprecedented level. The proposed pipeline is customizable, that is, suitable for rapid design of multiple protein families with only minor modifications.

Requirements

mindspore >=1.5.

!! NOTE: Pretraining and fine-tuning of our generative protein language model, and peptide sequence generation are conducted on Ascend-910 (32GB) with MindSpore. If you use a GPU, please follow the instructions on MindSpore's official website to install the GPU-adapted version.

Instructions for Use

Data Preparation

Dataset for pretraining can be obtained from BFD, Pfam and UniProt. In this work, the miniprotein scaffold libraries used for the first round fine-tuning can be obtained from previous work. In addition, the peptide sequences carrying targeting information used for the second round fine-tuning can be obtained through the related suites of Rosetta.

After data filtering and processing is completed, convert it to a txt format in which a line means a sequence.

Structure your data directory as follows:

your_data_dir/
  ├─01.txt
  ...
  └─09.txt

Convert the dataset to MindRecord format using:

python prepare_data.py --data_url your_data_dir --save_dir your_save_dir

Model Pretraining

Initiate pretraining with:

python train.py --train_url your_data_dir

Note: This code is tested on Pengcheng CloudBrain II (Ascend-910). Modifications may be needed for other environments.

Finetuning on Your Data

To finetune the pretrained model on your dataset:

python train.py --train_url your_data_dir --load_ckpt_path pretrained_ckpt_path --finetune

Peptide Generation

Generate peptides by specifying your template and the checkpoint path:

python generate_peptide.py --head_txt your_template_peptide --load_ckpt_path pretrained_ckpt_path

Visualization of the reduced distribution space

To see whether structure-based calculations shrink the synthetic distribution space in the desired direction, we extract protein descriptors of candidates in different screening stages and use factor analysis for dimensionality reduction visualization. Combining multiple protein descriptors allows us to gain a more comprehensive and detailed understanding of the underlying patterns of protein sequences. Here, we employ four length-independent descriptors ($k$-mer spectra, Normalized Moreau-Broto Autocorrelation descriptor, Quasi-sequence-order descriptor, Composition, Transition, and Distribution descriptor) to ensure that sequences of different lengths have equal-length descriptor vectors. All the descriptors are finally concatenated together for analysis.

get_descriptor.py # Script to extract descriptors from protein sequences
vis.py # Script for visualizing descriptors
example.sh # Example shell script to demonstrate descriptor extraction and visualization

Pretrained and Finetuned Models Availability

Pretrained model can be download here. Finetuned model can be download here.

Acknowledgments

This repository builds upon the PanGu-Alpha codebase. For comprehensive details, see here.

License

This project is covered under the Apache 2.0 License.

zhiweiniepku / protfound-pd Goto Github PK

protfound-pd's Introduction

ProtFound-PD

Table of Contents

Overview

Requirements

Instructions for Use

Data Preparation

Model Pretraining

Finetuning on Your Data

Peptide Generation

Visualization of the reduced distribution space

Pretrained and Finetuned Models Availability

Acknowledgments

License

protfound-pd's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org