Coder Social home page Coder Social logo

clean's Introduction

CLEAN: Enzyme Function Prediction using Contrastive Learning

DOI

CLEAN is converted into an easy-to-use webserver and made freely accessible at MMLI tools.

This is the official repository for the paper Enzyme Function Prediction using Contrastive Learning. CLEAN, Contrastive Learning enabled Enzyme ANnotation, is a machine learning algorithm to assign Enzyme Commission (EC) number with better accuracy, reliability, and sensitivity than all existing computational tools. We also offer a web server for CLEAN as part of MMLI AlphaSynthesis tools. Please note that as an initial release, CLEAN web server uses default parameters to generate results. In the future, we will allow user customized parameters. In the meantime, to reproduce the results in our manuscript, please follow the following guide.

To use CLEAN to inference the EC number for any amino acid sequence, we included the pretrained weights for both the 70% and 100% identity clustering split of SwissProt (expertly reviewed portion of the UniProt, total ~220k training data). User can follow the instruction below on how to install and inference with CLEAN. We also provide full training scripts.

drawing

If you find CLEAN helpful in your research, please consider citing us:

@article{doi:10.1126/science.adf2465,
  author = {Tianhao Yu  and Haiyang Cui  and Jianan Canal Li  and Yunan Luo  and Guangde Jiang  and Huimin Zhao },
  title = {Enzyme function prediction using contrastive learning},
  journal = {Science},
  volume = {379},
  number = {6639},
  pages = {1358-1363},
  year = {2023},
  doi = {10.1126/science.adf2465},
  URL = {https://www.science.org/doi/abs/10.1126/science.adf2465}
}  

1. Install

1.1 Requirements

Python >= 3.6; PyTorch >= 1.11.0; CUDA >= 10.1 Manuscript result was obtained using Python 3.10.4; PyTorch 1.11.0; CUDA 11.3; fair-esm 1.0.2

1.2 Quickstart


cd CLEAN/app/
conda create -n clean python==3.10.4 -y
conda activate clean
pip install -r requirements.txt

Pytorch installation is different for different Operating systems, please refer Pytorch Installation. For Linux, please use the following commands.


conda install pytorch==1.11.0 cpuonly -c pytorch (CPU)
conda install pytorch==1.11.0 cudatoolkit=11.3 -c pytorch (GPU)

The input FASTA file should be present in the data/inputs directory. We have provided 2 input sample files already in the codebase - price.fasta and init.fasta. We have also included pretrained weights for 70% and 100% splits, along with pre-evaluated embeddings for each EC cluster centers for fastest inference. Download, unzip these files and move the contains to data/pretrained.

Due to frequent requests, we are also releasing the train/test sets (5 folds for each of the split10/30/50/70/100) used in the paper. Please download from the following link. As described in Supplementary Text 1. ML model development and evaluation, we are reporting results averaged across 5 folds, we didn't create additional validation sets, and interested users can create the validation sets following a similar procedure for creating train/test sets. Note that these cross-validation splits are from when developing CLEAN, and are not completely equal to the full split10/30/50/70/100 released in the repo.


python build.py install
git clone https://github.com/facebookresearch/esm.git
mkdir data/esm_data
python CLEAN_infer_fasta.py --fasta_data price

result will be generated as results/inputs/price_maxsep.csv

1.2.1 Running in Docker (CPU version)

  1. Pull the Docker Image for AMD64 Architecture Ubuntu Machine from moleculemaker/clean-image-amd64

docker pull moleculemaker/clean-image-amd64

Our current experiments have only been successful on Docker Containers running with > 12GB Memory

  1. Running this library requires downloading huge weight files (around 7.3GB), so its better to pre-download the weight files and mount these while running the docker container. You can download these from:

curl -o esm1b_t33_650M_UR50S-contact-regression.pt https://dl.fbaipublicfiles.com/fair-esm/regression/esm1b_t33_650M_UR50S-contact-regression.pt

curl -o esm1b_t33_650M_UR50S.pt https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt

  1. From the directory having these weight files, we are now ready to run the docker image. During this run, we will mount the above downloaded weights on the Docker container, start the container, and run the CLEAN library for a file price.fasta (which is already packaged in the image). If you wish to run this on your own FASTA file, you can copy it under /app/data/inputs directory.

sudo docker run -it -v ./:/root/.cache/torch/hub/checkpoints moleculemaker/clean-image-amd64 /bin/bash -c 'echo Starting Execution && python $(pwd)/CLEAN_infer_fasta.py --fasta_data price'

The output file will be generated under results/inputs directory with the same name as the input file.

1.3 Procedures

Install requirement and build CLEAN

pip install -r requirements.txt
git clone https://github.com/facebookresearch/esm.git
python build.py install

Next, esm-1b embeddings need to be pre-computed from a FASTA file. There are two options:

  1. Retrive all embedding for all SwissProt sequences (slow, but required for training)

  2. Retrive only embeddings for enzymes to be inferenced (fast)

For option 1, run following commands in python:

python

>>> from CLEAN.utils import *

>>> ensure_dirs("data/esm_data")

>>> ensure_dirs("data/pretrained")

>>> csv_to_fasta("data/split100.csv", "data/split100.fasta")

>>> retrive_esm1b_embedding("split100")

For option 2, move the fasta file (for example, test.fasta) to be inferred to /data, and run following commands :

python

>>> from CLEAN.utils import *

>>> ensure_dirs("data/esm_data")

>>> ensure_dirs("data/pretrained")

>>> retrive_esm1b_embedding("test")

2. Inference

2.1 Preparation

We offer two EC-calling inference algorithms: max-separation and p-value. max-separation consistently gives better precision and recall, but results from p-value can be controlled by adjusting p_value as a hyperparameter.

Before inference, AA sequences to be inferred are stored in a CSV file, with the same format as the split100.csv. The field EC number in the csv file can be any EC number if unknow, but please ignore the printed evaluation metrics in this case. The esm-1b embeddings of the infered sequences need to be pre-computed using the following commands (using new.csv as an example):

python

>>> from CLEAN.utils import *

>>> csv_to_fasta("data/new.csv", "data/new.fasta")

>>> retrive_esm1b_embedding("new")

2.2.1 Inference with p-value

For inferencing using p-value, there are two hyperparameter: nk_random and p_value. nk_random is the number of randomly chosen enzymes (in thousands) from the training set used for calculating background distances (distances to incorrect EC numbers) for each EC number. p-value is the threshould for a EC number to be considered significant relative to the backgound distances. The following commands show how to get EC prediction results from p-value:

python

>>> from CLEAN.infer import infer_pvalue

>>> train_data = "split100"

>>> test_data = "new"

>>> infer_pvalue(train_data, test_data, p_value=1e-5, nk_random=20,

report_metrics=True, pretrained=True)

This should produce similar results (depending on the version of ESM-1b weights):


The embedding sizes for train and test: torch.Size([241025, 128]) torch.Size([392, 128])

Calculating eval distance map, between 392 test ids and 5242 train EC cluster centers

############ EC calling results using random chosen 20k samples ############

---------------------------------------------------------------------------

>>> total samples: 392 | total ec: 177

>>> precision: 0.558 | recall: 0.477 | F1: 0.482 | AUC: 0.737

---------------------------------------------------------------------------

2.2.2 Inference with max-separation

For inferencing using max-separation, there are no hyperparameters to tune: it's a greedy approach that prioritizes EC numbers that have the maximum separation to other EC numbers in terms of the pairwise distance to the query sequence. max-separation gives a deterministic prediction and usually outperforms p-value in turns of precision and recall. Because this algorithm does not need to sample from the training set, it is much faster than p-value. The following commands show how to get EC predicition results from max-separation:

python

>>> from CLEAN.infer import infer_maxsep

>>> train_data = "split100"

>>> test_data = "new"

>>> infer_maxsep(train_data, test_data, report_metrics=True, pretrained=True)

This should produce similar results (depending on the version of ESM-1b weights):


The embedding sizes for train and test: torch.Size([241025, 128]) torch.Size([392, 128])

Calculating eval distance map, between 392 test ids and 5242 train EC cluster centers

############ EC calling results using maximum separation ############

---------------------------------------------------------------------------

>>> total samples: 392 | total ec: 177

>>> precision: 0.596 | recall: 0.479 | F1: 0.497 | AUC: 0.739

---------------------------------------------------------------------------

2.2.3 Interpreting prediction result csv file

The prediction results are store in the folder results/ with file name = test_data + infer_algo (for example, new_maxsep.csv). An example output would be:


Q9RYA6,EC:5.1.1.20/7.4553

O24527,EC:2.7.11.1/5.8561

Q5TZ07,EC:3.6.1.43/8.0610,EC:3.1.3.4/8.0627,EC:3.1.3.27/8.0728

Where the first column (Q9RYA6) is the id of the enzyme, second column (EC:5.1.1.20/7.4553) is the predicted EC number and pairwise distance between cluster center of 5.1.1.20 and Q9RYA6. Note in the case of enzyme Q5TZ07, three enzyme functions are predicted.

2.3 Inference with newly trained model

In addition to inferencing with pretrained weights for 70% and 100% splits, we also support inferencing with user-trained models. For example, if a model is trained and saved as data/model/split10_triplet.pth, inferencing with max-separation:

from CLEAN.infer import *

infer_maxsep("split10", "new", report_metrics=True,

pretrained=False, model_name="split10_triplet")

2.4 Inference on a single FASTA file

In addition to inferencing a csv file with Entry, EC number and Sequence, we also allow inferencing on just a single FASTA file. For example, there is a simple FASTA file data/query.fasta:


>WP_063462990

LIDCNIDMTQLFAPSSSSTDATGAPQGLAKFPSLQGRAVFVTGGGSGIGAAIVAAFAE

QGARVAFVDVAREASEALAQHIADAGLPRPWWRVCDVRDVQALQACMADAAAELGSDF

AVLVNNVASDDRHTLESVTPEYYDERMAINERPAFFAIQAVVPGMRRLGAGSVINLGS

TGWQGKGTGYPCYAIAKSSVNGLTRGLAKTLGQDRIRINTVSPGWVMTERQIKLWLDA

EGEKELARNQCLPDKLRPHDIARMVLFLASDDAAMCTAQEFKVDAGWV

>WP_012434361

MSSPANANVRLADSAFARYPSLVDRTVLITGGATGIGASFVEHFAAQGARVAFFDIDA

SAGEALADELGDSKHKPLFLSCDLTDIDALQKAIADVKAALGPIQVLVNNAANDKRHT

IGEVTRESFDAGIAVNIRHQFFAAQAVMEDMKAANSGSIINLGSISWMLKNGGYPVYV

MSKSAVQGLTRGLARDLGHFNIRVNTLVPGWVMTEKQKRLWLDDAGRRSIKEGQCIDA

ELEPADLARMALFLAADDSRMITAQDIVVDGGWA

Inference through the following command in terminal:


python CLEAN_infer_fasta.py --fasta_data query

And the max-separation prediction will be stored in results/query_maxsep.csv.

3. Training

We provide the scripts for CLEAN models with both triplet margin and Supcon-Hard losses. Supcon-Hard Loss samples multiple positives and negatives and performs better than Triplet Margin Loss small training datasets, however it takes longer time to train.

triplet margin loss is given as:

$$ \mathcal{L}^{TM} = ||z_a - z_p||_2 - ||z_a - z_n||_2 + \alpha ,$$

where $z_a$ is the anchor, $z_p$ is the positive, $z_n$ is the hard-mined negative. SupCon-Hard loss is given as:

$$\mathcal{L}^{sup} = \sum_{e\in E} \frac{-1}{|P(e)|}\sum_{z_p \in P(e)}\log \frac{\exp (z_e \cdot z_p / \tau)}{\sum_{z_a \in A(e)} \exp (z_i \cdot z_a / \tau) } $$

where a fixed number of positives are sampled from the same EC class as the anchor, and a fixed number of negatives are hard-mined.

Before training, a required step is to mutate the sequences with 'orphan' EC number ('orphan' in the way this EC number has only one sequence). Since we need to sample positive sequences other than the anchor sequence, we mutated the anchor sequence and use the mutated sequences as positive sequences. This only needs to be done ONCE for every training file!. Run following commands:

python

>>> from CLEAN.utils import mutate_single_seq_ECs, retrive_esm1b_embedding

>>> train_file = "split10"

>>> train_fasta_file = mutate_single_seq_ECs(train_file)

>>> retrive_esm1b_embedding(train_fasta_file)

Next, to speed up training, the pair-wise distance matrix and embedding matrix need to be pre-computed. This only also only needs to be done ONCE for every training file!. Run following commands:

python

>>> from CLEAN.utils import compute_esm_distance

>>> train_file = "split10"

>>> compute_esm_distance(train_file)

This will save the two matrices (split10.pkl and split10_esm.pkl) in folder location /data/distance_map.

3.1 Train a CLEAN model with triplet margin loss

To train a CLEAN model with triplet margin loss, and take 10% split as an example, simply run:


python ./train-triplet.py --training_data split10 --model_name split10_triplet --epoch 2500

The model weight is saved as /data/model/split10_triplet.pth and to inference with it, see section 2.3.

We recommand different epoch numbers for training different splits:

  • 10% split: epoch = 2000

  • 30% split: epoch = 2500

  • 50% split: epoch = 3500

  • 70% split: epoch = 5000

  • 100% split: epoch = 7000

3.2 Train a CLEAN model with SupCon-Hard loss

To train a CLEAN model with SupCon-Hard loss, and take 10% split as an example, run:


python ./train-supconH.py --training_data split10 --model_name split10_supconH --epoch 1500 --n_pos 9 --n_neg 30 -T 0.1

We fixed the number of positive to be 9, the number of positive to be 30 and temperature to be 0.1 in all of our experiments. We recommand using 25% less number of epochs compared to triplet margin loss on the same training data.

Also notice that the outputing embedding for SupCon-Hard is out_dim=256 while for triplet margin is out_dim=128. To infer with a CLEAN-supconH model, see notes in src/CLEAN/infer.py about rebuilding CLEAN.

4. Confidence estimate using GMM

To train an ensumlbe of GMM:

python gmm.py

clean's People

Contributors

bodom0015 avatar canallee avatar samarthgupta1011 avatar tttianhao avatar zhoubay avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

clean's Issues

offline tool issue with the training

Hello,
I am currently trying to run your tool with the docker container and I am facing an issue at the training step.
Here is the command line that I use:
/shared/projects/seabioz/softwares/CLEAN/clean-1.0.1.sif python ./scripts/train-supconH.py --training_data split100 --model_name split100_supconH --epoch 4100 --n_pos 9 --n_neg 30 -T 0.1

And here is the error I get:
Traceback (most recent call last): File "/shared/projects/seabioz/softwares/CLEAN/./scripts/train-supconH.py", line 139, in <module> main() File "/shared/projects/seabioz/softwares/CLEAN/./scripts/train-supconH.py", line 118, in main train_loss = train(model, args, epoch, train_loader, File "/shared/projects/seabioz/softwares/CLEAN/./scripts/train-supconH.py", line 50, in train for batch, data in enumerate(train_loader): File "/usr/local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 634, in __next__ data = self._next_data() File "/usr/local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 678, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.10/site-packages/CLEAN-0.1-py3.10.egg/CLEAN/dataloader.py", line 105, in __getitem__ File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 791, in load with _open_file_like(f, 'rb') as opened_file: File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 271, in _open_file_like return _open_file(name_or_buffer, mode) File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 252, in __init__ super().__init__(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: './data/esm_data/P32143_6.pt'

It's like a pt file was missing but I downloaded them with your scripts so I don't get what could be missing?

a bug on `random_positive` in `dataloader.py`

There is a bug in the random_positive function in dataloader. When there is only one protein ID in EC,
like EC:3.4.22.54, and it only have ['Q9TTH8'] in it

Note, i use split10 dataset, so maybe lack of some sequence in this EC, but i think it's doesnt matter about this bug
it will return Q9TTH8_x ( where x is a randint num among 1-9,) as the positive sequence. However, since this sequence does not exist in the data, it will prompt that the sequence cannot be read when the dataloader is run subsequently.

So i think the question is here

def random_positive(id, id_ec, ec_id): 
    pos_ec = random.choice(id_ec[id])
    pos = id
    if len(ec_id[pos_ec]) == 1:  # this is the error comes from, and i think if only 1 in EC then use this is a better idea
        return pos + '_' + str(random.randint(0, 9))
    while pos == id:  # also this could change into a new list which don't contain the anchor(in this function is id variable), and then use random.choice will be efficient
        pos = random.choice(ec_id[pos_ec])
    return pos

all in the comments is my opinion, and i will raise a PR soon later

problem with job at web server

Thanks for this excellent tool for structure-based function prediction. I submitted a job at web server and received an email for the result, However, the web page cannot be refreshed. Can you help solve this problem?
Thanks a lot.

The cross-validation process details.

Hi,

Thanks for contributing the code! I was wondering the details about your cross-validation process.

I conducted a standard K-fold split on the split10.csv for cross validation. Then I replicated the training process and train a CLEAN model with triplet loss on the training set(in CV). The final F1 score on the validation set(in CV) is around 0.5.

Therefore, I am curious about how to obtain the results shown in Figure S1. Did you just use a standard random cross validation split, or use some tricks(like let the EC numbers of the validation set in CV has at least one sample in the training set in CV)?

Meanwhile, I am also curious about how did you spliet the "understudied validation dataset" shown in Figure S2. How did you maintain the "no more than 5 times"? Are there some samples not used neither in training nor inferencing(to get the result in Figure S2)?

Thanks for your answers in advance and look forward to your response!

Error when inference with p-value in CPU only mode

Prompts

"RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU."

when I tried to run infer_pvalue as suggested in the 2.2.1 section of README with a CPU only installation.

Fixed by change

checkpoint = torch.load('./data/pretrained/'+ train_data +'.pth')

to
checkpoint = torch.load('./data/pretrained/'+ train_data +'.pth', map_location=device)

The same changes may also be applied to codes listed below:

checkpoint = torch.load('./data/model/'+ model_name +'.pth')

emb_train = torch.load('./data/pretrained/70.pt')

emb_train = torch.load('./data/pretrained/100.pt')

checkpoint = torch.load('./data/model/'+ model_name +'.pth')

Error:

The online website does not work, and error reported in the local training. Which always reported that there is No such file or directory: './data/esm_data/P76077_4.pt'.

Traceback (most recent call last):
File "/home/yangshihui/CLEAN/./train-triplet.py", line 138, in
main()
File "/home/yangshihui/CLEAN/./train-triplet.py", line 117, in main
train_loss = train(model, args, epoch, train_loader,
File "/home/yangshihui/CLEAN/./train-triplet.py", line 44, in train
for batch, data in enumerate(train_loader):
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 530, in next
data = self._next_data()
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 570, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/CLEAN-0.1-py3.10.egg/CLEAN/dataloader.py", line 76, in getitem
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 699, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 231, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 212, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './data/esm_data/P76077_4.pt'
Traceback (most recent call last):
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/CLEAN-0.1-py3.10.egg/CLEAN/infer.py", line 90, in infer_maxsep
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 699, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 231, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 212, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './data/pretrained/split100.pth'

training & validation details

Hello! How did you train the model with SupCon-Hard loss? The number of epoch numbers was suggested in the readme. How did you get these numbers? I didn't seem to find your setup for validation. Thanks!

How to set parameters?

Hello, I have a few questions about parameter settings. I would like to know how your parameters are set, such as n_ Pos, n_ Neg, batch_ Size these parameters. Do I need to consider the sample size of categories in the dataset when setting parameters? If my dataset has 60 categories, with the highest sample size category containing 50000 samples and the lowest sample size category containing 20 samples, facing such a dataset, n_ Pos, n_ Neg, batch_ How should these parameters of size be set?

No such file or directory: './gmm_test/GMM_100_500_0.pkl'

I encountered this error when running gmm.py. I noticed that the model needed to be retrained with split100 before to generate the necessary distance map. I would like to generate confidence levels for each CLEAN inference. What else is necessary to do this after running gmm.py? Thanks

on single sequence query: ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

Hello,

We are attempting to determine the lowest P value range of a single protein sequence using the conda CLEAN install. I.e., the input CSV is one sequence with an EC number and identifier. When we run this using infer_pvalue with default parameters, it calculates results but gives the following error/warning and does not print the model fit statistics (recall, precision etc):

ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

It seems that the AUC cannot be calculated on only one input sequence. Does this affect the P value cutoff or model interpretation? Does the infer_pvalue function depend on multiple queries in an input file? If multiple sequences are required, how should we interpret the predictions for the one query sequence of interest?

Thanks so much for your help.

Details about data split

Hi,

Thanks for your great work and nice code.

I'm interested in your data split, e.g. 'split10.csv' and 'split100.csv'. There are few details about how to get the splited data in both your paper and code. I guess you preprocess it through comparing data from SwissProt with data from your two test set.

I'd appreciate it if you could give more details about data split either in description or code.

No such file or directory: './data/esm_data/WP_063460136.pt'

I've followed the installation steps but when running the python CLEAN_infer_fasta.py --fasta_data price command I get the following error:
FileNotFoundError: [Errno 2] No such file or directory: './data/esm_data/WP_063460136.pt'

From what I checked WP_063460136 is the name of the first sequence in the fasta file.

What should I do?

FileNotFoundError: [Errno 2] No such file or directory: './data/esm_data/P24665_0.pt'

When running train-triplet.py or train-supconH.py I get the following error:

FileNotFoundError: [Errno 2] No such file or directory: './data/esm_data/P24665_0.pt'

I've already retrieved all the esms and './data/esm_data/P24665.pt' exist but './data/esm_data/P24665_0.pt' doesn't.

I checked the code and I found this

def random_positive(id, id_ec, ec_id):
    pos_ec = random.choice(id_ec[id])
    pos = id
    if len(ec_id[pos_ec]) == 1:
        return pos + '_' + str(random.randint(0, 9))
    while pos == id:
        pos = random.choice(ec_id[pos_ec])
    return pos

If I understand correctly the bug comes from return pos + '' + str(random.randint(0, 9)), I do not understand why the + '' + str(random.randint(0, 9)) is needed. It seems to just create a filename that doesn't exist. Could you explain?

Cannot reproduce results in README

Hi,

Thanks so much for this work, and for making the repo super nice and straightforward!

Before evaluating the model on a separate use case I have, I wanted to make sure I didn't do anything wrong when setting up the project, so I've been trying to evaluate the model per the README to ensure I get consistent results. However, I obtain very poor performance when calling inference.py on the test sets provided (price, new), so it would be great to understand what I've done wrong.

#### results on new

The embedding sizes for train and test: torch.Size([241025, 128]) torch.Size([392, 128])
100%|███████████████████████████████████████████████| 5242/5242 [00:00<00:00, 7713.09it/s]
Calculating eval distance map, between 392 test ids and 5242 train EC cluster centers
392it [00:04, 93.33it/s] 
############ EC calling results using maximum separation ############
---------------------------------------------------------------------------
>>> total samples: 392 | total ec: 177 
>>> precision: 0.0135 | recall: 0.0139| F1: 0.0136 | AUC: 0.507 
---------------------------------------------------------------------------



#### results on price

The embedding sizes for train and test: torch.Size([241025, 128]) torch.Size([149, 128])
100%|███████████████████████████████████████████████| 5242/5242 [00:00<00:00, 8758.51it/s]
Calculating eval distance map, between 149 test ids and 5242 train EC cluster centers
149it [00:00, 155.21it/s]
############ EC calling results using maximum separation ############
---------------------------------------------------------------------------
>>> total samples: 149 | total ec: 56 
>>> precision: 0.0 | recall: 0.0| F1: 0.0 | AUC: 0.5 
---------------------------------------------------------------------------

From my attempts to debug, this doesn't seem to be an issue of models/data processing. For instance, I compared the embedding of the first cluster (EC 2.7.10.2) from data/pretrained/100.pt with ones I manually recomputed, and obtained the same values (up to some numerical error). Specifically, I made a FASTA file using the sequences that are in EC 2.7.10.2, extracted their embeddings, then passed them through the pretrained model (data/pretrained/split100.pth). I compared these with what we get from calling get_cluster_center on the precomputed tensor. These appeared to be consistent. So, if the embeddings are calculated in a consistent manner, I'm not sure why the predictions are turning out to be wrong.

Python 3.10.4 
#### recalculate the first EC cluster embeddings

>>> import os, torch
>>> from CLEAN.utils import * 
>>> from CLEAN.distance_map import *
>>> from CLEAN.model import LayerNormNet 

>>> train_data = "split100"
>>> train_csv = pandas.read_csv('data/split100.csv', delimiter='\t')
>>> id_ec_train, ec_id_dict_train = get_ec_id_dict('data/split100.csv')
>>> list(ec_id_dict_train.keys())[0]
2.7.10.2

#### make fasta of sequences in 2.7.10.2
>>> with open("data/ec_2.7.10.2.fasta", "w") as f:           
>>>    for u in ec_id_dict_train['2.7.10.2']:
>>>        sequence = train_csv[train_csv['Entry'] == u].iloc[0].Sequence                                                                                                                                                            
>>>        f.write(f">{u}\n")                                                                                                                                                                                        
>>>        f.write(f"{sequence}\n")

#### calculate ESM embeddings
>>> retrive_esm1b_embedding('ec_2.7.10.2')                 

#### load the split100 model weights
>>> device = torch.device("cpu")
>>> dtype = torch.float32 
>>> model = LayerNormNet(512, 128, device, dtype)
>>> checkpoint = torch.load('./data/pretrained/'+ train_data +'.pth', map_location="cpu")                                                                                                                                
>>> model.load_state_dict(checkpoint) 
<All keys matched successfully>
>>> model.eval()   

#### calculate model embeddings
>>> esm_to_cat = [load_esm(id) for id in ec_id_dict_train['2.7.10.2']]
>>> esm_emb = torch.cat(esm_to_cat)
>>> model_emb = model(esm_emb)
>>> model_emb.mean(0)
tensor([ 0.5685, -0.2730,  1.3413, -0.0456,  0.5519, -0.5602,  0.4451,  0.3555,
        -0.3991,  0.8149, -0.7487,  0.8769, -0.0774, -1.2195, -0.3510,  0.3407,
         0.6934, -0.4897, -0.6785,  0.4822, -0.4403,  0.1503,  0.6215, -0.2650,
         1.0949,  0.4402,  0.4229,  1.4833,  0.2911, -2.0526, -1.0108,  0.8270,
         0.0103, -0.4964,  0.4265,  0.6308, -0.5499, -1.2762, -0.9738,  0.3144,
        -0.9146,  0.4415,  0.2395,  0.2096,  0.0948, -0.6719,  0.1269, -0.6432,
         1.3322,  0.8958,  0.2907,  1.5833,  1.6047,  0.0428, -0.1019, -0.1428,
         0.6814, -0.9868,  0.4500,  0.1788, -0.3415,  1.0227,  0.2723,  0.2320,
         0.5672, -0.8140, -0.4842,  0.3829, -1.4036, -0.3750, -2.0640, -0.9057,
        -1.1886,  0.3434, -1.0756, -1.4245,  1.1374, -0.1440, -0.1107, -2.4469,
         0.1129, -0.2940,  0.3541,  0.9514, -0.1509, -1.1097, -0.3776,  0.0645,
         0.1615, -0.3648,  0.8489, -0.1049,  0.1044, -0.9301,  0.1868,  0.8924,
         0.1700, -1.5468,  0.9586, -1.1084,  1.4576,  1.4288,  0.3229,  0.3504,
        -0.1556, -0.0749,  0.1157,  0.2287, -0.2752,  1.2659,  0.7747,  0.2845,
         0.5852, -0.9135,  1.0046,  1.1457, -0.8711,  0.5439,  0.4540,  0.0190,
        -0.2778,  1.8937, -1.7569, -1.3366, -0.5689, -1.9689,  0.2271, -0.3354],
       device='cuda:0', grad_fn=<MeanBackward1>)

#### compare with precomputed embeddings
>>> emb_train = torch.load('./data/pretrained/100.pt', map_location=device)
>>> cluster_center_model = get_cluster_center(emb_train, ec_id_dict_train)
>>> cluster_center_model['2.7.10.2']
tensor([ 0.5684, -0.2730,  1.3413, -0.0455,  0.5519, -0.5602,  0.4452,  0.3555,
        -0.3990,  0.8149, -0.7487,  0.8769, -0.0774, -1.2195, -0.3510,  0.3407,
         0.6935, -0.4897, -0.6786,  0.4822, -0.4402,  0.1503,  0.6215, -0.2650,
         1.0949,  0.4402,  0.4229,  1.4833,  0.2911, -2.0526, -1.0108,  0.8270,
         0.0103, -0.4963,  0.4265,  0.6308, -0.5499, -1.2763, -0.9737,  0.3145,
        -0.9146,  0.4415,  0.2394,  0.2096,  0.0948, -0.6719,  0.1269, -0.6432,
         1.3322,  0.8958,  0.2907,  1.5833,  1.6047,  0.0427, -0.1019, -0.1428,
         0.6814, -0.9868,  0.4500,  0.1787, -0.3415,  1.0227,  0.2722,  0.2320,
         0.5672, -0.8140, -0.4843,  0.3830, -1.4036, -0.3750, -2.0639, -0.9057,
        -1.1886,  0.3434, -1.0756, -1.4245,  1.1373, -0.1440, -0.1108, -2.4469,
         0.1129, -0.2940,  0.3541,  0.9514, -0.1508, -1.1097, -0.3776,  0.0645,
         0.1616, -0.3648,  0.8489, -0.1049,  0.1043, -0.9301,  0.1868,  0.8923,
         0.1699, -1.5468,  0.9585, -1.1083,  1.4576,  1.4288,  0.3229,  0.3504,
        -0.1556, -0.0749,  0.1157,  0.2288, -0.2752,  1.2659,  0.7747,  0.2845,
         0.5853, -0.9134,  1.0046,  1.1458, -0.8711,  0.5439,  0.4540,  0.0190,
        -0.2778,  1.8937, -1.7569, -1.3366, -0.5689, -1.9689,  0.2270, -0.3354])

Would greatly appreciate any help wherever you think I made a mistake.

Thank you!!

Annotating functions of proteins having more than 1022 amino acid residues

Hi,
I tried using the web version (https://clean.platform.moleculemaker.org/configuration) for predicting ec numbers from protein sequences, but it is throwing an error whenever the protein sequence length is greater than 1022. I am planning to install the git hub package and am wondering if git hub package can predict function of those proteins that have sequence length greater than 1022.

Please let me know,

Thanks,
Sourav Dutta
image

bad gateway in webserver

Hi, I was trying to use the webserver but it give the bad gateway error.
Do you know if it is down for good or if this problem will be fixed meanwhile? Thank you very much

Just one EC

Thank you for providing this excellent tool. Could you guide us on how to specifically obtain annotations for halogenases from metagenomes using it?

Question to protein language model

Hi, congratulations!protein language models have played a key role in the protein feature extraction. Since none of the methods you compared use a language model, is that the main reason your method is good. I think supervised contrastive learning should not be so different from supervised learning. Our lab has also developed a lightweight protein language model (ProtFlash), and we would like to collaborate to test this approach if given the opportunity.

Set Up Questions

Reading through this protocol I am extremely excited about this tool and working on getting it set up. First off, I am looking to get some clarification on the required steps in Part 1 of the README. If we are doing the quickstart 1.2, is section 1.3 also required? Some of the code overlaps, so I assume these are independent of each other and it is an either or.

Additionally, I am needing for this to be fully set up on a Linux system in conda with cpuonly pytorch, but can't determine from the protocol if local installation is required for some steps or if the functionality can exist entirely in a conda environment.

Finally, Part 2 is inference and Part 3 is training. If we all we are looking to do is inference, similar to the what is available on the web-server version of CLEAN, is Part 3 required to continue using the inference tools? Since Part 3 is after Inference and I only need the functionality of what is available on the web-server, just needing to scale up for a large dataset.

Thank you very much in advance. very excited to get this set up. Appreciate any help you can offer.

Using local clean for large-scale data prediction

Hello, thank you for your excellent work. I now want to predict EC numbers for large-scale data locally, with approximately 100 million sequences. How can I improve my speed? I used it in the cluster and tested it using 10000 sequences, with a single CPU time of one hour. I tried to split the file into 200000 lines per file, but except for the first file, the prediction speed of the remaining files significantly slowed down.

Best,
SJY

There have a problem when we run the command "python CLEAN_infer_fasta.py --fasta_data price"

Traceback (most recent call last):
File "/share/database/CLEAN/CLEAN_infer_fasta.py", line 30, in
main()
File "/share/database/CLEAN/CLEAN_infer_fasta.py", line 24, in main
infer_maxsep(train_data, test_data, report_metrics=False, pretrained=True)
File "/share/soft/miniconda3/envs/clean/lib/python3.10/site-packages/CLEAN-0.1-py3.10.egg/CLEAN/infer.py", line 92, in infer_maxsep
Exception: No pretrained weights for this training data

We have found pretrained dataset, can you provided a website to download it?

Duplicated amino acid sequece in datasets

There exist duplicated amino acid sequences in split100.csv and new.csv.

image image

The feature extraction script of ESM provided by Facebook does not allow duplicated sequences. However, the authors of this repository utilize the feature extraction script of ESM which can not work properly when there exist duplicates. Please provide a detailed procedure including how the features of sequences in the NEW-392 dataset and Split-100 dataset are obtained.

Almost 30k duplicates in split100.csv. But there are no duplicates in split70.csv. This is weird too. If the sequences are duplicated intentionally for contrastive learning, then why there are no duplicates in split70.csv?

Some confusion about the creation of positive and negative sample pairs

Hi! As you described in the article, the anchor sequence and the positive sequence are derived from the same EC number. However, when using the torch.utils.data.Dataset class in your code to create the dataset, the positive sequence is randomly selected from some EC numbers that belong to the anchor sequence, which may lead to inconsistent EC numbers between the anchor sequence and the positive sequence (due to the function random_positive()). I wonder if there is a problem with my understanding?

class Triplet_dataset_with_mine_EC(torch.utils.data.Dataset):

    def __init__(self, id_ec, ec_id, mine_neg):
        self.id_ec = id_ec
        self.ec_id = ec_id
        self.full_list = []
        self.mine_neg = mine_neg
        for ec in ec_id.keys():
            if '-' not in ec:
                self.full_list.append(ec)

    def __len__(self):
        return len(self.full_list)

    def __getitem__(self, index):
        anchor_ec = self.full_list[index]
        anchor = random.choice(self.ec_id[anchor_ec])
        pos = random_positive(anchor, self.id_ec, self.ec_id)
        neg = mine_negative(anchor, self.id_ec, self.ec_id, self.mine_neg)
        a = torch.load('./data/esm_data/' + anchor + '.pt')
        p = torch.load('./data/esm_data/' + pos + '.pt')
        n = torch.load('./data/esm_data/' + neg + '.pt')
        return format_esm(a), format_esm(p), format_esm(n)

def random_positive(id, id_ec, ec_id):
    pos_ec = random.choice(id_ec[id])
    pos = id
    if len(ec_id[pos_ec]) == 1:
        return pos + '_' + str(random.randint(0, 9))
    while pos == id:
        pos = random.choice(ec_id[pos_ec])
    return pos

Request to clarify license

Hi,
Could you please clarify the license for this project eg if it’s available for commercial use? Thank you for sharing this fascinating research

FileNotFoundError: [Errno 2] No such file or directory: './data/esm_data/D4AXL1.pt'

Thank you for your job! It really helped me a lot. I tried to train the model under the guidance of the manual, but I encountered an error, as shown below.
image
When I execute "compute_esm_distance(train_file)" Times error, display "FileNotFoundError: [Errno 2] No such file or directory: './data/esm_data/D4AXL1.pt' ", presumably because only the mutant *pt file of the orphan sequence is generated in esm_data, and the *pt file of the non-orphan sequence is missing. How can I solve this problem? Thank you very much.

data clustering and split

  1. I clustered 'split100.csv' with MMseqs using a 0.5 identity condition, and there were 32,283 clusters,
    whereas 'split50.csv' resulted in 29,942, showing a difference.
    (mmseqs cluster --min-seq-id 0.5)
    Was there any additional data removal after the clustering?
    If the clustering conditions were different, please let me know how it was done.

  2. Can you provide specific details on how you split the data for cross-validation?
    If you did a random split, could you inform me about the random state used?

  3. What is the reason for not selecting models using a separate validation set or
    utilizing models from cross-validation for benchmark testing?

Questions about querying unclassified proteins

Hi - I have a few questions about inferring EC number from unclassified proteins or those not found in uniprot/swissprot. Can I pass in all protein fastas from my data, or can I only pass in enzymes? In other words, has CLEAN been designed to not give results for proteins that are not enzymes? I also noticed that the "Enzyme IDs" in split100.csv all correspond to accession numbers. Can I make up a unique ID for each uncategorized protein, or does CLEAN only work on proteins found in UniProt. If so, would passing in KEGG accessions (or those from other databases) as the enzyme IDs work?

Attempting to deserialize object on a CUDA

Hello I have recently come across the following message when using the infer_pvalue option:

Traceback (most recent call last):
File "", line 1, in
File "/home/guillermo/miniconda3/envs/clean/lib/python3.10/site-packages/CLEAN-0.1-py3.10.egg/CLEAN/infer.py", line 28, in infer_pvalue
File "/home/guillermo/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 712, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/home/guillermo/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 1046, in _load
result = unpickler.load()
File "/home/guillermo/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 1016, in persistent_load
load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
File "/home/guillermo/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 1001, in load_tensor
wrap_storage=restore_location(storage, location),
File "/home/guillermo/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 176, in default_restore_location
result = fn(storage, location)
File "/home/guillermo/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 152, in _cuda_deserialize
device = validate_cuda_device(location)
File "/home/guillermo/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 136, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

I have been trying to fix this but so far I have not been able to. Could you please help with this.
Cheers

Changes in dataset

Hi, thanks to your work, we are also conducting researches on EC-number prediction using dataset you provided in data folder.

However, we've recognized the changes in commit done on Jan, where datasets are changed to older version.

Can you explain about what and why changes are given in dataset from that commit?

Cheers,
Doyeong Hwang

ValueError: The number of weights does not match the population

Hi! Thanks for the great work. When I train the CLEAN model using the codes provided, I met this error:
"
File "/Users/Zachary/opt/anaconda3/envs/clean/lib/python3.10/site-packages/CLEAN-0.1-py3.10.egg/CLEAN/dataloader.py", line 39, in mine_negative
File "/Users/Zachary/opt/anaconda3/envs/clean/lib/python3.10/random.py", line 537, in choices
raise ValueError('Total of weights must be finite')
"
So I added the following snippet to the mine_hard_negative function in the dataloader.py:
"
valid_freq = []
for value in freq:
if math.isfinite(value) and not math.isnan(value):
valid_freq.append(value)
else:
# Replace invalid value with a very small value to minimaize its effect
valid_freq.append(1e-8)

    normalized_freq = [i/sum(valid_freq) for i in valid_freq]

" to replace the original code
"normalized_freq = [i/sum(freq) for i in freq]"
I just wanna ask is this choice reasonable? many thanks

Some confusion about Fig 2.F in the paper

Hello,

I have been studying the original paper and came across Fig 2. F, which presents the comparison of CLEAN results with other tools, including BLASTp and ProtInfer. I found it intriguing that the prediction accuracy of BLASTp for EC X.X.X.X is higher than that of EC X.X.X.-, and a similar pattern is observed for ProtInfer. Intuitively, one might expect that if the tool can predict EC X.X.X.X correctly, it should also be able to predict EC X.X.X.- accurately.

Could you please provide some insights or clarification on this observation?

Testing 1.2 Quickstart

  1. There is not data/pretained directory. I recommend the pretrained folder and just dropping a .gitkeep in it.
    • Maybe this is unnecessary if the point 2. is clarified.
  2. When you download the data you get a folder CLEAN_pretrained which then needs to be renamed to pretrained. I think changing CLEAN_pretrained to pretrained will be the easiest since this is what the src is referencing.

No such file or directory: 'results/inputs/init_maxsep.csv'

Hi,
Interesting work!
I am trying to install employ this software. I have created all the embeddings. I am testing it with some sample data that came with this software (init)
However I do get this error. (please refer to the screenshot)
Could please help me with this?
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.