nstrodt / udsmprot Goto Github PK

View Code? Open in Web Editor NEW

81.0 7.0 24.0 30.72 MB

Protein sequence classification with self-supervised pretraining

License: Other

Shell 8.26% Python 91.74%

machine-learning deep-learning language-modeling protein-classification

udsmprot's People

Contributors

Stargazers

Watchers

udsmprot's Issues

Recreate same EC dataset

Hello,
I am trying to replicate the same dataset for EC prediction(EC40 and EC50) as in your paper UDSMProt but I find some difficulties.

First in your script code/create_datasets.sh in line 27 :
python proteomics_preprocessing.py clas_ec --drop_ec7=True --working_folder=datasets/clas_ec/clas_ec_ec50_level1 --pretrained_folder=datasets/lm/lm_sprot_uniref --level=2 --include_NoEC=False --dataset="uniprot" --sampling_method_train=1 --sampling_method_valtest=3 --ignore_pretrained_clusters=True --sampling_ratio=[.8,.1,.1] --save_prev_ids=True
I think it should be :
python proteomics_preprocessing.py clas_ec --drop_ec7=True --working_folder=datasets/clas_ec/clas_ec_ec50_level2 --pretrained_folder=datasets/lm/lm_sprot_uniref --level=2 --include_NoEC=False --dataset="uniprot" --sampling_method_train=1 --sampling_method_valtest=3 --ignore_pretrained_clusters=True --sampling_ratio=[.8,.1,.1] --save_prev_ids=True The working folder should be name clas_ec_ec50_level2 not clas_ec_ec50_level1 .

Secondly when I run the script create_datasets.sh, I have an error that says :
../tmp_data/cdhit04_uniprot_sprot_2017_03.pkl not found.
And I think this maybe because in the link you provide :
there are two files with two different versions of swissprot. The file "cdhit04_uniprot_sprot_2016_07.pkl" which uses the 07/2016 version and the file "uniref50_2017_03_uniprot_sprot_2017_03.pkl" which uses the 03/2017 version.

So I am a little bit confused, I don't know if I have to download the Swiss-Prot release of 03/2017 or 07/2016 with the files in your link in order to replicate exactly the same dataset as you in your UDSMProt paper. And even if I have the correct cdhit04 file, will I have exactly the same test dataset ? And if it is not the case, it would be kind if you could provide a link to download the exact train/dev/test dataset for ECpred.

Thanks a lot in advance

BPE does not help to improve performance?

Very detailed analysis!

It seems that you have tried BPE encoding. (I saw BPE part in your data-process scripts) I am just curious is it helpful?

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions.

Hi,

I tried to replicate the data using ./create_datasets.sh but I received the following error:

File "/UDSMProt/code/utils/dataset_utils.py", line 268, in prepare_dataset
tok_num = np.array([[tok_stoi[o] for o in p] for p in tok])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (556825,) + inhomogeneous part.

Could you please help me how I can fix it. Thanks

Using new dataset for classification

Hi, thank you for the amazing work!
I am just curious if I have my own dataset for which I would like to use your model architecture for protein classification, then what would be the best way to do that?

Also, I am not able to find these Enzyme classification datasets in the repository which are mentioned in code.
path_ec_knna = git_data_path/"suppa.txt"
path_ec_knnb = git_data_path/"suppb.txt"

Thanks.

Error when trying to run the benchmarks

Hello, when trying to run the benchmarks from the jupyter notebook, which for various reasons I had to port to run in a local script instead, I received the following error message:
2023-12-23 18:19:26.336247: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-23 18:19:26.359764: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-23 18:19:26.359816: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-23 18:19:26.360542: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-23 18:19:26.364383: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-23 18:19:26.364542: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-23 18:19:28.150936: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
14945 training set records, 1661 validation set records, 4152 test set records.
Traceback (most recent call last):
File "/home/gradwan/protein_bert/sally1.py", line 37, in
pretrained_model_generator, input_encoder = load_pretrained_model()
File "/home/gradwan/protein_bert/proteinbert/existing_model_loading.py", line 54, in load_pretrained_model
return load_pretrained_model_from_dump(dump_file_path, create_model_function, create_model_kwargs = create_model_kwargs, optimizer_class = optimizer_class, lr = lr,
File "/home/gradwan/protein_bert/proteinbert/model_generation.py", line 159, in load_pretrained_model_from_dump
n_annotations, model_weights, optimizer_weights = pickle.load(f)
_pickle.UnpicklingError: invalid load key, 'v'.

Could you please help? Many thanks!

incompatible packages in the proteomics.yml

Hi,
I'm trying to install the UDSMProt through conda env create -f proteomics.yml, but incompatible packages were found. Could you provide an updated proteomics.yml file?
Thanks!

nstrodt / udsmprot Goto Github PK

udsmprot's People

Contributors

Stargazers

Watchers

Forkers

udsmprot's Issues

Recreate same EC dataset

BPE does not help to improve performance?

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions.

Using new dataset for classification

Error when trying to run the benchmarks

incompatible packages in the proteomics.yml

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent