nstrodt / udsmprot Goto Github PK
View Code? Open in Web Editor NEWProtein sequence classification with self-supervised pretraining
License: Other
Protein sequence classification with self-supervised pretraining
License: Other
Hello,
I am trying to replicate the same dataset for EC prediction(EC40 and EC50) as in your paper UDSMProt but I find some difficulties.
First in your script code/create_datasets.sh in line 27 :
python proteomics_preprocessing.py clas_ec --drop_ec7=True --working_folder=datasets/clas_ec/clas_ec_ec50_level1 --pretrained_folder=datasets/lm/lm_sprot_uniref --level=2 --include_NoEC=False --dataset="uniprot" --sampling_method_train=1 --sampling_method_valtest=3 --ignore_pretrained_clusters=True --sampling_ratio=[.8,.1,.1] --save_prev_ids=True
I think it should be :
python proteomics_preprocessing.py clas_ec --drop_ec7=True --working_folder=datasets/clas_ec/clas_ec_ec50_level2 --pretrained_folder=datasets/lm/lm_sprot_uniref --level=2 --include_NoEC=False --dataset="uniprot" --sampling_method_train=1 --sampling_method_valtest=3 --ignore_pretrained_clusters=True --sampling_ratio=[.8,.1,.1] --save_prev_ids=True The working folder should be name clas_ec_ec50_level2 not clas_ec_ec50_level1 .
Secondly when I run the script create_datasets.sh, I have an error that says :
../tmp_data/cdhit04_uniprot_sprot_2017_03.pkl not found.
And I think this maybe because in the link you provide :
there are two files with two different versions of swissprot. The file "cdhit04_uniprot_sprot_2016_07.pkl" which uses the 07/2016 version and the file "uniref50_2017_03_uniprot_sprot_2017_03.pkl" which uses the 03/2017 version.
So I am a little bit confused, I don't know if I have to download the Swiss-Prot release of 03/2017 or 07/2016 with the files in your link in order to replicate exactly the same dataset as you in your UDSMProt paper. And even if I have the correct cdhit04 file, will I have exactly the same test dataset ? And if it is not the case, it would be kind if you could provide a link to download the exact train/dev/test dataset for ECpred.
Thanks a lot in advance
Very detailed analysis!
It seems that you have tried BPE encoding. (I saw BPE part in your data-process scripts) I am just curious is it helpful?
Hi,
I tried to replicate the data using ./create_datasets.sh but I received the following error:
File "/UDSMProt/code/utils/dataset_utils.py", line 268, in prepare_dataset
tok_num = np.array([[tok_stoi[o] for o in p] for p in tok])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (556825,) + inhomogeneous part.
Could you please help me how I can fix it. Thanks
Hi, thank you for the amazing work!
I am just curious if I have my own dataset for which I would like to use your model architecture for protein classification, then what would be the best way to do that?
Also, I am not able to find these Enzyme classification datasets in the repository which are mentioned in code.
path_ec_knna = git_data_path/"suppa.txt"
path_ec_knnb = git_data_path/"suppb.txt"
Thanks.
Hello, when trying to run the benchmarks from the jupyter notebook, which for various reasons I had to port to run in a local script instead, I received the following error message:
2023-12-23 18:19:26.336247: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-23 18:19:26.359764: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-23 18:19:26.359816: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-23 18:19:26.360542: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-23 18:19:26.364383: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-23 18:19:26.364542: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-23 18:19:28.150936: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
14945 training set records, 1661 validation set records, 4152 test set records.
Traceback (most recent call last):
File "/home/gradwan/protein_bert/sally1.py", line 37, in
pretrained_model_generator, input_encoder = load_pretrained_model()
File "/home/gradwan/protein_bert/proteinbert/existing_model_loading.py", line 54, in load_pretrained_model
return load_pretrained_model_from_dump(dump_file_path, create_model_function, create_model_kwargs = create_model_kwargs, optimizer_class = optimizer_class, lr = lr,
File "/home/gradwan/protein_bert/proteinbert/model_generation.py", line 159, in load_pretrained_model_from_dump
n_annotations, model_weights, optimizer_weights = pickle.load(f)
_pickle.UnpicklingError: invalid load key, 'v'.
Could you please help? Many thanks!
Hi,
I'm trying to install the UDSMProt through conda env create -f proteomics.yml
, but incompatible packages were found. Could you provide an updated proteomics.yml file?
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.