fhalab / embeddings_reproduction Goto Github PK

License: Other

Jupyter Notebook 98.47% Python 1.53%

embeddings_reproduction's Introduction

Code to reproduce the paper Learned Protein Embeddings for Machine Learning.

Installation

embeddings_reproduction can be installed with pip from the command line using the following command:

$ pip install git+https://github.com/fhalab/embeddings_reproduction.git

It can also be installed in editable mode (-e) from the source with:

$ git clone https://github.com/fhalab/embeddings_reproduction.git
$ cd embeddings_reproduction
$ pip install -e .

The second option might be necessary depending on how your computer handles Git-LFS. Because some of the files are large, the connection might time out.

Computing Environment

This was originally developed using Anaconda Python 3.5 and the following packages and versions:

gensim==1.0.1
numpy==1.13.1
pandas==0.20.3
scipy==0.19.1
sklearn==0.19.0
matplotlib==2.0.2
seaborn==0.8.1

File structure

The repository is divided into code, inputs and outputs. Inputs contains all the unlabeled sequences used to build docvec models, the labeled sequences used to build Gaussian process regression models, and AAIndex, ProFET, and one-hot encodings of the labeled sequences. Code contains Python implementations of Gaussian process regression and the mismatch string kernel in addition to Jupyter notebooks that reproduce the analyses in the paper. Outputs contains all the embeddings produced during the course of analysis and csvs storing the results of the cross-validation over embedding hyperparameters, the negative controls, and the results of varying the embedding dimension or the number of unlabeled sequences. Note that while code to train docvec models is provided, the actual docvec models produced by gensim are not included in the repository because they are too large. These are at freely available at http://cheme.caltech.edu/~kkyang/.

Inferring embeddings using a pretrained model

To infer embeddings, you need a model and all it's associated files, and an iterable of sequences. For example, to infer embeddings using original_5_7 (no randomization, k=5, w=7):

Download original_5_7.pkl, original_5_7.pkl.docvecs.doctag_syn0.npy, original_5_7.pkl.syn1neg.npy, and original_5_7.pkl.wv.syn0.npy. Make sure they are all in the same directory.
After installing the embeddings_reproduction package, and assuming we're in the same directory as the models:

from embeddings_reproduction import embedding_tools

embeds = embedding_tools.get_embeddings_new('original_5_7.pkl', seqs, k=5, overlap=False)

The choice of pretrained model should be treated as a hyperparameter and chosen using validation.

embeddings_reproduction's People

Contributors

Stargazers

Watchers

embeddings_reproduction's Issues

An error in visualize page

Hi
when i run all script i get this error message:
in plot_ChRs()
4 df = pd.read_csv('../inputs/localization.txt')
5 with open('../inputs/localization_seq.pkl', 'rb') as f:
----> 6 X_1, terms = pickle.load(f)
7 X_p = pd.read_csv('../inputs/localization_profet.tsv', delimiter='\t')
8 X_p.index = X_p['name']

UnpicklingError: invalid load key, 'v'

What is the path to the (final) protein embeddings?

Hi,

Somewhat related to #3 I cannot identify where to find the actual pretrained protein embedding. In gensim I would like to use Word2Vec.load(path/to/embedding.model) -- where can I find this?

Thank you

Low efficiency of large volume protein sequence prediction

Can this model be used to generate features for a large number (more than 10,000) of protein sequences, and is there an improvement after trying the low efficiency of the model prediction?

Error in train_docvec_models.ipynb

I tried to recreate the original doc2vec models in train_docvec_models.ipynb but ran into the following error at "model.build_vocab(documents)" when using "merge=True" in the kmer_hypers

TypeError: unhashable type: 'list'

Do you have any suggestions? Thanks!

'Doc2Vec' object has no attribute 'running_training_loss

from embeddings_reproduction import embedding_tools
embeds = embedding_tools.get_embeddings_new(['ABCFFFFFFFFFFFF','EFGHQWERRTTUIIO'], seqs, k=5, overlap=False)
getting the following error
'Doc2Vec' object has no attribute 'running_training_loss

UnpicklingError: invalid load key, 'v'.

Hello,
many thanks for the github.

when I run test_predictions, i got following errors ...

UnpicklingError Traceback (most recent call last)
in
1 with open('../inputs/X_aaindex_64_cosine.pkl', 'rb') as f:
----> 2 X_aa = pickle.load(f)

UnpicklingError: invalid load key, 'v'.
also,
npicklingError Traceback (most recent call last)
in
13 # Sequence and structure
14 with open('../inputs/T50_seq_struct.pkl', 'rb') as f:
---> 15 X, _ = pickle.load(f)
16 evals, mu = evaluate(df_train, df_test, X, y_col, 'seq_struc', guesses=(1, 100))
17 res = pd.concat((res, evals), ignore_index=True)

UnpicklingError: invalid load key, 'v'.

my version
print(np.version)
1.18.5
print(pd.version)
1.1.0.

or pkl files are corrupted?

thanks,

User assistance

Hi, I was really interested in your paper, but this repository isn't so user friendly. It would be wonderful to add a setup.py so it can be installed with pip and some documentation for users on how to access the embeddings.

I would be happy to send a PR for the setup.py then we could discuss further on the PR

Examples in README

As a user, it would be nice to directly have some examples in the README to show how I could use this library. There are two simple scenarios that would personally benefit me:

Load the embeddings and generate a hierarchical clustering (show a plot, perhaps?)
Load the embeddings and train a simple model. One example would be a target prioritization approach - @ozlemmuslu would be able to help using the example from her master's thesis if we can see exactly how to make a dataframe out of the embeddings!

Which model to use for computing new embeddings?

As some users have noted before, in other issues, it is unclear how to use the final model to generate embeddings for a new set of protein sequences. I have identified the files located at http://cheme.caltech.edu/~kkyang/models/ and I have found the script embedding_tools.py from which I suppose the function get_embeddings_new() is the relevant one. But which doc2vec_file should I use to compute embeddings for my set of sequences? Which one is the "final" one?

As previously noted, if a minimal example of this was included in the main README file I am sure it would enable many more users to benefit from your work.

http://cheme.caltech.edu/~kkyang/ : 404

Hello,

the URL above is returning a 404. Can you provide an alternate URL?

Thanks

Separate code from data

I'm 7GB+ into trying to clone this repo and my computer is incredibly upset. I would suggest making a separate repository to house all of the data so the code can be downloaded and used independently.

Did you compare your results with BioVec by EhsaneddinAsgari

Thank you so much for your great work.

I read a paper called "DeepPrime2Sec: Deep Learning for Protein Secondary Structure Prediction from the Primary Sequences" by Asgari, E., Poerner, N., McHardy, A., & Mofrad, M.. (https://github.com/ehsanasgari/DeepPrime2Sec)
In the paper, he mentioned he used five kinds of features to do the prediction of protein secondary structure from the protein primary sequence. These five features are:

One-hot vector representation (length: 21) --- onehot: vector representation indicating which amino acid exists at each specific position, where each index in the vector indicates the presence or absence of that amino acid.
ProtVec embedding (length: 50) --- protvec: representation trained using Skip-gram neural network on protein amino acid sequences (ProtVec). The only difference would be character-level training instead of n-gram based training.
3. Contextualized embedding (length: 300) --- elmo: we use the contextualized embedding of the amino acids trained in the course of language modeling, known as ELMo, as a new feature for the secondary structure task. Contextualized embedding is the concatenation of the hidden states of a deep bidirectional language model. The main difference between ProtVec embedding and ELMO embedding is that the ProtVec embedding for a given amino acid or amino acid k-mer is fixed and the representation would be the same in different sequences. However, the contextualized embedding, as it is clear from its name, is an embedding of word changing based on its context. We train ELMo embedding of amino acids using UniRef50 dataset in the dimension size of 300.
4. Position Specific Scoring Matrix (PSSM) features (length: 21) --- pssm: PSSM is amino acid substitution scores calculated on protein multiple sequence alignment of homolog sequences for each given position in the protein sequence.
5. Biophysical features (length: 16) --- biophysical For each amino acid we create a normalized vector of their biophysical properties, e.g., flexibility, instability, surface accessibility, kd-hydrophobicity, hydrophilicity, and etc.

However, he didn't show how to do these feature extraction. I am not sure if you compared your embedding to his work.

By the way,
In my ML project, I want to embed a protein to a vector and then use DL models to do drug-protein interaction prediction. Do you have an example to show how to use it similar to RDkit, eg.
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=512)?

Many thanks!