📚 Bioinformatics Theory

https://github.com/yangkky/Machine-learning-for-proteins

Level 1: Nucleotide sequence

Data from DNA or RNA.
4 letters: A, C, G, T

Level 2: Amino acids sequence

Data from Proteins.
IUPAC: The 20 standard amino acids: A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V.
Extended IUPAC: 26 letters. In addition to the standard 20 amino acids, this includes: B, X, Z, J, U, O.
- X is for Unknown or "other" amino acid.

Level 3: Domains of proteins

Domains are sequences that are evolutionarily conserved, and as such have a well-defined fold and function.
Pfam is a database of 32,207,059 protein domains used extensively in bioinformatics.
Sequences in Pfam are clustered into evolutionarily-related groups called families.

Level 4: 3D shape of proteins

Read about Ramachandran plot

Level 5: Gen Ontology

GO
Uniprot

Reactoma

Grafo de interacción de proteinas.

BLAST

Basic Local Alignment Search Tool. Algoritmo que determina el nivel de similaridad de 2 secuencias. Puede usarse:

A nivel de secuencias de nucleótidos (Nucleutide BLAST)
A nivel de secuencias de aminoácidos (Protein BLAST)

🗄 Datasets

UniProt: Secuencias de aminoácidos.
- UniProtKB: UniProt Knowledgebase
  - Swiss-Prot: Manually annotated and reviewed (561.911)
  - TrEMBL: Automatically annotated and not reviewed. (177.754.527)
- UniRef: UniProt Reference Clusters
  - UniRef50: 50% (39,178,216)
  - UniRef90: 90% (107,153,647)
  - UniRef100: 100% (216,491,817)
- UniParc: UniProt Archive
Pfam: Secuencias aminoácidos + dominios.
Protein data bank: Muy pocas con respecto a UniProt. A parte de la secuencia contiene metadatos y repr. 3D.
- En kaggle: 400000 protein data set (146 MB)

🧠 Deep Learning

Unsupervised learning for sequences

Unsupervised learning is a pre-task that allows the neural net to learn from lot of unlabeled data.

Language Model (LM): Predict next aminoacid. (used in RNNs)
Masked Language Model (MLM): Predict hidden aminoacids. (introduced in BERT paper)
Replaced Token Detection: Is this aminoacid real or fake? (introduced in ELECTRA)

Supervised tasks

Secondary structure prediction
Contact prediction: Predic in a pair of aminocis are within 8 angstroms of each other.
Remote homology detection.

Traditional NLP methods

Count vectorizers with n-grams
TF-IDF

1D convolutions

Ultra-Deep Learning Model (2016)
AlphaFold: From seq -> predict 3D shape
- Paper in Nature (Jan 2020)
- Paper in Proteins (Sep 2019)

Recurrent Neural Nets (RNN, LSTM, ...)

DeepDom (January 2019) LSTM
UniRep (March 2019) mLSTM Unsupervised
UDSMProt (September 2019) AWD-LSTM Unsupervised
Kaggle: LSTM with keras
Character-Level LSTM in PyTorch

Step 3: Transformers

Protein-Structure-Prediction (weights and biases)
Read Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences: After training the Transformer algorithm to process amino acid sequences, researchers looked at the embedding learned by the model. Above all, they found out that the neural network had built a complex representation of the input sequences. This in turn reflects their biological properties such as activity, stability, structure, binding etc. In other words, the deep learning algorithm learned important biochemical properties characterising the different amino acids and proteins, all by itself, without any supervision.
Illustrating the Reformer
ELECTRA: Otra forma de hacer el aprendizaje no supervisado
Advices for training transformers: Train Large, Then Compress twitt, blog, paper
Kaggle winner solution to Google’s QUEST Q&A Labeling: BERT, RoBERTa, BART
Andrés Solution to Predicting Molecular Properties
- Summary
- Code
Choose one from Hugging face
Train with fastai

Step 4: 3D Proteins

3D protein

javiabellan / bioinformatics Goto Github PK

bioinformatics's Introduction