Coder Social home page Coder Social logo

bioinformatics's Introduction

📚 Bioinformatics Theory

Level 1: Nucleotide sequence

  • Data from DNA or RNA.
  • 4 letters: A, C, G, T

Level 2: Amino acids sequence

  • Data from Proteins.
  • IUPAC: The 20 standard amino acids: A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V.
  • Extended IUPAC: 26 letters. In addition to the standard 20 amino acids, this includes: B, X, Z, J, U, O.
    • X is for Unknown or "other" amino acid.

Level 3: Domains of proteins

  • Domains are sequences that are evolutionarily conserved, and as such have a well-defined fold and function.
  • Pfam is a database of 32,207,059 protein domains used extensively in bioinformatics.
  • Sequences in Pfam are clustered into evolutionarily-related groups called families.

Level 4: 3D shape of proteins

Read about Ramachandran plot

Level 5: Gen Ontology

  • GO
  • Uniprot

Reactoma

Grafo de interacción de proteinas.

BLAST

Basic Local Alignment Search Tool. Algoritmo que determina el nivel de similaridad de 2 secuencias. Puede usarse:

  • A nivel de secuencias de nucleótidos (Nucleutide BLAST)
  • A nivel de secuencias de aminoácidos (Protein BLAST)

🗄 Datasets

  • UniProt: Secuencias de aminoácidos.
    • UniProtKB: UniProt Knowledgebase
      • Swiss-Prot: Manually annotated and reviewed (561.911)
      • TrEMBL: Automatically annotated and not reviewed. (177.754.527)
    • UniRef: UniProt Reference Clusters
      • UniRef50: 50% (39,178,216)
      • UniRef90: 90% (107,153,647)
      • UniRef100: 100% (216,491,817)
    • UniParc: UniProt Archive
  • Pfam: Secuencias aminoácidos + dominios.
  • Protein data bank: Muy pocas con respecto a UniProt. A parte de la secuencia contiene metadatos y repr. 3D.

🧠 Deep Learning

Unsupervised learning for sequences

Unsupervised learning is a pre-task that allows the neural net to learn from lot of unlabeled data.

  • Language Model (LM): Predict next aminoacid. (used in RNNs)
  • Masked Language Model (MLM): Predict hidden aminoacids. (introduced in BERT paper)
  • Replaced Token Detection: Is this aminoacid real or fake? (introduced in ELECTRA)

Supervised tasks

  • Secondary structure prediction
  • Contact prediction: Predic in a pair of aminocis are within 8 angstroms of each other.
  • Remote homology detection.

Traditional NLP methods

  • Count vectorizers with n-grams
  • TF-IDF

1D convolutions

Recurrent Neural Nets (RNN, LSTM, ...)

Step 3: Transformers

Step 4: 3D Proteins

  • 3D protein

Python packages

Learn Resources

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.