Coder Social home page Coder Social logo

geneticswem's Introduction

Baseline Needs Even More Love: A Simple Word-Embedding-Based Model for Genetic Engineering Attribution

Introduction

This repository contains my submission for the Genetic Engineering Attribution Challenge. The goal of the challenge was to create an algorithm that identified the most likely lab-of-origin for genetically engineered DNA. The challenge is described in more detail here, and the competition dataset is analysed here.

Approached the challenge using natural language processing. Byte-pair encoded genetic sequences. Implemented a variant of SWEM-max in TensorFlow to classify the encoded sequences. SWEM-max was particularly well suited to the task given its strong performance on small datasets and long texts. The approach proposed in my paper was judged as "particularly promising" and "quite distinctive from the other submissions", and ranked 5th in the Innovation Track (judged on the real-world merits of the model). Also ranked 21st in the Prediction Track (judged on model top-10 accuracy).

Abstract

(Shen et al., 2018) first demonstrated that Simple Word-Embedding-Based Models (SWEMs) outperform convolution neural networks (CNNs) and recurrent neural networks (RNNs) in many natural language processing (NLP) tasks. We apply SWEMs to the task of genetic engineering attribution. We encode genetic sequences using BPE as proposed by (Alley et al., 2020), which separates the sequence into motifs (distinct sequences of DNA). Our model uses a max-pooling SWEM to extract a feature vector from the organism’s motifs, and a simple neural network to extract a feature vector from the organism’s phenotypes (observed characteristics). These two feature vectors are concatenated and then used to predict the lab of origin. Our model achieves 90.24% top-10 accuracy on the private test set, outperforming RNNs (Alley et al., 2020) and CNNs (Nielsen & Voigt, 2018). The simplicity of our model makes it interpretable, and we discuss how domain experts may approach interpreting the model.

Read the full report here.

alt text

BibTeX

If you would like to cite this report, please cite it as:

@article{GeneticSWEM,
  author = {Kieran Litschel},
  title = {Baseline Needs Even More Love: A Simple Word-Embedding-Based Model for Genetic Engineering Attribution},
  URL = {https://github.com/KieranLitschel/GeneticSWEM},
  year = {2020},
}

Code

Development directory

During the competition I made a submission every time I made a significant improvement to my model. I saved the notebooks corresponding to each submission and have included them in this repository in the development directory. If you are curious about how the development of my model progressed through the competition take a look.

Submission

The Train, Infer, and Build vectors and metadata for TensorFlow notebooks were my final submission for the Innovation Track. The Train notebook loads and pre-processes the training data, and uses it to train the model. The Infer notebook loads and pre-processes the test data, and makes predictions for each sample using the model trained in the Train notebook. The Build vectors and metadata for TensorFlow notebook is used to extract the word embeddings from the trained model in the format projector.tensorflow.org accepts as input.

Note that the execution output is not saved for any of the submitted notebooks. If you want to see the execution output, take a look at development notebook GE_8_36, as this notebook trains and evaluates the same model.

Requirements for Running Notebooks

These notebooks were run on Ubuntu 18.04 using Python 3.6.9. A requirements.txt of the versions of packages we used is included. You will likely have success running these notebooks with a different operating system and different version of Python and the required packages, but we have not tested this.

XSWEM Package

I am currently working on an open source implementation of SWEM-max (max-pooling SWEM) in TensorFlow, with an emphasis on the interpretability techniques discussed in the paper. I have had some ideas of how to make the model even more interpretable, which I plan to include in the implementation. Development is at an early stage and currently the implementation is not flexible enough to completely replicate the model proposed here, but I plan to improve flexibility in the future to support this. You can check it out here.

Future Plans

I have had a few ideas on how to improve my approach since the competition. I plan to explore these ideas, and then develop the report into a full pre-print paper.

geneticswem's People

Contributors

kieranlitschel avatar dependabot[bot] avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.