Coder Social home page Coder Social logo

gappredict's Introduction

GapPredict

About

GapPredict is an LSTM character-level language model that can be used for gap filling de novo assemblies. GapPredict can predict the bases of a single gap using reads mapping to known flanking sequences of the gap.

In its current implementation, GapPredict will predict a user-defined number of bases after both the left flank and the right flank of the gap. If the user chooses a sufficiently long prediction length, then GapPredict may be able to predict both the gap and the reciprocal flank. A downstream local alignment tool (eg. Exonerate [1]) can be then used to align, say, the first 100 bases of the reciprocal flank to the prediction. If these 100 bases align with high % identity and % coverage (eg. >90%), then it is likely the preceding bases are a good prediction of the actual gap.

Installing GapPredict

Please ensure you are using Python3.7. Dependencies were last tested with Python3.7.9. Refer to https://www.tensorflow.org/install/gpu to hook up your GPU if you don't have another set-up in place. In the event that v1.1 doesn't work, try using v1.0.2 with Python3.6 and file an issue. v1.0.2 contains the intended configuration with Tensorflow 1.5.

You can install GapPredict by cloning or downloading the .zip file directly from GitHub.

git clone [email protected]:bcgsc/GapPredict.git

GapPredict uses Python3.6 and packages outlined in requirements.txt (https://github.com/bcgsc/GapPredict/blob/master/requirements.txt). These packages can be quickly installed by running:

pip install -r requirements.txt

or

python -m pip install -r requirements.txt

In order to train models and predict efficiently, a GPU is mandatory. Steps to install CUDA and cuDNN are available at the following links:

Input Data Preparation

GapPredict requires two input files - a FASTA file containing the sequences of the left and right flanks of your gap, and a FASTQ file containing reads mapping to your gap flanks. The length of the flanks can be arbitrary, however in our tests, we used uniform lengths of 500 bp.

We used BioBloomTools BioBloomMIMaker [2], followed by BioBloomTools BioBloomMICategorizer [2] to obtain reads from the full set of reads used in the de novo assembly that map only to our gap's flanks.

Gap Prediction With GapPredict

To run GapPredict, navigate to the lib directory and call:

python GapPredict.py -o <output directory> -fa <FASTA path> -fq <FASTQ path>

For help, call:

python GapPredict.py --help

We've provided sample FASTA and FASTQ files in lib/data/real_gaps/sealer_filled and lib/data/real_gaps/sealer_unfilled. Gaps in lib/data/real_gaps/sealer_filled have been filled by Sealer [3], a state-of-the-art gap-filling tool, so we've also included Sealer's output for the actual gap sequence to use as a reference. The human reference genome (HG38) must be used to obtain a reference sequence for gaps in lib/data/real_gaps/sealer_unfilled (in addition to gaps in lib/data/real_gaps/sealer_filled).

eg. python GapPredict.py -o <output directory> -fa .../lib/data/real_gaps/sealer_filled/7391826_358-1408.fasta -fq .../lib/data/real_gaps/sealer_filled/7391826_358-1408.fastq

Where ... is the absolute path to the lib directory.

GapPredict Outputs

GapPredict outputs the following directory structure:

Root directory (<gap ID>R<replicate number>)

  • beam_search - contains results from predicting the flanks and gaps using beam search
    • predict_gap - contains results from predicting the gap from both the forward and reverse complement direction using beam search with a user specified beam length
      • inner directories specifying which direction the gap was predicted from
        • beam_search_predicted_probabilities.npy - vector of length B (beam length) of log-sum probabilities for each predicted gap, in descending order
        • beam_search_predictions.fasta - file of the top B gap predictions for the gap from the given direction
    • regenerate_seq - contains results from predicting the left flank and the right flank from both the forward and reverse complement direction using beam search with a user specified beam length
      • inner directories specifying the left/right flank and which direction the flank was predicted from
        • beam_search_predicted_probabilities.npy - vector of length B (beam length) of log-sum probabilities for each predicted flank, in descending order
        • beam_search_predictions.fasta - file of the top B gap predictions for the given flank from the given direction
  • predict_gap - contains results from predicting the gap from both the forward and reverse complement direction using the greedy algorithm (beam search with a beam length of 1)
    • inner directories specifying which direction the gap was predicted from
      • predicted_probabilities.npy - P x 4 matrix (where P is the length predicted) containing the probability vector output by the GapPredict model for each predicted base
  • regenerate_seq - contains results from predicting the left flank and the right flank from both the forward and reverse complement direction using the greedy algorithm (beam search with a beam length of 1)
    • flank_predict.fasta - contains the left flank and right flank predicted from both the forward and reverse complement directions (4 sequences total)
    • inner directories specifying the left/right flank and which direction the flank was predicted from
      • greedy_predicted_probabilities.npy - contains the log-sum probability for the greedy prediction
      • predicted_probabilities.npy - P x 4 matrix (where P is the length predicted) containing the probability vector output by the GapPredict model for each predicted base
      • random_predicted_probabilities.npy - vector of length P with the probability for each randomly chosen base
      • teacher_force_predicted_probabilities.npy - vector of length P with the probability of each base chosen to match the actual reference sequence
  • BS_<batch size>ED<embedding dimensions>LD<LSTM cells>E<epochs>R<replicate> - contains model training results
    • contains graphs for training loss, training accuracy, validation loss, and validation accuracy in addition to the matrix containing these metrics at each epoch
      • validation loss and validation accuracy are a V x E matrix where E is number of epochs and V is number of sequences in the validation set, and contains the respective metric for each of the V sequences
    • lengths.npy - vector of lengths for the validation set for weighted sums, where sequences are in the same order as the validation loss and accuracy matrices
  • gap_predict_align.fa - contains the sequences for the greedy prediction of the gap from both the left and right flanks (including the flank seeds), and the sequences from the input FASTA file
  • my_model_weights.h5 - contains GapPredict model parameters and can be loaded into a GapPredict model

Pipeline Reproduction Steps

Refer to this link.

Citations

  1. G. S. C. Slater and E. Birney. “Automated generation of heuristics for biological sequence comparison BMC Bioinform. Bioinform., vol. 6, no. 31, Feb. 2005.
  2. J. Chu, H. Mohamadi, E. Erhan, J. Tse, R. Chiu, S. Yeo, and I. Birol. “Improving on hash-based probabilistic sequence classification using multiple spaced seeds and multi-index Bloom filters”, bioRxiv:434795, Oct. 2018.
  3. D. Paulino, R. L. Warren, B. P. Vandervalk, A. Raymond, S. D. Jackman, and I. Birol. “Sealer: a scalable gap closing application for finishing draft genomes", BMC Bioinform., vol. 16, no. 230, Jul. 2015.
  4. E. Chen, J. Chu, J. Zhang, R. L. Warren, I. Birol. "GapPredict - A Language Model for Resolving Gaps in Draft Genome Assemblies", IEEE/ACM Transactions on Computational Biology and Bioinformatics. doi:10.1109/TCBB.2021.3109557

gappredict's People

Contributors

dependabot[bot] avatar ericchen424 avatar warrenlr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

gappredict's Issues

Winter Break TODO

  1. Hook up the GPU code and see how much faster it is
  • right now we take about 50 minutes to handle 5 million 26-mers with the Vanilla DNN
  • try to increase batch size
  • can we do a conv net to do word embedding on our kmer, then shrink the size of our RNN hidden layer size because our embedded kmer simplifies things a bit?
  1. We want graphs for 1 kb, 10 kb, 100 kb, 1 Mb
  • for training time
  • for training accuracy
  • for validation accuracy
  • benchmarks with respect to coverage (1X, 10X, 20X, 30X, etc... just take a random sample), k-mer length, model architecture (ie. hyperparameters)

https://arxiv.org/abs/1611.02683
Unsupervised training?

fix bidirectional align

messed it up and it gets overwritten by the second flank... should just put a single bidirectional align in each flank folder not a single one on the outside that can get overwritten

Training Set

Currently our "training set" for training accuracy per epoch is just the last batch of the epoch

Should we maybe store every batch run in an epoch and do a final "review" over all of these batches at the end of the epoch? (review set changes each epoch)

TODO

  1. Extend to predict d > 1 bases and see how it performs
  • "Done"... the statistics should now be pretty good for d > 1 base but I need to hook this up with the Keras model and test it
  1. Try making the training set an actual chromosome (or even just a tiny chunk of a contiguous region so our kmers actually have some context)
  • "Done"... we have a shell script to extract all reads mapping to a contiguous region of an actual bacterial chromosome now
  1. Train on an entire genome, start at some random kmer and estimate the next d bases
  • plot the base correctness against how many bases down you're predicting and see when we get to 25%
  • Not really done... I've tried this on a contiguous stretch of reads and got very good accuracy but this is only for a 1 Kb stretch (which isn't very long)... also this is only next 1 base prediction and haven't expanded this to d bases yet
  1. Take an entire genome, delete some chunks out of it, select some seeds at the edge of those chunks and see if you can predict the next d unknown kmers
  • we will train on the deleted chunks
  • Done, or at least the skeleton to predict d unknown bases given a model predicted on any number of bases

Shannon entropy?
ntHits?
cBF - tries to keep the kmers that are well covered (and thus not likely to be incorrect) but also not too redundant (and thus likely have a poor mapping)

  • use this to filter out reads

Nov 7
First should make sure that my "hole" in the dataset dataset is actually a valid hole.
Then probably want to look into hooking up the model saving functionality so we can hook up an application that predicts an arbitrary number of bases

  • Note: It seems to make a bit less sense to be predicting say 30 bases if we only trained on the next 1 base... although it would be interesting to see how much accuracy falls off; however we can compare training on the next 1 base and the next 30 bases

Scalability - make this run on terminal (to see if we can use the stronger hardware), add more technology, can we use C++?
How does increasing the k of k-mers affect the prediction and training time?
Get a deeper understanding of the architecture of this model and whether it's appropriate for our task?

  • Does a CNN make sense for this problem?

Perhaps e-mail the MachineLearningMastery guy and ask him if there are ways to optimize the model
Use the French-English prediction as a benchmark
Look into Google machine learning (eg. their autocomplete, their sentence completion models too)?

Maybe just other sentence completion models as well
See if we can find benchmark data for other RNN/LSTM models (and also with different frameworks) - make sure it's working on big data as well

https://github.com/tensorflow/nmt

Also, hook up the command line stuff, run on a 10k bp contig, get the runtime stats


Nov 21
Parameter sweep

  • check how different hyperparameters affect runtime and accuracy

Consider:

  1. A vanilla neural network
  • Encode your read (eg. one hot), predict the next base
  • flatten it (4xk vector)
  • try both onehot encoding and "w-mer" embedding (w << k)
    - consider a conv net (take say 4x w windows with w << k, sliding window)
  • take in a matrix and output a base
  • one hot encoding
  1. Kmer embeddings (not just 1 hot encodings)
  2. Run ntHits and filter out kmers (or the count based filtering)
  • gets rid of low confidence kmers
  1. See how coverage affects model quality:
  • eg. sample 10% to simulate say 6X coverage for a 60X coverage
  1. Get a set of contigs and a set of reads not represented in the assembly (this set of reads might be small enough)
  • train on the reads and see if we can patch the contigs/scaffold
  1. See if we can increase the number of threads in HPCE

TODO:

Check backwards prediction (we always do forwards prediction right now) Done

Make a script that compares the filled gaps with the human genome assembly REFERENCE

Having both the flanks is important because if we do a bidirectional prediction:

  1. We expect to overlap at some point
  2. We expect each prediction direction to reach the other flank eventually
    We can kind of estimate the length of the gap so if we go past it we know that something bad happened
  • We need to use an assembly from the same read set that we're training on
  1. First find an Assembly from Illumina 150-PE reads

  2. Find the PE reads themselves

  3. Redo Justin's pipeline on them to extract gaps + gap reads
    Done... but the pipeline seems to have given weird results this time

  4. Write new code to make models on multiple of those gaps^ while we wait for reference sequence

  5. Try the early stopping approach with the flanks (and don't forget predicting in the other direction)
    Early stopping done, parallelism not so much

Make the metric a bit less pessimistic (just use % match rather than % predicted... we can still do early stopping on 100% though but we can now pick a set of weights that might perform well given a polymorphism for example... humans are diploid)
Done

Implement a rolling encoding

Because many k-mers might be overlapping, we actually do most of the work with just a single kmer and can reuse this result

OR

See if we can make an encoded vector/matrix for each read and just make each kmer's one hot encoding point to a specific offset + length.

IGV Alignment

We do not care about flank predictions (especially after we revise how to get the flank metrics)
Skip them in the alignment

Jan 9 TODO

A-R-B-R-C
R is a repeat longer than 26 but shorter than the read length. If R > l then who knows what will happen but we can revisit this later. If it does, then it shows that the LSTM may be able to hold memory longer than the read length.

One base extensions are expected to choke at this point. If the LSTM works then we expect that retaining region A will allow us to fill in R

  • Try to incorporate attention
  • Search up language models (not translation; more like predicting the next word)

Start with some static length n
Split dataset into n predict next base; n+1 predict next base; n+2 predict next base

Read!

Word2vec implementation

https://machinelearningmastery.com/develop-word-embeddings-python-gensim/
https://www.nature.com/articles/s41598-018-33321-1
https://datasciencehongkong.files.wordpress.com/2018/02/dna2vec.pdf
https://arxiv.org/pdf/1701.06279.pdf
(the gist of this seems to be that if you fragment a DNA sequence into kmers in a smart way then train a word2vec model, you can get meaningful embeddings which may be better than one hot encodings)

We also may want to try GRU's or bidirectionality in our models?

https://medium.com/swlh/playing-with-word-vectors-308ab2faa519
(If we ever do things with the actual vectors, a hack is to discard query terms from the results... read the article to see what I mean)

https://arxiv.org/abs/1301.3781
Maybe read the entire journal too for word2vec

https://www.biorxiv.org/content/biorxiv/early/2018/05/31/335943.full.pdf

I think we should think of a good way of determining whether our embeddings show some semantic relationships, especially given that we'll eventually train on the full reads set
word2vec itself with the Wikipedia dataset isn't perfect but it can do things like king - man + woman = queen. Similarly dna2vec can show that say ATA + AGG = ATAAGG (not possible with our embedding right now though because I'm using static length)

New Approach

Needs: Reference, read data
The reference ideally has the gaps filled

  1. Do an assembly
  2. Find a region with a gap
  3. Get the reads mapping to the borders of that gap
  4. Get reads that don't map (and thus some of them map to the gap)
  5. Training: Same as before
  6. Validation: Just try to reproduce the sequence using a seed before the gap. Use % accuracy or length before first error as a metric

Note: Perhaps we do this with 2 gaps because this will give us 2 gaps with an ambiguity

TODO

  1. Reverse complement not reverse Done
  2. Rerun sealer on the gap and the same data Done, indirectly, in the sense that we only use sealer gaps now
  3. Bigger font size for the plot and fatter lines Done
  4. Look at gaps that Sealer closed. Get the reads. See if they map to the gaps that Sealer closed to see coverage. Done

Take the gap that sealer closed and align the gap with the context sequence against the reference human genome. See if Sealer actually filled it correctly. If so then use the gap because we consider it to be "easy".
5) We don't actually care what the model wants to predict for the flanking region Done? But still need to put into practice

  • for training we do what we do
  • for validation/early stopping we will see how well we predict the flank
  • for prediction itself only focus on the gap and seed in the flanking region
  1. "Teacher Forcing" on the validation sequence Done
  • predict base, see if it's correct, regardless of it's correct feed in the correct next base... this is not actually teacher forcing because it doesn't really do anything to training

Visualization of sequence reproduction

  1. 1 significant figure
  2. Sliding window plot of average correctness (8, 20)...
    2.5) Also plot # of matches within the window
  3. Individual base prediction plot for each base
  4. Flag the seed
  5. Actual on top
  6. Annotate matches and errors along the alignment

Fix full pipeline

Save the probabilities for gap prediction (we only do it for flanks)

TODO

  1. Do Sealer -> miBF
  2. send justin the draft genome + read set
  3. run 10000 epochs

Make a Keras model for word2vec

Start by googling existing models

We probably want to see if we can find a way to train it to handle long distance relationships since I feel like the training set would only be limited to the length of a single read

Implement some controls

https://cs224d.stanford.edu/reports/jessesz.pdf

As specified here two useful controls seem to be:

  1. A random genome
  2. A genome consisting of a repeat of a single substring

The point of these controls is to help pick hyperparameters and models
Models that perform poorly on these 2 controls are not considered as these 2 controls have rather "obvious" predictions

Tweak to consider

Make a stateful RNN that uses batches of size say 104 (the size of the number of kmers per read) and disables shuffling since all of these kmers are connected to each other

Problem:

  • Small batch size = long runtime
  • Need to ensure kmer length is the same

Jan 2 week TODO

  1. Train on the length 1000 contig, seed with say the first 100 bases or so (or maybe training length to be safe) and predict the next base until the end of the contig

Evaluate how accurate each base prediction is (0 or 1)
Find the average accuracy

  1. Hook up the word embedding
  2. Investigate how to exploit the GPU better
  3. Order of approximation? Loss function for model?

Make a "Smart" dummy model

This model will not be a machine learning model. This will model a memorization task.

It will simply hash every input kmer and output kmer combo and map every input kmer to the most frequent output kmer. This is meant to serve as a baseline since this is probably the best "function" we'd get from our data. However, if we get an input kmer we've never seen before we'll output garbage because we don't know how to answer something we never memorized.

TODO

We seed then estimate
eg.
SSSSSS-------X-XXX-
QQQQ
Can we train an LSTM to predict the 10th base over? (input/output will be slightly different)
Done

Try to predict starting from say 300-1000 (skip over the part that fails) to see if "memory" is consistent Done

Make the sliding window average for the #1 choice, not of the actual binary correctness Done
For "beam search": Done
Predict the top 2 bases
Seed with next base and the top #2 base and see which one gives a top #1 base with higher probability - choose that one.

Experiment with other gap filling software (eg. Sealer) and get a benchmark. Then see how well we do.

TODO

  1. Make font size even bigger Done?
  2. Keep a text file of all the validation metrics and training metrics from now on so we can merge them arbitrarily Done, np file rather than text
  3. Try to fill the Sealer gaps Attempted
  4. Use alignment and reference genome to extract candidate gaps for Sealer unknown problems and try to fill them
  5. Play with the toy gap to see if using flanks as a proxy for the gaps themselves is reasonable
  6. Try using CLUSTAL (Clustal is slow, use MUSCLE instead) Ready
  7. Make an all in one that trains model and predicts Done

Some resources

Regarding quality scores for output

Does it make sense to not have them? Not like I'm particularly keen on adding them, but it feels like by not having them we're treating the output sequences as being "correct" whereas input sequences have some degree of uncertainty to them, although the output sequences technically may also be wrong at certain parts.

todo

  1. Try to recursively regenerate the sequence where if you make a screwup you backtrack and get a new seed right before the incorrectly predicted base (or maybe not right before... go a bit further back)
  2. Augment the data generator to make a text file of all the input-output mappings
  3. When we reach a base that we predict incorrectly, we generate a new LSTM that's seeded with the exact same k-1mer but we use the 2nd best base.

New Implementation

  1. Create a generator that randomly samples some reads, picks a random k between say 25 and min(l), integer encodes, and passes into embedding layer... output probably shouldn't be longer than what the minimum read length supports
  • we will keep doing this until early stopping makes us stop (at which point we've probably sampled enough)
  1. Probably don't need the encoder decoder paradigm but just keep it in for now
  • actually we probably need to get rid of this... if we want feeding embeddings to our encoding layer to be equivalent to just feeding in a longer sequence then we probably want a single RNN (or something similar) rather than feeding into a decoder RNN
  1. Use an embedding layer (the Keras tutorial tells you how) - as a result we probably don't need 1 hot encoding
  2. Fix the bug where instead of passing in the probability vector you pass in the one-hot encoding

In the end we'll probably seed with the minimum of some arbitrary length and the length of the known region before the gap.

https://github.com/farizrahman4u/seq2seq

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.