bcgsc / gappredict Goto Github PK

Character-level language model for draft genome assembly gap-filling

License: GNU General Public License v3.0

Python 100.00%

gappredict's Introduction

GapPredict

About

GapPredict is an LSTM character-level language model that can be used for gap filling de novo assemblies. GapPredict can predict the bases of a single gap using reads mapping to known flanking sequences of the gap.

In its current implementation, GapPredict will predict a user-defined number of bases after both the left flank and the right flank of the gap. If the user chooses a sufficiently long prediction length, then GapPredict may be able to predict both the gap and the reciprocal flank. A downstream local alignment tool (eg. Exonerate [1]) can be then used to align, say, the first 100 bases of the reciprocal flank to the prediction. If these 100 bases align with high % identity and % coverage (eg. >90%), then it is likely the preceding bases are a good prediction of the actual gap.

Installing GapPredict

Please ensure you are using Python3.7. Dependencies were last tested with Python3.7.9. Refer to https://www.tensorflow.org/install/gpu to hook up your GPU if you don't have another set-up in place. In the event that v1.1 doesn't work, try using v1.0.2 with Python3.6 and file an issue. v1.0.2 contains the intended configuration with Tensorflow 1.5.

You can install GapPredict by cloning or downloading the .zip file directly from GitHub.

git clone [email protected]:bcgsc/GapPredict.git

GapPredict uses Python3.6 and packages outlined in requirements.txt (https://github.com/bcgsc/GapPredict/blob/master/requirements.txt). These packages can be quickly installed by running:

pip install -r requirements.txt

python -m pip install -r requirements.txt

In order to train models and predict efficiently, a GPU is mandatory. Steps to install CUDA and cuDNN are available at the following links:

Input Data Preparation

GapPredict requires two input files - a FASTA file containing the sequences of the left and right flanks of your gap, and a FASTQ file containing reads mapping to your gap flanks. The length of the flanks can be arbitrary, however in our tests, we used uniform lengths of 500 bp.

We used BioBloomTools BioBloomMIMaker [2], followed by BioBloomTools BioBloomMICategorizer [2] to obtain reads from the full set of reads used in the de novo assembly that map only to our gap's flanks.

Gap Prediction With GapPredict

To run GapPredict, navigate to the lib directory and call:

python GapPredict.py -o <output directory> -fa <FASTA path> -fq <FASTQ path>

For help, call:

python GapPredict.py --help

We've provided sample FASTA and FASTQ files in lib/data/real_gaps/sealer_filled and lib/data/real_gaps/sealer_unfilled. Gaps in lib/data/real_gaps/sealer_filled have been filled by Sealer [3], a state-of-the-art gap-filling tool, so we've also included Sealer's output for the actual gap sequence to use as a reference. The human reference genome (HG38) must be used to obtain a reference sequence for gaps in lib/data/real_gaps/sealer_unfilled (in addition to gaps in lib/data/real_gaps/sealer_filled).

eg. python GapPredict.py -o <output directory> -fa .../lib/data/real_gaps/sealer_filled/7391826_358-1408.fasta -fq .../lib/data/real_gaps/sealer_filled/7391826_358-1408.fastq

Where ... is the absolute path to the lib directory.

GapPredict Outputs

GapPredict outputs the following directory structure:

Root directory (<gap ID>R<replicate number>)

beam_search - contains results from predicting the flanks and gaps using beam search
- predict_gap - contains results from predicting the gap from both the forward and reverse complement direction using beam search with a user specified beam length
  - inner directories specifying which direction the gap was predicted from
    - beam_search_predicted_probabilities.npy - vector of length B (beam length) of log-sum probabilities for each predicted gap, in descending order
    - beam_search_predictions.fasta - file of the top B gap predictions for the gap from the given direction
- regenerate_seq - contains results from predicting the left flank and the right flank from both the forward and reverse complement direction using beam search with a user specified beam length
  - inner directories specifying the left/right flank and which direction the flank was predicted from
    - beam_search_predicted_probabilities.npy - vector of length B (beam length) of log-sum probabilities for each predicted flank, in descending order
    - beam_search_predictions.fasta - file of the top B gap predictions for the given flank from the given direction
predict_gap - contains results from predicting the gap from both the forward and reverse complement direction using the greedy algorithm (beam search with a beam length of 1)
- inner directories specifying which direction the gap was predicted from
  - predicted_probabilities.npy - P x 4 matrix (where P is the length predicted) containing the probability vector output by the GapPredict model for each predicted base
regenerate_seq - contains results from predicting the left flank and the right flank from both the forward and reverse complement direction using the greedy algorithm (beam search with a beam length of 1)
- flank_predict.fasta - contains the left flank and right flank predicted from both the forward and reverse complement directions (4 sequences total)
- inner directories specifying the left/right flank and which direction the flank was predicted from
  - greedy_predicted_probabilities.npy - contains the log-sum probability for the greedy prediction
  - predicted_probabilities.npy - P x 4 matrix (where P is the length predicted) containing the probability vector output by the GapPredict model for each predicted base
  - random_predicted_probabilities.npy - vector of length P with the probability for each randomly chosen base
  - teacher_force_predicted_probabilities.npy - vector of length P with the probability of each base chosen to match the actual reference sequence
BS_<batch size>ED<embedding dimensions>LD<LSTM cells>E<epochs>R<replicate> - contains model training results
- contains graphs for training loss, training accuracy, validation loss, and validation accuracy in addition to the matrix containing these metrics at each epoch
  - validation loss and validation accuracy are a V x E matrix where E is number of epochs and V is number of sequences in the validation set, and contains the respective metric for each of the V sequences
- lengths.npy - vector of lengths for the validation set for weighted sums, where sequences are in the same order as the validation loss and accuracy matrices
gap_predict_align.fa - contains the sequences for the greedy prediction of the gap from both the left and right flanks (including the flank seeds), and the sequences from the input FASTA file
my_model_weights.h5 - contains GapPredict model parameters and can be loaded into a GapPredict model

Pipeline Reproduction Steps

Refer to this link.

Citations

G. S. C. Slater and E. Birney. “Automated generation of heuristics for biological sequence comparison BMC Bioinform. Bioinform., vol. 6, no. 31, Feb. 2005.
J. Chu, H. Mohamadi, E. Erhan, J. Tse, R. Chiu, S. Yeo, and I. Birol. “Improving on hash-based probabilistic sequence classification using multiple spaced seeds and multi-index Bloom filters”, bioRxiv:434795, Oct. 2018.
D. Paulino, R. L. Warren, B. P. Vandervalk, A. Raymond, S. D. Jackman, and I. Birol. “Sealer: a scalable gap closing application for finishing draft genomes", BMC Bioinform., vol. 16, no. 230, Jul. 2015.
E. Chen, J. Chu, J. Zhang, R. L. Warren, I. Birol. "GapPredict - A Language Model for Resolving Gaps in Draft Genome Assemblies", IEEE/ACM Transactions on Computational Biology and Bioinformatics. doi:10.1109/TCBB.2021.3109557

gappredict's People

Contributors

Stargazers

Watchers

Forkers

renesugar darrengao628 ningshuang-yao

gappredict's Issues

Winter Break TODO

Hook up the GPU code and see how much faster it is

right now we take about 50 minutes to handle 5 million 26-mers with the Vanilla DNN
try to increase batch size
can we do a conv net to do word embedding on our kmer, then shrink the size of our RNN hidden layer size because our embedded kmer simplifies things a bit?

the input to our RNN is a matrix of kmers that all overlap by k-1 bases that have been embedded to vector form by a CNN trained on a random sample of kmers (we'd like to train on all possible kmers of some length k but this is exponential in length so not really)
https://machinelearningmastery.com/cnn-long-short-term-memory-networks/

We want graphs for 1 kb, 10 kb, 100 kb, 1 Mb

for training time
for training accuracy
for validation accuracy
benchmarks with respect to coverage (1X, 10X, 20X, 30X, etc... just take a random sample), k-mer length, model architecture (ie. hyperparameters)

https://arxiv.org/abs/1611.02683
Unsupervised training?

Test both just forward and one with forward and reverse reads

When you have Keras set up

fix bidirectional align

messed it up and it gets overwritten by the second flank... should just put a single bidirectional align in each flank folder not a single one on the outside that can get overwritten

Training Set

Currently our "training set" for training accuracy per epoch is just the last batch of the epoch

Should we maybe store every batch run in an epoch and do a final "review" over all of these batches at the end of the epoch? (review set changes each epoch)

Tips for Model Architecture

https://github.com/karpathy/char-rnn

num parameters should be similar in order of magnitude as your dataset to prevent underfitting
For overfitting consider something like dropout or just drop # parameters

Isn't this almost what I'm doing right now?

https://medium.com/@infoecho/dcnet-denoising-dna-sequence-with-a-lstm-rnn-and-pytorch-3b454ff727e7

Hyperparameter Optimization Resources

https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/ - adam is considered to be slightly better than RMSprop, find a citation?

How to deal with N's in FASTQ

Print the time taken for each step

Fix IDs for FASTAs produced

It's very hard to disambiguate which replicate each came from and what the RNN parameter was

TODO

Extend to predict d > 1 bases and see how it performs

"Done"... the statistics should now be pretty good for d > 1 base but I need to hook this up with the Keras model and test it

Try making the training set an actual chromosome (or even just a tiny chunk of a contiguous region so our kmers actually have some context)

"Done"... we have a shell script to extract all reads mapping to a contiguous region of an actual bacterial chromosome now

Train on an entire genome, start at some random kmer and estimate the next d bases

plot the base correctness against how many bases down you're predicting and see when we get to 25%
Not really done... I've tried this on a contiguous stretch of reads and got very good accuracy but this is only for a 1 Kb stretch (which isn't very long)... also this is only next 1 base prediction and haven't expanded this to d bases yet

Take an entire genome, delete some chunks out of it, select some seeds at the edge of those chunks and see if you can predict the next d unknown kmers

we will train on the deleted chunks
Done, or at least the skeleton to predict d unknown bases given a model predicted on any number of bases

Shannon entropy?
ntHits?
cBF - tries to keep the kmers that are well covered (and thus not likely to be incorrect) but also not too redundant (and thus likely have a poor mapping)

use this to filter out reads

Nov 7
First should make sure that my "hole" in the dataset dataset is actually a valid hole.
Then probably want to look into hooking up the model saving functionality so we can hook up an application that predicts an arbitrary number of bases

Note: It seems to make a bit less sense to be predicting say 30 bases if we only trained on the next 1 base... although it would be interesting to see how much accuracy falls off; however we can compare training on the next 1 base and the next 30 bases

Scalability - make this run on terminal (to see if we can use the stronger hardware), add more technology, can we use C++?
How does increasing the k of k-mers affect the prediction and training time?
Get a deeper understanding of the architecture of this model and whether it's appropriate for our task?

Does a CNN make sense for this problem?

Perhaps e-mail the MachineLearningMastery guy and ask him if there are ways to optimize the model
Use the French-English prediction as a benchmark
Look into Google machine learning (eg. their autocomplete, their sentence completion models too)?
Maybe just other sentence completion models as well
See if we can find benchmark data for other RNN/LSTM models (and also with different frameworks) - make sure it's working on big data as well

https://github.com/tensorflow/nmt

Also, hook up the command line stuff, run on a 10k bp contig, get the runtime stats

Nov 21
Parameter sweep

check how different hyperparameters affect runtime and accuracy

Consider:

A vanilla neural network

Encode your read (eg. one hot), predict the next base
flatten it (4xk vector)
try both onehot encoding and "w-mer" embedding (w << k)
- consider a conv net (take say 4x w windows with w << k, sliding window)
take in a matrix and output a base
one hot encoding

Kmer embeddings (not just 1 hot encodings)
Run ntHits and filter out kmers (or the count based filtering)

gets rid of low confidence kmers

See how coverage affects model quality:

eg. sample 10% to simulate say 6X coverage for a 60X coverage

Get a set of contigs and a set of reads not represented in the assembly (this set of reads might be small enough)

train on the reads and see if we can patch the contigs/scaffold

See if we can increase the number of threads in HPCE

Turn flank prediction into FASTA in addition to the alignment

right now we just make an alignment text file which is a pain to parse

TODO:

Check backwards prediction (we always do forwards prediction right now) Done

Make a script that compares the filled gaps with the human genome assembly REFERENCE

Having both the flanks is important because if we do a bidirectional prediction:

We expect to overlap at some point
We expect each prediction direction to reach the other flank eventually
We can kind of estimate the length of the gap so if we go past it we know that something bad happened

We need to use an assembly from the same read set that we're training on

First find an Assembly from Illumina 150-PE reads
Find the PE reads themselves
Redo Justin's pipeline on them to extract gaps + gap reads
Done... but the pipeline seems to have given weird results this time
Write new code to make models on multiple of those gaps^ while we wait for reference sequence
Try the early stopping approach with the flanks (and don't forget predicting in the other direction)
Early stopping done, parallelism not so much

Make the metric a bit less pessimistic (just use % match rather than % predicted... we can still do early stopping on 100% though but we can now pick a set of weights that might perform well given a polymorphism for example... humans are diploid)
Done

Remove Qualities From Model

Might want to bring this up

I think it makes a bit more sense for the sake of kmer embedding

Implement a rolling encoding

Because many k-mers might be overlapping, we actually do most of the work with just a single kmer and can reuse this result

See if we can make an encoded vector/matrix for each read and just make each kmer's one hot encoding point to a specific offset + length.

IGV Alignment

We do not care about flank predictions (especially after we revise how to get the flank metrics)
Skip them in the alignment

Jan 9 TODO

A-R-B-R-C
R is a repeat longer than 26 but shorter than the read length. If R > l then who knows what will happen but we can revisit this later. If it does, then it shows that the LSTM may be able to hold memory longer than the read length.

One base extensions are expected to choke at this point. If the LSTM works then we expect that retaining region A will allow us to fill in R

Try to incorporate attention
Search up language models (not translation; more like predicting the next word)

Start with some static length n
Split dataset into n predict next base; n+1 predict next base; n+2 predict next base

Test how well the model performs with different coverage

If say we have 100X coverage and we randomly sample n% of the dataset, this sample is quite representative of a nX coverage dataset

fix saved validation array

either save the lengths or just directly save the weighted mean

Read!

https://cs224d.stanford.edu/reports/jessesz.pdf
It looks like these people did exactly what we're trying to do^ (or at least to some degree)
https://towardsdatascience.com/recurrent-neural-networks-and-lstm-4b601dd822a5

https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf?

https://academic.oup.com/bioinformatics/article/21/8/1719/250163
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4914104/

gives a new encoding method: kmer bag of words

https://www.nature.com/articles/s41598-018-33321-1

Word2vec implementation

https://machinelearningmastery.com/develop-word-embeddings-python-gensim/
https://www.nature.com/articles/s41598-018-33321-1
https://datasciencehongkong.files.wordpress.com/2018/02/dna2vec.pdf
https://arxiv.org/pdf/1701.06279.pdf
(the gist of this seems to be that if you fragment a DNA sequence into kmers in a smart way then train a word2vec model, you can get meaningful embeddings which may be better than one hot encodings)

We also may want to try GRU's or bidirectionality in our models?

https://medium.com/swlh/playing-with-word-vectors-308ab2faa519
(If we ever do things with the actual vectors, a hack is to discard query terms from the results... read the article to see what I mean)

https://arxiv.org/abs/1301.3781
Maybe read the entire journal too for word2vec

https://www.biorxiv.org/content/biorxiv/early/2018/05/31/335943.full.pdf

I think we should think of a good way of determining whether our embeddings show some semantic relationships, especially given that we'll eventually train on the full reads set
word2vec itself with the Wikipedia dataset isn't perfect but it can do things like king - man + woman = queen. Similarly dna2vec can show that say ATA + AGG = ATAAGG (not possible with our embedding right now though because I'm using static length)

If we ever get to predicting longer than 1 base, check if the first k bases are consistent everytime we increase to k+n

New Approach

Needs: Reference, read data
The reference ideally has the gaps filled

Do an assembly
Find a region with a gap
Get the reads mapping to the borders of that gap
Get reads that don't map (and thus some of them map to the gap)
Training: Same as before
Validation: Just try to reproduce the sequence using a seed before the gap. Use % accuracy or length before first error as a metric

Note: Perhaps we do this with 2 gaps because this will give us 2 gaps with an ambiguity

Make a table for all the runtime metrics

So we dont have to scroll all over the place

TODO

Reverse complement not reverse Done
Rerun sealer on the gap and the same data Done, indirectly, in the sense that we only use sealer gaps now
Bigger font size for the plot and fatter lines Done
Look at gaps that Sealer closed. Get the reads. See if they map to the gaps that Sealer closed to see coverage. Done

Take the gap that sealer closed and align the gap with the context sequence against the reference human genome. See if Sealer actually filled it correctly. If so then use the gap because we consider it to be "easy".
5) We don't actually care what the model wants to predict for the flanking region Done? But still need to put into practice

for training we do what we do
for validation/early stopping we will see how well we predict the flank
for prediction itself only focus on the gap and seed in the flanking region

"Teacher Forcing" on the validation sequence Done

predict base, see if it's correct, regardless of it's correct feed in the correct next base... this is not actually teacher forcing because it doesn't really do anything to training

Visualization of sequence reproduction

1 significant figure
Sliding window plot of average correctness (8, 20)...
2.5) Also plot # of matches within the window
Individual base prediction plot for each base
Flag the seed
Actual on top
Annotate matches and errors along the alignment

Fix full pipeline

Save the probabilities for gap prediction (we only do it for flanks)

Plot Aesthetics to fix

Add a title
Add axes labels

Applicability of RNNs to biological sequences

https://ieeexplore-ieee-org.ezproxy.library.ubc.ca/document/1504688

Language Model

https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/

Make one more module that 1) gets the reverse complement of every read, 2) extracts the k+n-mer from that direction

This is because the k+n-mer from the forward strand reading 5' -> 3' is different from the k+n-mer from the reverse strand reading 5'-3'

TODO

Do Sealer -> miBF
send justin the draft genome + read set
run 10000 epochs

Think about how to train your model

One hot encoding hints that RNNs might be useful but need to think of how to represent our model
Research sample RNNs to see how they do next word(s) predictions

https://blog.paperspace.com/recurrent-neural-networks-part-1-2/
https://medium.com/@curiousily/making-a-predictive-keyboard-using-recurrent-neural-networks-tensorflow-for-hackers-part-v-3f238d824218

https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

Make a Keras model for word2vec

Start by googling existing models

We probably want to see if we can find a way to train it to handle long distance relationships since I feel like the training set would only be limited to the length of a single read

LSTM optimizations?

https://github.com/oxford-cs-deepnlp-2017/lectures/blob/master/Lecture%206%20-%20Nvidia%20RNNs%20and%20GPUs.pdf

Slide 24

https://arxiv.org/abs/1604.01946

how to use grep

http://www.gnu.org/software/grep/manual/grep.html#Command_002dline-Options
-A will give trailing lines (useful for fastq!)

https://www.biostars.org/p/10353/#11200

Implement some controls

https://cs224d.stanford.edu/reports/jessesz.pdf

As specified here two useful controls seem to be:

A random genome
A genome consisting of a repeat of a single substring

The point of these controls is to help pick hyperparameters and models
Models that perform poorly on these 2 controls are not considered as these 2 controls have rather "obvious" predictions

Tweak to consider

Make a stateful RNN that uses batches of size say 104 (the size of the number of kmers per read) and disables shuffling since all of these kmers are connected to each other

Problem:

Small batch size = long runtime
Need to ensure kmer length is the same

Jan 2 week TODO

Train on the length 1000 contig, seed with say the first 100 bases or so (or maybe training length to be safe) and predict the next base until the end of the contig

Evaluate how accurate each base prediction is (0 or 1)
Find the average accuracy

Hook up the word embedding
Investigate how to exploit the GPU better
Order of approximation? Loss function for model?

Hook up Training History visualization

This should help us with tweaking epoch/batch size
https://keras.io/visualization/

We are missing the best epoch

we save all the validation/training metrics but not the best epoch from early stopping

Make a "Smart" dummy model

This model will not be a machine learning model. This will model a memorization task.

It will simply hash every input kmer and output kmer combo and map every input kmer to the most frequent output kmer. This is meant to serve as a baseline since this is probably the best "function" we'd get from our data. However, if we get an input kmer we've never seen before we'll output garbage because we don't know how to answer something we never memorized.

TODO

We seed then estimate
eg.
SSSSSS-------X-XXX-
QQQQ
Can we train an LSTM to predict the 10th base over? (input/output will be slightly different)
Done

Try to predict starting from say 300-1000 (skip over the part that fails) to see if "memory" is consistent Done

Make the sliding window average for the #1 choice, not of the actual binary correctness Done
For "beam search": Done
Predict the top 2 bases
Seed with next base and the top #2 base and see which one gives a top #1 base with higher probability - choose that one.

Experiment with other gap filling software (eg. Sealer) and get a benchmark. Then see how well we do.

teacher forcing + attention example in torch

https://github.com/zyxue/bio-seq2seq-attention/blob/671692de040b81784e4ca62fde02e5f7302f2e93/seq2seq/train_on_one_batch.py#L69

https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

TODO

Make font size even bigger Done?
Keep a text file of all the validation metrics and training metrics from now on so we can merge them arbitrarily Done, np file rather than text
Try to fill the Sealer gaps Attempted
Use alignment and reference genome to extract candidate gaps for Sealer unknown problems and try to fill them
Play with the toy gap to see if using flanks as a proxy for the gaps themselves is reasonable
Try using CLUSTAL (Clustal is slow, use MUSCLE instead) Ready
Make an all in one that trains model and predicts Done

Some resources

https://skymind.ai/wiki/lstm
https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py
https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/
https://machinelearningmastery.com/define-encoder-decoder-sequence-sequence-model-neural-machine-translation-keras/
https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/
https://machinelearningmastery.com/models-sequence-prediction-recurrent-neural-networks/
https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/
https://machinelearningmastery.com/understanding-stateful-lstm-recurrent-neural-networks-python-keras/

https://machinelearningmastery.com/keras-functional-api-deep-learning/
Keras

https://arxiv.org/abs/1409.3215

Revise Validation Probability Curves

Use the teacher forcing implementation to get the probabilities as its more accurate to what we do for validation

Regarding quality scores for output

Does it make sense to not have them? Not like I'm particularly keen on adding them, but it feels like by not having them we're treating the output sequences as being "correct" whereas input sequences have some degree of uncertainty to them, although the output sequences technically may also be wrong at certain parts.

todo

Try to recursively regenerate the sequence where if you make a screwup you backtrack and get a new seed right before the incorrectly predicted base (or maybe not right before... go a bit further back)
Augment the data generator to make a text file of all the input-output mappings
When we reach a base that we predict incorrectly, we generate a new LSTM that's seeded with the exact same k-1mer but we use the 2nd best base.

New Implementation

Create a generator that randomly samples some reads, picks a random k between say 25 and min(l), integer encodes, and passes into embedding layer... output probably shouldn't be longer than what the minimum read length supports

we will keep doing this until early stopping makes us stop (at which point we've probably sampled enough)

Probably don't need the encoder decoder paradigm but just keep it in for now

actually we probably need to get rid of this... if we want feeding embeddings to our encoding layer to be equivalent to just feeding in a longer sequence then we probably want a single RNN (or something similar) rather than feeding into a decoder RNN

Use an embedding layer (the Keras tutorial tells you how) - as a result we probably don't need 1 hot encoding
Fix the bug where instead of passing in the probability vector you pass in the one-hot encoding

In the end we'll probably seed with the minimum of some arbitrary length and the length of the known region before the gap.

https://github.com/farizrahman4u/seq2seq

bcgsc / gappredict Goto Github PK

gappredict's Introduction

GapPredict

About

Installing GapPredict

Input Data Preparation

Gap Prediction With GapPredict

GapPredict Outputs

Pipeline Reproduction Steps

Citations

gappredict's People

Contributors

Stargazers

Watchers

Forkers

gappredict's Issues

Recommend Projects

Recommend Topics

Recommend Org