The mohawk from gwarmstrong

Model directly predict abundance profile

rework cli for training and classifying

Small to-do for classify CLI:

change model to model_path

Write train CLI:

options for seed, number of reads to simulate, train_ratio, batch_size, gpu, summary_interval, epochs, summarize, learning_rate, log_dir, concise=True, config file
essentially implement what is in the trial scripts
move trial_ids and validation ids to files
find way to test

save numeric hyperparameters as metrics

Based on my experience with tensorboard, this will help for two reasons:

Hyperparameters are stored as strings, so numeric hyperparameters can be sorted weirdly
You can filter with metrics on the tensorboard interface much more easily, so the models could be subset for analysis based on hyperparameters

Remove automatic class labeling

Let it be passed into the trainer functionality

Use google-nucleus for parsing sequencing data

It could be nice to take advantage of google's nucleus for parsing sequencing data.

In this colab notebook, it looks nice to work with, potentially saving time and memory on parsing and sampling data.

It may be a bit of a pain to incorporate with pytorch and will need some benchmarking if implemented.
See these links for some starting info on this front.
https://discuss.pytorch.org/t/read-dataset-from-tfrecord-format/16409/7
https://github.com/pgmmpk/tfrecord

Add additional level of hierarchy to mohawk cli

This should be added to allow multiple ways of invoking mohawk functionality, without necessarily assuming that all models will be of the form that is currently being used.

log testing requirements

go through and figure out what can be better covered and how by testing

Add model architecture

Add the architecture from this colab notebook to the package's models.

train on reverse complements

Currently only forward strands are trained on...

Add `mohawk train` to README

This will help the training part of the CLI

improve model training flexibility

The goal here is to be able to construct multiple types of models that accomplish the same task, but may have drastically ways of preparing and treating data, using the same user-facing API/CLI.

Details:
Want to be able to train models with a command that looks something like:

mohawk train --model-type [model_type] \
    --sequence-directory /path/to/seq/dir \
    --lables /path/to/seq/labels.txt \
    --depth [number_of_reads_to_generate] \
    --length [length_of_sequences_to_produce] \
    --random-seed [seed_for_rng]

Internally, the python will need to be written in such a way that different methods for prep, training, etc., can be executed for different model types.

pass in hyperparameters from the command line

It would be nice to have a way to pass in custom hyperparameters from the command line so that factors that are considered upstream of the model hyperparameters can also be incorporated

trial scripts for training should be transitioned to CLI with config files

Currently there are some scripts that are python scripts I have been using to training models that contain parameters and code to begin training the model. This should be transitioned so there is a mohawk train command in the CLI, where parameters can be passed in as some sort of config file.

train on data with errors

Currently, model training is performed on reads taken directly from genomes, it would be useful to incorporate some reads with errors in the training regimes.

train on varying numbers of genomes sampled from different levels of "diversity"

mock ete toolkit for unit tests

The main problem is downloading the taxdump etc., using ete on Travis. This could be fixed by mocking parts of the ete toolkit responsible for downloading, and reworking tests that require this functionality so that we don't actually have to have it for the unit tests.

ensure that reads are sample from the same location only once #27
Add simulated errors to training data (could involve neural network) #1
Add error correction module to fix low-quality sequences (could use some real data with alignments to genome?) #28

gwarmstrong / mohawk Goto Github PK

mohawk's People

Contributors

Stargazers

Watchers

mohawk's Issues

Recommend Projects

Recommend Topics

Recommend Org