Coder Social home page Coder Social logo

mohawk's People

Contributors

gwarmstrong avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

mohawk's Issues

rework cli for training and classifying

Small to-do for classify CLI:

  • change model to model_path

Write train CLI:

  • options for seed, number of reads to simulate, train_ratio, batch_size, gpu, summary_interval, epochs, summarize, learning_rate, log_dir, concise=True, config file
  • essentially implement what is in the trial scripts
  • move trial_ids and validation ids to files
  • find way to test

save numeric hyperparameters as metrics

Based on my experience with tensorboard, this will help for two reasons:

  • Hyperparameters are stored as strings, so numeric hyperparameters can be sorted weirdly
  • You can filter with metrics on the tensorboard interface much more easily, so the models could be subset for analysis based on hyperparameters

Use google-nucleus for parsing sequencing data

It could be nice to take advantage of google's nucleus for parsing sequencing data.

In this colab notebook, it looks nice to work with, potentially saving time and memory on parsing and sampling data.

It may be a bit of a pain to incorporate with pytorch and will need some benchmarking if implemented.
See these links for some starting info on this front.
https://discuss.pytorch.org/t/read-dataset-from-tfrecord-format/16409/7
https://github.com/pgmmpk/tfrecord

improve model training flexibility

The goal here is to be able to construct multiple types of models that accomplish the same task, but may have drastically ways of preparing and treating data, using the same user-facing API/CLI.

Details:
Want to be able to train models with a command that looks something like:

mohawk train --model-type [model_type] \
    --sequence-directory /path/to/seq/dir \
    --lables /path/to/seq/labels.txt \
    --depth [number_of_reads_to_generate] \
    --length [length_of_sequences_to_produce] \
    --random-seed [seed_for_rng]

Internally, the python will need to be written in such a way that different methods for prep, training, etc., can be executed for different model types.

pass in hyperparameters from the command line

It would be nice to have a way to pass in custom hyperparameters from the command line so that factors that are considered upstream of the model hyperparameters can also be incorporated

trial scripts for training should be transitioned to CLI with config files

Currently there are some scripts that are python scripts I have been using to training models that contain parameters and code to begin training the model. This should be transitioned so there is a mohawk train command in the CLI, where parameters can be passed in as some sort of config file.

train on data with errors

Currently, model training is performed on reads taken directly from genomes, it would be useful to incorporate some reads with errors in the training regimes.

mock ete toolkit for unit tests

The main problem is downloading the taxdump etc., using ete on Travis. This could be fixed by mocking parts of the ete toolkit responsible for downloading, and reworking tests that require this functionality so that we don't actually have to have it for the unit tests.

Improve Training data simulation

Some ideas of things that can/should be done

  • ensure that reads are sample from the same location only once #27
  • Add simulated errors to training data (could involve neural network) #1
  • Add error correction module to fix low-quality sequences (could use some real data with alignments to genome?) #28

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.