The mycelia from cjprybol

test issue for zapier

Replace evidence named tuples with nested dictionaries

Evidence = Dict(sequence_identifier => (index: Int, orientation: Bool) for kmer in dataset)

https://julialang.github.io/PackageCompiler.jl/dev/index.html

Use community detection for seperating out protein communities, genome communities

https://www.youtube.com/watch?v=F4RVBAGJcFY

https://juliagraphs.org/Graphs.jl/dev/centrality/#Graphs.betweenness_centrality

Modularity is a measure of the structure of networks or graphs which measures the strength of division of a network into modules (also called groups, clusters or communities). Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules. Modularity is often used in optimization methods for detecting community structure in networks. However, it has been shown that modularity suffers a resolution limit and, therefore, it is unable to detect small communities. Biological networks, including animal brains, exhibit a high degree of modularity.

https://juliagraphs.org/Graphs.jl/dev/community/#Graphs.modularity-Tuple{AbstractGraph,%20AbstractVector{%3C:Integer}}

https://en.wikipedia.org/wiki/Louvain_method

Use degree >= 3 nodes for pre-indexing targets

If we take starting nodes for traversing the graph at random, even if we take a weighted sample, could still accidentally grab a node that is off on a very low frequency or erroneous offshoot path.

By taking all nodes with >= 3, we always enable the option to take the better of the two or more adjacent paths

Possibly consider taking a weighted subset of these hub nodes, rather than taking all of them. But by taking all of them, we get a good way of indexing the graph completely

can explore by:

weighted walk forward from A until we find a hub node or find B
weighted walk forward from B until we find a hub node or find A
use the pre-calculated shortest path from A to B
can either stop there for a possibly, but not necessarily guaranteed, shortest path
Can continue shortest path searches from A and B UP UNTIL they are equal to or longer than the precalculated route from hub to hub, then we'll know for sure which is the shortest

Finish gutting README and moving notes into documentation

CAMI datasets to develop against

https://www.microbiome-cosi.org/cami/cami/cami2
https://edwards.flinders.edu.au/cami-challenge-datasets/
https://data.cami-challenge.org/participate
https://github.com/CAMI-challenge/data
https://zenodo.org/communities/cami?q=&l=list&p=1&s=10&sort=newest
https://www.microbiome-cosi.org/cami/resources

Pull in rough notes from other repositories, blog, private note locations

Use TSV for reading into & out of neo4j

Switch to using static vectors for kmers instead of kmers.jl

Need to handle N's and proteins

Fill in placeholder pages in documentation

Convert documentation structure to use Divio

https://documentation.divio.com/

Consider switch from MetaGraphs to MetaGraphsNext.jl

When generating all possible AAmers, drop AAmers that start with stop codon

finish out core genome finding algorithm for linear, circular, and multi-component graphs

Move tutorial docstrings into ArgParse.jl

Revise Monday.com plan according to new timelines

Move back onto cjprybol.github.io domain

Try dib-lab classification tools

Grist would be good to try for long read and/or contig classification

https://dib-lab.github.io/genome-grist/output-guide/

I'll need to build my own refseq database, but that shouldn't be too bad - I have the scripts for doing that download

Can also use the underlying commands sourmash gather and sourmash tax

Have tutorial pages call argparse's function help rather than be hardcoded

Move development notebooks into docs for web rendering

Get jupyterlab templates working

Finish PoC joint-probility assembler + variant caller by passing my polished graphs into pggb/odgi+vg variant deconvolution flow

algorithm:

deconvolve reads into kmer graph
iterative correction, starting at first k-length with sparsity, ending at first k-length with no updates (possibly better terminating conditions, should look into that)
write graph out as GFA (try compacted and raw)
to determine primary contigs using shortest path with L = (1 / total bases, where total bases = length * average depth or the actual lossless alignment calculation), or to start but with external dependencies, add try metaflye or https://github.com/lh3/minigraph
use primary contigs as reference for generating variant calls using ODGI -> VG flow, or possibly VG directly
ensure that VG variant calling can use coverage information - if not, update VCF files with depth-of-coverage information
https://odgi.readthedocs.io/en/latest/rst/commands/odgi_build.html

add jellyfish counting function

"""
	function jellyfish_count(;fasta::String, k::Int, directory::String, jellyfish_path="$(homedir())/jellyfish-linux")

count kmers from fasta file and write outputs to directory
"""
function jellyfish_count(;fasta::String, k::Int, directory::String, jellyfish_path="jellyfish")
    id = first(split(last(split(fasta, '/')), '.'))
    counts_file = "$directory/$id.$k.counts"
    if !isfile(counts_file)
        jf_file = "$directory/$id.$k.jf"
        run(`$(jellyfish_path) count -m $k -s 100M --canonical -o $jf_file $fasta`)
        run(`$(jellyfish_path) dump -ct -o $counts_file $jf_file`)
        rm(jf_file)
    end
end

Stop putting kmer_counts datastructure into the metagraph

When we create an induced subgraph or update the graph, the kmer_counts datastructure is no longer correct and needs to be rebuilt. Should just rebuild on the fly each time using the metadata attached to the node weights

Implement bi-directional A* to improve runtime of pathfinding if we're willing to create indexed distance matrix beforehand

get codespace env working again

last working build on master was somewhere between

working parent: 00ee14b
possibly working child: 3e30308

Load host information into neo4j taxonomy db to improve viral filtering

https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide

Implement bidirectional dijkstras as my re-routing solution

rough algorithm:

build graph
find all nodes with in-degree > 1 and all nodes with out-degree > 1, these are our branch points
when looking for routes, accept all non-branch paths but run bd-dijkstras on branch points

Move neo4j learning notes out of readme and into a different note

Increase storage efficiency of genomes with Aeron's skipmer approach

cjprybol/sars-cov2-pangenome-analysis#2

Need a simpler kmer graph data structure for assembly and error correction

When working with observational data (fastq files) rather than reference data (fasta files), use a simplified kmer graph that only records # of supporting pieces of evidence as an Int rather than recording each piece of evidence (e.g. record identifier, index, orientation) individually

add basic docstrings for all functions
make docs available online
add real examples to docstrings to replace placeholders

cjprybol / mycelia Goto Github PK

mycelia's People

Contributors

Stargazers

Watchers

Forkers

mycelia's Issues

Recommend Projects

Recommend Topics

Recommend Org