Coder Social home page Coder Social logo

csbiology / biofsharp Goto Github PK

View Code? Open in Web Editor NEW
106.0 15.0 32.0 317.95 MB

Open source bioinformatics and computational biology toolbox written in F#.

Home Page: https://csbiology.github.io/BioFSharp/

License: MIT License

Batchfile 0.01% F# 94.82% Shell 0.01% Perl 4.50% Dockerfile 0.19% PowerShell 0.03% Jupyter Notebook 0.45%
bioinformatics biostatistics datascience biology dataprocessing fsharp amino-acids nucleotides sequence-analysis docker

biofsharp's Introduction

Logo

Nuget Made with F#

BioFSharp is an open source bioinformatics and computational biology toolbox written in F#. https://csbiology.github.io/BioFSharp/

Gitter GitHub contributors

Build status (ubuntu and windows) Test Coverage
codecov

Core functionality

In its core namespace, BioFSharp contains the basic data structures for common biological objects and their modification. Our type modeling starts at chemical elements, abstracts those to form formulas, and finally molecules of high biological relevance such as amino acids and nucleotides. Sequences of these molecules are modelled by BioCollections, which provide extensive functionality for investigating their real life counterparts.

Data model

Additionally, core algorithms for biological sequences such as alignments and pattern matching algorithms are implemented.

Besides the core functionality, BioFSharp has several namespaces as sub-projects with different scopes:

IO functionality

The IO namespace aims to make data available and ease further processing. It contains read/write functions for a diverse set of biological file formats such as Fasta, FastQ, GeneBank or GFF, as well as helper function for searching on or transforming the input data. Wrappers for commonly used command line tools like NCBI's Blast assure interoperability with an array of existing bioinformatic workflows

BioDB functionality

The BioDB namespace offers API access to powerful popular databases like GEO and EBI(including SwissProt/Expasy). We additionally provide an API access for FATool, a webservice by our workgroup for querying functional annotations of proteins.

This project is netframework only and has a new home here: https://github.com/CSBiology/BioFSharp.BioDB

BioContainers functionality

The BioContainers namespace is our newest BioFSharp project and we are very excited about it! It is all about making common bioinformatics tools programmatically accessible from F#. This is realized by making the containerized tool accessible via the Docker daemon. We wrap some functionality from Docker.DotNet to communicate with the docker API while providing extensive, type safe bindings for already 9 tools, including Blast, ClustalO, and TMHMM

ML functionality

Make your workflow ML ready with BioFSharp.ML. Currently contains helper functionf for CNTK and a pre-trained model we used in our publication about predicting peptide observability.

Stats functionality

The Stats namespace contains statistical functions with a clear biological focus such as functions for calculating Gene Ontology Enrichments.

Documentation

Functions, types and Classes contained in BioFSharp come with short explanatory description, which can be found in the API Reference.

More indepth explanations, tutorials and general information about the project can be found here.

The documentation and tutorials for this library are automatically generated (using the F# Formatting) from *.fsx and *.md files in the docs folder. If you find a typo, please submit a pull request!

Contributing

Please refer to the Contribution guidelines

Community/Social

Want to get in touch with us? We recently joined the twitter crowd:

Twitter Follow

Twitter Follow

biofsharp's People

Contributors

benjaminsaljooghi avatar bvenn avatar caroott avatar dawedawe avatar goedels avatar graemevissers avatar hlweil avatar kmutagene avatar kristinakeuper avatar mikhayn avatar muehlhaus avatar pmenges avatar scheidto avatar wieczoreke avatar zimmerd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

biofsharp's Issues

Implementation of the KEGG API

Description

KEGG (Kyoto Encyclopedia of Genes and Genomes) is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.

An implementation of this API in the BioDB project should enable querying of pathways via the KEGG API and parsing of KEGG records.

Starting point resources

Refactoring and inclusion of our next generation sequencing pipeline

Description

Several tools for the analysis of next generation sequencing data have been developed in the workgroup during the last months. These need to be refactored and added to the project.
This includes:

  • modeling of the GeneOmnibus FTP server structure
  • SOFT reader
  • SRAToolkit wrapper
  • Hera wrapper
  • FastP wrapper

[Feature Request]Rework BioCollections

BioSeq, BioArray, and BioList have inconsistent function naming and missing functionality:

different functionality:

  • (Naming) transcribeCodingStrand/transcribeCodeingStrand (BioSeq/BioArray)
  • (function signature) Digestion.[CollectionName].digest

missing functionality:

  • conversion the same as standard collections: BioSeq.ofBioArray etc.

  • create cDNA from mRNA

  • find first start codon

  • find gene (first start codon –> first stop codon)

  • all functions with the signature kind collection -> .. -> collection need to be abstracted, e.g. BioArray.map

  • (Debatable, Breaking): put the types in the toplevel namespace and make the functions static members. Has the advantage of less namespace Clutter (BioFSharp.BioSeq.BioSeq<#IBioItem> becomes BioFSharp.BioSeq<#IBioItem>)

[Docs] Add ImgP docs

(BioFSharp.ImgP)[https://github.com/CSBiology/BioFSharp/tree/master/src/BioFSharp.ImgP] needs documentation. Ill avoid posting a list here @bvenn , since i guess you know best what to cover here.

[BUG] FastA-Writer appends to file instead of recreating it

Description
The BioFSharp.IO.FastA.write function appends the given lines to the dataset specified by the filepath.

Expected behavior
The write function is expected to recreate the specified file instead of appending lines to it.

Solution
The write function should be renamed to writeOrAppend and an additional write function should be introduced.

linked to #47

[BUG] BioContainer STDOut is not attached to FSI, only gets returned when container tasks are finished/aborted

Describe the bug
BioContainer STDOut is not attached to FSI, only gets returned when container tasks are finished/aborted. This leads to no output until the containerized process is completely finished.

To Reproduce

Here is an example where i wanted to see the progress of a fasterq-dump task in SRATools:


open SRATools

let sraImage = Docker.ImageId "quay.io/biocontainers/sra-tools:2.10.3--pl526haddd2b5_0"

let sraContext = 
    BioContainer.initBcContextWithMountAsync client sraImage  @"C:\Users\kevin\Downloads\CsbScaffold-master\MetaIndexing_New\data"
    |> Async.RunSynchronously

let FQDOptions =
    [
        FasterQDumpParams.OutDirectory @"C:\Users\kevin\Downloads\CsbScaffold-master\MetaIndexing_New\data\Testerino"
        FasterQDumpParams.TempDirectory @"C:\Users\kevin\Downloads\CsbScaffold-master\MetaIndexing_New\data\Testerino\tmp"
        FasterQDumpParams.Split SplitOptions.SplitFiles
        FasterQDumpParams.PrintDetails
        FasterQDumpParams.ShowProgress
    ]

runFasterQDumpOfAccession sraContext FQDOptions "SRR000001"

Output in FSI only gets printed after the whole task is finished:

Expected behavior
STDOut of the container should appear in FSI as produced during the task

to see how it should work, take a look at the output using docker cli:


docker run -it -v C:/Users/kevin/Downloads/CsbScaffold-master/MetaIndexing_New/data:/data f9876632ae1e

/ # cd data
/data # fasterq-dump -O ./Testerino -t ./Testerino/tmp --split-files -x -p SRR000001

OS and framework information (please complete the following information):

  • OS: Windows 10
  • OS Version 1909
  • .Net core SDK version 3.1.102
  • Docker Desktop Community 2.2.0.4 (43472) stable

[BUG]OBO parser skips all relationships but the last

Describe the bug
the OBO type seems to have some differences from the specification

To Reproduce
Parse this file: http://www.obofoundry.org/ontology/ms.html

Expected behavior
Here is an example:

[Term]
id: MS:1000014
name: accuracy
def: "Accuracy is the degree of conformity of a measured mass to its actual value." [PSI:MS]
xref: value-type:xsd:float "The allowed value-type for this CV term."
is_a: MS:1000480 ! mass analyzer attribute
relationship: has_units MS:1000040 ! m/z
relationship: has_units UO:0000169

should contain both relationship lines.

Newest obo format specifications can be found here: https://owlcollab.github.io/oboformat/doc/GO.format.obo-1_4.html

[Docs] Add BioTools docs

BioFSharp.BioTools needs documentation. The following topics must be covered:

  • General introduction to (Bio)Containers
  • Setup of Docker for windows
  • Example usage (I'm thinking about something like a simple BLAST here)
  • Guidelines on how to wrap a new container API

[BUG]SOFT Parser skips entities

Describe the bug
The SOFT parser skips entities of the same type that are listed directly beneath each other, e.g. the list
^Sample1
..
^Sample2
..
^Sample3

gets parsed as [Sample1;Sample3]

To Reproduce
Parse a SOFT family file with multiple sample entries

Expected behavior
all samples should be parsed (this is in fact absolutely necessary)

Additional context
Fix is underway

[BUG] Paket.exe missing when building on a clean(with all necessary frameworks and tools) machine

Describe the bug
I get the paket.exe is missing exception at the 'Nuget' target in the fake build script on a clean machine.

To Reproduce
Reset your PC (LOL)
build BioFSharp with fake build

Expected behavior
Builds

Screenshots

image

OS and framework information (please complete the following information):

  • OS: Windows 10
  • OS Version 1903
  • .Net core SDK version 2.2.401

Additional context
I have a fix for this in the pipeline, just wanted to track progress here

Add pretty printers for common biological file formats

As analyzing biological data with BioFSharp involves a lot of scripting, optional pretty printers help a lot when looking at results directly in the fsi window. This should be especially easy to add for parsing libraries that can already write the correct file format. A basic pretty printer for biocollections is already impelemented.
Here are some formats from the top of my head, feel free to add more

  • BioCollections (BioList,BioArray,BioSeq)
  • Clustal
  • GFF
  • FastA/Q
  • GeneBank
  • Alignments

[Feature Request]Extend OBO parser to fully support the obo flat file 1.4 specifications

Is your feature request related to a problem? Please describe.
The obo module is incomplete, we have a makeshift parser for terms and that's it.
This document contains the formal full specs.

Describe the solution you'd like
Extend the parser to parse at least all stanzas, maybe parse the full specs (low priority)
Add writers for round trip testing(low priority)

Describe alternatives you've considered
As always we can just not do it, the parser is enough for our uses right now.

[BUG] Include all possible bin sizes within GSEA (0 and 1)

Describe the bug

For gene set enrichment analysis (GSEA) fishers exact method is applied to analyse over-, or under representated groups. The method is based on multiple hypergeometric distribution tests.

In BioFSharp.Stast.OntologyEnrichment.CalcHyperGeoPvalue, two cases get individual treatments: When the number of differentially expressed genes in a random bin is 0 or 1, the pValue is reported as nan. While these cases might not be of interest, a true pValue can be calculated.

For further analysis a multiple-testing-correction can be performed. The BenjaminiHochberg-method calculates false discovery rates (FDR) for every p value. nans cannot be processed, so they get filtered out. This filtering of p values that could have been calculated manipulates the FDR-calculation and keeps p values more flat than expected.

Often bins of sized lower than 5 are not of interested and are rejected anyway. The filtering should be supervised by the operator and have to be performed after the enrichment analysis and prior to multiple testing correction.

The current filter within the GSEA leads to results that cannot be easily interpreted.

Solution

  • Remove the if expression within CalcHyperGeoPvalue

  • add additional context for filtering procedures to the documentation

  • consider renaming the functions to lower case

[Feature Request] Automatically take care if Image exists with correct tag for biocontainers

Is your feature request related to a problem? Please describe.
It is possible that the container api wrappers will not work with newer versions of the container/contained software as they were developed for (addition of commands, changes of commands etc.)

Describe the solution you'd like
Our biocontainer API wrappers should be pinned to the specific version of the containers they were developed for. I imagine the following solution:

  • check if there is an image available with the correct tag
  • if not, pull the correct image

Describe alternatives you've considered
There are no alternatives.

[Feature Request] Complete Digestion module

Is your feature request related to a problem? Please describe.
I'm always frustrated by the superior functionality of the Array module compared to the incomplete Seq and the missing List module.

Describe the solution you'd like
Homogenous implementations for all frequently used collections

[BUG] OboParser neglects first occurences of alt_ids

Description

OboTerms can consists of serveral alt_id items. The OboParser just takes the last occurence of the alt_id keyword and discards previous ones.

Repro steps

Try to parse the following item (downloaded from http://geneontology.org/page/download-ontology):

[Term]
id: GO:0004748
name: ribonucleoside-diphosphate reductase activity, thioredoxin disulfide as acceptor
namespace: molecular_function
alt_id: GO:0016959
alt_id: GO:0016960
alt_id: GO:0016961
def: "Catalysis…
comment: When thioredoxin…
synonym: "2'-deoxyri…
synonym: "2'-deoxyri…
xref: EC:1.17.4.1
xref: MetaCyc:RIBONUCLEOSIDE-DIP-REDUCTI-RXN
xref: RHEA:23252
is_a: GO:0061731 ! ribonucleoside-diphosphate reductase activity

Expected behavior

All alt_ids should be stored in the resulting OboTerm

Actual behavior

Just GO:0016961 is stored

Add most important BioTools docker APIs for 1.0.0

only the tools deemed essential from #40 for 1.0.0 release, as all of them is a little bit to much:

  • BLAST | YES
  • IntaRNA | YES
  • TargetP | NO
  • TMHMM | NO
  • SRAToolkit | YES
  • HMMER | YES
    • hmmbuild
    • hmmalign
    • hmmsearch
    • hmmscan
    • hmmpress
    • phmmer
    • jackhmmer
    • nhmmer
    • nhmmscan
    • hmmfetch
    • hmmstat
    • hmmemit
    • hmmlogo
    • hmmconvert
    • hmmpgmd
    • makehmmerdb
    • hmmsim
    • alimask
  • ClustalO | YES

[Feature Request]Add Entrez query builder

Is your feature request related to a problem? Please describe.
As you can pass Entrez query strings to the Entrez cgi DSLs, it would be helpful to have functions to create them.

Describe the solution you'd like
A DSL for entrez query strings

Describe alternatives you've considered
This is the only valid timeline.

[BUG] Fasta reader cant read from gzip file

Describe the bug
Trying to read a fasta file from a gzip archive results in an error

To Reproduce
Steps to reproduce the behavior:

  1. Open BioFSharp.IO
  2. Use FastA.fromGzipFile to read docsrc/content/Chalmy_Cp.fasta.gz

Expected behavior
A sequence of FastaItems is created

Screenshots
image

OS and framework information (please complete the following information):

  • OS: Windows 10
  • OS 1809
  • .Net core SDK version : 2.1.500

Additional context
The function uses the readFileGZip function from FSharpAux.IO. The relevant code is:

/// Reads a gZip file line by line without creating a tempory file
/// Alternatively use FileEnumerator
let readFileGZip (filePath:string) =
    seq {use reader     = File.OpenRead(filePath)
         use unzip      = new GZipStream(reader, CompressionMode.Decompress, true)
         unzip.Seek(0L,SeekOrigin.Begin) |> ignore
         use textReader = new StreamReader(unzip, Encoding.Default)
         while not textReader.EndOfStream do
            yield textReader.ReadLine()}

[Feature Request] Increase verbosity of Sailent.compute

Is your feature request related to a problem? Please describe.
I really like Sailent. Since it can take a while to compute, I would like a verbose option that tracks the bootstrapping loop so one can infer how much time it will take to run.

Git nuget packages use wrong frameworks

Description

The different framework versions in the nuget packages referenced via git (FSharpAux/IO and FSharp.Stats) are not used correctly in the build process (see images) either on mono or windows

Repro steps

fake build the repo

Expected behavior

Correct framework versions are used for every target framework

Actual behavior

Incorrect framework versions are used for every target framework
Windows:
image

Mono (Travis):
image

[Feature Request] Add high-level functionality for Entrez cgi DSLs

Low level functionality that requires knowledge of HttpFs and Hopac is implented since #84 .
Further abstraction would make it easier to use.

Additionally, most Entrez cgis provide the functionality of using a history server. When abstracting the current DSL, one should do it in such a way that the return results themself are pipe-able, meaning they can be used as further input for cgi queries. Here is an example from Entrez direct, where the result of esearch is the input for efetch:

esearch -db pubmed -query "lycopene cyclase" |
efetch -format abstract

[BUG] Nucleotides.antiparallel delivers unexpected results

Describe the bug
The antiparallel function in the BioFSharp.Nucleotides module delivers unexpected results. It works on single nucleotides instead of a nucleotide sequence.

To Reproduce

  • I did not actually test it.
  • I checked the source code only.

Expected behavior
I expect the antiparallel function to deliver the sequence of a complementary DNA strand in 5' to 3' direction.

A double stranded DNA:
5'-GAAATGTTCTTGCAGTTAA-3'
3'-CTTTACAAGAACGTCAATT-5'

The sense strand with the original sequence is the upper strand in this schema. The antiparallel DNA strand is the lower strand. The sequence of the antiparallel DNA strand (reverse-complement) in 5' to 3' direction is:
5'-TTAACTGCAAGAACATTTC-3'

The function is expected to work on a sequence of nucleotides, e.g. Nucleotide list. The algorithm would look like:
myDnaSequence |> List.map (fun nuc -> Nucleotides.complement nuc) |> List.rev

For a simple online tool see:
https://www.bioinformatics.org/sms/rev_comp.html

Suggestion

  • Remove Nucleotides.antiparallel function
  • (Remove the Nucleotides.inverse function. Compare with #65)
  • Create a module for manipulating nucleotide sequences with functions:
    • reverse: Nucleotide list -> Nucleotide list
    • complement: Nucleotide list -> Nucleotide list
    • reverseComplement: Nucleotide list -> Nucleotide list
      Where list could (also) be BioList, BioSeq, BioArray.

FastA to GFF3 conversion

Description

Protein data analysis tools make use of information about the gene loci of the proteins in order to consider relation. This kind of information can nicely be stored in and read from a GFF3 file.
Some Tools require a GFF3 file. To tackle this if you only have a FastA at hand. An automatic converter might be nice.

Implementation Steps

  • Refurbish FastA Header parser

An old implementation which automatically parses UniProt style FastA header is present in BioFSharp.BioID. It needs to be refurbished.

  • add FastA to GFF3 parser

Given the info from the parsed FastA header this is just a small function getting the protein names from the header, grouping them by the gene loci and creating GFF3 items of them accordingly.

Should 'release' build target require confirmation?

Description

When writing documentation it is useful to have the following commands at hand:

fake build -t releaselocal for a local generation of the docs
fake build -t releasedocs for pushing the docs to gh-pages

When accidentally typing fake build -t release local or other variants, the release target is called and all comitted and uncommitted changes are pushed, thereby bumping the version and updating the nuget package.

It would be nice to have an additional security confirmation if fake build -t release xxx is entered
This affects all CSBiology projects.

Cannot run NUnit tests from Visual Studio

FAKE will execute the tests via the NUnit.Runners nunit-console.exe but Visual Studio won't pick up tests? I've tried referencing all the NUnit paket dll's to no avail.

[Docs] Clarify Nucleotides.inverse function

BioFSharp.Nucleotides.inverse
The inverse function in the BioFSharp.Nucleotides module needs clarification. It is unclear what "inversion" means on the nucleotide level of a nucleotide sequence.

The given example states, that the sequence "ATGC" is converted to "CGTA". This is usually understood as reading the string from the back and thus reversing the nucleotide sequence. Reversing the nucleotide sequence cannot be done on the nucleotide level.

Please give a reference with the biological relevance and explanation of the conversion performed by this function.

[BUG] Wrong versions in nuget packages built with Paket.pack

Describe the bug
There are still some issues with the nuget packages we build. Some of the packages have incorrect versions. Here is a screenshot of the 0.1.1 package:

image

This propagates to the deps.json file in the netstandard2.0 package:

image

Strangely, while still using the pretty old paket.template with type file, the versions for FSharpAux seem to be fine:

image

To Reproduce
Look at your packages and take note of the version

Expected behavior
Version should be the same as in the releasenotes. I don't get why it is not:

in build.fsx:

the "Nuget" target uses the release version:


Target.create "NuGet" (fun _ ->
    Paket.pack(fun p ->
        { p with
            ToolPath=".paket/paket.exe"
            OutputPath = "bin"
            Version = release.NugetVersion
            ReleaseNotes = String.toLines release.Notes})
)

which is bound here:

let release = ReleaseNotes.load "RELEASE_NOTES.md"

There is definately something wrong with either the way we build packages or the way we push the prerelease packages.

Documentation cleanup and extension

Description

Documentation enhancements due for first release

  • Several docs which are marked as 'coming soon' need to be added (example)
  • Extension of the EBI docs for Entrez/SwissProt databases
  • general typo and markdown typo fixing
  • Complete summary for all functions/data types/modules for the API reference
  • Fix warnings in the projects (eg incomplete pattern matches, type constraints)

[BUG] GFF3-Writer appends to file instead of recreating it

Description
The BioFSharp.IO.GFF3.write function appends the given lines to the dataset specified by the filepath.

Expected behavior
The write function is expected to recreate the specified file instead of appending lines to it.

Solution
The write function should be renamed to writeOrAppend and an additional write function should be introduced.

[Feature Request] Rework ModificationInfo's implementation of IBioItem

Is your feature request related to a problem? Please describe.
IBioItem fields Symbol and Formula implementations in ModificationInfo are either placeholders or incorrect in some cases

Describe the solution you'd like
Symbol creation following some kind of convention (also helpful for prettyPrinting) and a more clever way of representing formulas than applying the modification to an empty formula (what happens on 'loss' modifications? Isotopic modifications?)

[Feature Request] Rework alignment

The Problem

Using the Pairwise alignment in BioFSharp.Algorithms works fine but the only implemented way to write out this alignment in a correct format is in the BioFSharp.IO.Clustal module. Although both generally use the same BioFSharp.Alignment.Alignment type, the conversion can be quite cumbersome.

Solution

Remodel BioFSharp.Algorithms.Pairwise Alignment and BioFSharp.IO.Clustal

  • Add ConservationInfo module to BioFSharp.IO.Clustal or BioFSharp.Alignment

  • Let Clustal functions use BioSeqs instead of Strings

  • Let BioFSharp.Algorithms.PairwiseAlignment functions use BioSeqs as output instead of Nucleotides

  • Add create function to Alignment Type in BioFSharp.Alignment

These changes should make using the different alignment functions of different namespaces together easier.

Example of unnecessary conversions

Output type of alignment
 Alignment.Alignment<Nucleotides.Nucleotide list, Algorithm.PairwiseAlignment.Score>
Expected input of clustal write function
 Alignment.Alignment<BioID.TaggedSequence<string,char>,Clustal.AlignmentInfo>
Needed Conversion
 let mappedData = 
     alignment.AlignedSequences
     |> List.mapi (fun i (ns:Nucleotides.Nucleotide list) -> 
         Seq.map (BioItem.symbol) ns
         |> BioID.createTaggedSequence (sprintf "seq%i" i)
     )

 let conservationInfo = String.init firstGeneSeq.Length (fun _ -> "*")

 let newHeader = {Header = "Decoy";ConservationInfo = conservationInfo}

 let newAlignment = {MetaData = newHeader;AlignedSequences = mappedData}

which is very cumbersome

[Feature Request]Replace deprecated FAKE calls in the build script

Is your feature request related to a problem? Please describe.
Not a real problem yet but may be in the future. The build script uses deprecated functions from FAKE. We should update the script accordingly.

Describe the solution you'd like
Use the suggested FAKE functions:
image

Describe alternatives you've considered
Let future us take care of it

[Feature Request] Documentation for Ontology Enrichment

Is your feature request related to a problem? Please describe.
I wanted to use the OntologyEnrichment module but there is close to no documentation on how to use it.

Describe the solution you'd like
Replace the filler documentation here with an actual documentation.

[Docs] Add BioFSharp.ML docs

BioFSharp.ML needs documentation. The following topics must be covered:

  • How to use the CNTK load script
  • How to extract features from biological sequences for ML
  • A full example from raw data over feature extraction, training and prediction

[Feature Request] Add installation procedure to Readme

Is your feature request related to a problem? Please describe.
The installation procedure is explained in the docs, but not in the readme.

Describe the solution you'd like
I suggest we also place an explanation in the readme.

Describe alternatives you've considered

  • Placing explanation in readme.
  • Not placing explanation in readme.

Additional context
I like basics explanations placed in the readme. ;)

Blastwrapper documentation spams into test file

Each time when building the documentation, the line
BlastWrapper(ncbiPath).blastP inputFile queryFastaPath outputPath ([customOutputFormat;] |> seq<BlastParams>) located in BlastWrapper.fsx adds lines to the example file /data/blastTestOutput.csv.
This leads to an iteratively largening amount of lines in this test file.

Addition of Unit Testing for all projects

Description

None of the projects currently contain any unit tests. There are placeholder tests under the BioFSharp.Tests.NetCore project, which use Expecto.

How to design tests:
Basically, you can do what you want, as long as the test is as granular and self-contained as possible. Some inspiration can be taken from the current tests or this blog post:
https://stackify.com/unit-testing-basics-best-practices/ (ignore the C# code and setup instructions and focus on the concepts)

How to add tests:

How to run tests(from project root):

  • Windows: ./build.cmd -t runTestsAll
  • Linux: ./build.sh -t runTestsDotnet

Starting point resources

Expecto docs
This well written blog post

Addition of tools for statistical analysis of biological sequences

Description

The following features are added to BioFSharp :

  • A frequency vector to represent the count of elements in biological sequences, e.g. how many occurrences of Adenine or Thymine are there in the sequence. The probability vector contains a probability for each element to be picked by chance.
  • The position frequency matrix contains the count of elements at specific positions, e.g. how many Adenine are at the first position when comparing different sequences.
    The position probability matrix contains the probability of a specific element to be at that position.
  • A position weight matrix is typically a position probability matrix normalized by a probability vector of the background of the chosen sequence.

Implementation Steps

  • Add frequency and probability vector
  • Add position frequency, position probability and position weight matrix

Related information

Description position weight matrix

[Feature Request] Improve Formula parsing/printing consistency

Is your feature request related to a problem? Please describe.
the input of Formula.parseFormulaString and the output of Formula.toString are inconsistent:

let CO2 = Formula.parseFormulaString "CO2" // Number is an integer and no space between the elements

Formula.toString CO2 // val it : string = "C1.00 O2.00 " Number is a float and theres a space between the elements

while the output of the Formula.toString is valid input for Formula.parseFormulaString, i dont understand why we need the spaces and floats.

Additionally, (CO)2 is not parsed correctly (this case is marked as To-Do in the source code):

let C2O2 = Formula.parseFormulaString "(CO)2" 

Formula.toString CO2 // val it : string = "C1.00 O2.00" , expected "C1.00 O2.00"

Describe the solution you'd like
Solve the (CO)2 problem, change number output of Formula.toString to int and remove spaces

Describe alternatives you've considered
Formula.toString output is not wrong per se, so we could leave it as-is, but the (CO)2 problem needs fixing

[Feature Request]Implement Entrez query construction and databases queries

[Feature Request] Optimize Sailent to increase computation speed

Is your feature request related to a problem? Please describe.
Sailent is pretty slow, as it samples bootstrap distributions millions of times depending on the iteration count. The current implementation uses an array shuffle function and takes the first n entries after creation of the shuffled array. This can be greatly improved upon.

[Feature Request] Missing GAF Parser

Description

In BioFsharp.IO there is a parser missing for GAF files (GO Annotation file format). A format description can be found here.
Consider backwards compatibility (GAF2.0 -> GAF1.0).

  • add readFile
  • add writeFile

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.