Coder Social home page Coder Social logo

bioinf-commons's Introduction

JetBrains Research license tests

Bioinf-commons

Bioinformatics library in Kotlin.

Contents

  • org.jetbrains.bio.dataframe - Pandas like dataframe
  • org.jetbrains.bio.experiment - Named computation - experiment and resources configuration
  • org.jetbrains.bio.genome - APIs for working with Genome, Sequence, Genes, Ontologies etc.
    • containers - Genome location sets API
    • coverage - Genome coverage API, paired/single end, fragment size estimation
    • data - Describe any kind of dataset resources including replicates
    • format - Bam (including Bisulfite sequencing), Bed, Fasta, Fastq, 2bit formats support
    • methylome - API to work with methylomes - filtration, statistics, aggregations etc.
    • query - Named functions - queries with caching capabilities
    • sampling - API for genomic sampling - sequencies, locations, etc.
    • sequence - Genome sequence API
      Also: Biomart, Ensembl, UCSC support, Genomes and Genes annotations
  • org.jetbrains.bio.statistics - Statistics utilities including distributions mixtures, hmms and hypothesis testing
  • org.jetbrains.bio.util - Cancellable computations, progress reporters, logging utilities, and other utils

Tests

$ ./gradlew clean test --no-daemon --max-workers 1

Usages

Used in the following projects:

  • SPAN Semi-supervised Peak Analyzer
  • JBR JBR Genome Browser
  • FARM hierarchical association rule mining and visualization method

bioinf-commons's People

Contributors

dievsky avatar iromeo avatar olegs avatar serge-p7v avatar xewar313 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bioinf-commons's Issues

BED parsing is sometimes too relaxed

Today I learned that the line chr1\t0\t100 will happily parse as bed12. The missing fields will just be filled with default values.

val bedFormat = BedFormat.from("bed12")
val content = "chr1\t0\t100"
withBedFile(content) { path ->
    bedFormat.parse(path) { parser ->
        val entries = parser.map { entry -> entry.unpack(bedFormat) }
        assertEquals(1, entries.size)
        assertEquals(0, parser.linesFailedToParse)
        assertEquals(0, entries.first().score)
    }
}

No exceptions thrown, no errors logged, just total peace and harmony. Looks wrong.

This also interferes with our current approach to loading BED track in JBR to some extent, and the underlying problem is more related to BedEntry.unpack method being unusually forgiving. Because of that, I'm not sure where to place the issue -- big (home of unpack), bioinf-commons (home of BedFormat) or epigenome (home of JBR).

Parallelism approach seems inconsistent

Some parts of bioinf-commons use the common ForkJoinPool, while other parts create a newFixedThreadPool.

For example, MoreStreams.kt uses ForkJoinPool in its chunked and forking extension methods, but ExecutorExtensions.kt methods (like await) prefer a separate pool. Span uses both approaches: FJP is used for the HMM EM expect and refill steps, and GenomeMap parallel operations (e.g. read coverage or compute peaks) work through await and use a special thread pool.

It might be better to use just one approach (common FJP, most likely). This will at least prevent an inadvertent creating of several thread pools in parallel. (And while we're at it, we might want to switch to coroutines.)

Discussion is most welcome.

Simplify genome configuration

At the moment we have at least 3 ways to initiate Genome object - via local data folder, via GenomeAnnotationsConfig and for tests. Modification genomes information from JBR Genome Browser (https://github.com/JetBrains-Research/jbr) is quite complicated.

We should consider some unification of Genome initialisation scheme.

Adding repeats track fails for hs1 genome reference

[Jul 7, 2023 11:11:29] ERROR PathExtensions Repeats: processing /Users/Oleg.Shpynov/.jbr_browser/genomes/hs1/rmsk.txt.gz: [FAILED] after 1.406 ms
Caused by: ERROR NullPointerException
org.jetbrains.bio.genome.Repeats$repeatsPath$1.invoke(Annotations.kt:72)
org.jetbrains.bio.genome.Repeats$repeatsPath$1.invoke(Annotations.kt:60)
org.jetbrains.bio.util.PathExtensionsKt$checkOrRecalculate$1$1$block$1.invoke(PathExtensions.kt:348)
org.jetbrains.bio.util.PathExtensionsKt$checkOrRecalculate$1$1$block$1.invoke(PathExtensions.kt:347)
org.jetbrains.bio.util.PathExtensionsKt$checkOrRecalculate$1$1.invoke(PathExtensions.kt:490)
org.jetbrains.bio.util.PathExtensionsKt$checkOrRecalculate$1$1.invoke(PathExtensions.kt:345)
org.jetbrains.bio.util.LoggerExtensionsKt.time(LoggerExtensions.kt:20)
org.jetbrains.bio.util.PathExtensionsKt.checkOrRecalculate(PathExtensions.kt:345)
org.jetbrains.bio.util.PathExtensionsKt.checkOrRecalculate$default(PathExtensions.kt:320)
org.jetbrains.bio.genome.Repeats.repeatsPath(Annotations.kt:60)
org.jetbrains.bio.genome.Repeats.all$lambda-1(Annotations.kt:85)
com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4876)
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529)
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278)
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155)
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
com.google.common.cache.LocalCache.get(LocalCache.java:3951)
com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4871)
org.jetbrains.bio.genome.Repeats.all$bioinf_commons(Annotations.kt:84)
org.jetbrains.bio.genome.Chromosome.getRepeats(Genome.kt:557)
org.jetbrains.bio.browser.tracks.base.RepeatsTrackView$preprocess$repeatClasses$1.invoke(RepeatsTrackView.kt:37)
org.jetbrains.bio.browser.tracks.base.RepeatsTrackView$preprocess$repeatClasses$1.invoke(RepeatsTrackView.kt:35)
kotlin.sequences.FlatteningSequence$iterator$1.ensureItemIterator(Sequences.kt:315)
kotlin.sequences.FlatteningSequence$iterator$1.hasNext(Sequences.kt:303)
kotlin.sequences.SequencesKt___SequencesKt.toCollection(_Sequences.kt:786)
kotlin.sequences.SequencesKt___SequencesKt.toSet(_Sequences.kt:827)
org.jetbrains.bio.browser.tracks.base.RepeatsTrackView.preprocess(RepeatsTrackView.kt:38)
org.jetbrains.bio.browser.GenomeBrowser$preprocess$2.invoke(GenomeBrowser.kt:279)
org.jetbrains.bio.browser.GenomeBrowser$preprocess$2.invoke(GenomeBrowser.kt:277)
org.jetbrains.bio.util.LoggerExtensionsKt.time(LoggerExtensions.kt:20)
org.jetbrains.bio.browser.GenomeBrowser.preprocess(GenomeBrowser.kt:277)
org.jetbrains.bio.browser.GenomeBrowser.access$preprocess(GenomeBrowser.kt:32)
org.jetbrains.bio.browser.GenomeBrowser$preprocess$asTasks$1.invoke$lambda-1$lambda-0(GenomeBrowser.kt:232)
java.base/java.util.concurrent.ForkJoinTask$AdaptedCallable.exec(ForkJoinTask.java:1428)
java.base/java.util.concurrent.ForkJoinTask.doExec$$$capture(ForkJoinTask.java:373)
java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java)
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
[Jul 7, 2023 11:11:29] ERROR GenomeBrowser Failed to preprocess track Repeats
Caused by: ERROR UncheckedExecutionException java.lang.NullPointerException
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2051)
com.google.common.cache.LocalCache.get(LocalCache.java:3951)
com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4871)
org.jetbrains.bio.genome.Repeats.all$bioinf_commons(Annotations.kt:84)
org.jetbrains.bio.genome.Chromosome.getRepeats(Genome.kt:557)
org.jetbrains.bio.browser.tracks.base.RepeatsTrackView$preprocess$repeatClasses$1.invoke(RepeatsTrackView.kt:37)
org.jetbrains.bio.browser.tracks.base.RepeatsTrackView$preprocess$repeatClasses$1.invoke(RepeatsTrackView.kt:35)
kotlin.sequences.FlatteningSequence$iterator$1.ensureItemIterator(Sequences.kt:315)
kotlin.sequences.FlatteningSequence$iterator$1.hasNext(Sequences.kt:303)
kotlin.sequences.SequencesKt___SequencesKt.toCollection(_Sequences.kt:786)
kotlin.sequences.SequencesKt___SequencesKt.toSet(_Sequences.kt:827)
org.jetbrains.bio.browser.tracks.base.RepeatsTrackView.preprocess(RepeatsTrackView.kt:38)
org.jetbrains.bio.browser.GenomeBrowser$preprocess$2.invoke(GenomeBrowser.kt:279)
org.jetbrains.bio.browser.GenomeBrowser$preprocess$2.invoke(GenomeBrowser.kt:277)
org.jetbrains.bio.util.LoggerExtensionsKt.time(LoggerExtensions.kt:20)
org.jetbrains.bio.browser.GenomeBrowser.preprocess(GenomeBrowser.kt:277)
org.jetbrains.bio.browser.GenomeBrowser.access$preprocess(GenomeBrowser.kt:32)
org.jetbrains.bio.browser.GenomeBrowser$preprocess$asTasks$1.invoke$lambda-1$lambda-0(GenomeBrowser.kt:232)
java.base/java.util.concurrent.ForkJoinTask$AdaptedCallable.exec(ForkJoinTask.java:1428)
java.base/java.util.concurrent.ForkJoinTask.doExec$$$capture(ForkJoinTask.java:373)
java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java)
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)

ReadsQuery caches and fragment size override

(Note: this is a copy of epigenome #1423 issue. I recreated it here because the code responsible for the bug has since been extracted to bioinf-commons.)

A bug was discovered.
Steps to reproduce:

  • call a ReadsQuery for paired-end reads without fragment option. A coverage cache file will be generated for paired-end coverage data;
  • call another ReadsQuery for the same input file, but this time with fragment option.

Expected result: the new query treats coverage as single-end.
Actual result: the new query loads the old cache file which is paired-ended, and thus ignores the fragment option.

This happens because our cache files are fragment-agnostic.

Bed format detection problems

A file consisting of exactly one line will always be recognized as a valid BED file by BedFormat.auto due to the algorithm that skips one malformed line on account that it might be a header. The same is true for any file that has exactly one line that doesn't match NON_DATA_LINE_PATTERN. This is probably not the intended behaviour.
Example: chr1\t2 is recognized as a valid bed3 file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.