Coder Social home page Coder Social logo

abdenlab / oxbow Goto Github PK

View Code? Open in Web Editor NEW
48.0 7.0 6.0 8.46 MB

Read specialized NGS formats as data frames in R, Python, and more.

Home Page: https://lifeinbytes.substack.com/p/breaking-out-of-bioinformatic-data-silos

License: Apache License 2.0

Rust 93.91% R 0.52% C 0.16% Python 5.41%
apache-arrow bioinformatics data-science dataframe fair-data genomics multiomics ngs pandas polars

oxbow's Introduction

oxbow

DOI

Read specialized bioinformatic file formats as data frames in R, Python, and more.

File formats create a lot of friction for computational biologists. Oxbow is a data unification layer that aims to improve data accessibility and ease of high-performance analytics.

Data I/O is handled in Rust with features exposed to Python and R via Apache Arrow.

Learn more in our recent blog post.

Docs

Read the latest Python and Rust API documentation.

Contributing

Want to contribute? Join us!

Development

The oxbow project is split into separate Rust, Python, and R packages. You can download sample data by following these instructions.

oxbow's People

Contributors

garrettng avatar ghuls avatar jackh726 avatar manzt avatar nvictus avatar pkerpedjiev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

oxbow's Issues

Feature request: Return iterator of RecordBatches (in Python)

Oxbow's Python functions (read_bam, etc.) currently return a bytes object. It would be great if they instead returned an iterator of pa.RecordBatch objects instead. The goal here would be to allow reading files in chunks (instead of loading the whole file in memory), and also to return PyArrow objects (that can be turned into pa.Tables, polars/pandas dataframes, etc.) instead of bare bytes objects. The desired chunk size (in number of rows? in bytes?) would ideally be exposed as a kwarg.

r-oxbow installation fails

R-Version: 4.3.0
OS: Pos!OS (Ubuntu variant)
rust-Version: rustc 1.69.0 (84c898d65 2023-04-16)

I've cloned the repo to make sure I can build the local r-oxbow.

cd r-oxbow/src/rust
cargo build

This works fine. Detects the oxbow in the directory above and installs it fine.

remotes::install_local("r-oxbow")

Generates this output:

── R CMD build ─────────────────────────────────────────────────────────────────
✔  checking for file ‘/tmp/RtmpjHZ4Kt/file34ed022cec682/r-oxbow/DESCRIPTION’ ...
─  preparing ‘oxbow’: (2.1s)
✔  checking DESCRIPTION meta-information ...
─  cleaning src
─  checking for LF line-endings in source and make files and shell scripts (425ms)
─  checking for empty or unneeded directories
─  building ‘oxbow_0.0.0.9000.tar.gz’
   
* installing *source* package ‘oxbow’ ...
** using staged installation
** libs
using C compiler: ‘gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0’
rm -Rf oxbow.so ./rust/target/release/liboxbow.a entrypoint.o
gcc -I"/rmflight_stuff/software/R-4.3.0/include" -DNDEBUG   -I/usr/local/include    -fpic  -g -O2  -c entrypoint.c -o entrypoint.o
# In some environments, ~/.cargo/bin might not be included in PATH, so we need
# to set it here to ensure cargo can be invoked. It is appended to PATH and
# therefore is only used if cargo is absent from the user's PATH.
if [ "" != "true" ]; then \
	export CARGO_HOME=/tmp/RtmpxvOvG8/R.INSTALL34f23e7aef79/oxbow/src/.cargo; \
fi && \
	export PATH="/opt/TinyTeX/bin/x86_64-linux/:/opt/TinyTeX/bin/x86_64-linux/:/home/rmflight/.cargo/bin:/home/rmflight/anaconda3/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:/home/rmflight/.local/bin:/home/rmflight/bin:/bin/java/:/software/julia-1.0.5/bin:/home/rmflight/.cargo/bin:/opt/TinyTeX/bin/x86_64-linux/:/home/rmflight/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin:/home/rmflight/.cargo/bin" && \
	cargo build --lib --release --manifest-path=./rust/Cargo.toml --target-dir ./rust/target
error: failed to get `oxbow` as a dependency of package `r-oxbow v0.1.0 (/tmp/RtmpxvOvG8/R.INSTALL34f23e7aef79/oxbow/src/rust)`

Caused by:
  failed to load source for dependency `oxbow`

Caused by:
  Unable to update /tmp/RtmpxvOvG8/R.INSTALL34f23e7aef79/oxbow

Caused by:
  failed to read `/tmp/RtmpxvOvG8/R.INSTALL34f23e7aef79/oxbow/Cargo.toml`

Caused by:
  No such file or directory (os error 2)
make: *** [Makevars:16: rust/target/release/liboxbow.a] Error 101
ERROR: compilation failed for package ‘oxbow’
* removing ‘/rmflight_stuff/software/R-4.3.0/library/oxbow’
Warning message:
In i.p(...) :
  installation of package ‘/tmp/RtmpjHZ4Kt/file34ed0a9f21f9/oxbow_0.0.0.9000.tar.gz’ had non-zero exit status

Same error if I use:

remotes::install_github("abdenlab/oxbow", subdir="r-oxbow")

Reading VCF files?

Hello,

I wanted to try oxbow and was testing on the following VCF file, but it doesn't seem to work. I'm not sure if this should be supported already, or if I am missing something.

I built the oxbow wheel from current main:

$ git rev-parse HEAD
e3d2a1751901430a16438134b87bc16f21d90269

files

$ wget https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/GCF_000001405.40.gz 
$ wget https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/GCF_000001405.40.gz.tbi
$ ls -l GCF_000001405.40.* 
-rw-r--r--  1 andreaspoehlmann  staff  26611209012 Oct 16 10:51 GCF_000001405.40.gz
-rw-r--r--  1 andreaspoehlmann  staff      3118040 Oct 16 10:58 GCF_000001405.40.gz.tbi
$ md5sum GCF_000001405.40.*
a1082ca70e15eb63301dfc33b19d0ae7  GCF_000001405.40.gz
76959b1691e8e62cd650664b00b7ea02  GCF_000001405.40.gz.tbi

code

# read_vcf.py                                                                       
import importlib.metadata
import oxbow as ox
import polars as pl

print("oxbow.__version__", importlib.metadata.version("oxbow"))

ipc = ox.read_vcf("GCF_000001405.40.gz", index="GCF_000001405.40.gz.tbi")
df = pl.read_ipc(ipc)
print(df)

error

$ python read_vcf.py
oxbow.__version__ 0.2.0
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: ExternalError(Custom { kind: InvalidData, error: InvalidInfo(InvalidField(InvalidValue(Other(Other("RS")), InvalidInteger(ParseIntError { kind: PosOverflow })))) })', src/lib.rs:117:49
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/Users/andreaspoehlmann/development/oxbow-test/read_vcf.py", line 8, in <module>
    ipc = ox.read_vcf("GCF_000001405.40.gz", index="GCF_000001405.40.gz.tbi")
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ExternalError(Custom { kind: InvalidData, error: InvalidInfo(InvalidField(InvalidValue(Other(Other("RS")), InvalidInteger(ParseIntError { kind: PosOverflow })))) })

system info

$ python -VV
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:41:52) [Clang 15.0.7 ]
$ uname -a
Darwin F2WR4P9QNH 23.0.0 Darwin Kernel Version 23.0.0: Fri Sep 15 14:43:05 PDT 2023; root:xnu-10002.1.13~1/RELEASE_ARM64_T6020 arm64
$ system_profiler SPHardwareDataType | grep -e "Model\|Memory\|Cores"
      Model Name: MacBook Pro
      Model Identifier: Mac14,5
      Total Number of Cores: 12 (8 performance and 4 efficiency)
      Memory: 64 GB

Cheers,
Andreas 😃

Needs a logo

Gotta have a cool logo for powerpoints and such

Range query API

Partially implemented (some may already be implemented)

  • alignments:
    • sam.gz:
      • tbi
      • csi
    • bam: bai/csi
      • bai
      • tbi
      • csi
    • cram
      • tbi
      • csi
  • variants:
    • vcf
      • tbi
      • csi
    • bcf
      • tbi
      • csi
  • sequences: fasta, fastq
    • fasta.gz
      • fai (fasta, uncompressed)
      • [] gzi (compressed)
    • fastq.gz
      • fai (fastq, uncompressed)
      • gzi (compressed)
  • interval feature
    • bed.gz
    • bedGraph.gz
    • gtf.gz/gff.gz
    • Arbitrary TSV files with tabix

InvalidReferenceSequenceName: Exception due to brackets in reference name

Hello,

I have been using oxbox quite a lot since that initial blog post and have just run into a new issue that I am hoping there may be a work around for. It occurs when I try to read a BAM aligned to the SacCer genome with STAR as the tRNA reference names contain brackets.

eg. in the Ensembl annotation they have:

tI(AAU)I1_tRNA-E1

I assume these brackets are the root of the issue. Is there any chance that this could be handled within oxbow? Or is it a limitation imposed by arrow?

Thanks in advance.

read_bam shouldn't require BAM index

It would be great if read_bam worked without a BAM index (provided you're not doing a region query). My use case: Oxford Nanopore's Dorado basecaller outputs basecalled reads in BAM format (where all the alignment fields are null in the BAM), so I want to read them, but there's no point in indexing them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.