Coder Social home page Coder Social logo

alejandrogzi / gtfsort Goto Github PK

View Code? Open in Web Editor NEW
25.0 25.0 1.0 634 KB

A chr/pos/feature GTF sorter that uses a lexicographically-based index ordering algorithm.

License: MIT License

Rust 98.70% Dockerfile 1.30%
algorithms bioinformatics gtf sorting sorting-algorithms

gtfsort's Introduction

rare case of a vet turned bioinformatician interested in multi-omics, cancer and open-source software.

me . email . blog

gtfsort's People

Contributors

alejandrogzi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

pre-mrna

gtfsort's Issues

error .

I am getting this error with gtfsort - any thoughts on how to fix ?
thread '' panicked at /Users/apewoksu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/gtfsort-0.2.1/src/gtf/attr.rs:77:10:
called Result::unwrap() on an Err value: Invalid
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
thread '' panicked at /Users/apewoksu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/gtfsort-0.2.1/src/gtf/attr.rs:77:10:
called Result::unwrap() on an Err value: Invalid
thread '' panicked at /Users/apewoksu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/gtfsort-0.2.1/src/gtf/attr.rs:77:10:
called Result::unwrap() on an Err value: Invalid
thread '' panicked at /Users/apewoksu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/gtfsort-0.2.1/src/gtf/attr.rs:77:10:
called Result::unwrap() on an Err value: Invalid
thread '' panicked at /Users/apewoksu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/gtfsort-0.2.1/src/gtf/attr.rs:77:10:
called Result::unwrap() on an Err value: Invalid
thread '' panicked at /Users/apewoksu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/gtfsort-0.2.1/src/gtf/attr.rs:77:10:
called Result::unwrap() on an Err value: Invalid
thread '' panicked at /Users/apewoksu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/gtfsort-0.2.1/src/gtf/attr.rs:77:10:
called Result::unwrap() on an Err value: Invalid

No results and the error shows The error thread '<unnamed>' panicked at ...

Hi Alejandro,

I have tried running the tool in multiple cores or single-core mode, and both results show errors as the below script.
I also try different ways to get the tools. eg install by cargo or Conda or using the container image in centos7/8.
How do I deal with this?

$ gtfsort -t 1 -i TAIR10_GFF3_genes_transposons.gff -o sorted.gff 2>stderr.log


##### GTFSORT #####
A rapid chr/pos/feature gtf sorter in Rust.
Repo: github.com/alejandrogzi/gtfsort

2024-03-01T03:15:25.131Z INFO  [gtfsort] Using 1 threads
$ head stderr.log

thread '<unnamed>' panicked at /home/public_tools/.cargo/registry/src/index.crates.io-6f17d22bba15001f/gtfsort-0.2.1/src/gtf/attr.rs:77:10:
called `Result::unwrap()` on an `Err` value: Invalid
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at /home/public_tools/.cargo/registry/src/index.crates.io-6f17d22bba15001f/gtfsort-0.2.1/src/gtf.rs:34:67:
called `Result::unwrap()` on an `Err` value: Invalid
thread '<unnamed>' panicked at /home/public_tools/.cargo/registry/src/index.crates.io-6f17d22bba15001f/gtfsort-0.2.1/src/gtf/attr.rs:77:10:
called `Result::unwrap()` on an `Err` value: Invalid
thread '<unnamed>' panicked at /home/public_tools/.cargo/registry/src/index.crates.io-6f17d22bba15001f/gtfsort-0.2.1/src/gtf/attr.rs:77:10:
called `Result::unwrap()` on an `Err` value: Invalid
thread '<unnamed>' panicked at /home/public_tools/.cargo/registry/src/index.crates.io-6f17d22bba15001f/gtfsort-0.2.1/src/gtf/attr.rs:77:10:

By the way, I need to install it using Conda with an additional channel "conda-forge" for libgcc-ng.

$ conda create -n gtfsort -c bioconda gtfsort
WARNING: A conda environment already exists at '/home/tcman/miniconda3/envs/gtfsort'
Remove existing environment (y/[n])? y

Channels:
 - bioconda
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: failed

LibMambaUnsatisfiableError: Encountered problems while solving:
  - nothing provides libgcc-ng >=12 needed by gtfsort-0.2.1-h4ac6f70_0

Could not solve for environment specs
The following package could not be installed
└─ gtfsort is not installable because it requires
   └─ libgcc-ng >=12 , which does not exist (perhaps a missing channel).
$ conda create -n gtfsort -c bioconda gtfsort -c conda-forge
Channels:
 - bioconda
 - conda-forge
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/tcman/miniconda3/envs/gtfsort

  added / updated specs:
    - gtfsort


The following NEW packages will be INSTALLED:

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-2_gnu
  gtfsort            bioconda/linux-64::gtfsort-0.2.1-h4ac6f70_0
  libgcc-ng          conda-forge/linux-64::libgcc-ng-13.2.0-h807b86a_5
  libgomp            conda-forge/linux-64::libgomp-13.2.0-h807b86a_5
  libstdcxx-ng       conda-forge/linux-64::libstdcxx-ng-13.2.0-h7e041cc_5


Proceed ([y]/n)? y


Downloading and Extracting Packages:

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate gtfsort
#
# To deactivate an active environment, use
#
#     $ conda deactivate

After sorting, part CDS entries missing

Hi alejandrogzi,

Hi Alejandro,

I've been testing the gtfsort tool on the latest axolotl GTF version (AmexT_v47-AmexG_v6.0-DD.gtf) downloaded from Axolotl-omics. However, I encountered an issue with data reduction after sorting.

Issue Description:

Input File: AmexT_v47-AmexG_v6.0-DD.gtf (1,977,265 rows).
Output File: reference_gtf_after_gtfsort.gtf (1,425,753 rows).

Before Sort:
image
After Sort:
image

Below content is what my mentor think about which part CDS is missing:

The gtfsort program is only printing out one CDS entry (in column 3) per transcript. Here’s an example, for two transcripts of gene AMEX60DD000031. The first capture is from the original file, the second from the sorted file. There should be four CDS entries for each transcript, but only the final one is included in each case. (Seems like perhaps all entries are being written into one location, so only the final one persists, perhaps)

[jhgraber@random testGtfsort]$ grep AMEX60DD000031 AmexT_v47-AmexG_v6.0-DD.gtf | cut -b 1-100

chr10p ambMex60DD gene 10258638 10502225 1000 - . gene_id "AMEX60DD000031"; gene_name "LOC102279365

chr10p ambMex60DD transcript 10258638 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC

chr10p ambMex60DD exon 10258638 10258703 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10306400 10306498 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10403547 10403667 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10404202 10404245 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10502174 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD CDS 10306466 10306498 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

chr10p ambMex60DD CDS 10403547 10403667 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

chr10p ambMex60DD CDS 10404202 10404245 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

chr10p ambMex60DD CDS 10502174 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

chr10p ambMex60DD transcript 10284305 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC

chr10p ambMex60DD exon 10284305 10284358 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10306400 10306498 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10403547 10403667 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10404202 10404245 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10502174 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD CDS 10284317 10284358 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

chr10p ambMex60DD CDS 10306400 10306498 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

chr10p ambMex60DD CDS 10403547 10403667 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

chr10p ambMex60DD CDS 10404202 10404245 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

chr10p ambMex60DD CDS 10502174 10502224 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

[jhgraber@random testGtfsort]$ grep AMEX60DD000031 reference_gtf_after_gtfsort.gtf | cut -b 1-100

chr10p ambMex60DD gene 10258638 10502225 1000 - . gene_id "AMEX60DD000031"; gene_name "LOC102279365

chr10p ambMex60DD transcript 10258638 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC

chr10p ambMex60DD exon 10258638 10258703 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10306400 10306498 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10403547 10403667 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10404202 10404245 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10502174 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD CDS 10502174 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

chr10p ambMex60DD transcript 10284305 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC

chr10p ambMex60DD exon 10284305 10284358 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10306400 10306498 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10403547 10403667 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10404202 10404245 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD exon 10502174 10502225 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC102279

chr10p ambMex60DD CDS 10502174 10502224 1000 - . gene_id "AMEX60DD000031"; transcript_id "LOC1022793

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.