Coder Social home page Coder Social logo

dieterich-lab / circtools Goto Github PK

View Code? Open in Web Editor NEW
24.0 6.0 20.0 2.52 MB

circtools: a modular, python-based framework for circRNA-related tools that unifies several functionalities in a single, command line driven software.

Home Page: http://circ.tools

License: GNU General Public License v3.0

Python 67.93% R 29.31% Shell 1.81% Perl 0.94%
bioinformatics circular-rna toolbox primer-design computational-rna-biology dcc rbp bioconductor fuchs

circtools's Introduction

circtools

a one-stop software solution for circular RNA research

circtools

Documentation Status Build Status Zenodo DOI link Python Package Index Downloads Python package version

This is an older version of circtools (version 1.2.2). The latest version (1.3.1) can be obtained from https://github.com/jakobilab/circtools.

Introduction

Circular RNAs (circRNAs) originate through back-splicing events from linear primary transcripts, are resistant to exonucleases, typically not polyadenylated, and have been shown to be highly specific for cell type and developmental stage. Although few circular RNA molecules have been shown to exhibit miRNA sponge function, for the vast majority of circRNAs however, their function is yet to be determined.

The prediction of circular RNAs is a multi-stage bioinformatics process starting with raw sequencing data and usually ending with a list of potential circRNA candidates which, depending on tissue and condition may contain hundreds to thousands of potential circRNAs. While there already exist a number of tools for the prediction process (e.g. DCC and CircTest), publicly available downstream analysis tools are rare.

We developed circtools, a modular, Python3-based framework for circRNA-related tools that unifies several functionalities in single command line driven software. The command line follows the circtools subcommand standard that is employed in samtools or bedtools. Currently, circtools includes modules for detecting and reconstructing circRNAs, a quick check of circRNA mapping results, RBP enrichment screenings, circRNA primer design, statistical testing, and an exon usage module.

Documentation

Click here to access the complete documentation on Read the Docs.

Installation

The circtools package is written in Python3 (>=3.4), two modules, namely detect and reconstruct also require a working Python 2 installation (>=2.7). It requires only a small number of external dependencies, namely standard bioinformatics tools:

Installation is managed through python3 setup.py install. No sudo access is required if the installation is executed with --user which will install the package in a user-writeable folder. The binaries should be installed to /home/$user/.local/bin/ in case of Debian-based systems.

circtools was developed and tested on Debian Jessie but should also run with any distribution.

The installation requires running python on the command line:

git clone https://github.com/dieterich-lab/circtools.git
cd circtools
python3 setup.py install --verbose --user

The installation procedure will automatically install two dependencies: DCC and FUCHS. The primer-design module as well as the exon analysis and circRNA testing module require a working installation of R with BioConductor. All R packages required are automatically installed during the setup. Please see the "Installing circtools" chapter of the main circtools documentation for more detailed installation instructions.

Modules

Circtools currently offers seven modules:

The detect command is an interface to DCC, also developed at the Dieterich Lab. The module allows to detect circRNAs from RNA sequencing data. The module is the foundation of all other steps for the circtools work flow. All parameters supplied to circtools will be directly passed to DCC.

The quickcheck module of circtools is an easy way to check the results of a DCC run for problems and to quickly assess the number of circRNAs in a given experiment. The module needs the mapping log files produced by STAR as well as the directory with the DCC results. The module than generates a series of figures in PDF format to assess the results.

The reconstruct command is an interface to FUCHS. FUCHS is employing DCC-generated data to reconstruct circRNA structures. All parameters supplied to circtools will be directly passed to FUCHS.

The circtest command is an interface to CircTest. The module a a very convenient way to employ statistical testing to circRNA candidates generated with DCC without having to write an R script for each new experiment. For detailed information on the implementation itself take a look at the CircTest documentation. In essence, the module allows dynamic grouping of the columns (samples) in the DCC data.

The exon module of circtools employs the ballgown R package to combine data generated with DCC and circtest with ballgown-compatible stringtie output or cufflinks output converted via tablemaker in order get deeper insights into differential exon usage within circRNA candidates.

The enrichment module may be used to identify circRNAs enriched for specific RNA binding proteins (RBP) based on DCC-identified circRNAs and processed eCLIP data. For K526 and HepG2 cell lines plenty of this data is available through the ENCODE project.

The primer command is used to design and visualize primers required for follow up wet lab experiments to verify circRNA candidates.

circtools's People

Contributors

alexey0308 avatar cdieterich avatar haraldwilhelmi avatar shubhada-kulkarni avatar tjakobi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

circtools's Issues

How do we optimally integrate DCC and FUCHS?

The easiest is way would probably be to directly interface with the corresponding main class and to circumvent the need to call another python instance from within python. We need something like an API for DCC and FUCHS that is open to the outside.

Quantile plot for reconstruction module

As an addition to the - arbitrary - categorization of circRNAs in groups based on their length, we should add a quantile plot to the visualization routines.

BED track coordinates are incorrect

in script exon_usage_circtools_wrapper.R

exon_analysis_dcc_bsj_enriched_track.bed
and
exon_analysis_dcc_predictions_track.bed

show off-by-one errors.

Manual "primer module" missing

First of all, thank you for making these scripts available. This is not really an issue, is more a request.
The full documentation for the primer module is missing. Could you solve this? Thank you
Cheers.

Unify all enrichment plotting code in one file

Right now code is split in two files, one for single samples and another one (_twin.R) for direct one-to-one comparisons. It would make sense to merge everything in one file and guess from the user input if we have one or 2 samples.

CircSkipping visualization

Circle skipping events detected by DCC should be available in a BED-like format in order to be used as track in IGV or similar viewers.

Visualization option for enrichment module

It would be nice to have some kind of auto-generated diagrams visualizing the enrichment results. However, it should not be a function that just draws 2.000 plots; instead something more high level would be better.

R install script fails on fresh R 3.4.0 install

Normally the lib dir is not writable, the installation does not check for this (or sets an alternative path). Therefore the install fails:

Rscript install.R
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
Warning in install.packages("BiocInstaller", repos = a["BioCsoft", "URL"]) :
  'lib = "/usr/local/lib/R/site-library"' is not writable
Error in install.packages("BiocInstaller", repos = a["BioCsoft", "URL"]) :
  unable to install packages
Calls: source ... eval.parent -> eval -> eval -> eval -> eval -> install.packages
Execution halted

Remove (py)bedtools dependency

Benchmarks shows that the IO-heavy work flow of bedtools has a severe impact of the performance. It would be reasonable to instead use Brandon's replacement functions.

Add option to include / exclude specific genomic features during shuffeling

We need the possibility to select specific regions of the genome for shuffling.

I.e. current mode is to shuffle over the complete genome.

New modes:

  • all (that's the current mode and will probably stay default)
  • CDS the coding sequence
  • any combination of
    • Exon
    • Intron
    • UTR: untranslated regions

Whereas all exons + introns + UTR(s) should sum up to the hosting "gene" feature.

Enrichment module produces external bedtools error when using -I *_utr

The feature selection for exon and gene works without any errors. However, the exact same work flow fails after quite some time (~30 minutes) with a bedtools error:

/biosw/bedtools/git_unstable/bin/bedtools
Parsing annotation...

Processed 133938 entries
Done parsing annotation
Parsing annotation...

Processed 142387 entries
Done parsing annotation
Parsing BED input file...
Done parsing BED input file:
=> 294976 peaks, 45 nt average width
Parsing annotation...

Processed 58051 entries
Done parsing annotation
Parsing circular RNA input file...
Done parsing circular RNA input file:
=> 1956 circular RNAs, 11218 nt average (theoretical unspliced) length
Starting random shuffling of input peaks
Starting data acquisition from samplings
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/biosw/python3/3.5.1/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/biosw/python3/3.5.1/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/enrichment/enrichment_check.py", line 761, in random_sample_step
    circular_intersect = self.do_intersection(shuffled_peaks[iteration], circ_rna_bed)
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/enrichment/enrichment_check.py", line 480, in do_intersection
    intersect_return = base_bed.intersect(query_bed, c=True)
  File "/home/tjakobi/.local/lib/python3.5/site-packages/pybedtools/bedtool.py", line 806, in decorated
    result = method(self, *args, **kwargs)
  File "/home/tjakobi/.local/lib/python3.5/site-packages/pybedtools/bedtool.py", line 337, in wrapped
    decode_output=decode_output,
  File "/home/tjakobi/.local/lib/python3.5/site-packages/pybedtools/helpers.py", line 356, in call_bedtools
    raise BEDToolsError(subprocess.list2cmdline(cmds), stderr)
pybedtools.helpers.BEDToolsError:
Command was:

        bedtools intersect -c -b /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.604gh2qr.tmp -a /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.m5g3e86b.tmp

Error message was:
Error: line number 216606 of file /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.604gh2qr.tmp has 3 fields, but 6 were expected.

"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/tjakobi//.local/bin/circtools", line 18, in <module>
    import circtools
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/__init__.py", line 2, in <module>
    main()
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/circtools.py", line 30, in main
    CircTools()
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/circtools.py", line 66, in __init__
    getattr(self, args.command)()
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/circtools.py", line 187, in enrich
    enrich.run_module()
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/enrichment/enrichment_check.py", line 180, in run_module
    ), range(self.cli_params.num_iterations + 1))
  File "/biosw/python3/3.5.1/lib/python3.5/multiprocessing/pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/biosw/python3/3.5.1/lib/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value
pybedtools.helpers.BEDToolsError:
Command was:

        bedtools intersect -c -b /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.604gh2qr.tmp -a /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.m5g3e86b.tmp

Error message was:
Error: line number 216606 of file /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.604gh2qr.tmp has 3 fields, but 6 were expected.

Command exited with non-zero status 1
        Command being timed: "circtools enrich -c /home/tjakobi/work/projects/circRNA/encode3/dcc_k562_hepg2_total_vs_cytosol//circTest/circRNA_HepG2_RNaseR_P_signif_1percFDR.csv -b /home/tjakobi/work/data/circtools/encode_hg38_clip_peaks/HepG2/combined/RBFOX2_HepG2_combined.bed -a /home/tjakobi/work/data/circtools/input/GRCh38.85.gtf -g /home/tjakobi/work/data/circtools/input/hg38.chrom.sizes -i 10000 -I three_prime_utr -I five_prime_utr -p 40 -P 1 -T 1 -o /home/tjakobi/work/data/circtools/output/encode/hepg2/utr// -F RBFOX2_HepG2_combined -t /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/"
        User time (seconds): 56499.98
        System time (seconds): 2763.36
        Percent of CPU this job got: 3341%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 29:33.64
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 148304
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 3629391
        Minor (reclaiming a frame) page faults: 352230918
        Voluntary context switches: 96145491
        Involuntary context switches: 4247851
        Swaps: 0
        File system inputs: 553612394
        File system outputs: 294311389
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 1

Interestingly, while the first temporary bed files exists, the second one cannot be recovered after the crash. Also the number of columns is different for the different runs:

tjakobi@porta:[run]{0}> grep "Error: line number" slurm-90*
slurm-90457.out:Error: line number 210067 of file /scratch/tjakobi/circtools/EIF3D_HepG2_combined/pybedtools.5fffa2o0.tmp has 4 fields, but 6 were expected.
slurm-90457.out:Error: line number 210067 of file /scratch/tjakobi/circtools/EIF3D_HepG2_combined/pybedtools.5fffa2o0.tmp has 4 fields, but 6 were expected.
slurm-90458.out:Error: line number 216897 of file /scratch/tjakobi/circtools/HNRNPC_HepG2_combined/pybedtools.0f_wvxui.tmp has 4 fields, but 6 were expected.
slurm-90458.out:Error: line number 216897 of file /scratch/tjakobi/circtools/HNRNPC_HepG2_combined/pybedtools.0f_wvxui.tmp has 4 fields, but 6 were expected.
slurm-90459.out:Error: line number 280584 of file /scratch/tjakobi/circtools/QKI_HepG2_combined/pybedtools.1j18rnyf.tmp has 5 fields, but 6 were expected.
slurm-90459.out:Error: line number 280584 of file /scratch/tjakobi/circtools/QKI_HepG2_combined/pybedtools.1j18rnyf.tmp has 5 fields, but 6 were expected.
slurm-90460.out:Error: line number 216606 of file /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.604gh2qr.tmp has 3 fields, but 6 were expected.
slurm-90460.out:Error: line number 216606 of file /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.604gh2qr.tmp has 3 fields, but 6 were expected.

Correct handling of the number of iterations

For n=2000 iterations we found quite often 1980 entries in the column for linear raw count. Given that the sample was run with -i 2000 and -p 20 that seems like the last iterations may be skipped.

Module to design siRNAs

A module to design siRNA sequences for use with circlular RNA was suggest on the SPP1738 conference and seems like a natural extension of circtools' functionality.

installation error

Hi Sir
I tried to install circtools on my PC(ubuntu 16.04) and I got the following error message. Could you help with this? Thank you

$ python3 setup.py install --verbose --user
running install
Requirement already satisfied: statsmodels in /home/cd/miniconda3/lib/python3.6/site-packages (0.9.0)
Requirement already satisfied: patsy in /home/cd/miniconda3/lib/python3.6/site-packages (from statsmodels) (0.5.0)
Requirement already satisfied: pandas in /home/cd/miniconda3/lib/python3.6/site-packages (from statsmodels) (0.23.0)
Requirement already satisfied: six in /home/cd/miniconda3/lib/python3.6/site-packages (from patsy->statsmodels) (1.11.0)
Requirement already satisfied: numpy>=1.4 in /home/cd/miniconda3/lib/python3.6/site-packages (from patsy->statsmodels) (1.14.3)
Requirement already satisfied: python-dateutil>=2.5.0 in /home/cd/miniconda3/lib/python3.6/site-packages (from pandas->statsmodels) (2.7.3)
Requirement already satisfied: pytz>=2011k in /home/cd/miniconda3/lib/python3.6/site-packages (from pandas->statsmodels) (2018.4)
Bioconductor version 3.6 (BiocInstaller 1.28.0), ?biocLite for help
A new version of Bioconductor is available after installing the most recent
version of R; see http://bioconductor.org/install
BioC_mirror: https://bioconductor.statistik.tu-dortmund.de
Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.4 (2018-03-15).
installation path not writeable, unable to update packages: DBI, RMySQL,
codetools, foreign, lattice, spatial
Skipping install of 'CircTest' from a github remote, the SHA1 (2fd16602) has not changed since last install.
Use force = TRUE to force installation
Skipping install of 'primex' from a github remote, the SHA1 (f715f111) has not changed since last install.
Use force = TRUE to force installation
Cloning into 'DCC'...
remote: Counting objects: 876, done.
remote: Compressing objects: 100% (14/14), done.
remote: Total 876 (delta 6), reused 17 (delta 4), pack-reused 857
Receiving objects: 100% (876/876), 226.95 KiB | 239.00 KiB/s, done.
Resolving deltas: 100% (599/599), done.
Checking connectivity... done.
Traceback (most recent call last):
File "setup.py", line 11, in
from setuptools import setup
ImportError: No module named setuptools
Cloning into 'FUCHS'...
remote: Counting objects: 1928, done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 1928 (delta 2), reused 7 (delta 1), pack-reused 1920
Receiving objects: 100% (1928/1928), 36.20 MiB | 4.58 MiB/s, done.
Resolving deltas: 100% (1024/1024), done.
Checking connectivity... done.
Traceback (most recent call last):
File "setup.py", line 11, in
from setuptools import setup
ImportError: No module named setuptools
Traceback (most recent call last):
File "setup.py", line 234, in
'Documentation': 'http://docs.circ.tools'
File "/home/cd/miniconda3/lib/python3.6/site-packages/setuptools/init.py", line 129, in setup
return distutils.core.setup(**attrs)
File "/home/cd/miniconda3/lib/python3.6/distutils/core.py", line 148, in setup
dist.run_commands()
File "/home/cd/miniconda3/lib/python3.6/distutils/dist.py", line 955, in run_commands
self.run_command(cmd)
File "/home/cd/miniconda3/lib/python3.6/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "setup.py", line 71, in run
subprocess.check_call(["bash", "scripts/install_external.sh"])
File "/home/cd/miniconda3/lib/python3.6/subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['bash', 'scripts/install_external.sh']' returned non-zero exit status 1.

Installation instructions for OligoArrayAux software

As far as I can see, the source page does not provide any instructions on how to install the software. Therefore, we have to provide walk through for the user. However, less external dependencies would be even better, of course. Example: samtools only requires htslib which more or less is okay if compiler and gzip is installed.

primer design R installation script stuck in loop

When running the primer design install script, I get stuck in a loop and can't continue the installation:

The current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]
The current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]
The current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]
The current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]
The current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]
The current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]
The current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]
The current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]
The current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]
The current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]
^CThe current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]

Fresh install fails because of statsmodels

Currently the installation fails on fresh systems when numpy is not installed BEFORE statsmodels is installed. The requirements.txt or setup.py order do not have any influence on the build order.

Provide a sample data set

We need a sample data set. This will help to

  • A) provide a solid foundation for unit tests
  • B) provide a well documented user manual

Add per exon p-value and log2 fold changes to exon Excel output

Right now the Excel file could provide more detailed information that is scattered throughout different files. This information should be unified in the produced .XLS file.

Bonus points: can we also incorporate newly detected exons from the reconstruct/FUCHS module?

Length calculation when used in feature mode

When run in feature mode, in order to compute the peaks / length the length is taken from the annotation.

This behavior will cause problems when the circRNA is located in an exon-rich region of the gene while the remaining linear part may span several KB of intron space (or vice-versa). It would make more sense to only account for accumulated feature length instead.

Integration of CircTest

This issues includes

  • a wrapper script from Python to R
  • the interface between the R script and the Python framework

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.