dieterich-lab / circtools Goto Github PK

circtools: a modular, python-based framework for circRNA-related tools that unifies several functionalities in a single, command line driven software.

Home Page: http://circ.tools

License: GNU General Public License v3.0

Python 67.93% R 29.31% Shell 1.81% Perl 0.94%

bioinformatics circular-rna toolbox primer-design computational-rna-biology dcc rbp bioconductor fuchs

circtools's Introduction

circtools

a one-stop software solution for circular RNA research

This is an older version of circtools (version 1.2.2). The latest version (1.3.1) can be obtained from https://github.com/jakobilab/circtools.

Introduction

Circular RNAs (circRNAs) originate through back-splicing events from linear primary transcripts, are resistant to exonucleases, typically not polyadenylated, and have been shown to be highly specific for cell type and developmental stage. Although few circular RNA molecules have been shown to exhibit miRNA sponge function, for the vast majority of circRNAs however, their function is yet to be determined.

The prediction of circular RNAs is a multi-stage bioinformatics process starting with raw sequencing data and usually ending with a list of potential circRNA candidates which, depending on tissue and condition may contain hundreds to thousands of potential circRNAs. While there already exist a number of tools for the prediction process (e.g. DCC and CircTest), publicly available downstream analysis tools are rare.

We developed circtools, a modular, Python3-based framework for circRNA-related tools that unifies several functionalities in single command line driven software. The command line follows the circtools subcommand standard that is employed in samtools or bedtools. Currently, circtools includes modules for detecting and reconstructing circRNAs, a quick check of circRNA mapping results, RBP enrichment screenings, circRNA primer design, statistical testing, and an exon usage module.

Documentation

Click here to access the complete documentation on Read the Docs.

Installation

The circtools package is written in Python3 (>=3.4), two modules, namely detect and reconstruct also require a working Python 2 installation (>=2.7). It requires only a small number of external dependencies, namely standard bioinformatics tools:

bedtools (>= 2.27.1) [RBP enrichment module, installed automatically]
R (>= 3.3) [Data visualization and data processing]

Installation is managed through python3 setup.py install. No sudo access is required if the installation is executed with --user which will install the package in a user-writeable folder. The binaries should be installed to /home/$user/.local/bin/ in case of Debian-based systems.

circtools was developed and tested on Debian Jessie but should also run with any distribution.

The installation requires running python on the command line:

git clone https://github.com/dieterich-lab/circtools.git
cd circtools
python3 setup.py install --verbose --user

The installation procedure will automatically install two dependencies: DCC and FUCHS. The primer-design module as well as the exon analysis and circRNA testing module require a working installation of R with BioConductor. All R packages required are automatically installed during the setup. Please see the "Installing circtools" chapter of the main circtools documentation for more detailed installation instructions.

Modules

Circtools currently offers seven modules:

detect (detailed documentation)

The detect command is an interface to DCC, also developed at the Dieterich Lab. The module allows to detect circRNAs from RNA sequencing data. The module is the foundation of all other steps for the circtools work flow. All parameters supplied to circtools will be directly passed to DCC.

quickcheck (detailed documentation)

The quickcheck module of circtools is an easy way to check the results of a DCC run for problems and to quickly assess the number of circRNAs in a given experiment. The module needs the mapping log files produced by STAR as well as the directory with the DCC results. The module than generates a series of figures in PDF format to assess the results.

reconstruct (detailed documentation)

The reconstruct command is an interface to FUCHS. FUCHS is employing DCC-generated data to reconstruct circRNA structures. All parameters supplied to circtools will be directly passed to FUCHS.

circtest (detailed documentation)

The circtest command is an interface to CircTest. The module a a very convenient way to employ statistical testing to circRNA candidates generated with DCC without having to write an R script for each new experiment. For detailed information on the implementation itself take a look at the CircTest documentation. In essence, the module allows dynamic grouping of the columns (samples) in the DCC data.

exon (detailed documentation)

The exon module of circtools employs the ballgown R package to combine data generated with DCC and circtest with ballgown-compatible stringtie output or cufflinks output converted via tablemaker in order get deeper insights into differential exon usage within circRNA candidates.

enrich (detailed documentation)

The enrichment module may be used to identify circRNAs enriched for specific RNA binding proteins (RBP) based on DCC-identified circRNAs and processed eCLIP data. For K526 and HepG2 cell lines plenty of this data is available through the ENCODE project.

primer (detailed documentation)

The primer command is used to design and visualize primers required for follow up wet lab experiments to verify circRNA candidates.

circtools's People

Contributors

Stargazers

Watchers

Forkers

smyang2018 satarupabando mlebeur pilm-bioinformatics skambha6 eboileau biocko vallurumk yingliang1229 siyangming jinbinchan genebial wook2014 jakobilab biorabiei chrislou-bioinfo

circtools's Issues

How do we optimally integrate DCC and FUCHS?

The easiest is way would probably be to directly interface with the corresponding main class and to circumvent the need to call another python instance from within python. We need something like an API for DCC and FUCHS that is open to the outside.

Investigate what 'unknown breakpoint' events are

Those events can be found in the data files and are plotted.
However, what does it mean?
The BSJ is neither covered by one nor by two mates, so how exactly is it covered?

Integration of the alternative exon usage script

Quantile plot for reconstruction module

As an addition to the - arbitrary - categorization of circRNAs in groups based on their length, we should add a quantile plot to the visualization routines.

BED track coordinates are incorrect

in script exon_usage_circtools_wrapper.R

exon_analysis_dcc_bsj_enriched_track.bed
and
exon_analysis_dcc_predictions_track.bed

show off-by-one errors.

Manual "primer module" missing

First of all, thank you for making these scripts available. This is not really an issue, is more a request.
The full documentation for the primer module is missing. Could you solve this? Thank you
Cheers.

Unify all enrichment plotting code in one file

Right now code is split in two files, one for single samples and another one (_twin.R) for direct one-to-one comparisons. It would make sense to merge everything in one file and guess from the user input if we have one or 2 samples.

Add support for zipped input files

Support for zipped files is de-facto standard and would also help reduce disk space used.

CircSkipping visualization

Circle skipping events detected by DCC should be available in a BED-like format in order to be used as track in IGV or similar viewers.

Visualization option for enrichment module

It would be nice to have some kind of auto-generated diagrams visualizing the enrichment results. However, it should not be a function that just draws 2.000 plots; instead something more high level would be better.

R install script fails on fresh R 3.4.0 install

Normally the lib dir is not writable, the installation does not check for this (or sets an alternative path). Therefore the install fails:

Rscript install.R
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
Warning in install.packages("BiocInstaller", repos = a["BioCsoft", "URL"]) :
  'lib = "/usr/local/lib/R/site-library"' is not writable
Error in install.packages("BiocInstaller", repos = a["BioCsoft", "URL"]) :
  unable to install packages
Calls: source ... eval.parent -> eval -> eval -> eval -> eval -> install.packages
Execution halted

X labels of the PDF output are alphabetically ordered

The X-axis labels do not correspond to the sorting order of the columns on page three of the generated PDF file.

Use consistent Python 3 versions in the documentation

Currently we partly refer to 3.4 and 3.5 We should change it to "Works with python >= 3.4".

Remove (py)bedtools dependency

Benchmarks shows that the IO-heavy work flow of bedtools has a severe impact of the performance. It would be reasonable to instead use Brandon's replacement functions.

Add option to include / exclude specific genomic features during shuffeling

We need the possibility to select specific regions of the genome for shuffling.

I.e. current mode is to shuffle over the complete genome.

New modes:

all (that's the current mode and will probably stay default)
CDS the coding sequence
any combination of
- Exon
- Intron
- UTR: untranslated regions

Whereas all exons + introns + UTR(s) should sum up to the hosting "gene" feature.

Fix broken pysam dependency of HTseq for DCC

DCC requires HTseq that does not install with the most recent pysam version. We have to manually specify the pysam version to be 0.13.0.

Enrichment module produces external bedtools error when using -I *_utr

The feature selection for exon and gene works without any errors. However, the exact same work flow fails after quite some time (~30 minutes) with a bedtools error:

/biosw/bedtools/git_unstable/bin/bedtools
Parsing annotation...

Processed 133938 entries
Done parsing annotation
Parsing annotation...

Processed 142387 entries
Done parsing annotation
Parsing BED input file...
Done parsing BED input file:
=> 294976 peaks, 45 nt average width
Parsing annotation...

Processed 58051 entries
Done parsing annotation
Parsing circular RNA input file...
Done parsing circular RNA input file:
=> 1956 circular RNAs, 11218 nt average (theoretical unspliced) length
Starting random shuffling of input peaks
Starting data acquisition from samplings
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/biosw/python3/3.5.1/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/biosw/python3/3.5.1/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/enrichment/enrichment_check.py", line 761, in random_sample_step
    circular_intersect = self.do_intersection(shuffled_peaks[iteration], circ_rna_bed)
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/enrichment/enrichment_check.py", line 480, in do_intersection
    intersect_return = base_bed.intersect(query_bed, c=True)
  File "/home/tjakobi/.local/lib/python3.5/site-packages/pybedtools/bedtool.py", line 806, in decorated
    result = method(self, *args, **kwargs)
  File "/home/tjakobi/.local/lib/python3.5/site-packages/pybedtools/bedtool.py", line 337, in wrapped
    decode_output=decode_output,
  File "/home/tjakobi/.local/lib/python3.5/site-packages/pybedtools/helpers.py", line 356, in call_bedtools
    raise BEDToolsError(subprocess.list2cmdline(cmds), stderr)
pybedtools.helpers.BEDToolsError:
Command was:

        bedtools intersect -c -b /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.604gh2qr.tmp -a /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.m5g3e86b.tmp

Error message was:
Error: line number 216606 of file /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.604gh2qr.tmp has 3 fields, but 6 were expected.

"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/tjakobi//.local/bin/circtools", line 18, in <module>
    import circtools
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/__init__.py", line 2, in <module>
    main()
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/circtools.py", line 30, in main
    CircTools()
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/circtools.py", line 66, in __init__
    getattr(self, args.command)()
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/circtools.py", line 187, in enrich
    enrich.run_module()
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/enrichment/enrichment_check.py", line 180, in run_module
    ), range(self.cli_params.num_iterations + 1))
  File "/biosw/python3/3.5.1/lib/python3.5/multiprocessing/pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/biosw/python3/3.5.1/lib/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value
pybedtools.helpers.BEDToolsError:
Command was:

        bedtools intersect -c -b /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.604gh2qr.tmp -a /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.m5g3e86b.tmp

Error message was:
Error: line number 216606 of file /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.604gh2qr.tmp has 3 fields, but 6 were expected.

Command exited with non-zero status 1
        Command being timed: "circtools enrich -c /home/tjakobi/work/projects/circRNA/encode3/dcc_k562_hepg2_total_vs_cytosol//circTest/circRNA_HepG2_RNaseR_P_signif_1percFDR.csv -b /home/tjakobi/work/data/circtools/encode_hg38_clip_peaks/HepG2/combined/RBFOX2_HepG2_combined.bed -a /home/tjakobi/work/data/circtools/input/GRCh38.85.gtf -g /home/tjakobi/work/data/circtools/input/hg38.chrom.sizes -i 10000 -I three_prime_utr -I five_prime_utr -p 40 -P 1 -T 1 -o /home/tjakobi/work/data/circtools/output/encode/hepg2/utr// -F RBFOX2_HepG2_combined -t /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/"
        User time (seconds): 56499.98
        System time (seconds): 2763.36
        Percent of CPU this job got: 3341%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 29:33.64
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 148304
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 3629391
        Minor (reclaiming a frame) page faults: 352230918
        Voluntary context switches: 96145491
        Involuntary context switches: 4247851
        Swaps: 0
        File system inputs: 553612394
        File system outputs: 294311389
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 1

Interestingly, while the first temporary bed files exists, the second one cannot be recovered after the crash. Also the number of columns is different for the different runs:

tjakobi@porta:[run]{0}> grep "Error: line number" slurm-90*
slurm-90457.out:Error: line number 210067 of file /scratch/tjakobi/circtools/EIF3D_HepG2_combined/pybedtools.5fffa2o0.tmp has 4 fields, but 6 were expected.
slurm-90457.out:Error: line number 210067 of file /scratch/tjakobi/circtools/EIF3D_HepG2_combined/pybedtools.5fffa2o0.tmp has 4 fields, but 6 were expected.
slurm-90458.out:Error: line number 216897 of file /scratch/tjakobi/circtools/HNRNPC_HepG2_combined/pybedtools.0f_wvxui.tmp has 4 fields, but 6 were expected.
slurm-90458.out:Error: line number 216897 of file /scratch/tjakobi/circtools/HNRNPC_HepG2_combined/pybedtools.0f_wvxui.tmp has 4 fields, but 6 were expected.
slurm-90459.out:Error: line number 280584 of file /scratch/tjakobi/circtools/QKI_HepG2_combined/pybedtools.1j18rnyf.tmp has 5 fields, but 6 were expected.
slurm-90459.out:Error: line number 280584 of file /scratch/tjakobi/circtools/QKI_HepG2_combined/pybedtools.1j18rnyf.tmp has 5 fields, but 6 were expected.
slurm-90460.out:Error: line number 216606 of file /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.604gh2qr.tmp has 3 fields, but 6 were expected.
slurm-90460.out:Error: line number 216606 of file /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.604gh2qr.tmp has 3 fields, but 6 were expected.

Empty lines in the samplesheet file confuse DCC

Empty lines in the samplesheet file should be ignored.

BED file produced by exon usage script misses chromosome column

The back splice junction enriched BED file does not seem to include the first BED column for the chromosome.

error in DECIPHER call to RSQLite

this commit in RSQLite raises error when dbWriteTable is called.

with RSQLite <=1.1-15 no error is reported

Correct handling of the number of iterations

For n=2000 iterations we found quite often 1980 entries in the column for linear raw count. Given that the sample was run with -i 2000 and -p 20 that seems like the last iterations may be skipped.

Include option to generate black and white or grey scale graphs for non-color journals

Should be easy to implement using the specific ggplot2 theme.

Integration of FUCHS

Check the processing of single exon genes in terms of circRNA RBP enrichment

How are single exon genes / circRNAs treated when computing the enrichment in contrast to the linear genes?

Module to analyse conservation of circRNAs across species

We require a module that is able to asses / analyse the conservation on gene as well as back splice junction level across different species.

Module to design siRNAs

A module to design siRNA sequences for use with circlular RNA was suggest on the SPP1738 conference and seems like a natural extension of circtools' functionality.

Finish implementation of permutation testing

The current proof of concept testing for significance of RPB enrichment has to be improved by employing a permutation testing approach

Missing python module in install routine

Some python modules do not seem be be installed automatically.

Examples:

reporttools
statsmodels

Fix -c option for enrich module

The -c option refers to the circRNACount file while it should refer to the Coordinates file.

installation error

Hi Sir
I tried to install circtools on my PC(ubuntu 16.04) and I got the following error message. Could you help with this? Thank you

$ python3 setup.py install --verbose --user
running install
Requirement already satisfied: statsmodels in /home/cd/miniconda3/lib/python3.6/site-packages (0.9.0)
Requirement already satisfied: patsy in /home/cd/miniconda3/lib/python3.6/site-packages (from statsmodels) (0.5.0)
Requirement already satisfied: pandas in /home/cd/miniconda3/lib/python3.6/site-packages (from statsmodels) (0.23.0)
Requirement already satisfied: six in /home/cd/miniconda3/lib/python3.6/site-packages (from patsy->statsmodels) (1.11.0)
Requirement already satisfied: numpy>=1.4 in /home/cd/miniconda3/lib/python3.6/site-packages (from patsy->statsmodels) (1.14.3)
Requirement already satisfied: python-dateutil>=2.5.0 in /home/cd/miniconda3/lib/python3.6/site-packages (from pandas->statsmodels) (2.7.3)
Requirement already satisfied: pytz>=2011k in /home/cd/miniconda3/lib/python3.6/site-packages (from pandas->statsmodels) (2018.4)
Bioconductor version 3.6 (BiocInstaller 1.28.0), ?biocLite for help
A new version of Bioconductor is available after installing the most recent
version of R; see http://bioconductor.org/install
BioC_mirror: https://bioconductor.statistik.tu-dortmund.de
Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.4 (2018-03-15).
installation path not writeable, unable to update packages: DBI, RMySQL,
codetools, foreign, lattice, spatial
Skipping install of 'CircTest' from a github remote, the SHA1 (2fd16602) has not changed since last install.
Use force = TRUE to force installation
Skipping install of 'primex' from a github remote, the SHA1 (f715f111) has not changed since last install.
Use force = TRUE to force installation
Cloning into 'DCC'...
remote: Counting objects: 876, done.
remote: Compressing objects: 100% (14/14), done.
remote: Total 876 (delta 6), reused 17 (delta 4), pack-reused 857
Receiving objects: 100% (876/876), 226.95 KiB | 239.00 KiB/s, done.
Resolving deltas: 100% (599/599), done.
Checking connectivity... done.
Traceback (most recent call last):
File "setup.py", line 11, in
from setuptools import setup
ImportError: No module named setuptools
Cloning into 'FUCHS'...
remote: Counting objects: 1928, done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 1928 (delta 2), reused 7 (delta 1), pack-reused 1920
Receiving objects: 100% (1928/1928), 36.20 MiB | 4.58 MiB/s, done.
Resolving deltas: 100% (1024/1024), done.
Checking connectivity... done.
Traceback (most recent call last):
File "setup.py", line 11, in
from setuptools import setup
ImportError: No module named setuptools
Traceback (most recent call last):
File "setup.py", line 234, in
'Documentation': 'http://docs.circ.tools'
File "/home/cd/miniconda3/lib/python3.6/site-packages/setuptools/init.py", line 129, in setup
return distutils.core.setup(**attrs)
File "/home/cd/miniconda3/lib/python3.6/distutils/core.py", line 148, in setup
dist.run_commands()
File "/home/cd/miniconda3/lib/python3.6/distutils/dist.py", line 955, in run_commands
self.run_command(cmd)
File "/home/cd/miniconda3/lib/python3.6/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "setup.py", line 71, in run
subprocess.check_call(["bash", "scripts/install_external.sh"])
File "/home/cd/miniconda3/lib/python3.6/subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['bash', 'scripts/install_external.sh']' returned non-zero exit status 1.

Installation instructions for OligoArrayAux software

As far as I can see, the source page does not provide any instructions on how to install the software. Therefore, we have to provide walk through for the user. However, less external dependencies would be even better, of course. Example: samtools only requires htslib which more or less is okay if compiler and gzip is installed.

State clearly that FUCHS and DCC require Python 2

Currently it is not clear that DCC and FUCHS require also Python 2. This should be stated in the documentation.

Long term solution: port to Python 3.

primer design R installation script stuck in loop

When running the primer design install script, I get stuck in a loop and can't continue the installation:

The current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]
The current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]
The current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]
The current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]
The current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]
The current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]
The current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]
The current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]
The current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]
The current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]
^CThe current version of circtools can work only with RSQLite version <= 1.1.5
Your version is 2.0
Would you like to install the 1.1.15 one? [y/n]

Fresh install fails because of statsmodels

Currently the installation fails on fresh systems when numpy is not installed BEFORE statsmodels is installed. The requirements.txt or setup.py order do not have any influence on the build order.

Provide a sample data set

We need a sample data set. This will help to

A) provide a solid foundation for unit tests
B) provide a well documented user manual

Provide additional documentation for the enrich module

The statistics an internal processes of the enrich module should be more clearly described int the documentation.

Add per exon p-value and log2 fold changes to exon Excel output

Right now the Excel file could provide more detailed information that is scattered throughout different files. This information should be unified in the produced .XLS file.

Bonus points: can we also incorporate newly detected exons from the reconstruct/FUCHS module?

Integration of DCC

Implement a primer visualization and design wrapper R script

This script can then be called by the circtools platform. The infrastructure for calling R scripts has already been developed. We need to make a decision where the R scripts will be deployed during installation.

Normalize #circRNA isoforms in visualization plots

Currently visualization is done using raw #circRNA isoforms/gene counts. It would be more helpful to normalize the count in a RPKM/FPKM like manner, e.g. isorms per kilobase .

Establish a stable version of the circtools platform

This issue included the definition of a clear structure of circtools classes as well as how and where data and scripts are stored.

Add framework for R installation to setup script

We need a way to automatically install and deploy R packages from this repository as well as external repository, e.g. CircTest.

Length calculation when used in feature mode

When run in feature mode, in order to compute the peaks / length the length is taken from the annotation.

This behavior will cause problems when the circRNA is located in an exon-rich region of the gene while the remaining linear part may span several KB of intron space (or vice-versa). It would make more sense to only account for accumulated feature length instead.

Integration of CircTest

This issues includes

a wrapper script from Python to R
the interface between the R script and the Python framework