This repository contains scripts accompanying a recent biorxiv preprint.
The required data to run the scripts is available here:
- ALKBH5: SRA accession SRR9646144 (use
fasterq-dump SRR9646144
in directorydata
to get FASTQ files).fasterq-dump
is a tool available in the SRA toolkit. - ChAdOx1 HEK293: SRA accession SRR13320597 (use
fasterq-dump SRR13320597
in directorydata
to get FASTQ files) - S protein constructs: download here (data coming soon)
To generate the BAM files for downstream analysis, run the following:
The following will run minimap2 and convert the SAM to a BAM file (requires minimap2, samtools):
./ALKBH5.sh
Outputs will be in the outputs/ALKBH5
directory. The script uses -t 8
by default, edit the file to use a different number of processors.
Then, to run the downstream analysis, you need to run
Rscript ALKBH5.R
from an environment that contains R & packages GenomicRanges
, Biostrings
, Rsamtools
, BSgenome
, and stringr
.
If you need this environment you can get a Singularity image from this repository (coming soon). If you pull the Singularity image, you can then run (after installing Singularity):
singularity exec image_name.sif Rscript ALKBH5.R
where you replace image_name.sif
with the filename of the container.
Similar to above, first run this script to run minimap2 & samtools:
./ChAdOx1.sh
Outputs will be in the outputs/ChAdOx1
directory. The script uses -t 8
by default, edit the file to use a different number of processors.
Then, to run the downstream analysis, run
Rscript ChAdOx1.R
from the same environment described above, the Singularity environment will work here too.
This uses a small Nextflow pipeline. Download and install Nextflow using the instructions on their website. Then, run the following in an environment that contains minimap2, samtools:
nextflow S_protein.nf
If you want to run it locally, remove the nextflow.config
file, or edit the nextflow.config
file accordingly. If you want to run in manually, check the contents of the processes in S_protein.nf
- there should be enough information in that to run minimap2 and samtools from the command line.
Then, to run the downstream analysis, run
Rscript S_protein.R
from the same environment described above, the Singularity environment will work here too.
The scripts in deletion_utils.R
are made to identify small deletions/introns using the CIGAR strings of the alignment BAM file.
Usually, this requires Oxford Nanopore direct RNA or cDNA reads aligned using e.g. minimap2 as such:
minimap2 -ax splice -t8 --cs=long --sam-hit-only template.fa reads.fastq > mapped.sam
The steps for using the functions are usually as follows:
- Read BAM file using R package Rsamtools and convert to GRanges using function
BAM_to_granges()
- Identify deletions in reads using
get_deletion_df_no_overlaps()
. - Optionally, retain only deletions whose sequences are diverse by calculating sequence diversity with
seq_diverse()
(this filters out many deletions that might be due to poor mapping) - Optionally, get information about read span on template using
get_spans()
.
Briefly, this method works by cycling through aligned reads, considering all CIGAR string positions that are not I, S or H (no insertions, no soft/hard clipping) so that you only look at positions mapped to the template (mapped & introns & deletions). It will assess all CIGAR strings that are D or N, and check if these events are 1) not the first or last event in what's left of the CIGAR string and 2) above a specified minimum length (very small deletions are likely noise)
There is a bit of a different workflow if you only want to consider deletions that are contained within a specified BED file (e.g. deletions within exons), for this you use the following function:
Briefly, this method works like the other function, but will only consider reads that overlap other regions (e.g. exons) (calculated using the findOverlaps()
function from R package GenomicRanges
), and it will make sure there are other-region-mapping reads flanking the deletion (on the same 'other region').
stringr
, Biostrings
, GenomicRanges
, Rsamtools
, optionally also e.g. BSgenome.Hsapiens.UCSC.hg38
if you aligned to hg38
.