schulzlab / aeron Goto Github PK

Alignment, quantification and fusion prediction from long RNA reads

License: MIT License

Python 100.00%

aeron's Introduction

About

Recent sequencing technologies like Pacbio and Oxford Nanopore have made significant progress in sequencing accuracy and sequencing output per run. Thus, there is a need to develop methods that can use long read RNA sequencing data for tasks, where long reads overpower short reads, such as transcript quantification and gene fusion detection. AERON is an alignment based pipeline for quantification and detection of gene-fusion events using only long RNA-reads. It uses a state-of-the-art sequence-to-graph aligner to align reads generated from long read sequencing technologies to a reference transcriptome. It makes use of a novel way to assign reads to transcripts, based on the position of the mapping of the read on the transcript and the fraction of the read contained in a transcript. AERON also introduces the first long read specific gene-fusion detection algorithm. AERON was tested on different datasets of varying length and coverage and was found to provide accurate transcript quantification. Also, AERON was able to detect experimentally validated fusion events and novel fusion events which could be validated further.

Prerequisites

snakemake (The experiments were run using snakemake version 5.0.0)
Python (version >= 2.7)
variation graphs tools also known as vg tools. Can be installed from https://github.com/vgteam/vg (experiments were run using vgtools v1.5.0-499-ge8a9bcb)

Download

The software can be downloaded by using the following command

	git clone https://github.com/SchulzLab/Aeron.git

Pipeline

The downloaded folder should contain a "snakemake-pipeline" folder which contains the following files and folders:

AeronScripts: Folder consisting of scripts to generate the graph file namely GraphBuilder.py and ParseGTF.py and additional scripts required by the pipeline
Binaries: Folder consisting of all the binaries required by the pipeline
input: Folder containing a sample graph file (in .gfa format)
config.yaml: Sample config file consisting of all the parameters required by AERON
Snakefile: Pipeline required to run quantification step of AERON
Snakefile_fusion: Pipeline required to run the fusion-detection step of AERON

Version

0.01 - first complete version with quantification and fusion-gene detection ability

0.02 - updated readme and bugfixes. Thanks for the user feedback. (24.3.2020)

Running

Overview

Aeron, currently consists of 3 steps: Graph building, Quantification, Fusion-gene detection.

Graph building - A transcriptome graph collection is created using known transcripts as anntotation, which is saved as an index (gfa file). In this collection each graph describes the exons of each annotated gene and contains the path information of transcript of that gene For any given dataset this index can be used to run step 2.
Quantification - Using a set of long reads, Aeron aligns those reads against the graph index. Then the alignments are processed and reads are assigned to transcripts based on alignment statistics. The output of this step are the read counts per transcript and files necessary for step 3.
Fusion-gene detection - Unmapped reads are considered to be fusion read candidates. Based on partial alignments of reads to different genes a second type of graph, the fusion graph, is constructed. Reads are then aligned to all fusion candidates represented in the fusion graph and the complete transcriptome. Foe each cnadidate fusin a fusion score is computed and filters are applied to derive a list of final putative fusion transcripts.

Details how to execute all of those steps are found below.

Graph building

To generate a graph file from a reference sequence, run the following command from the AeronScripts folder:

	python GraphBuilder -e Path_to_the_genome_sequence -g Path_to_the_gtf_file -o Output_File

The above command will generate a "gfa" file, which must be used for the transcript quantification and gene-fusion detection steps. The graph file generated during this step can be used for multiple datasets

A sample graph file generated from annotated transcripts of human (ENSEMBL v92, hg38) is provided in the input file.

Things to remember:

The genome sequence file should be in fasta format with each sequence representing a chromosome.
The number of chromosomes in the sequence fasta file should match the number of chromosomes in the gtf file.
The chromosome ids in the sequence fasta file should match the chromosome ids in the gtf file.

Quantification and gene-fusion event detection

In the folder Aeron, make a directory titled input
Copy the input files to the "input" folder
Input files should include:
- The input read file(s) in fasta or fastq format.
- A graph file in .gfa format.
- Annotation file of the species in gtf format.
- Transcript sequences (not genome) in fasta or fastq format.
Make sure that there is no underscores in the file names, graph .gfa, reads .fq, transcripts .fa
Edit config.yaml, add input file names. An example config file is provided in the repository. Below we explain some of the parameters to be set in the config.yaml with more details.

parameter	default	explanation
graph	-	Relative or absolute path of the input graph generated with the graph building script
transcripts	-	Relative or absolute path of the reference transcripts in fasta/q format
reads	-	Name of the input long read fasta/q file(s). Multiple files can be included by placing each of them in its own line. the files should be in the input folder. Do not include "input/".
gtffile	-	Relative or absolute path of the gtf file used for building the graph
vgpath	-	Relative or absolute path of the vg toolkit binary (https://github.com/vgteam/vg)
fusion_max_error_rate	0.2	Maximum allowed error rate for a read to support a fusion. If a read aligns to a predicted fusion transcript with a higher error rate, it is considered to not support the fusion.
fusion_min_score_difference	200	Minimum score difference for a read to support a fusion. If a read aligns to both a reference transcript and a predicted fusion transcript, and the score difference is less than the parameter, it is considered to not support the fusion. Higher values lead to a higher precision (higher fraction of predicted fusions are real) at the cost of lower sensitivity (smaller number of real fusions are detected).
seedsize	17	Minimum size of an exact match between a read and a transcript to be used for alignment in the quantification pipeline. Higher values lead to faster runtime for quantification but potentially lower accuracy.
maxseeds	20	Number of exact matches used for alignment in the quantification pipeline. Higher values lead to a slower runtime for quantification but potentially higher accuracy.

For quantification:

Run the following command

snakemake --cores=no_of_cores all  (experiments were run using 10 cores)

The quantification results will be in a folder named output

For gene-fusion detection

First run the quantification (see above)
Then run the following command (adjusting the number of cores as needed)

snakemake --cores 40 all -s Snakefile_fusion

The fusion detection results will be in a folder named fusionoutput

The predicted fusion gene candidates need to be analyzed carefully. In out preprint we delineate additional steps that should be considered to validate/analyze predicted fusion genes.

Citation (and more details)

AERON: Transcript quantification and gene-fusion detection using long reads

Mikko Rautiainen$, Dilip A Durai$, Ying Chen, Lixia Xin, Hwee Meng Low, Jonathan Göke, Tobias Marschall**, Marcel H. Schulz,** link to preprint

$ joint first authors, ** corresponding authors

Things to remember:

The graph file given with the repository has been generated using annotated transcripts of human (ENSEMBL v92, hg38). Hence, the file can be used as an input. There is no need to generate a new graph file.
You only need to create the input folder once. Multiple files can be included in the input folder. Also, multiple snakemake runs can be executed using the same input folder.
An AERON run will create a folder called tmp to store temporary files. The files are named according to the input file given to the program. For instance, a file containing node name to integer mapping for a graph file MyGraph.gfa will be named as MyGraph_nodemapping.txt. If, in another run, the program is using a graph file with the same name (MyGraph), then AERON wont create a new file. Instead it will use the old mapping file. This tmp folder can be removed after every snakemake run

aeron's People

Contributors

Stargazers

Watchers

Forkers

shixianhu wangdi2014 areebapatel michalkowalski94 iskyer1980 standardgalactic changlabsnu qinqian

aeron's Issues

Missing default value for fusion detection

I noticed fusion detection has following parameters but they are not set in config file
FUSION_MAX_ERROR_RATE = config["fusion_max_error_rate"]
FUSION_MIN_SCORE_DIFFERENCE = config["fusion_min_score_difference"]

Error in rule postprocess

Keep getting error in rule postprocess for only some read files, not sure what the issue is. There is no provided detail about the error. I have added a /tmp directory as well. The errors say:

Error in rule postprocess:
jobid: 99
output: .....
shell: .....

Is there some known issue with the postprocess binary?

Issues with output from quantification step

Hi we build the graph using the gtf and the genome fasta for GRCh38. However when we run the quantification the count matrix looks like we are only getting results to the odd chromosomes at the end of the reference:
Transcript Count
KI270412 1
KI270382 1
KI270539 1
KI270311 4
KI270334 1
KI270522 10
KI270381 32
KI270580 2
KI270340 1
KI270429 4
KI270418 1
KI270389 3
KI270364 25
KI270515 2
KI270375 1
KI270420 3

The documentation is slightly confusing. For graph building we want genome fasta (1 sequence per chromosome) and gtf. Then for quantification in the config file its the transcripts fasta the same genome fasta file or do you need to give the transcripts as separate fasta sequences?

Binaries/Postprocess: /lib64/libc.so.6: version `GLIBC_2.17' not found (required by Binaries/Postprocess)

Hi,
Thanks for this useful software for Fusion-detection.It helps me a lot,However, the system version of my server is too low so that I cannot upgrade the version of glibc in the system from 2.12 to 2.17.Once I updated, my system crashed.Does this software must rely on glibc 2.17?
Actually, I hope it can be supported by glibc 2.14 or lower version.Could you help me?
Look forward to your reply.
Thank you very much.

.gfa missing, difficulties with graphbuilder

We encountered some problems when trying the software as
the .gfa (input/hg38.gfa) that is provided does not seem to be valid. We
have been trying to compile a .gfa of our own using the graphbuilder.py
and gencode v33 files. Unfortunately, this does not seem to work (see
below).

python GraphBuilder.py -e ../input/GRCh38.primary_assembly.genome.fa
-g ../input/hg38.gencode.v33.annotation.gtf -o gencode.hg38.gfa
Reading Sequences
Done reading sequences
Reading and processing gtf
Traceback (most recent call last):
File "GraphBuilder.py", line 183, in
pg = ParseGTF(fn)
File "/media/data/erst/Aeron/GraphBuilder/ParseGTF.py", line 36, in
init
en[exn[0]].append(enu[0])
IndexError: list index out of range

list index out of range when running GraphBuilder.py

Hello!
Trying to generate a custom .gfa file but hitting this snag:

$ python /home/apps/Aeron/AeronScripts/GraphBuilder.py -e /home/refs/human/dna/hg38_wControl.fa -g /home/refs/human/rna/gencode.v35wControl.annotation.gtf -o aeron_out`
Reading Sequences
Done reading sequences
Reading and processing gtf
Traceback (most recent call last):
  File "/home/apps/Aeron/AeronScripts/GraphBuilder.py", line 183, in <module>
    pg = ParseGTF(fn)
  File "/home/apps/Aeron/AeronScripts/ParseGTF.py", line 36, in __init__
    en[exn[0]].append(enu[0])
IndexError: list index out of range

Looks like a semantic issue with the latest gencode gtf, which doesn't have double quotes next to the exon_number values. removing the double quotes on line 36 of parseGTF.py fixed it.

Aeron cannot be run on simulation dataset

Hi Aeron developers,

I tried to run your snakemake pipeline on the jaffal-simulated datasets, however, it failed in different scripts, I corrected all of those ones (master...qinqian:Aeron:master), and able to run it. However, it failed to generate any gene fusions. Are you still maintaining the software? Thanks for your feedbacks.

Best,
Alvin

error running Aeron software

Hi,

I am interested in using Aeron software for detecting fusions from nanopore data.
I am trying to run Aeron on the example files which are provided, however, I am getting following error:

SyntaxError:
Not all output, log and benchmark files of rule assign_reads_to_transcripts contain the same wildcards. This is crucial though, in order to avoid that two or more jobs write to the same file.
File "/sfs/lustre/bahamut/scratch/ss7mh/softwares/AERON/aeron/snakemake_pipeline/Snakefile", line 99, in

I have downloaded from both the sources (i.e. https://github.com/SchulzLab/Aeron as well as https://bitbucket.org/dilipdurai/aeron) and from both the sources I got the same error.

Can you please help me to run the software successfully?

Thanks,
Sandeep

Inconsistencies in fusion read table

Hello,

I've ran AERON to completion but I noticed something odd in the fusionread_table when I try use the GTF to add gene names to the output file (based on transcript ID). The gene_id (col4) sometimes matches the gene for transcript_id in col7 and other times the gene for transcript_id in col9.

When I count the number of times gene-id1 doesn't match the transcript_id1 and the number of times gene-id2 doesn't match transcript_id2, there's a discrepancy so it's not even a 1-1 error.

Any idea why this isn't consistant? I've attached some examples.

Thank you,
Melissa

Error_examples_fusionread_table_GM24143_GRCh38cdna_GRCh38.txt
Error_examples_fusionread_table_GM24143_GRCh38cdna_GRCh38.xlsx

Remaining adapter sequences

Hello,

I was wondering if the inputs can handle adapter sequences between the fused transcripts without too much problems?

Thanks,
Chang

Error while calling fusions

Hi, I'm trying to run Aeron for fusions finding and I'm running into an issue I can't resolve. The command:
snakemake --cores 4 all -s Snakefile
finished successfully, but I get the following error when I run:
snakemake --cores 4 all -s Snakefile_fusion

Building DAG of jobs...
MissingInputException in line 129 of /home/unimelb.edu.au/nadiamd/work_area/ideas_grant/Aeron/Snakefile_fusion:
Missing input files for rule sam_to_bam:
fusiontmp/reads_tofusions_onlyfusion_x50_Homo-sapiens_hg38.sam

My config.yaml is pasted below.

I'm also interested to know if the data from your fusion simulation (in your paper) is available somewhere for downloading?

Many thanks,
Nadia.

config.yaml:

#input files at top: check them!

# all input files must be in the folder ./input/
# use the full file name, including file ending

# input splice graph
# Should be in the input folder
# format must be .vg
graph: hg38.gfa

# reference transcripts
# format can be either fasta/fastq, gzipped or not
# Should be in the input folder

transcripts: Homo-sapiens.GRCh38.cdna.all.fa

# sequenced reads
# Should be in the input folder
# format can be either fasta/fastq, gzipped or not
# for more files, add them in new lines starting with "- "
# NOTE: the file names without ending must be unique! You cannot have eg. reads.fq and reads.fa
reads:
- x100.fixed.fastq.gz
- x50.fixed.fastq.gz
- x10.fixed.fastq.gz
- x2.fixed.fastq.gz
- x1.fixed.fastq.gz

# Needed for expression quantificatino
# Should be in the input folder
gtffile: Homo-sapiens.GRCh38.100.gtf

# needed to convert between alignment formats
# https://github.com/vgteam/vg
vgpath: /home/unimelb.edu.au/nadiamd/work_area/ideas_grant/Aeron/vg


#optional parameters below: default values will probably work

fusion_max_error_rate: 0.2
fusion_min_score_difference: 200

#size of the seed hits. Fewer means more accurate but slower alignments.
seedsize: 17
#max number of seeds. Fewer means faster but more inaccurate alignment
maxseeds: 20

# No need to change these

aligner_bandwidth: 35
alignment_selection: --greedy-length
alignment_E_cutoff: 1

scripts: AeronScripts
binaries: Binaries

Error:/usr/bin/bash: tmp/aligner_stdout.txt: No such file or directory

(base) dhwani@dhwani:/DATA1/RNASeq/Aeron$ snakemake --cores=10 all
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 10
Rules claiming more threads will be scaled down.
Job counts:
count jobs
2 align
1 all
1 assign_reads_to_transcripts
1 generateCountMatrix
1 output_assignment_statistics
2 postprocess
8

[Thu Jul 13 18:15:26 2023]
rule align:
input: input/hg38.gfa, input/324848.fastq
output: output/aln_324848_hg38_all.gam
jobid: 1
benchmark: benchmark/aln_324848_hg38_all.txt
wildcards: reads=324848, graph=hg38
threads: 10

/usr/bin/bash: tmp/aligner_stdout.txt: No such file or directory
[Thu Jul 13 18:15:26 2023]
Error in rule align:
jobid: 1
output: output/aln_324848_hg38_all.gam
shell:
/usr/bin/time -v /DATA1/RNASeq/Aeron/Binaries/GraphAligner -g input/hg38.gfa -f input/324848.fastq --try-all-seeds --seeds-mxm-length 17 --seeds-mem-count 15 --seeds-mxm-cache-prefix tmp/seedcache -a output/aln_324848_hg38_all.gam -t 10 -b 35 --greedy-length 1> tmp/aligner_stdout.txt 2> tmp/aligner_stderr.txt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /DATA1/RNASeq/Aeron/.snakemake/log/2023-07-13T181526.504064.snakemake.log

WARNING: The graph has an edge between non-existant node(s)

I used GraphBuilder.py to genenrated hg38.gfa . Then when I ran the step ”align_with_secondaries“ in the snakemakefusion，it always have an error as: WARNING: The graph has an edge between non-existant node(s) . Is there something wrong in GraphBuilder ?

Erros in download hg38.gfa

Hey Aeron Group,

When I downloaded the hg38.gfa, I met the problem below, Could you help me solve this problem, or tell me how can I download the hg38.gfa? Thanks.

$ git lfs fetch hg38.gfa
fetch: Fetching reference refs/heads/master
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
error: failed to fetch some objects from 'https://github.com/SchulzLab/Aeron.git/info/lfs'

Memory requirements

Hi,

I am running Aeron on our institute's LSF cluster. When I run the fusion snakefile, the loose_gene_fusion file has >157K putative fusions. With 60 cores, the fusionfinder rule processes only 500 putative fusions consuming 100G memory and >7 hours of wall time. Is this normal behaviour? Since we have a strict memory reservation policy on the cluster, could you suggest possible memory requirements for this process? Also suggestions on how I could speed up the pipeline?

Example output files?

Hi,

I would like to know if the output data that Aeron produces can be used with https://github.com/stianlagstad/chimeraviz. Do you have any example output files that you can share?

Thank you!

Error with FusionFinder: Memory and time requiement

Hi,

When I run the fusion snakefile, the loose_gene_fusion file has >600K putative fusions. Then, the fusionfinder rule processes only 50 putative fusions consuming 500G memory and >20 hours of wall time.
Is this normal behaviour?
Since we have a strict memory reservation policy on the cluster, could you give me some suggestions on how I could speed up the pipeline，e.g. by setting strict parameters to the loose_fusions rule to cut down the number of putative fusions in loose_gene_fusion file?

quantification issue (Error in rule allig JobID 7)

Hey

I am having a problem similar to Bodoko except when running the quantification step.

/Aeron$ snakemake --cores=all
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 24
Rules claiming more threads will be scaled down.
Job counts:
count jobs
2 align
1 all
1 assign_reads_to_transcripts
1 generateCountMatrix
1 output_assignment_statistics
2 postprocess
8

[Fri Apr 16 15:43:57 2021]
rule align:
input: input/hg19.gfa, input/hg19.fa
output: output/aln_hg19_hg19_all.gam
jobid: 7
benchmark: benchmark/aln_hg19_hg19_all.txt
wildcards: reads=hg19, graph=hg19
threads: 15

/usr/bin/bash: tmp/aligner_stdout.txt: No such file or directory
[Fri Apr 16 15:43:57 2021]
Error in rule align:
jobid: 7
output: output/aln_hg19_hg19_all.gam
shell:
/usr/bin/time -v Binaries/GraphAligner -g input/hg19.gfa -f input/hg19.fa --try-all-seeds --seeds-mxm-length 17 --seeds-mem-count 15 --seeds-mxm-cache-prefix tmp/seedcache -a output/aln_hg19_hg19_all.gam -t 15 -b 35 --greedy-length 1> tmp/aligner_stdout.txt 2> tmp/aligner_stderr.txt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/nickk/Aeron/.snakemake/log/2021-04-16T154357.466981.snakemake.log

Could someone help me with this issue? I also dont see the log files (.snakemake) in my Aeron folder.

My Config file is as follow

graph: hg19.gfa
transcripts: hg19.fa

seedsize: 17
maxseeds: 15
fusion_max_error_rate: 0.2
fusion_min_score_difference: 200
alignment_selection: --greedy-length
alignment_E_cutoff: 1

#bandwidth for the aligner. Higher means more accurate but slower alignment.
aligner_bandwidth: 35
gtffile: home/nickk/Aeron/ensembl75.gtf

https://bitbucket.org/dilipdurai/aeron/

scripts: AeronScripts

https://github.com/maickrau/GraphAligner

binaries: Binaries

needed to convert mummer seeds to .gam seeds

vgpath: home/nickk/vg

Problem with the latest commits

Hi,

I think c978944 breaks the graphbuilding script.

ERROR in rule align

Hi,
Tanks for your useful software, but I had problems in quantification part. I tried my own data(include graph file in .gfa format) and test data from https://bitbucket.org/dilipdurai/aeron/src/master/snakemake_pipeline/input/input.tar, both got the same error messages:

Error in rule align:
        jobid: 12
        output: output/aln_ReferenceTranscriptFastaFile_HumanUpdated38V5_all.gam
        log: tmp/aligner_stdout_ReferenceTranscriptFastaFile_HumanUpdated38V5.txt, tmp/aligner_stderr_ReferenceTranscriptFastaFile_HumanUpdated38V5.txt

RuleException:
CalledProcessError in line 54 of /software/Aeron/Snakefile:
Command ' set -euo pipefail;  /usr/bin/time -v Binaries/GraphAligner -g input/HumanUpdated38V5.gfa -f input/ReferenceTranscriptFastaFile.fa --try-all-seeds --seeds-mxm-length 17 --seeds-mem-count 20 --seeds-mxm-cache-prefix tmp/seedcache -a output/aln_ReferenceTranscriptFastaFile_HumanUpdated38V5_all.gam -t 10 -b 35 --greedy-length --E-cutoff 1 1> tmp/aligner_stdout_ReferenceTranscriptFastaFile_HumanUpdated38V5.txt 2> tmp/aligner_stderr_ReferenceTranscriptFastaFile_HumanUpdated38V5.txt ' returned non-zero exit status 1.
  File "/software/Aeron/Snakefile", line 54, in __rule_align
  File "/software/python3.6/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /software/Aeron/.snakemake/log/2020-03-25T112637.955041.snakemake.log

My command is "snakemake --cores 10 all" and my experiments were run using snakemake_5.0.0, vg_1.5.0(pre-built release version), python_3.6, and latest Aeron_f02a963.

loose fusions rule error

I am getting the following error when running fusion snakemake.

rule loose_fusions:                                                                                                                       
    input: fusiontmp/exactparsematrix_sim21_22_Homo-sapiens-GRCh38-cdna-all_GRCh38-97.txt                                                 
    output: fusiontmp/loose_gene_fusion_sim21_22_Homo-sapiens-GRCh38-cdna-all_GRCh38-97.txt                                               
    jobid: 12                                                                                                                             
    wildcards: reads=sim21_22, transcripts=Homo-sapiens-GRCh38-cdna-all, graph=GRCh38-97                                                  
                                                                                                                                          
AeronScripts/pairmatrix_get_genes.py < fusiontmp/exactparsematrix_sim21_22_Homo-sapiens-GRCh38-cdna-all_GRCh38-97.txt > fusiontmp/loose_getxt
Traceback (most recent call last):
  File "AeronScripts/pairmatrix_get_genes.py", line 13, in <module>
    lgene = generegex.search(parts[0]).group(1)
AttributeError: 'NoneType' object has no attribute 'group'

MissingInputException in line 59 of Snakefile_fusion: `input files for rule pair_assignments

In file Snakefile_fusion，line 61: transcriptaln = "output/aln_{transcripts}_{graph}_full_length.gam", any steps to generated this file ?

The error is : MissingInputException in line 59 of Snakefile_fusion: `input files for rule pair_assignments. I supposed this step was missed.

Error in rule align_with_secondaries:

Hello,

I am using Aeron to call fusion genes from Nanopore RNA sequencing data of human tissues.

I generated a graph file, and then, I ran the quantification without errors.

However, the following error showed up when I ran "Snakefile_fusion".

snakemake --cores 20 all -s Snakefile_fusion

Error in rule align_with_secondaries:
    jobid: 17
    output: output/aln_test_20201012hg38GraphBuilder_secondary.gam
    log: tmp/aligner_stdout_test_20201012hg38GraphBuilder.txt, tmp/aligner_stderr_test_20201012hg38GraphBuilder.txt (check log file(s) for error message)
    shell:
        /usr/bin/time -v Binaries/Aligner --all-alignments -g input/20201012hg38GraphBuilder.gfa -f input/test.fastq --try-all-seeds --seeds-mxm-length 17 --seeds-mem-count 20 --seeds-mxm-cache-prefix tmp/seeds_20201012hg38GraphBuilder_index -a output/aln_test_20201012hg38GraphBuilder_secondary.gam -t 1 -b 35 --E-cutoff 1 1> tmp/aligner_stdout_test_20201012hg38GraphBuilder.txt 2> tmp/aligner_stderr_test_20201012hg38GraphBuilder.txt
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Could you comment on what would be the cause of this issue?

config.yalm:

graph: 20201012hg38GraphBuilder.gfa
transcripts: hg38.fa
reads: test.fastq
gtffile: HomoSapiens.GRCh38.93.gtf
vgpath: ~/.conda/envs/mamba/envs/snakemake/bin/vg

fusion_max_error_rate: 0.2
fusion_min_score_difference: 200
seedsize: 17
maxseeds: 20
aligner_bandwidth: 35
alignment_selection: --greedy-length
alignment_E_cutoff: 1

scripts: AeronScripts
binaries: Binaries

job counts:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 20
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       align_reads_to_fusions
        1       align_with_secondaries
        1       all
        1       exact_pair_assignments
        1       filter_fusions
        1       fusion_support_sam
        1       fusion_transcripts
        1       fusionfinder
        2       index_bam
        1       loose_fusions
        1       merge_ref_and_fusions
        1       pair_assignments
        1       parse_matrix
        1       partial_pairs
        1       reheader_sam
        2       sam_to_bam
        18

stderr (tmp/aligner_stderr_test_20201012hg38GraphBuilder.txt):

GraphAligner Branch AlignmentSelection commit a38bd5b318f01ace50d03d36a5bed9234e9b3dda 2019-01-30 13:16:01 +0100
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Command terminated by signal 6
        Command being timed: "Binaries/Aligner --all-alignments -g input/20201012hg38GraphBuilder.gfa -f input/test.fastq --try-all-seeds --seeds-mxm-length 17 --seeds-mem-count 20 --seeds-mxm-cache-prefi
x tmp/seeds_20201012hg38GraphBuilder_index -a output/aln_test_20201012hg38GraphBuilder_secondary.gam -t 20 -b 35 --E-cutoff 1"
        User time (seconds): 16.08
        System time (seconds): 0.90
        Percent of CPU this job got: 3%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 7:11.08
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1834744
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 24755
        Voluntary context switches: 850
        Involuntary context switches: 22
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

beginner error

Hi folks,

I'm just trying to start my first run, so this is almost certainly user error. Can you spot my error?

config.yaml:

#input files at top: check them!

# all input files must be in the folder ./input/
# use the full file name, including file ending

# input splice graph
# Should be in the input folder
# format must be .vg
graph: hg38.gfa

# reference transcripts
# format can be either fasta/fastq, gzipped or not
# Should be in the input folder

transcripts: GRCh38_latest_rna.fna

# sequenced reads
# Should be in the input folder
# format can be either fasta/fastq, gzipped or not
# for more files, add them in new lines starting with "- "
# NOTE: the file names without ending must be unique! You cannot have eg. reads.fq and reads.fa
reads: 
- PAF25672_pass_concat.fastq.gz

# Needed for expression quantificatino
# Should be in the input folder
gtffile: hg38.ncbiRefSeq.gtf

# needed to convert between alignment formats
# https://github.com/vgteam/vg
vgpath: /home/rcorbett/bin/vg


#optional parameters below: default values will probably work

fusion_max_error_rate: 0.2
fusion_min_score_difference: 200

#size of the seed hits. Fewer means more accurate but slower alignments.
seedsize: 17
#max number of seeds. Fewer means faster but more inaccurate alignment
maxseeds: 20

# No need to change these

aligner_bandwidth: 35
alignment_selection: --greedy-length
alignment_E_cutoff: 1

scripts: AeronScripts
binaries: Binaries

Contents of my input folder:

ls -1 input/
GRCh38_latest_rna.fna
hg38.gfa
hg38.ncbiRefSeq.gtf
PAF25672_pass_concat.fastq.gz

My command and the output:

snakemake --cores=48 all
Building DAG of jobs...
InputFunctionException in line 91 of /projects/rcorbettprj2/eydoux/Aeron/Snakefile:
AssertionError: 
Wildcards:
reads=PAF25672_pass_concat_GRCh38_latest
transcripts=rna
graph=hg38

Can you see anything I should be changing?

FusionFinder: src/FusionFinder.cpp:41: std::cxx11::string geneFromTranscript(std::cxx11::string): Assertion `!match.empty()' failed.

Hi,
I used the new version of Aeron, but when I run the step of "rule fusionfinder" in the Snakefile_fusion, there always have the error.
The log file is :
Fusion finder Branch develop commit f9a9e1703e1abcf99fcf0a3ea699cf41f4d8c0d4 2019-07-09 10:39:49 +0200
load graph
load putative fusions
load reads
load partial assignments
FusionFinder: src/FusionFinder.cpp:41: std::__cxx11::string geneFromTranscript(std::__cxx11::string): Assertion `!match.empty()' failed.
Aborted (core dumped).

The format of input file I generated is in the attach file.
format.txt

erro in generateCountMatrix

Hi Aeron Group,

I was using Aeron to detect fusion transcrits in Nanopore transcriptome data, after the graph building, I ran the command "snakemake --cores 10 all ", but encountered some difficulties beblow, in generateCountMatrix,

rule generateCountMatrix:
input: output/matrix_CPD1906270733_transcripts_Genome_all.txt
output: output/CountMatrix_CPD1906270733_transcripts_Genome.txt
jobid: 7
benchmark: benchmark/generateCount_CPD1906270733_transcripts_Genome.txt
wildcards: reads=CPD1906270733, transcripts=transcripts, graph=Genome

Traceback (most recent call last):
File "/share/apps/biosoft/fushion/Aeron/AeronScripts/ThreePrime.py", line 24, in
if(float(ent[-2])>0.2):
ValueError: could not convert string to float: 'ENST00000391753.6'

First, I opened the file "output/matrix_CPD1906270733_transcripts_Genome_all.txt", its format was as follows:

408:1122|554f1097-d17d-4c0d-9fd9-c6bd9fe9ede1 ENST00000391753.6 0.949153
408:1122|554f1097-d17d-4c0d-9fd9-c6bd9fe9ede1 ENST00000302907.9 0.983051
408:1122|554f1097-d17d-4c0d-9fd9-c6bd9fe9ede1 ENST00000610644.4 0.960452

there were three columns, first column was Nanopore reads, second was Ensemble transcripts ID, and was the third an indentity score? It's just my guess.

and then I opened the scripts "ThreePrime.py",

for line in m:
ent=line.rstrip().split("\t")
if(float(ent[-2])>0.2):
if(ent[0] != nam):
max=0
mt[ent[0]]=""
if(float(ent[-2])>max):
mt[ent[0]]=[]
tran=ent[1].split(".")[0]
mt[ent[0]] = tran
mt1[ent[0]] = float(ent[-1])
nam=ent[0]
max=float(ent[-2])
elif(float(ent[-1])==max):
tran=ent[1].split(".")[0]
if(float(ent[-1])<float(mt1[ent[0]])):
mt[ent[0]] = tran
mt1[ent[0]] = float(ent[-1])
else:
tmp="Hello"

My question was why using [-2] and [-1] in the scripsts, and whether the file "output/matrix_CPD1906270733_transcripts_Genome_all.txt" was correct ?

Please help me solve this problem, thanks.

MissingInputException in line 127 of ./Aeron/Snakefile_fusion: Missing input files for rule sam_to_bam:

Dear Schulz Lab,

thank you for providing such an interesting tool.

No matter if I try to run the fusion detection or quantification tool with the downloaded references from ensembl, I get these error messages. This appears to me similar to one of the previously opened issues. I would be very happy to get any hint how to solve these issues. Below you find my commands and the error messages.

snakemake --cores 40 all -s Snakefile_fusion
wildcard constraints in inputs are ignored
wildcard constraints in inputs are ignored
wildcard constraints in inputs are ignored
wildcard constraints in inputs are ignored
wildcard constraints in inputs are ignored
wildcard constraints in inputs are ignored
wildcard constraints in inputs are ignored
wildcard constraints in inputs are ignored
wildcard constraints in inputs are ignored
wildcard constraints in inputs are ignored
wildcard constraints in inputs are ignored
wildcard constraints in inputs are ignored
MissingInputException in line 127 of ./Aeron/Snakefile_fusion:
Missing input files for rule sam_to_bam:
fusiontmp/reads_tofusions_P_HomouusapiensuuGRCh38uucdnauuall_ens92uuhg38.sam

snakemake --cores=10
MissingInputException in line 41 of ./Aeron/Snakefile:
Missing input files for rule align:
input/r

Thank you!

AttributeError: 'ParseGTF' object has no attribute 'getTranscriptPosition'

Dear,

When I run the command "snakemake --core=10 all", the snakemake environment gives me the following error:
in File "AeronScripts//ThreePrime.py", line 79, in
transtart, transend = gf.getTranscriptPosition(i)
AttributeError: 'ParseGTF' object has no attribute 'getTranscriptPosition'

Indeed, GraphBuilder/ParseGTF.py script file does not define the 'getTranscriptPosition' attribute.

Do you have any suggestions?

Thank you.

Missing input files for rule pair_assignments: "output/aln_{transcripts}_{graph}_full_length.gam"

Dear,

When I run Snakefile_fusion, the snakemake environment gives me the following error:
"MissingInputException in line 59 of Snakefile_fusion: `input files for rule pair_assignments".

Indeed, at row 61, the script asks the file "output/aln_{transcripts}_{graph}_full_length.gam" as input, but, looking at the Snakefile_fusion file, it seems there is no rule that can produce this output.

I'm confused.
Is there something wrong?
Should I add a new rule in the Snakefile_fusion script?

Best Regards.

GraphBuilder outputs empty file

Hello,

I am trying to get started using Aeron on some long read data generated from murine macrophages. I am attempting to build the graph using the following command, but get the output below and the gta file is empty.

$ python GraphBuilder.py -e mm10.fa -g ref-transcripts.gtf -o mm10.gta
Reading Sequences
Done reading sequences
Reading and processing gtf
Done reading gtf
Collecting all the genes
Building graph for:
Warning: No sequence information added
Warning: No connection information added

schulzlab / aeron Goto Github PK

aeron's Introduction

About

Prerequisites

Download

Pipeline

Version

Running

Overview

Graph building

Quantification and gene-fusion event detection

Citation (and more details)

Things to remember:

aeron's People

Contributors

Stargazers

Watchers

Forkers

aeron's Issues

needed to convert mummer seeds to .gam seeds

Recommend Projects

Recommend Topics

Recommend Org