opengene / genefuse Goto Github PK

View Code? Open in Web Editor NEW

114.0 14.0 62.0 404 KB

Gene fusion detection and visualization

License: MIT License

C++ 46.29% Makefile 0.17% C 52.69% Python 0.85%

fusion bioinformatics gene alk eml4 ros1 ret cancer cosmic

genefuse's Issues

gen_fusion_file.jl

Hi,

the same issue with gen_fusion_file.jl for csv file preparation while using Julia 0.6 version.

ERROR: LoadError: UndefVarError: readstring not defined in detect at /home/ubuntu/julia/gen_fusion_file.jl:33 in main at /home/ubuntu/julia/gen_fusion_file.jl:22 while loading /home/ubuntu/julia/gen_fusion_file.jl, in expression starting on line 119

Could you please generate result file .

My genes are attached.
genes (2).txt

Many thanks for your responce

primary transcripts used in fusion.csv

Hi, Developers
Could you explain the exact meaing of "primary transcript" in fusion.csv.
Does it mean the longest transcript or pre-mature transcript ?

many thanks
zhengxc

Fusion-making .py script does not take into account gene orientation

Thanks for creating this tool! I have noticed that the scripts/make_fusion_genes.py reports wrong order of exons for genes in '-' orientation, which results in considering some fusions as untranscribed and showing wrong fusion structure.

E.g. NTRK3 in cancer.hg38.csv:

NTRK3_ENST00000394480.6,chr15:87859751-88256768
1,88256644,88256747
2,88256284,88256486
3,88255906,88256168
4,88184225,88184299
...

Versus NTRK3 from the fusions.csv file generated by the script:

NTRK3_NM_002530,chr15:87859750-88256739
1,87859750,87877120
2,87880269,87880428
3,87929190,87929434
4,87933011,87933184
...

Here is a quick fix I introduced to make_fusion_genes.py:

Line 33:
_, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
Line 44:
_, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
And insert these after line 50:
if strand == '-':
exons = exons[::-1]

This returns the exons in the right order.

How to use GeneFuse to detect fusion for nanopore sequencing fastq?

Hi sfchen,
I have installed genefuse and worked for the test data successfully. However, it did not work for nanopore fastq. Could you help me? Thank you!
My command line as follow:
genefuse -r ~/database/genome/GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa -f cosmic-fusion-list/cancer.GRCh38.csv -t 8 -1 ***.fastq -h ***.fusion.html > ***.log

GOPC_ENST00000052569.6 location

hi，I recently find GOPC location is not consist with its location in UCSU.

Can not get the fusions with the testdata

Can not detect the fusions with the test data,is anyone can help me? all the testdata is from the git and
http://opengene.org/dataset.html

Here is my shell command:
./genefuse -r Homo_sapiens_assembly19.fasta -f genes/cancer.hg19.csv -1 ./testdata/genefuse.R1.fq.gz -2 ./testdata/genefuse.R2.fq.gz -h report.html > result
12:17:45 start with 4 threads
12:18:1 mapper indexing done
12:18:6 sequence number before filtering: 0
12:18:6 removeByComplexity: 0
12:18:6 removeByDistance: 0
12:18:6 removeIndels: 0
12:18:53 matcher indexing done
12:18:53 removeAlignables: 0
12:18:53 found 0 fusions
12:18:53 done

What is the minimum required read length?

I conducted a simulation. When my read length is 89bp, it outputs expected fusions. However when I use read length of 85bp. There are no results.

The main change is the LOG information: "sequence number before filtering: XXX". Shorter read length result in very few (or even zero) sequence number before filtering.

How do I adjust the source code to satisfy more shorter reads?

calculate genefuse fusion freq

solved

how to filter fusion result

Would param --unique=2 cause false postive if depth was high(eg:2000x)?

Maybe a typo in the source code, `abs(left.startGP.position) < abs(left.startGP.position)`?

In indexer.cpp:347

...
        // or smaller positive if contig is the same
        if(left.startGP.contig == right.startGP.contig && abs(left.startGP.position) < abs(left.startGP.position))
            return true;
        else
            return false;
...

Your intention was abs(left.startGP.position) < abs("right".startGP.position), wasn't it?

gen_fusion_file.jl is out of date

Hi,

I notice that the scripts/gen_fusion_file.jl is out of date and can't be used anymore.
So I wrote a simple python script to replace its function, and I would like to share it with others. However any pushes to this repo is banned, would you mind to add this script manually?

"""
Generate a user-defined fusion file 

Input file should be a text file containing 
1. gene name and 
2. transcript name (optional)

For example:
gene1   transcript1
gene2
gene3   transcript3
"""
__author__ = "Kai"
__date__ = "20/05/2020"

import argparse


def read_genelist(inputf):
    with open(inputf, "r") as fh:
        for line in fh:
            yield line.rstrip("\n").split()


def make_fusion_gene(gene, fw, refflat):
    # no transcript specified --> use the longest transcript
    if len(gene) == 1:
        transcripts = {}
        with open(refflat, "r") as fh:
            for line in fh:
                if gene[0] not in line:
                    continue
                _, transcript, chrom, _, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
                transcripts[transcript] = (chrom, start, end, exonstart, exonend)
        transcript = get_longest_transcript(transcripts.keys(), refflat)
        chrom, start, end, exonstart, exonend  = transcripts[transcript]

    # use user-specified transcript
    elif len(gene) == 2:
        with open(refflat, "r") as fh:
            for line in fh:
                if gene[1] not in line:
                    continue
                _, transcript, chrom, _, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
                break
    
    # write to a file
    header = f">{gene[0]}_{transcript},{chrom}:{start}-{end}\n"
    fw.write(header)
    exons = list(zip(exonstart.split(","), exonend.split(",")))[:-1]
    for index, each_exon in enumerate(exons, start=1):
        fw.write(f'{index},{each_exon[0]},{each_exon[1]}\n')
    fw.write("\n")


def get_longest_transcript(transcripts, refflat):
    longest_length = 0
    longest_transcript = ""
    with open(refflat, "r") as fh:
        for line in fh:
            line = line.strip().split()
            if line[1] in transcripts:
                length = int(line[5]) - int(line[4])
                if length > longest_length:
                    longest_length = length
                    longest_transcript = line[1]
    return longest_transcript


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument("input", help="Input filename")
    parser.add_argument("-r", "--refflat", required=True, help="Path to the refFlat.txt file, need to be downloaded from UCSC in advance")
    parser.add_argument("-o", "--output", required=True, help="The output filename")
    args = parser.parse_args()
    
    with open(args.output, "w") as fw:
        for gene in read_genelist(args.input):
            make_fusion_gene(gene, fw, args.refflat)

the gen_fusion_file.jl script does not work because OpenGene can not be installed

Hi !

I would like to run the julia script on a list of genes to get the file fusion.csv but I have not managed to install OpenGene, here is the error returned


julia> Pkg.add("OpenGene")
   Cloning default registries into /root/.julia/registries
   Cloning registry General from "https://github.com/JuliaRegistries/General.git"
  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
 Resolving package versions...
ERROR: Unsatisfiable requirements detected for package OpenGene [63bcc5cb]:
 OpenGene [63bcc5cb] log:
 ├─possible versions are: 0.1.0-0.1.11 or uninstalled
 ├─restricted to versions * by an explicit requirement, leaving only versions 0.1.0-0.1.11
 └─restricted by julia compatibility requirements to versions: uninstalled — no versions left
Stacktrace:
 [1] #propagate_constraints!#61(::Bool, ::Function, ::Pkg.GraphType.Graph, ::Set{Int64}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/GraphType.jl:1005
 [2] propagate_constraints! at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/GraphType.jl:946 [inlined]
 [3] #simplify_graph!#121(::Bool, ::Function, ::Pkg.GraphType.Graph, ::Set{Int64}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/GraphType.jl:1460
 [4] simplify_graph! at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/GraphType.jl:1460 [inlined]
 [5] macro expansion at ./logging.jl:319 [inlined]
 [6] resolve_versions!(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/Operations.jl:338
 [7] #add_or_develop#58(::Array{Base.UUID,1}, ::Function, ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/Operations.jl:1164
 [8] #add_or_develop at ./none:0 [inlined]
 [9] #add_or_develop#13(::Symbol, ::Bool, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/API.jl:64
 [10] #add_or_develop at ./none:0 [inlined]
 [11] #add_or_develop#12 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/API.jl:29 [inlined]
 [12] #add_or_develop at ./none:0 [inlined]
 [13] #add_or_develop#11 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/API.jl:28 [inlined]
 [14] #add_or_develop at ./none:0 [inlined]
 [15] #add_or_develop#10 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/API.jl:27 [inlined]
 [16] #add_or_develop at ./none:0 [inlined]
 [17] #add#18 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/API.jl:69 [inlined]
 [18] add(::String) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/API.jl:69 
 [19] top-level scope at none:0

here my geneList (iI working with hg19), I would be happy if you can help me ()

CAMTA1
CCNB3
CIC
EPC1
EWSR1
FOXO1
FUS
GLI1
HMGA2
JAZF1
MEAF6
MKL2
NCOA2
NTRK3
PDGFB
PLAG1
ROS1
SS18
STAT6
TAF15
TCF12
TFE3
TFG
USP6
YWHAE

Thank you

fusion only option

I got a lot of deletions when I ran our real data using illumina targeted RNA fusion panel.
This panel is sequenced by 2x75 bp. Would you add an option for fusion only?

k-mer hit will miss when the fusion mates are in different strands

For example:

EML4-ALK fusion, one is forward and the other is reverse, the k-mer counting will be failed for one of them if only we only index one direction.

No fusion found using tests data

Hi Shifu,
no fusion found using tests data
./genefuse -r Homo_sapiens_assembly19.fasta -f druggable.hg19.csv -1 R1.fq.gz -2 R2.fq.gz -h r1r2n.html
15:51:11 start with 4 threads
15:51:50 mapper indexing done
15:52:20 sequence number before filtering: 0
15:52:20 removeByComplexity: 0
15:52:20 removeByDistance: 0
15:52:20 removeIndels: 0
15:54:5 matcher indexing done
15:54:5 removeAlignables: 0
15:54:5 found 0 fusions

./genefuse -r Homo_sapiens_assembly19.fasta -f druggable.hg19.csv -1 genefuse.R1.fq.gz -2 genefuse.R2.fq.gz -h genefuser1r2n.html
15:55:45 start with 4 threads
15:56:25 mapper indexing done
15:56:36 sequence number before filtering: 0
15:56:36 removeByComplexity: 0
15:56:36 removeByDistance: 0
15:56:36 removeIndels: 0
15:58:25 matcher indexing done
15:58:25 removeAlignables: 0
15:58:25 found 0 fusions

Dataset was downloaded from: http://opengene.org/dataset.html

Thanks.

No fusion generated using demo fastq files

No fusion detected with the demo data
genefuse -r hg19.fasta -f genes/druggable.hg19.csv -1 R1.fq -2 R2.fq -h report.html > result

std::bad_alloc

Hi:
I got a problem:
19:31:16 start with 12 threads
19:34:24 mapper indexing done
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc

I am wondering what caused this problem?

gz is not yet supported for reference file <-r>

tmprss-erg

GeneFuse/genes/cancer.hg38.csv

Lines 128 to 154 in aa281c0

    
           >TMPRSS2_ENST00000332149.9,chr21:41464551-41531116 
        
           1,41508081,41508159 
        
           2,41498119,41498189 
        
           3,41494356,41494578 
        
           4,41489507,41489593 
        
           5,41488394,41488513 
        
           6,41480476,41480602 
        
           7,41479172,41479282 
        
           8,41476577,41476620 
        
           9,41473325,41473496 
        
           10,41471806,41471981 
        
           11,41470648,41470743 
        
           12,41468396,41468538 
        
           13,41467734,41467886 
        
           14,41464551,41466153 
        
           >ERG_ENST00000288319.11,chr21:38380027-38661780 
        
           1,38498363,38498483 
        
           2,38445404,38445621 
        
           3,38423410,38423561 
        
           4,38403506,38403709 
        
           5,38402557,38402637 
        
           6,38400574,38400645 
        
           7,38392376,38392444 
        
           8,38391659,38391715 
        
           9,38390995,38391042 
        
           10,38380027,38383923

Whether the software supports fastq data from both DNAseq and RNAseq or not?

Can I use this soft to call fusion gene from RNAseq data?
Could anyone give me some advice?

thanks so much !

different gene set gives different results for same fusion

Hi thank-you for a great tool. I ran genefuse on my WES data from a sample with a known ETV6--NTRK3 fusion. I ran with druggable.hg38.csv and cancer.hg38.csv which both have ETV6 and NTRK3 gene, the exons and coordinates are identical for these 2 genes in the 2 .csv files. But I only detect the ETV6--NTRK3 fusion if I use the smaller druggable.hg38.csv gene set, and not if I use cancer.hg38.csv. I was wondering if having more genes in cancer.hg38.csv potentially means reads with better alignment to other genes are aligning to those genes rather than ETV6 or NTRK3?

Sort the reported fusions by importance

1, COSMIC curated fusions
2, can be transcribed or not
3, supporting reads number

Looking for "Encompassing Reads" in the output.

I have a question. Does GeneFuse report "Encompassing Reads" in the output? Thank you.

Note:
Encompassing reads are pairs of reads with each in the pair aligned to a different gene, thereby surrounding the fusion junctions or genomic breakpoints, and spanning reads are reads partially aligned to both genes corresponding to a fusion junction.

Support cross pair

@name1
AGGGTTACCTGAGGATCGAATGAATTGAAATGTGTAAATTGCCGAGCACGTAGTAACCATGCAACAAGTGTTAGCTCCTATTATCCTGTCCCTTTGAGGGATGGCACCATATGGGGACACAGTGTGTGCTGCCATCTCCCTTCTACCGGC
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

@name1
TGGGCCCTCCCTCCCAAGCACTGGAGGAGGTACTCGTTCTGTGGGCCGGGGCCCCTCCCTCCTGAGCACTGGAGGAGGCACTTGTTCTGTGGGCCACGGCCCCTCCCTCCCCAGCACTGGAGGAGGCACTGGTTGTGTGGGCCCTGGCTC
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

GeneFuse ignores fusions in TCF3

Hello,
Thanks for developing this tool! I was testing it on some hematological samples with TCF3 gene fusions (confirmed by FISH), and it seems like GeneFuse is not picking them up. I have checked that the gene is correctly reflected in the reference, and the reads connecting TCF3 and its partners are uniquely mapped to its location in the corresponding BAM files. I have two samples with different fusions, and in both of them, the fusion was not predicted even as untranscribed. Can you suggest what else can I try to fix this problem?
Thanks in advance!
Cheers,
Nadezda

how to see the version of genefuse?

Hi, I got the genefuse by git clone on the command line, and make, finally, I got the binary genefuse, but I can't find the version of it. How to check which version of the software, thanks!

get stuck if no read is presented in the fastq file

GeneFuse on RNA-seq

Dear sir,
I tried to apply genefuse to our RNA-seq with known fusion gene.
Somehow we couldn't find any fusion gene.
command I used:
genefuse -r hg19.fa -f druggable.hg19.csv -1 A2076_FFPE_RNA_S1_L001_R1_001.fastq.gz" -2 A2076_FFPE_RNA_S1_L001_R2_001.fastq.gz" -h A2076_FFPE_RNA.html > A2076_FFPE_RNA.result
14:39:27 start with 4 threads
14:39:51 mapper indexing done
14:39:55 sequence number before filtering: 45
14:39:55 removeByComplexity: 0
14:39:55 removeByDistance: 22
14:39:55 removeIndels: 2
14:41:14 matcher indexing done
14:41:14 removeAlignables: 3
14:41:14 found 0 fusions
14:41:16 done
What your suggestion to apply genefuse to RNA-seq ?

Support BAM input

IGH-BCL2 is not detected but bam is supported

I have a sample with BCL2-IGH verified by FISH. There are lots of reads to support viewed by IGV. the pictuers is as follows. I am running with default parameters，and BCL2,IGH is included in csv file

BCL2_NM_000633,chr18:60790579-60987361
1,60986406,60986613
2,60985314,60986185
3,60790578,60795992

IGH,chr14:106052774-107288051

I would like to know whether or not relevant parameters need to be adjusted. Thanks a lot

可否用gtf文件产生融合基因列表的csv文件？

如题，大概看了一下制作融合基因列表的jl脚本和文件。脚本仅仅支持hg19或hg38。我理解OpenGene的julia包里这俩人类基因组的外显子信息应该是有来源的，例如gtf文件。那么，我考虑可否通过gtf文件直接来制造候选的融合基因列表呢？

make: *** No rule to make target 'install'. Stop.

Need help to compile from source code.
Got "make: *** No rule to make target 'install'. Stop."
when sudo make install, in ubuntu 16.04 LTS

Running binary without any error but found no fusions when run with dataset for testing.
Is there any library must have before installation, Thanks.

No fusion found in tests data

amar@amar-Virtual-Machine:~/GeneFuse$ ./genefuse -1 testdata/R1.fq -2 testdata/R2.fq -r Homo_sapiens_assembly19.fasta -f genes/cancer.hg19.csv
22:53:5 start with 4 threads
22:53:9 mapper indexing done
22:53:9 sequence number before filtering: 0
22:53:9 removeByComplexity: 0
22:53:9 removeByDistance: 0
22:53:9 removeIndels: 0
22:53:25 matcher indexing done
22:53:25 removeAlignables: 0
22:53:25 found 0 fusions

./genefuse -1 testdata/R1.fq -2 testdata/R2.fq -r Homo_sapiens_assembly19.fasta -f genes/cancer.hg19.csv

genefuse v0.3.0, time used: 20 seconds

22:53:25 done
No fusion gene found in test data.Kindly reply....

JSON output corrupted due to a not escaped double-quotes character

I'm using genefuse JSON output to handle fusions information in order to prioritize them, so I'm using a Python script to read the JSON file. Sometimes the JSON output is corrupted because in the "qual" value of each read it could be used the double-quotes symbol. For example:
"reads":[ { "break":33, "strand":"reversed", "seq":"TTTTTTATAGGATTTGGGAAGGTAATGGAAAATTCCAGTCAAAGGGGGTTGTTCTCTGGTGGGCAGGGGCGGGGGTCACAAGGTGCTCAGTGGGGGAG", "qual":"9.B>?<5:=B94ACABA;BA@79B>B?6@A?@BB"@=<@:A=9@ABB>BB>B>?B.@A6BB?@>?@@B?7AAB>=<B=AA@A7B@:?<A9AA@@?=??" },

When genefuse writes the JSON file the double-quotes is not escaped, so trying to read the JSON file cause an error.
At the moment I'm avoiding the problem using a regex pattern to manipulate the JSON file and make it correct. I hope there's a way to update the construction of the JSON output in order to add the escape slash before double-quotes in "qual" value.

error when creating fusion.csv file

Hi,

I am trying to create the fusion.csv file using the script that you provide gen_fusion_file.jl but I get the error:

loading gencode grch38
ERROR: LoadError: KeyError: key "localfile" not found

And this is what I run:
julia scripts/gen_fusion_file.jl -r GRch38 -g genes_for_csv.txt -f fusion.csv

Thanks a lot,

Erica

while running Stuck

hi,
Sometimes it gets stuck and does not run. The final state is as follows. and the task is suspended. Why is it? Please help me, thanks!
23:11:23 start with 4 threads
23:11:27 mapper indexing done

How to get VAF info from genefuse results?

Hi there:
#Fusion: EML4_ENST00000318522.5:intron:18|-2:42544153___ALK_ENST00000389048.3:intron:19|+2:29446769 (total: 155, unique:13)
155 refers to the number of supporting reads and 13 is the unique reads. Can we calculate VAF by 13/155*100 = 8.4% ?
Or where else can we get VAF info from the result file? Thanks!

Multiple runs show different results.

Multiple runs show different unique read count.

vec<Match> is sorted differently a bit per run.

Some Matchs have the same read_break and seq.length().

This was found with hg19.fa, druggable.hg18.csv, test genefuse fastq files.

the data size of raw data effected fusion result

Hi,I cut off my raw data fastq to 300M(1000X),and use gene fuse to detect. The result was negative, no fusion detected. Using genefuse to detect 2G of my raw data , genefuse gets a lot of fusion results. Maybe the size of data effected genefuse.

	>TMPRSS2_ENST00000332149.9,chr21:41464551-41531116
	1,41508081,41508159
	2,41498119,41498189
	3,41494356,41494578
	4,41489507,41489593
	5,41488394,41488513
	6,41480476,41480602
	7,41479172,41479282
	8,41476577,41476620
	9,41473325,41473496
	10,41471806,41471981
	11,41470648,41470743
	12,41468396,41468538
	13,41467734,41467886
	14,41464551,41466153

	>ERG_ENST00000288319.11,chr21:38380027-38661780
	1,38498363,38498483
	2,38445404,38445621
	3,38423410,38423561
	4,38403506,38403709
	5,38402557,38402637
	6,38400574,38400645
	7,38392376,38392444
	8,38391659,38391715
	9,38390995,38391042
	10,38380027,38383923

opengene / genefuse Goto Github PK

genefuse's Issues

./genefuse -1 testdata/R1.fq -2 testdata/R2.fq -r Homo_sapiens_assembly19.fasta -f genes/cancer.hg19.csv

genefuse v0.3.0, time used: 20 seconds

Recommend Projects

Recommend Topics

Recommend Org