opengene / genefuse Goto Github PK
View Code? Open in Web Editor NEWGene fusion detection and visualization
License: MIT License
Gene fusion detection and visualization
License: MIT License
Can I use this soft to call fusion gene from RNAseq data?
Could anyone give me some advice?
thanks so much !
Hi,
the same issue with gen_fusion_file.jl for csv file preparation while using Julia 0.6 version.
ERROR: LoadError: UndefVarError: readstring not defined in detect at /home/ubuntu/julia/gen_fusion_file.jl:33 in main at /home/ubuntu/julia/gen_fusion_file.jl:22 while loading /home/ubuntu/julia/gen_fusion_file.jl, in expression starting on line 119
Could you please generate result file .
My genes are attached.
genes (2).txt
Many thanks for your responce
Hi,
I notice that the scripts/gen_fusion_file.jl is out of date and can't be used anymore.
So I wrote a simple python script to replace its function, and I would like to share it with others. However any pushes to this repo is banned, would you mind to add this script manually?
"""
Generate a user-defined fusion file
Input file should be a text file containing
1. gene name and
2. transcript name (optional)
For example:
gene1 transcript1
gene2
gene3 transcript3
"""
__author__ = "Kai"
__date__ = "20/05/2020"
import argparse
def read_genelist(inputf):
with open(inputf, "r") as fh:
for line in fh:
yield line.rstrip("\n").split()
def make_fusion_gene(gene, fw, refflat):
# no transcript specified --> use the longest transcript
if len(gene) == 1:
transcripts = {}
with open(refflat, "r") as fh:
for line in fh:
if gene[0] not in line:
continue
_, transcript, chrom, _, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
transcripts[transcript] = (chrom, start, end, exonstart, exonend)
transcript = get_longest_transcript(transcripts.keys(), refflat)
chrom, start, end, exonstart, exonend = transcripts[transcript]
# use user-specified transcript
elif len(gene) == 2:
with open(refflat, "r") as fh:
for line in fh:
if gene[1] not in line:
continue
_, transcript, chrom, _, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
break
# write to a file
header = f">{gene[0]}_{transcript},{chrom}:{start}-{end}\n"
fw.write(header)
exons = list(zip(exonstart.split(","), exonend.split(",")))[:-1]
for index, each_exon in enumerate(exons, start=1):
fw.write(f'{index},{each_exon[0]},{each_exon[1]}\n')
fw.write("\n")
def get_longest_transcript(transcripts, refflat):
longest_length = 0
longest_transcript = ""
with open(refflat, "r") as fh:
for line in fh:
line = line.strip().split()
if line[1] in transcripts:
length = int(line[5]) - int(line[4])
if length > longest_length:
longest_length = length
longest_transcript = line[1]
return longest_transcript
if __name__ == "__main__":
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("input", help="Input filename")
parser.add_argument("-r", "--refflat", required=True, help="Path to the refFlat.txt file, need to be downloaded from UCSC in advance")
parser.add_argument("-o", "--output", required=True, help="The output filename")
args = parser.parse_args()
with open(args.output, "w") as fw:
for gene in read_genelist(args.input):
make_fusion_gene(gene, fw, args.refflat)
Hi there:
#Fusion: EML4_ENST00000318522.5:intron:18|-2:42544153___ALK_ENST00000389048.3:intron:19|+2:29446769 (total: 155, unique:13)
155 refers to the number of supporting reads and 13 is the unique reads. Can we calculate VAF by 13/155*100 = 8.4% ?
Or where else can we get VAF info from the result file? Thanks!
Hi, Developers
Could you explain the exact meaing of "primary transcript" in fusion.csv.
Does it mean the longest transcript
or pre-mature transcript
?
many thanks
zhengxc
solved
I conducted a simulation. When my read length is 89bp, it outputs expected fusions. However when I use read length of 85bp. There are no results.
The main change is the LOG information: "sequence number before filtering: XXX". Shorter read length result in very few (or even zero) sequence number before filtering.
How do I adjust the source code to satisfy more shorter reads?
Hi,I cut off my raw data fastq to 300M(1000X),and use gene fuse to detect. The result was negative, no fusion detected. Using genefuse to detect 2G of my raw data , genefuse gets a lot of fusion results. Maybe the size of data effected genefuse.
In indexer.cpp:347
...
// or smaller positive if contig is the same
if(left.startGP.contig == right.startGP.contig && abs(left.startGP.position) < abs(left.startGP.position))
return true;
else
return false;
...
Your intention was abs(left.startGP.position) < abs("right".startGP.position)
, wasn't it?
GeneFuse/genes/cancer.hg38.csv
Lines 128 to 154 in aa281c0
Hi Shifu,
no fusion found using tests data
./genefuse -r Homo_sapiens_assembly19.fasta -f druggable.hg19.csv -1 R1.fq.gz -2 R2.fq.gz -h r1r2n.html
15:51:11 start with 4 threads
15:51:50 mapper indexing done
15:52:20 sequence number before filtering: 0
15:52:20 removeByComplexity: 0
15:52:20 removeByDistance: 0
15:52:20 removeIndels: 0
15:54:5 matcher indexing done
15:54:5 removeAlignables: 0
15:54:5 found 0 fusions
./genefuse -r Homo_sapiens_assembly19.fasta -f druggable.hg19.csv -1 genefuse.R1.fq.gz -2 genefuse.R2.fq.gz -h genefuser1r2n.html
15:55:45 start with 4 threads
15:56:25 mapper indexing done
15:56:36 sequence number before filtering: 0
15:56:36 removeByComplexity: 0
15:56:36 removeByDistance: 0
15:56:36 removeIndels: 0
15:58:25 matcher indexing done
15:58:25 removeAlignables: 0
15:58:25 found 0 fusions
Dataset was downloaded from: http://opengene.org/dataset.html
Thanks.
Hi,
I am trying to create the fusion.csv file using the script that you provide gen_fusion_file.jl but I get the error:
loading gencode grch38
ERROR: LoadError: KeyError: key "localfile" not found
And this is what I run:
julia scripts/gen_fusion_file.jl -r GRch38 -g genes_for_csv.txt -f fusion.csv
Thanks a lot,
Erica
Hi sfchen,
I have installed genefuse and worked for the test data successfully. However, it did not work for nanopore fastq. Could you help me? Thank you!
My command line as follow:
genefuse -r ~/database/genome/GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa -f cosmic-fusion-list/cancer.GRCh38.csv -t 8 -1 ***.fastq -h ***.fusion.html > ***.log
Need help to compile from source code.
Got "make: *** No rule to make target 'install'. Stop."
when sudo make install, in ubuntu 16.04 LTS
Running binary without any error but found no fusions when run with dataset for testing.
Is there any library must have before installation, Thanks.
1, COSMIC curated fusions
2, can be transcribed or not
3, supporting reads number
Would param --unique=2 cause false postive if depth was high(eg:2000x)?
如题,大概看了一下制作融合基因列表的jl脚本和文件。脚本仅仅支持hg19或hg38。我理解OpenGene的julia包里这俩人类基因组的外显子信息应该是有来源的,例如gtf文件。那么,我考虑可否通过gtf文件直接来制造候选的融合基因列表呢?
Hi, I got the genefuse by git clone on the command line, and make, finally, I got the binary genefuse, but I can't find the version of it. How to check which version of the software, thanks!
I have a sample with BCL2-IGH verified by FISH. There are lots of reads to support viewed by IGV. the pictuers is as follows. I am running with default parameters,and BCL2,IGH is included in csv file
BCL2_NM_000633,chr18:60790579-60987361
1,60986406,60986613
2,60985314,60986185
3,60790578,60795992
IGH,chr14:106052774-107288051
I would like to know whether or not relevant parameters need to be adjusted. Thanks a lot
For example:
EML4-ALK fusion, one is forward and the other is reverse, the k-mer counting will be failed for one of them if only we only index one direction.
Read1
@name1
AGGGTTACCTGAGGATCGAATGAATTGAAATGTGTAAATTGCCGAGCACGTAGTAACCATGCAACAAGTGTTAGCTCCTATTATCCTGTCCCTTTGAGGGATGGCACCATATGGGGACACAGTGTGTGCTGCCATCTCCCTTCTACCGGC
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Read2
@name1
TGGGCCCTCCCTCCCAAGCACTGGAGGAGGTACTCGTTCTGTGGGCCGGGGCCCCTCCCTCCTGAGCACTGGAGGAGGCACTTGTTCTGTGGGCCACGGCCCCTCCCTCCCCAGCACTGGAGGAGGCACTGGTTGTGTGGGCCCTGGCTC
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Multiple runs show different unique read count.
vec<Match>
is sorted differently a bit per run.
Some Match
s have the same read_break
and seq.length()
.
This was found with hg19.fa, druggable.hg18.csv, test genefuse fastq files.
Dear sir,
I tried to apply genefuse to our RNA-seq with known fusion gene.
Somehow we couldn't find any fusion gene.
command I used:
genefuse -r hg19.fa -f druggable.hg19.csv -1 A2076_FFPE_RNA_S1_L001_R1_001.fastq.gz" -2 A2076_FFPE_RNA_S1_L001_R2_001.fastq.gz" -h A2076_FFPE_RNA.html > A2076_FFPE_RNA.result
14:39:27 start with 4 threads
14:39:51 mapper indexing done
14:39:55 sequence number before filtering: 45
14:39:55 removeByComplexity: 0
14:39:55 removeByDistance: 22
14:39:55 removeIndels: 2
14:41:14 matcher indexing done
14:41:14 removeAlignables: 3
14:41:14 found 0 fusions
14:41:16 done
What your suggestion to apply genefuse to RNA-seq ?
Can not detect the fusions with the test data,is anyone can help me? all the testdata is from the git and
http://opengene.org/dataset.html
Here is my shell command:
./genefuse -r Homo_sapiens_assembly19.fasta -f genes/cancer.hg19.csv -1 ./testdata/genefuse.R1.fq.gz -2 ./testdata/genefuse.R2.fq.gz -h report.html > result
12:17:45 start with 4 threads
12:18:1 mapper indexing done
12:18:6 sequence number before filtering: 0
12:18:6 removeByComplexity: 0
12:18:6 removeByDistance: 0
12:18:6 removeIndels: 0
12:18:53 matcher indexing done
12:18:53 removeAlignables: 0
12:18:53 found 0 fusions
12:18:53 done
Thanks for creating this tool! I have noticed that the scripts/make_fusion_genes.py
reports wrong order of exons for genes in '-' orientation, which results in considering some fusions as untranscribed and showing wrong fusion structure.
E.g. NTRK3 in cancer.hg38.csv:
NTRK3_ENST00000394480.6,chr15:87859751-88256768
1,88256644,88256747
2,88256284,88256486
3,88255906,88256168
4,88184225,88184299
...
Versus NTRK3 from the fusions.csv file generated by the script:
NTRK3_NM_002530,chr15:87859750-88256739
1,87859750,87877120
2,87880269,87880428
3,87929190,87929434
4,87933011,87933184
...
Here is a quick fix I introduced to make_fusion_genes.py
:
Line 33:
_, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
Line 44:
_, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
And insert these after line 50:
if strand == '-':
exons = exons[::-1]
This returns the exons in the right order.
hi,
Sometimes it gets stuck and does not run. The final state is as follows. and the task is suspended. Why is it? Please help me, thanks!
23:11:23 start with 4 threads
23:11:27 mapper indexing done
I got a lot of deletions when I ran our real data using illumina targeted RNA fusion panel.
This panel is sequenced by 2x75 bp. Would you add an option for fusion only?
Hi thank-you for a great tool. I ran genefuse on my WES data from a sample with a known ETV6--NTRK3 fusion. I ran with druggable.hg38.csv and cancer.hg38.csv which both have ETV6 and NTRK3 gene, the exons and coordinates are identical for these 2 genes in the 2 .csv files. But I only detect the ETV6--NTRK3 fusion if I use the smaller druggable.hg38.csv gene set, and not if I use cancer.hg38.csv. I was wondering if having more genes in cancer.hg38.csv potentially means reads with better alignment to other genes are aligning to those genes rather than ETV6 or NTRK3?
I'm using genefuse JSON output to handle fusions information in order to prioritize them, so I'm using a Python script to read the JSON file. Sometimes the JSON output is corrupted because in the "qual" value of each read it could be used the double-quotes symbol. For example:
"reads":[ { "break":33, "strand":"reversed", "seq":"TTTTTTATAGGATTTGGGAAGGTAATGGAAAATTCCAGTCAAAGGGGGTTGTTCTCTGGTGGGCAGGGGCGGGGGTCACAAGGTGCTCAGTGGGGGAG", "qual":"9.B>?<5:=B94ACABA;BA@79B>B?6@A?@BB"@=<@:A=9@ABB>BB>B>?B.@A6BB?@>?@@B?7AAB>=<B=AA@A7B@:?<A9AA@@?=??" },
When genefuse writes the JSON file the double-quotes is not escaped, so trying to read the JSON file cause an error.
At the moment I'm avoiding the problem using a regex pattern to manipulate the JSON file and make it correct. I hope there's a way to update the construction of the JSON output in order to add the escape slash before double-quotes in "qual" value.
Hi:
I got a problem:
19:31:16 start with 12 threads
19:34:24 mapper indexing done
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
I am wondering what caused this problem?
amar@amar-Virtual-Machine:~/GeneFuse$ ./genefuse -1 testdata/R1.fq -2 testdata/R2.fq -r Homo_sapiens_assembly19.fasta -f genes/cancer.hg19.csv
22:53:5 start with 4 threads
22:53:9 mapper indexing done
22:53:9 sequence number before filtering: 0
22:53:9 removeByComplexity: 0
22:53:9 removeByDistance: 0
22:53:9 removeIndels: 0
22:53:25 matcher indexing done
22:53:25 removeAlignables: 0
22:53:25 found 0 fusions
22:53:25 done
No fusion gene found in test data.Kindly reply....
Hi !
I would like to run the julia script on a list of genes to get the file fusion.csv but I have not managed to install OpenGene, here is the error returned
julia> Pkg.add("OpenGene")
Cloning default registries into /root/.julia/registries
Cloning registry General from "https://github.com/JuliaRegistries/General.git"
Updating registry at `~/.julia/registries/General`
Updating git-repo `https://github.com/JuliaRegistries/General.git`
Resolving package versions...
ERROR: Unsatisfiable requirements detected for package OpenGene [63bcc5cb]:
OpenGene [63bcc5cb] log:
├─possible versions are: 0.1.0-0.1.11 or uninstalled
├─restricted to versions * by an explicit requirement, leaving only versions 0.1.0-0.1.11
└─restricted by julia compatibility requirements to versions: uninstalled — no versions left
Stacktrace:
[1] #propagate_constraints!#61(::Bool, ::Function, ::Pkg.GraphType.Graph, ::Set{Int64}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/GraphType.jl:1005
[2] propagate_constraints! at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/GraphType.jl:946 [inlined]
[3] #simplify_graph!#121(::Bool, ::Function, ::Pkg.GraphType.Graph, ::Set{Int64}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/GraphType.jl:1460
[4] simplify_graph! at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/GraphType.jl:1460 [inlined]
[5] macro expansion at ./logging.jl:319 [inlined]
[6] resolve_versions!(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/Operations.jl:338
[7] #add_or_develop#58(::Array{Base.UUID,1}, ::Function, ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/Operations.jl:1164
[8] #add_or_develop at ./none:0 [inlined]
[9] #add_or_develop#13(::Symbol, ::Bool, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/API.jl:64
[10] #add_or_develop at ./none:0 [inlined]
[11] #add_or_develop#12 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/API.jl:29 [inlined]
[12] #add_or_develop at ./none:0 [inlined]
[13] #add_or_develop#11 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/API.jl:28 [inlined]
[14] #add_or_develop at ./none:0 [inlined]
[15] #add_or_develop#10 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/API.jl:27 [inlined]
[16] #add_or_develop at ./none:0 [inlined]
[17] #add#18 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/API.jl:69 [inlined]
[18] add(::String) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Pkg/src/API.jl:69
[19] top-level scope at none:0
here my geneList (iI working with hg19), I would be happy if you can help me ()
CAMTA1
CCNB3
CIC
EPC1
EWSR1
FOXO1
FUS
GLI1
HMGA2
JAZF1
MEAF6
MKL2
NCOA2
NTRK3
PDGFB
PLAG1
ROS1
SS18
STAT6
TAF15
TCF12
TFE3
TFG
USP6
YWHAE
Thank you
No fusion detected with the demo data
genefuse -r hg19.fasta -f genes/druggable.hg19.csv -1 R1.fq -2 R2.fq -h report.html > result
Hello,
Thanks for developing this tool! I was testing it on some hematological samples with TCF3 gene fusions (confirmed by FISH), and it seems like GeneFuse is not picking them up. I have checked that the gene is correctly reflected in the reference, and the reads connecting TCF3 and its partners are uniquely mapped to its location in the corresponding BAM files. I have two samples with different fusions, and in both of them, the fusion was not predicted even as untranscribed. Can you suggest what else can I try to fix this problem?
Thanks in advance!
Cheers,
Nadezda
I have a question. Does GeneFuse report "Encompassing Reads" in the output? Thank you.
Note:
Encompassing reads are pairs of reads with each in the pair aligned to a different gene, thereby surrounding the fusion junctions or genomic breakpoints, and spanning reads are reads partially aligned to both genes corresponding to a fusion junction.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.