Coder Social home page Coder Social logo

namphuon / vifi Goto Github PK

View Code? Open in Web Editor NEW
27.0 27.0 18.0 3.36 MB

Pipeline for identifying viral integration and fusion mRNA reads from NGS data. Manuscript is currently in preparation.

License: GNU General Public License v3.0

Python 65.89% Perl 29.36% Shell 4.75%

vifi's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

vifi's Issues

What is the source of prebuilt hbv hmm model

In the prebuilt HBV hmm model, the sequences are annotated with hbv_ref34, etc. Do you have a detailed annotation of these sequences? Can I download all of the HBV sequence data in NCBI and build the hmm model myself?

Erros when running with other references

I would like to run ViFi with a different reference than the one provided (hg19). I concatenated the reference together with the viral sequence, and indexed it. I provided this index, as well as a list of chromosomes in the reference, to ViFi. However, it seems like there are some other files required - a bed file with mappability scores, gff files with genes, etc, which aren't currently documented. These are listed for hg19 in data_repo/hg19/file_list.txt:

fa_file 		                hg19full.fa
chrLen_file 		            hg19full.fa.fai
duke35_filename 		        wgEncodeDukeMapabilityUniqueness35bp_sorted.bedGraph
mapability_exclude_filename     wgMapabilityExcludable.bed
gene_filename 		            human_hg19_september_2011/Genes_July_2010_hg19.gff
exon_file 		                human_hg19_september_2011/Exon-Intron_July_2010_hg19.gff
oncogene_filename 		        cancer/oncogenes/Census_oncomerge.gff
centromere_filename 		    hg19_centromere.bed
conserved_regions_filename 		conserved.bed #conserved.gain5.bed (readdepth >2 samples in turner controls KT51-59) + lumpy XYM
segdup_filename 		        annotations/hg19GenomicSuperDup.tab

The fasta file and index are easy enough, so I made a file with this information for my reference and tried to run ViFi (without HMMs), but I ran into these warnings and errors:

WARNING:root:#TIME 0.048	 interval_list: Unable to open interval file "/home/data_repo/test_human/".
WARNING:root:#TIME 0.048	 interval_list: Unable to open interval file "/home/data_repo/test_human/".
WARNING:root:#TIME 0.048	 interval_list: Unable to open interval file "/home/data_repo/test_human/".
WARNING:root:#TIME 0.048	 interval_list: Unable to open interval file "/home/data_repo/test_human/".
WARNING:root:#TIME 0.048	 interval_list: Unable to open interval file "/home/data_repo/test_human/".
WARNING:root:#TIME 0.050	 rep_content: Unable to open mapability file "/home/data_repo/test_human/".
Traceback (most recent call last):
  File "/home/scripts/cluster_trans_new.py", line 161, in <module>
    if hg.interval(a, bamfile=bamFile).rep_content() <= 3 and a.mapq >= 10:
  File "/home/scripts/hg19util.py", line 384, in rep_content
    m = interval(duke35[p])
  File "/home/scripts/hg19util.py", line 172, in __init__
    self.load_line(line, file_format)
  File "/home/scripts/hg19util.py", line 182, in load_line
    if len(line.strip().split()) == 1:
AttributeError: 'list' object has no attribute 'strip'

Seems like the issue is something to do with the missing mappability scores, which aren't available for my reference genome. Are these required for running ViFi, and are there any options for running with a reference for which they aren't available?

hg19util.py

Hi,
This script has a possible error at line 202. Is this a problematic for loop?
Thanks.

ViFi can be used for non-human genome and virus??

Dear ViFi Team

I have a Viral infected Fish sequence. I would like to try running ViFi for my Data.
I have both Reference Genome and Viral sequence.
is it possible for me to use ViFi and Identify the Fusion genes from Virus and Fish Genome?

Kind Regards
Sri

breakpoint position left shift and large mem required

Hi,
We ran ViFi on samples of HCC RNA-seq data downloaded from PRJNA337887 (only one sample test),
After manually checking result from ViFi and supplement table 3 of original papers, the ViFi reported integration sites are all left shift 2 bp from the breakpoint either from original paper reported or manually check of mapping bam file using samtools tview.
The ViFi reported:
image
The original paper reported:

image
The samtools tview on one of the integrated site:
image

Besides,
the step cluster_trans_new.py consumed about 60G memory on a small input bam file, is it normal?

AlignmentHeader does not support item assignment

Hello !!,
I am getting this error while running vifi.
Any suggestion regarding this error will be really helpful.

Traceback (most recent call last):
File "/home/anurag/tools/ViFi/scripts/merge_viral_reads.py", line 118, in
outputFile.header['SQ'] = references
File "pysam/libcalignmentfile.pyx", line 537, in pysam.libcalignmentfile.AlignmentHeader.setitem
TypeError: AlignmentHeader does not support item assignment (use header.to_dict()

I am using pysam version 0.14.1 and samtools version 1.7

Thank you,
AKS

I got an error: File "/data/program/ViFi/ViFi/scripts/run_vifi.py", line 118, in <module> reference_dir = os.environ['REFERENCE_REPO'] File "/usr/lib/python2.7/UserDict.py", line 40, in __getitem__ raise KeyError(key) KeyError: 'REFERENCE_REPO'

Hi
I install docker and execute setup_linux_mac.sh
my cmd: sudo python $VIFI_DIR/scripts/run_vifi.py --cpus 2 --hmm_list $VIFI_DIR/data/hbv/hmms/hmms.txt -f $VIFI_DIR/test/data/test_R1.fq.gz -r $VIFI_DIR/test/data/test_R2.fq.gz -o $VIFI_DIR/tmp/docker/ --docke
but, I got the following error
File "/data/program/ViFi/ViFi/scripts/run_vifi.py", line 118, in
reference_dir = os.environ['REFERENCE_REPO']
File "/usr/lib/python2.7/UserDict.py", line 40, in getitem
raise KeyError(key)
KeyError: 'REFERENCE_REPO'

I confirmed my environmental variables.

Running ViFi with Custom Reference Files

Hello,

I would like to use ViFi with a specific reference file that I have. Therefore, I'm referring to the following content:
`#Set up reference for alignment
HUMAN_REF="GRCh38"
HUMAN_REF_FILE_NAME="hg38full.fa"
for virus in "hpv" "hbv" "hcv"; do
if [ ! -d $REFERENCE_REPO/${virus} ]; then
echo "Reference for virus $virus is not downloaded. Contact the author to get access to the viral references."
else
HUMAN_VIRAL_REF="grch38_${virus}.fas"
echo "Building the ${HUMAN_REF}+${virus} reference"
cat $AA_DATA_REPO//${HUMAN_REF}/${HUMAN_REF_FILE_NAME} $REFERENCE_REPO/${virus}/${virus}.unaligned.fas > $REFERENCE_REPO/${virus}/${HUMAN_VIRAL_REF}
docker run -v $REFERENCE_REPO/${virus}/:/home/${virus}/ docker.io/namphuon/vifi bwa index /home/${virus}/${HUMAN_VIRAL_REF}

    #Build reduced list of HMMs for testing
    echo "Creating the list of hmms for testing in $VIFI_DIR"
    ls $VIFI_DIR/viral_data/${virus}/hmms/*.hmmbuild > $VIFI_DIR/viral_data/${virus}/hmms/hmms.txt
    ls $VIFI_DIR/viral_data/${virus}/hmms/*.[0-9].hmmbuild > $VIFI_DIR/viral_data/${virus}/hmms/partial_hmms.txt`

running command
docker run -v $REFERENCE_REPO/AB033550/:/home/AB033550/ docker.io/namphuon/vifi bwa index /home/AB033550/hybrid_hg19nAB033550.fas

However, the HMM and TRE files were not generated, so I tried running "ViFi/scripts/build_references.sh". I ran the following command on June 13, but it still hasn't completed.
image
image

running command
sh /home/kde/PROJECTS/VirusIntegrationTools/download/ViFi/scripts/build_references.sh /home/kde/PROJECTS/VirusIntegrationTools/download/ViFi/viral_data/AB033550/hybrid_hg19nAB033550.fas /home/kde/PROJECTS/VirusIntegrationTools/download/ViFi/viral_data/AB033550/output hybrid /home/kde/PROJECTS/VirusIntegrationTools/download/ViFi/scripts

If it's appropriate to use "build_references.sh" to run ViFi with my reference files, why is it taking so long? Is there a solution?

Thanks

Error; hg19util

Hello,
I am trying to run vifi, but everytime I am getting this error.

Traceback (most recent call last):
File "run_vifi.py", line 4, in
import hg19util as hg19
ImportError: No module named hg19util

Kindly help me with this error.

Thanks
AKS

ViFi For germline

Hi everyone,

I'have a question about ViFi, I want to detect a germline virus integrations in my samples, I want to know if its possible use ViFi for this prupose... Even more, If I have different files for fastq_1 and fastq_2 I'have to merge all of them in one for _1 and other for _2??

Thanks for your help

Jordi

reduced.csv not being created

Hello,
When running my cases I am continually getting an empty table for output. I am faced with the following error:

"Traceback (most recent call last):
File "/home/dnygard/ViFi/scripts/merge_viral_reads.py", line 116, in
scores = read_scores_file(args.reducedName[0])
File "/home/dnygard/ViFi/scripts/merge_viral_reads.py", line 19, in read_scores_file
input = open(hmm_file, 'r')
IOError: [Errno 2] No such file or directory: 'tmp/temp/reduced.csv'
0"

I am wondering in what step is tmp/temp/reduced.csv produced so I might be able to trace the source of this error. If you have any suggestions they would be much appreciated. Thank you.

Problems with --disable_hmms

Hi,

I have encountered problems running ViFi on EBV genome with --disable_hmms. The possible bug leads to 0 clusters in output.clusters.txt and output.clusters.txt.range for several samples clearly having traces of EBV integration. The outputs.trans.bam files contain hundreds of reads though, so that ViFi seems to have successfully identified the integrations.

This is a possible duplicate of issue #4 but the issue doesn't seem to have been answered explicitly. The traceback generated indicates that there's script merge_viral_reads.py that is being run regardless of --disable_hmms option, nevertheless it seems to require the file tmp/temp/reduced.csv that can only be generated in run_hmms.py.

The actual command and the lines of traceback:
python ${VIFI_DIR}/scripts/run_vifi.py -f ${FQ1} -r ${FQ2} -o ${vifi_output_dir} -v ebv --cpus 8 --disable_hmms 1

4017.630011 45100000 reads done: #(Trans reads) = 995 38 D7ZQJ5M1:683:C4BGFACXX:6:2315:14661:71501 D7ZQJ5M1:683:C4BGFACXX:6:2315:14367:71714
4026.487438 45200000 reads done: #(Trans reads) = 998 38 D7ZQJ5M1:683:C4BGFACXX:6:2316:9952:35091 D7ZQJ5M1:683:C4BGFACXX:6:2316:9757:35047
Traceback (most recent call last):
File "/home/scripts/get_trans_new.py", line 238, in
miscFile.write(b)
AttributeError: 'NoneType' object has no attribute 'write'
[Finished identifying chimeric reads]: 6156.258875
[Cluster and identify integration points]: 6156.258919
scores = read_scores_file(args.reducedName[0])
Traceback (most recent call last):
File "/home/scripts/merge_viral_reads.py", line 128, in
IOError: [Errno 2] No such file or directory: 'tmp/temp/reduced.csv'
File "/home/scripts/merge_viral_reads.py", line 21, in read_scores_file
input = open(hmm_file, 'r')
0
[Finished cluster and identify integration points]: 6158.720271

Thank you in advance,
Sergei

Change output to use position of most representative strain in cluster

As multiple different viral strains can exist in the reference database, a cluster of chimeric reads might end up having the viral portion map to multiple different strains if the region is highly similar. To get a better output, we should report the viral position of the most representative strain.

Inquiry

Hello!
May I ask for a help?
I am not sure what is the problem and I can not find pysam/libcalignmentfile.pyx, and I can only find a file called libcalignmentfile.pxd. I used the hmm files and test fa files provide by ViFi. I finished running HMMS. Could you please tell me how to deal with the following error?

Traceback (most recent call last):
File "/home/brz/wsy/vifi/ViFi/scripts/merge_viral_reads.py", line 118, in
outputFile.header['SQ'] = references
File "pysam/libcalignmentfile.pyx", line 537, in pysam.libcalignmentfile.AlignmentHeader.setitem
TypeError: AlignmentHeader does not support item assignment (use header.to_dict()

Usage: samtools sort [options] <in.bam> <out.prefix>

Would ViFi remove duplicate reads?

Hi,

I would like to know whether ViFi would remove duplicate reads before determining the integration sites. if so, in which step would ViFi conduct deduplication?

Thanks in advance.

Gina

Problems with understanding output.clusters.txt and *.range files

Hi!

I have launched vifi on a TCGA patient and now I have a couple of questions. Shouldn't the min and max columns of the output.clusters.txt and output.clusters.txt.range be the same? According to the docs for output.clusters.txt:

The first line is the header information. Afterward, each integration cluster is separated by a line containing =. The first line of an integration cluster describes the following:

  1. Reference chromosome (chr19)
  2. Minimum reference position of all mapped reads belonging to that cluster (36212224)
  3. Maximum reference positions of all mapped reads belonging to that cluster (36212932)

And for output.clusters.txt.range:

The first line is header information. Afterward, each line is information about the cluster. For example,
Reference chromosome (chr19)
Minimum reference position of all mapped reads belonging to that cluster (36212224)
Maximum reference positions of all mapped reads belonging to that cluster (36212932)

However, the min, max columns of output.clusters.txt and output.clusters.txt.range in the test sample provided on github do not correspond to each other. Similarly, on my sample vifi output was the following:
output.clusters.txt (grepped only header lines):

chr9 102677385 102677660 54 53 1
chr9 102691056 102691123 6 5 1
chr9 102713338 102713581 54 52 2
chr9 102714438 102714529 80 2 78
chr9 102716962 102717461 1091 1087 4
chr9 102719829 102720390 265 265 0
chr9 103083283 103084075 886 1 885
chr9 103088570 103088778 157 0 157
chr9 103090067 103090202 13 0 13

And output.clusters.txt.range:

Chr,Min,Max,Split1,Split2
chr9,102677612,102677643,-1,-1
chr9,102691059,102691122,-1,-1
chr9,102713404,102713566,-1,-1
chr9,102714438,102714472,-1,-1
chr9,102716962,102717446,-1,-1
chr9,102720073,102720373,-1,-1
chr9,103083283,103083410,-1,-1
chr9,103088570,103088870,-1,-1
chr9,103090067,103090367,-1,-1

Also it seems that there were no split reads found in the TCGA sample despite having many reads in integration clusters - could you suggest what might have been the reason for that?

PS: I created a custom docker image from namphuon/vifi and launched vifi through python inside the container without --docker flag so in principle it should work OK this way, does it?

Thanks in advance,
Sergei

IOError: [Errno 2] No such file or directory: 'tmp/temp/hmmsearch.0'

Hi
I follow the install instructions (for source code version, not Dockerized version because my server platform is not in the platform list for docker installation) and everything seems going well. Then I run run_vifi.py using the test data(test_R1.fq.gz and test_R2.fq.gz) but ran into a problem:
image

I wonder what could be the cause of the problem, thanks

EBV support

Dear ViFi team,
Do you plan to support detection of EBV? If not, could you please direct me on how to build HMM models for EBV?

AttributeError: 'NoneType' object has no attribute 'write'

Hi, Nam,

I always got this warning when running my data with vifi:

[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 4 -M ../db/data/hpv/hg19_hpv.fas ../CL100076810_L01_582_clean_1.fq.gz ../CL100076810_L01_582_clean_2.fq.gz
[main] Real time: 613.215 sec; CPU: 2120.682 sec
Traceback (most recent call last):
File "../program/ViFi/scripts/get_trans_new.py", line 238, in
miscFile.write(b)
AttributeError: 'NoneType' object has no attribute 'write'

Do you have any idea why I have this warning and how could I fix it?

Looking forward to your kind reply.

Best Regards,
Zhihua

hg19util module not found

Hi,
When trying to run for the first time after installing all dependencies in the README, I got an error that the hg19util module could not be found. The only reference to a python module with this name on google is one in AmpliconArchitect. You reference AmpliconArchitect in the ViFi paper but it is not listed as a dependency. Is this a dependency that has gone unlisted or does the hg19util module come from somewhere else?

No chimeric results generated from the test data

Hello, after running the test data with default setting, I got no result in the .clusters.txt and .clusters.txt.range files. I also got some warnings: File "pysam/libcalignmentfile.pyx", line 537, in pysam.libcalignmentfile.AlignmentHeader.__setitem__TypeError: AlignmentHeader does not support item assignment" during the run. Also, is it normal that "[M::bwa_idx_load_from_disk] read 0 ALT contigs" ?

When I tested other samples, the results were similar to this. And I got another warning:
"File ".../ViFi/scripts/get_trans_new.py", line 238, in
miscFile.write(b)
AttributeError: 'NoneType' object has no attribute 'write'".

I am new to python and currently have no idea how to fix them. Could you help with this? Thank you in advance.

Below are the running record for the test data:
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 39718 sequences (4964750 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 986, 34, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (251, 281, 307)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (139, 419)
[M::mem_pestat] mean and std.dev: (279.25, 41.03)
[M::mem_pestat] low and high boundaries for proper pairs: (83, 475)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (7613, 7658, 7687)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (7465, 7835)
[M::mem_pestat] mean and std.dev: (7650.35, 41.77)
[M::mem_pestat] low and high boundaries for proper pairs: (7391, 7909)
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_pestat] skip orientation RF
[M::mem_process_seqs] Processed 39718 reads in 19.878 CPU sec, 19.911 real sec
[main] Version: 0.7.12-r1039
[main] CMD: bwa mem -t 1 -M .../ViFi/data//hpv/hg19_hpv.fas .../ViFi/test/data/test_R1.fq.gz .../ViFi/test/data/test_R2.fq.gz
[main] Real time: 56.086 sec; CPU: 28.238 sec
19859 17365 13
Prepared sequences for searching against HMMs: 0.120441s
Running HMMs
Running HMM .......

Finished running against HMMs: 6113.827747s
Processing results

Traceback (most recent call last):
File ".../ViFi/scripts/merge_viral_reads.py", line 118, in
outputFile.header['SQ'] = references
File "pysam/libcalignmentfile.pyx", line 537, in pysam.libcalignmentfile.AlignmentHeader.setitem
TypeError: AlignmentHeader does not support item assignment (use header.to_dict()
0
[Running BWA]: 0.032352
[Finished BWA]: 56.216900
[Identifying chimeric reads]: 56.231233
[Finished identifying chimeric reads]: 61.031476
[Running HMMS]: 61.031561
[Finished running HMMS]: 6177.171714
[Cluster and identify integration points]: 6177.172175
[Finished cluster and identify integration points]: 6184.909763

Failed to open file output.bam

I follow the instructions(install dependencies, set paths, download data and index the reference sequence), and everything's going well. Then I run run_vifi.py using the test data(test_R1.fq.gz and test_R2.fq.gz) and ran into a problem: Failed to open file output.bam.
helpme

Incorrect link of Step4

Hi,

I am trying to download the data repository for sSep4 and click the link but it's an empty link. May I have your help to fix it?

FYI, the cluster server in our school does not have docker installed and we are not allowed to install it as well. So I think I need to go through Step4?

Thanks,
Wenjin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.