Coder Social home page Coder Social logo

pacificbiosciences / falcon Goto Github PK

View Code? Open in Web Editor NEW
204.0 204.0 103.0 1.85 MB

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries

Home Page: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries

License: Other

Shell 27.04% Python 65.17% C 7.57% Makefile 0.22%

falcon's People

Contributors

abretaud avatar armintoepfer avatar bwlang avatar cdunn2001 avatar cschin avatar imoteph avatar isovic avatar isugif avatar lhon avatar mjhsieh avatar pb-dseifert avatar pb-isovic avatar pb-jchin avatar pbjd avatar rlleras avatar wenchaolin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

falcon's Issues

Variable 'script_fn' referenced before assignment on job_type = local

I Installed FALCON according to instructions, with the following two changes:

  • Ran make before checking out 97b0c27a in DALIGNER to compile DB2Falcon and LA4Falcon
  • Added job_type = local to the e-coli test config file.

When running I very quickly run into the following error:

"fc_env/lib/python2.7/site-packages/falcon_kit-0.2.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/fc_run.py", line 93, in run_script
    fc_run_logger.info( "executing %s locally, start job: %s " % (script_fn, job_name) )
UnboundLocalError: local variable 'script_fn' referenced before assignment

fc_run.py crashing immediately with traceback similar to that when different SMRTcells are mixed in same fasta file

Hi, Jason,

Traceback is below. I know this looks like the traceback when one has sequences in the same fasta file that differ in the characters prior to the "/". However, I checked (by program) and this is not the case. Is there any other cause of this error? Anything I can do to debug it?

Thanks,
David

fc_run.py fc_run_human.cfg
No target specified, assuming "assembly" as target
Your job 4035833 ("build_rdb-92665f66") has been submitted
Exception in thread Thread-6:
Traceback (most recent call last):
File "/net/gs/vol3/software/modules-sw/python/2.7.3/Linux/RHEL6/x86_64/lib/python2.7/threading.py", line 551, in *bootstrap_inner
self.run()
File "/net/gs/vol3/software/modules-sw/python/2.7.3/Linux/RHEL6/x86_64/lib/python2.7/threading.py", line 504, in run
self.__target(_self.__args, _self.__kwargs)
File "/net/gs/vol1/home/dgordon/falcon/FALCON-master/install/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 317, in __call

runFlag = self._getRunFlag()
File "/net/gs/vol1/home/dgordon/falcon/FALCON-master/install/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 147, in _getRunFlag
runFlag = any( [ f(self.inputDataObjs, self.outputDataObjs, self.parameters) for f in self._compareFunctions] )
File "/net/gs/vol1/home/dgordon/falcon/FALCON-master/install/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 812, in timeStampCompare
if min(outputDataObjsTS) < max(inputDataObjsTS):
ValueError: max() arg is an empty sequence

(several more like this in different threads)

falcon hangs with ecoli test dataset

With the ecoli test of falcon, it doesn't complete for
me. It gets to a point where fc_run.py is using about 0.7% of cpu
with no sge jobs running. Up to this point everything works fine:
many d_8b7e4620_raw_reads-6b44f214 jobs are submitted and complete,
many m_00011_raw_reads-77b3196c jobs are submitted and complete,
many ct_00012-8fd20896 jobs are submitted and complete.

I didn't find any errors or warnings in any of the logs I am aware of.

Then 8 jobs with names such as:
d_c6b8e2ad_preads-f4c28b8a
are submitted. They complete in less than 30 seconds, and that is the
end. Forever after (I've let it go for over an hour), fc_run.py just
sits there.

At this point the 0-rawreads directory has cns_done, da_done, and
rdb_build_done as well as many .las files (as well as m_* and job*
subdirectories).

The 1-preads_ovl directory has rdb_build_done as well as 6 m_* and
8 job* subdirectories. It has a large preads_norm.fasta file.

The 2-asm-falcon directory is empty.

Anyone have any things I could investigate?

Note: the only thing I changed from the normal install was changing
"-pe smp" to "-pe orte" because our sge doesn't have "-pe smp". (I've
also tried "-pe serial".)

Missing daligner las file

In some file system, the las output files might not create correctly due to file system issue. The log shows the block comparisons are fine, but empty files are generated. This makes the merging step fails but the root cause is not in the merging step. This is actually a file system issue, but it is worth to track this to see if there is a work-around.

Example:

$ find . -wholename "*job*.las" | xargs ls -l | awk '$6 == 0' | less
-rw-r--r-- 1 jchin Domain Users      0 Dec  3 08:14
./job_00087/preads.59.preads.88.N0.las
-rw-r--r-- 1 jchin Domain Users      0 Dec  3 08:14
./job_00087/preads.59.preads.88.N2.las
-rw-r--r-- 1 jchin Domain Users      0 Dec  3 08:14
./job_00087/preads.88.preads.59.N0.las
-rw-r--r-- 1 jchin Domain Users      0 Dec  3 08:15
./job_00028/preads.25.preads.29.C3.las
-rw-r--r-- 1 jchin Domain Users      0 Dec  3 08:15
./job_00028/preads.29.preads.25.C0.las
-rw-r--r-- 1 jchin Domain Users      0 Dec  3 15:17
./job_00109/preads.110.preads.36.N2.las
-rw-r--r-- 1 jchin Domain Users      0 Dec  3 15:17
./job_00109/preads.36.preads.110.N0.las
-rw-r--r-- 1 jchin Domain Users      0 Dec  3 15:17
./job_00109/preads.36.preads.110.N3.las
-rw-r--r-- 1 jchin Domain Users      0 Dec  3 12:20
./job_00108/preads.18.preads.109.N3.las
-rw-r--r-- 1 jchin Domain Users      0 Dec  3 12:20
./job_00108/preads.109.preads.18.N3.las
-rw-r--r-- 1 jchin Domain Users      0 Dec  3 12:20
./job_00108/preads.18.preads.109.N0.las
-rw-r--r-- 1 jchin Domain Users      0 Dec  3 12:20
./job_00108/preads.18.preads.109.N2.las
-rw-r--r-- 1 jchin Domain Users      0 Dec  3 12:22
./job_00098/preads.16.preads.99.N0.las
-rw-r--r-- 1 jchin Domain Users      0 Dec  3 12:22
./job_00098/preads.16.preads.99.N2.las
-rw-r--r-- 1 jchin Domain Users      0 Dec  3 12:22
./job_00098/preads.99.preads.16.N0.las

All the daligner log shows these blocks are compared nicely.

In this case, the workflow is just idle since the necessary sentinel files are not generated due to the errors.

To recover, one needs to terminatefc_run.py, kill all jobs. remove the sentinel files job_*done for the problematic jobs, and restart the workflow.

easy question about .raw_reads.bps

I see daligner processes running:

daligner -v -t16 -H12000 -e0.7 -s1000 raw_reads.3 raw_reads.1 raw_reads.2 raw_reads.3

Where are these "files" raw_reads.3 raw_reads.1 raw_reads.2 ?

I'll bet daligner pulls them out of .raw_reads.bps

Correct?

Recommendations for increasing assembly length

Following up on https://twitter.com/johnomics/status/557816525923811328

I'm trying to use FALCON to assemble a small amount of PacBio data from the butterfly Heliconius melpomene. The genome size estimate is 292 Mb (from flow cytometry). We have ~20x coverage with P4/C2, corrected with PBcR using a mixture of Illumina and 454 data (PacBio data is available here: http://www.ebi.ac.uk/ena/data/view/ERP005954). All the sequence has come from a partially inbred strain; roughly two thirds of the genome is still heterozygous. I am using the PacBio data to scaffold our existing genome assembly, so want to maximise length over basepair quality for the PacBio assembly.

I have tried a range of FALCON assemblies, summarised below. The results are very impressive, especially considering our limited data, and I'm willing to just stick with what we've got. However, the default Celera assembly produced by PBcR is 407 Mb long (22k scaffolds, N50 32 kb). I assume a lot of this is haplotype sequence, but even so it suggests that FALCON may be rejecting some parts of the genome during assembly. So I'm wondering if there's a way to tweak FALCON to include more sequence. Here's what I've tried so far, just including the parameters I've changed:

Version length_cutoff length_cutoff_pr max_diff max_cov min_cov ovlp Daligner -l ovlp Daligner -s Mean Read Length Read Bases (bn) Assembly Size (Mb) Scaffolds N50 (kb)
1 3000 1200 20 30 2 500 1000 3,514 3.75 239.7 6,751 78
2 500 500 20 30 2 500 1000 2,644 4.17 239.9 6,843 78
3 500 500 40 60 1 500 1000 2,644 4.17 249.5 7,127 77
4 500 500 40 60 1 350 500 2,644 4.17 248.8 7,027 80
5 500 500 50 100 1 350 500 2,644 4.17 250.9 7,073 80

Any suggestions for further improvements that might push assembly length up to 290-300 Mb? Or is this likely to be the best we can do?

Meaning of columns from fc_ovlp_stats.py

Hello
I wanted to estimate the best coverage for the overlap filtering but am a little
bit confused of the output.

If I execute it fc_ovlp_stats.py --n_core 20 --fofn las.fofn I obtain the following.

None 13329 1 0     
000000000 13329 8 8    
000000002 10096 2 0    
000000003 11647 5 7    
000000004 14689 2 1    
000000005 13854 0 1    

Could you please explain what each column means ?
Thanks

fc_ovlp_to_graph

Im noticing that "reverse_end" is defined twice (exactly the same) and there is a call to "oreverse_end" that is undefined. At least that is what PyDev tells me.

Best, Phil

About chemistry

Dear Jason,

Is it possible to use two different chemistry?

I have P4C2 and P5C3.

If I ran it P4C2 alone, preads size is 7Gb.

But ran it together, it generated only 700mb.

Won

local running #question

Dear Jason,

I saw there is type=local for avoid to use SGE grid.

How can I use it ?

Our grid never allow job submission in computing node.

Thank you.

Won

FALCON vs HGAP > Celera > Quiver

I am just testing out FALCON and was wondering what the primary benefits are compared to the standard workflow of HGAP > Celera > Quiver. Is there an application where one would be better than another and if I run FALCON do I still need to follow it up with Celera and Quiver? Thanks for any insights you can provide.

falcon crash

Hi, Jason,

I've had fc_run.py crash on me a number of times with the python key error below. It isn't a show-stopper because I just restart fc_run.py and it then completes successfully, but it is a minor problem because I might not notice it is dead for a few days.

I believe the conditions under which it occurs are the following: some jobs are running and they die (for one reason or another). So I restart fc_run.py which restarts those jobs. This time they complete. fc_run.py sees their done files and then crashes with this key error.

I should mention that the crash doesn't always say:

'task://localhost/build_p_rdb'

Sometimes it says:

'task://localhost/m_00054_preads'
or
'task://localhost/ct_00102'

Is there is a simple change I could make to controller.py so that it would handle this problem?

Thanks!
David

fc_run.py fc_run_human.cfg
No target specified, assuming "assembly" as target
Your job 4419007 ("ct_00353-06239db8") has been submitted
Your job 4419008 ("ct_00050-083f149c") has been submitted
Your job 4419009 ("ct_00333-0a7520bc") has been submitted
Traceback (most recent call last):
File "/net/gs/vol1/home/dgordon/falcon3/FALCON-master/install/fc_env/bin/fc_run.py", line 4, in
import('pkg_resources').run_script('falcon-kit==0.2.1', 'fc_run.py')
File "/net/gs/vol3/software/modules-sw/python/2.7.3/Linux/RHEL6/x86_64/lib/python2.7/site-packages/distribute-0.6.34-py2.7.egg/pkg_resources.py", line 505, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/net/gs/vol3/software/modules-sw/python/2.7.3/Linux/RHEL6/x86_64/lib/python2.7/site-packages/distribute-0.6.34-py2.7.egg/pkg_resources.py", line 1245, in run_script
execfile(script_filename, namespace, namespace)
File "/net/gs/vol1/home/dgordon/falcon3/FALCON-master/install/fc_env/lib/python2.7/site-packages/falcon_kit-0.2.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/fc_run.py", line 801, in
wf.refreshTargets(updateFreq = wait_time) #all
File "/net/gs/vol1/home/dgordon/falcon3/FALCON-master/install/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/controller.py", line 519, in refreshTargets
self._refreshTargets(objs = objs, callback = callback, updateFreq = updateFreq, exitOnFailure = exitOnFailure)
File "/net/gs/vol1/home/dgordon/falcon3/FALCON-master/install/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/controller.py", line 649, in _refreshTargets
task2thread[URL].join()
KeyError: 'task://localhost/build_p_rdb'
(fc_env){dgordon}e183:/net/eichler/vol20/projects/pacbio/nobackups/users/dgordon/falcon_chm13

sge killing off falcon jobs due to not enough memory

Hi, Jason,

Assembling drosophila as a test. I believe all jobs associated with directory
0-rawreads
have completed and falcon is now working on directory
1-preads_ovl

A number of the jobs with names such as 1-preads_ovl/job_d9d5103/rj_d9d5103a.sh
did not complete, I believe due to sge killing them off due to high memory usage.

In the cfg file are sge parameters:

sge_option_da = -pe orte 1 -l mfree=30G
sge_option_la = -pe orte 2 -l mfree=6G
sge_option_pda = -pe orte 6 -l mfree=6G
sge_option_pla = -pe orte 2 -l mfree=6G
sge_option_fc = -pe orte 6 -l mfree=6G
sge_option_cns = -pe orte 6 -l mfree=6G

I'm not clear which sge option is used for this step.
Could you please correct the table (below)? That would answer this question
and most every future question about which parameter to use when an sge parameter needs to be changed for a particular problem:

sge_option_da used for making files in 0-rawreads
sge_option_la used for making files in 0-rawreads
sge_option_pda used for making files in 1-preads_ovl
sge_option_pla used for making files in 1-preads_ovl
sge_option_fc used for making files in 2-asm-falcon
sge_option_fc used for making files in 2-asm-falcon

(You almost say this in the documentation, but not quite, at least for me.)

Thank you!
David

LA4Falcon / fc_consensus.py stalled

These have been running for 2 days not using, according to top <0.5% cpu time. LA4Falcon is usually in D state (which I believe usually indicates it is waiting on a system resource). It is unlikely there is any disk or memory contention.

Any thoughts about how to find out what the problem is?

Running m_ processes (LAsort/LAmerge) on a local disk

Hi, Jason,

Falcon's LAsort/LAmerge stage was disrupting everyone else on the cluster that the decision was made to cut back the # of falcon jobs from 100 to 19 (which I think will mean a human assembly will take 10 weeks instead of 2). So I'm looking for ways to speed up the falcon assembly without disrupting others on the cluster.

Would Falcon speed up by running LAsort/LAmerge on a local disk? I am seeing that each LAsort/LAmerge job requires a large # of little files: 13,000 in one case of which 12,500 are used just once. the files used just once (in the cases I looked at) are used by LAsort. To run LAsort off of local disk would mean copying these files to local disk. Would the copying itself be as disruptive to other users as running LAsort on them? Would there be any net speedup for falcon?

Thanks,
david

Pacbio header line name inconsisten

Dear All,
I`m experiencing this error when i run Falcon:
File filtered_subreads.fasta, Line 2261629: Pacbio header line name inconsisten

I checked my fasta file and I found this read header

m131122_225813_42179_c100588442550000001823089804281463_s1_p0/13/0_10418

i assume that there is something wrong in it, but i do not know what.

thanks
Luigi

Wish list for ideal assembler output and post-assembly analysis

This is not about getting perfect assembly which is fundamentally a mathematical problem which needs to be solved mathematically. Here we are talking about a wish list for what an assembler can provide from researchers' point of view for exploring genomic features more efficient for research purposes.

  1. For each location in the assembly, provide some quality measurement about
    1.1 base correctness
    1.2 contiguity correctness
    1.3 alternative possibilities of the assembly model locally
  2. For a region in the assembly, provide ways to get
    2.1 the raw data that support the region of the assembly
    2.2 the algorithm evidence on constructing the final sequences for the region
    2.3 alignment comparison/visualization comparing to other known assembly models (reference sequences)
  3. Non-local connection view for the relation between contigs if information is available
  4. Ability to combine other information to disambiguate not-fully resolved regions to generate re-fined assembly sequences
  5. Annotation cross-mapping between the assembly contigs and known references
  6. A good design UI in a software for all above

genome size is underestimated

Hi all,
i was trying to assemble the genome of a fungal pathogen. i know that the genome is of about 36Mb but the assembly output produce only few contigs for a total genome size of 2Mb. can you help me in optimizing the assembly parameters?
here the conf file

[General]
job_type = local
# list of files of the initial bas.h5 files
input_fofn = input.fofn
#input_fofn = preads.fofn

input_type = raw
#input_type = preads


# The length cutoff used for seed reads used for initial mapping
length_cutoff = 5000

# The length cutoff used for seed reads usef for pre-assembly
length_cutoff_pr = 5000

# Cluster queue setting
sge_option_da = -pe smp 8 -q jobqueue
sge_option_la = -pe smp 2 -q jobqueue
sge_option_pda = -pe smp 8 -q jobqueue
sge_option_pla = -pe smp 2 -q jobqueue
sge_option_fc = -pe smp 24 -q jobqueue
sge_option_cns = -pe smp 8 -q jobqueue

pa_concurrent_jobs = 32
ovlp_concurrent_jobs = 32

pa_HPCdaligner_option =  -v -dal4 -t16 -e.70 -l1000 -s1000
ovlp_HPCdaligner_option = -v -dal4 -t32 -h60 -e.96 -l500 -s1000

pa_DBsplit_option = -x500 -s400
ovlp_DBsplit_option = -x500 -s400

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --local_match_count_threshold 2 --max_n_read 200 --n_core 6

overlap_filtering_setting = --max_diff 100 --max_cov 100 --min_cov 15 --bestn 10

thanks
Luigi

Alternative to SGE

Hello Jason
I would like to run Falcon on either a LSF cluster or on a single server.
Since your code is so far SGE only, could you give a hint whether it might be rather trivial to adapt this
for either of them or you would anticipate quite some work?
E.g. which executables will have to be modified and so on.
Cheers

Silent failure of daligner_p, fc_run.py hangs

I ran falcon on pre-corrected PacBio reads (from MHAP/PBcR), so skipping the error-correction steps. We don't have SGE (but SLURM) so I ran it with job_type = local. After all daligner_p processes were finished, nothing was happening. No new files, fc_run.py still showing up in 'top' and the run command not finished.

@pb-jchin suggested that one or more of the daliger jobs might have failed with out a signal, and that I take a look at the folders and logs inside 1-preads_ovl for anomalies.

Indeed, I found a folder with much fewer files (3 instead of the 50-70, no .las files):

find 1-preads_ovl | cut -d '/' -f 2 | uniq -c

Maybe a tip for others to always check these folders (and a feature request to catch this behavior :-) )

Unable to decrease the Identity option ( -e ) below .70 in pa_HPCdaligner_option.

Hello,

We have sequenced 80x Pacbio Reads of Rice Genome , but the Quality of these data is relatively low ( Low alignment identity compared to Ecoli Test Data , Fruitfly Data etc. ) .

Then I tried to assemble the genome using the parameter in Ecoli Test but the genome size is underestimated much ( 290Mb out of 370Mb ) and the N50 is only 100Kb.

I wonder there could be much lower Identity among reads ( 80% Identity to Rice Reference, <70% in Read-to-Read ) and try to decrease the Identity option in pa_HPCdaligner_option to .60 or .50.
The program returns an error message "HPCdaligner: Average correlation must be in [.7,1.) (0.5)".
Could you please give some more freedom of this option in the code?
AND
Any experience OR suggestions to deal with these low quality data ?

Regards

Bin

Below is the Rice raw reads:
ricerawreads
Below is the Rice corrected reads using FALCON:
ricecorrectedreads
Below is the Ecoli raw reads:
ecolirawreads
Below is the Ecoli corrected reads using FALCON:
ecolicorrectedreads

falcon_overlap on large datasets

Hi,

I have a 9.4GB preads fasta file. I have tried running the following paramaters:
--d_core 3 --n_core --min_len 8500 preads.fa > preads.ovl

Where --n_core was set at 64, 32, 24, 20 and all failed with memory error: MemoryError: out of memory

I have tried running it on 16 cores but after a few days, the processes just stay idle with high memory usage for 72hours. Have you guys experienced this? I am able to successfully execute falcon_overlap and much smaller data sets but not this current one.

I am running this on a centos 6 box with 512GB RAM and 64 cores. the files are stored on nfs ssd raid array.

Thank you,

Edwin

daligner job distribution

Currently, fc_run uses HPCAligner for constructing the daligner jobs. It will be useful to build the logic for distributing the jobs without calling HPCAligner. This will be necessary if we like to be able to provide the feature to progressively doing daligner jobs when the new data are added progressively. Also, we can eliminate the "find" command for the merge and sort step.

falcon running failed,help!

To assemble human genome with pacbio-only sequencing data, we tried to apply FALCON software. We installed the software following the instructions, and no error turned up in the process of installation, the installation information is attached.Here is our sever environment which is satisfied with the installation requirement too.

[smrtanalysis@node4 FALCON-master]$ uname -a
Linux node4.bnu.edu.cn 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

[smrtanalysis@node4 FALCON-master]$ gcc --version
gcc (GCC) 4.8.2

[smrtanalysis@node4 FALCON-master]$ python --version
Python 2.7.3

However, when I ran the FALCON following the instructions, it failed and the running error information is also attched.

fc_run.cfg:
{begin}
[General]
# list of files of the initial bas.h5 files
input_fofn = input.fofn
#input_fofn = preads.fofn
input_type = raw
#input_type = preads
# The length cutoff used for seed reads used for initial mapping
length_cutoff = 12000
# The length cutoff used for seed reads usef for pre-assembly
length_cutoff_pr = 12000
sge_option_da = -l h=nodesuper -pe openmp 4 -q all.q
sge_option_la = -l h=nodesuper -pe openmp 4 -q all.q
sge_option_pda = -l h=nodesuper -pe openmp 2 -q all.q
sge_option_pla = -l h=nodesuper -pe openmp 2 -q all.q
sge_option_fc = -l h=nodesuper -pe openmp 4 -q all.q
sge_option_cns = -l h=nodesuper -pe openmp 4 -q all.q
pa_concurrent_jobs = 16
cns_concurrent_jobs = 16
ovlp_concurrent_jobs = 16
pa_HPCdaligner_option =  -v -dal4 -t16 -e.70 -l1000 -s400
ovlp_HPCdaligner_option = -v -dal4 -t32 -h60 -e.96 -l500 -s400
pa_DBsplit_option = -x50 -s50
ovlp_DBsplit_option = -x50 -s50
falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 2 --local_match_count_threshold 1 --max_n_read 10 --n_core 2
overlap_filtering_setting = --max_diff 50 --max_cov 30 --min_cov 20 --bestn 10
{end}
{begin}
ERROR 1(input_type = raw):
  command:  vi  ./0-rawreads/job_10684559/rj_10684559.log

"15079 vs 15111 :   3
15104 vs 15111 :   4

     18,604,978 14-mers (7.440891e-09 of matrix)
          5,100 seed hits (2.039698e-12 of matrix)
            560 confirmed hits (2.239669e-13 of matrix)
*** glibc detected *** daligner: double free or corruption (out): 0x0000003ea0c04b30 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3ea0c75366]
/lib64/libc.so.6[0x3ea0c77e93]
daligner[0x401ea6]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3ea0c1ecdd]
daligner[0x401339]
======= Memory map: ========
00400000-0041f000 r-xp 00000000 00:1c 113118876                          /home/nfs_UserData_220/falcon/installation_dir/FALCON-master/fc_env/bin/daligner
0061e000-0061f000 rw-p 0001e000 00:1c 113118876                          /home/nfs_UserData_220/falcon/installation_dir/FALCON-master/fc_env/bin/daligner"
{end}

{begin}
command:    gdb ./0-rawreads/job_58132038/core.9839
"[New Thread 9839]
Core was generated by `daligner -v -t16 -H3000 -e0.7 -s1000 raw_reads.1 raw_reads.1'.
Program terminated with signal 6, Aborted."
{end}

{begin}
ERROR 2(input_type = preads):
when run the command "fc_run.py fc_run.cfg",it shows

"No target specified, assuming "assembly" as target 
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/home/smrtanalysis/install/smrtanalysis_2.3.0.140936/redist/python2.7/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/home/smrtanalysis/install/smrtanalysis_2.3.0.140936/redist/python2.7/lib/python2.7/threading.py", line 504, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/nfs_UserData_220/falcon/installation_dir/FALCON-master/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 317, in __call__
    runFlag = self._getRunFlag()
  File "/home/nfs_UserData_220/falcon/installation_dir/FALCON-master/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 147, in _getRunFlag
    runFlag = any( [ f(self.inputDataObjs, self.outputDataObjs, self.parameters) for f in self._compareFunctions] )
  File "/home/nfs_UserData_220/falcon/installation_dir/FALCON-master/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 812, in timeStampCompare
    if min(outputDataObjsTS) < max(inputDataObjsTS):
ValueError: max() arg is an empty sequence

Exception in thread Thread-4:
Traceback (most recent call last):
  File
"/home/smrtanalysis/install/smrtanalysis_2.3.0.140936/redist/python2.7/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/home/smrtanalysis/install/smrtanalysis_2.3.0.140936/redist/python2.7/lib/python2.7/threading.py", line 504, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/nfs_UserData_220/falcon/installation_dir/FALCON-master/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 317, in __call__
    runFlag = self._getRunFlag()
  File "/home/nfs_UserData_220/falcon/installation_dir/FALCON-master/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 147, in _getRunFlag
    runFlag = any( [ f(self.inputDataObjs, self.outputDataObjs, self.parameters) for f in self._compareFunctions] )
  File "/home/nfs_UserData_220/falcon/installation_dir/FALCON-master/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 812, in timeStampCompare
    if min(outputDataObjsTS) < max(inputDataObjsTS):
ValueError: max() arg is an empty sequence"
{end}

{begin}
and also the file sge_log/falcon-ea66cd36.o9615    shows :"/home/sgeadmin/node10/node10/job_scripts/9615: line 3: 42517               core dumped)DB2Falcon preads"  
"in <module>  for res in exe_pool.imap(filter_stage1, inputs): File  '/home/smrtanalysis/install/smrtanalysis_2.3.0.140936/redist/python2.7/lib/python2.7/multiprocessing/pool.py', line 626, in next raise value
OSError: [Errno 2] No such file or directory"
{end}

ps:
we have also run the ecoli example descriped in the manual,the same error like above.

hardware environment:
supernode: 48 cores 2.7GHz ,600G memory * 1
node : 24 cores 2.1GHz, 64G memory * 8

Would you like to help us about these problems?

generate utg fasta

it is useful to have fasta of less resolved graphs, so a function graph_to_utg_sequence is needed.

gcc 4.9.2 and FALCON

Hi,

My default gcc is version 4.9.2.

When I tried to compile the FALCON code, I got the following error:

$ python setup.py install --prefix=$HOME/src/FALCON
running install
/home/support/yzhang/.local/lib/python2.7/site-packages/setuptools-12.0-py2.7.egg/pkg_resources/init.py:2510: PEP440Warning: 'enstaller (4.7.0.dev1-55c9a8e)' is being parsed as a legacy, non PEP 440, version. You may find odd behavior and sort order. In particular it will be sorted as less than 0.0. It is recommend to migrate to PEP 440 compatible versions.
/home/support/yzhang/.local/lib/python2.7/site-packages/setuptools-12.0-py2.7.egg/pkg_resources/init.py:2510: PEP440Warning: 'supplement (0.5dev.dev202)' is being parsed as a legacy, non PEP 440, version. You may find odd behavior and sort order. In particular it will be sorted as less than 0.0. It is recommend to migrate to PEP 440 compatible versions.
running bdist_egg
running egg_info
creating falcon_kit.egg-info
writing requirements to falcon_kit.egg-info/requires.txt
writing falcon_kit.egg-info/PKG-INFO
writing top-level names to falcon_kit.egg-info/top_level.txt
writing dependency_links to falcon_kit.egg-info/dependency_links.txt
writing requirements to falcon_kit.egg-info/requires.txt
writing falcon_kit.egg-info/PKG-INFO
writing top-level names to falcon_kit.egg-info/top_level.txt
writing dependency_links to falcon_kit.egg-info/dependency_links.txt
writing manifest file 'falcon_kit.egg-info/SOURCES.txt'
reading manifest file 'falcon_kit.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'falcon_kit.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib.linux-x86_64-2.7
creating build/lib.linux-x86_64-2.7/falcon_kit
copying src/py/FastaReader.py -> build/lib.linux-x86_64-2.7/falcon_kit
copying src/py/init.py -> build/lib.linux-x86_64-2.7/falcon_kit
copying src/py/falcon_kit.py -> build/lib.linux-x86_64-2.7/falcon_kit
copying src/py/fc_asm_graph.py -> build/lib.linux-x86_64-2.7/falcon_kit
running build_ext
building 'falcon_kit.DW_align' extension
creating build/temp.linux-x86_64-2.7
creating build/temp.linux-x86_64-2.7/src
creating build/temp.linux-x86_64-2.7/src/c
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -O2 -fPIC -I/soft/python-epd/canopy/1.4.1/appdata/canopy-1.4.1.1975.rh5-x86_64/include/python2.7 -c src/c/DW_banded.c -o build/temp.linux-x86_64-2.7/src/c/DW_banded.o
src/c/DW_banded.c: In function 'align':
src/c/DW_banded.c:135:5: internal compiler error: Illegal instruction
max_d = (int) (0.3*(q_len + t_len));
^
0x87a75f crash_signal
../../gcc-4.9.2/gcc/toplev.c:337
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See http://gcc.gnu.org/bugs.html for instructions.
error: command 'gcc' failed with exit status 1

I thought you might be interesting in knowing this.

For my installation, I reverted to gcc/4.8.2 after this failure, and installed FALCON.

Best,

  • Ying

n_core is hardcoded in fc_run.py for fc_ovlp_filter.py

line 718 of fc_run.py uses a hardcoded number of cores. It would be better to have this as part of the the overlap_filtering_setting

717 script.append( """fc_ovlp_filter.py --fofn las.fofn %s
718 --n_core 24 --min_len %d > preads.ovl""" % (overlap_filtering_setting, length_cutoff_pr) )
719

Debug mode

Hi,
I was wondering if there is any kind of debug mode?
Currently I have a run, where I do not get any error-messages but the 2-asm-falcon folder has just dummy files (empty files). I like to try find out, where exactly the run fails. but have no clue where to start.
SGE is not the problem, as I tried it locally mode and I got the same result.
Maybe someone could write how to analyze such situation?

I first tried FALCON with the ecoli-set and it looks like it finished without any problems. (if this correct that fc_ovlp_to_graph.log should be empty)

assembly of error-corrected reads

Hi Jason,
I'm trying to assemble pacbio reads previously corrected with proovread. The genome is ~2Gb, and I used fc_run_LG.py.

All the jobs run through, but the output is really strange: all contigs I get (p_cgt.fa) are repetitions of the short k-mers TAACCC and TTAGGG (for ex.TTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG......)

The preads4falcon.fasta looks fine, so I grep'ed the seqs present in sg_edges_list from the preads4falcon, and they correspond exactly to these repeats. Thus, it seems that fc_ovlp_to_graph.py is printing only the bad stuff.

Would you have any ideas on why that is happening?

Thanks a lot!

PS. I have ~10.5mi sequences in preads4falcon.fasta, but only 252 entries in sg_edges_list

determining how many jobs to go

Hi, Jason,

At the initial daligner stage (where it is working in 0-rawreads), it submits hundreds of jobs with names such as: d_7741e0bb_raw_reads-ddbf9706. Is there any way to determine how many more such jobs there will be? Or, since I know how many have been submitted (completed and still running), is there any way to know the total that must be completed? Is this information available, for example, in raw_reads.db (or can it be calculated from there)? I just want to know whether I am, for example, 30% done or 99% done.

Thanks,
David

E.coli output

Hi,
I'm trying out FALCON with the E.coli test data but I seem to get a different output -i.e. different size assembly- when compared to the numbers you published on twitter a few days ago (4,631,625b). I tried both config files from the 'examples' folder (fc_run_ecoli.cfg and fc_run_ecoli_2.cfg) and get different numbers for both (4,631,535b and 4,631,559b respectively).

I'm running Falcon on a CentOS 6.3 machine, and currently going for the job_type = local instead of SGE. The only message I get on the terminal, throughout the entire process, is 'No target specified, assuming "assembly" as target'. I also couldn't find any relevant .log files.

Anyways, even though the difference is minor, I find it still important to understand where this variability comes from. Who knows if and how it will scale up for larger genomes...

With which .cfg file did you assembled the E.coli data (those 3 fasta files from dropbox)?
Also, do you think that different versions of Python, DAZZ_DB adn DALIGNER could account for this difference?

Following this, it would be really great if:

  1. you could add a section to the manual with the expected results of the E.coli test;
  2. add a more detailed description of which files one should generally expect as output (for example, the a_ctg files), and what do they represent.

Many thanks,
Juliana

issues with install_notes.sh script

Hi Jason,
There are multiple places in the install_notes.sh where there is hard coded /home instead of $HOME.
Also the very last section on installing samtools is missing the following line.
tar xvjf samtools-0.1.19.tar.bz2

Thanks,
Alicia

Installation Issue

Making a new thread, as per your request. I am hoping you can help me out with an installation issue I have been having at a national computing resource. When I try to rebuild with:

virtualenv --no-site-packages --always-copy $PWD/fc_env
. $PWD/fc_env/bin/activate
git clone https://github.com/cschin/pypeFLOW
cd pypeFLOW
python setup.py install

cd ..
git clone https://github.com/PacificBiosciences/FALCON.git
cd FALCON
python setup.py install

cd ..
git clonehttps://github.com/pb-jchin/DAZZ_DB.git
cd DAZZ_DB/
make
cp DBrm DBshow DBsplit DBstats fasta2DB ../fc_env/bin/

cd ..
git clone https://github.com/pb-jchin/DALIGNER.git
cd DALIGNER
make
cp daligner daligner_p DB2Falcon HPCdaligner LA4Falcon LAmerge LAsort ../fc_env/bin
cd ..

I see the error, "line 6: /usr/bin/time: No such file or directory" when I try to run the test job for all of the d_* error correction jobs in 0-rawreads. Apologies, but I didn't do the original install, but some of the folks at this resource want me to prove the update fixes my issue, by doing an install in one my directories.

I am not sure how to remove "/usr/bin/time" from fc_run.py, as I don't see any instances in the script itself.

Seed sequence is not chosen according to design

When outputting the sequence from Daligner database to fc_consensus.py, the first read should be the seed sequence. The rest of the sequences are output in reverse order according to their length. Currently, get_seq_data(config) sort the whole array of reads by the length. If there is read longer than the seed read, it will become the seed read. In the case where the supporting reads are trimmed to be contained in the seed read, this will not be a problem. However, LA4Falcon does output full length reads, namely, the seed read is changed. This is a bug which will impact the assembly performance.

The fix is easy

-                    seqs.sort( key=lambda x: -len(x) )
+                    seqs = seqs[:1] + sorted(seqs[1:], key=lambda x: -len(x))

More details about length_cutoff?

Dear Dr. Chin,

When I am playing with different length_cutoff settings, I found that it doesn't seem to be a direct threshold for output. For example, in 0-rawreads/preads/out.00001.fa (I assume these are error-corrected reads) of the E. Coli example, many reads are shorter than 10,000 bp even though I set length_cutoff=12000.

Could you please share some details about how it works?

Thank you very much!

Best,
Yunfei

enhancements I made--happy to share

Summary of modifications to Falcon to improve performance and
debugging:

  1. Created a log for fc_run.py which logs:
    1. when a job is qsub'd (with the time and full qsub command)
    2. when a job's "done" file is found (with the time and job name)
    3. when it is waiting on the done file of a job (just logs this
      every 100 seconds)
This logging information helps in debugging when a process has
died (which happens for a variety of reasons).  When a process
dies, fc_run.py will wait forever for the done flag.  By looking
in this log file at the jobs that have been qsub'd but not yet
finished with "done", and then looking with qstat at the running
jobs, it is easy to find the jobs that have failed.  Using the
qsub line gives you the sh script which gives you the log files so
you can then find why they failed.

I've written a python program (happy to share) that does all of
this and prints out a list any failed processes and their
stderr's.

Having the full qsub command allows you the ability to restart
from the command line the particular job that failed, without
having to restart the entire assembly.  Since fc_run.py is simply
looking for a done file, and doesn't care whether the done file is
created by the job it qsub'd or the one you qsub'd from the
command line, when your job completes, fc_run.py will carry on as
though there had been no problem.
  1. Rewrote the portion of fc_run.py that qsub's the LA4Falcon jobs.
    Instead it runs a python script I wrote which copies the needed
    files to /var/tmp on the host, runs LA4Falcon, writes the output
    back to the network drive and sets the done file.

  2. Modified (per Jason's instructions) LA4Falcon so it uses mmap
    reads instead of normal reads.

The combination of 2 and 3 usually speeds up LA4Falcon by a factor of
between 10 and 100.  However, there are times when copying the
needed files (around 50GB) can itself take a long time (on the
order of a day) for reasons I don't understand (perhaps other
disk/network contention?)
  1. Modified LA4Falcon so it reports its progress to stderr. This enables me to
    estimate how quickly it is running and when it will complete.

improve consensus for diploid sample

Currently, the consensus module will not generate base for het-SNP sites as the coverage of the major bases is less then 50% due to the heterozygosity. This can be fixed with some consensus algorithm tweaking.

falcon liver cancer,

hi jchin,when I falcon liver cancer,the sge_log/falcon-793b5798.o12743 display:"UnboundLocalError: local variable 'overlap_data' referenced before assignment"
Is there anything wrong with my fc_run.cfg ? or should I install HBAR-DTK for large genome assembly?
thank you

fc_run.cfg:
[General]
# list of files of the initial bas.h5 files
input_fofn = input.fofn
#input_fofn = preads.fofn

input_type = raw
#input_type = preads

# The length cutoff used for seed reads used for initial mapping
length_cutoff = 15000

# The length cutoff used for seed reads usef for pre-assembly
length_cutoff_pr = 15000


sge_option_da = -pe openmp 8 -q all.q
sge_option_la = -pe openmp 2 -q all.q
#6 seems to small... 8 might be better for Dmel
sge_option_pda = -pe openmp 8 -q all.q
sge_option_pla = -pe openmp 2 -q all.q
sge_option_fc = -pe openmp 23 -q all.q
sge_option_cns = -pe openmp 8 -q all.q

pa_concurrent_jobs = 32
ovlp_concurrent_jobs = 32

pa_HPCdaligner_option =  -v -dal128 -t16 -e.70 -l1000 -s1000
ovlp_HPCdaligner_option = -v -dal128 -t32 -h60 -e.96 -l500 -s1000

pa_DBsplit_option = -x500 -s400
ovlp_DBsplit_option = -x500 -s400

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --local_match_count_threshold 2 --max_n_read 200 --n_core 6

overlap_filtering_setting = --max_diff 100 --max_cov 100 --min_cov 1 --bestn 10

Is there an estimate of running time?

Dear Dr. Chin,

Is there a way in Falcon to estimate running time? We are running Falcon on a large data set for several days. I think it should have been done based on previous analyses, but it's stilling running (if you use 'top' on compute nodes, you can see DALIGNER is working). Even an estimate on number of remaining jobs would be helpful.

Thanks!

Best,
Yunfei

PypeFlow jobs failing intermittently

Two separate FALCON jobs have failed with very similar Tracebacks:

Traceback (most recent call last):
  File "/usr/local/bin/fc_run.py", line 5, in <module>
    pkg_resources.run_script('falcon-kit==0.2.1', 'fc_run.py')
  File "/usr/local/lib/python2.7/dist-packages/setuptools-2.1-py2.7.egg/pkg_resources.py", line 488, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/lib/python2.7/dist-packages/setuptools-2.1-py2.7.egg/pkg_resources.py", line 1338, in run_script
    execfile(script_filename, namespace, namespace)
  File "/usr/local/lib/python2.7/dist-packages/falcon_kit-0.2.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/fc_run.py", line 738, in <module>
    wf.refreshTargets(updateFreq = wait_time) #all            
  File "/usr/local/lib/python2.7/dist-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/controller.py", line 519, in refreshTargets
    self._refreshTargets(objs = objs, callback = callback, updateFreq = updateFreq, exitOnFailure = exitOnFailure)
  File "/usr/local/lib/python2.7/dist-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/controller.py", line 649, in _refreshTargets
    task2thread[URL].join()
KeyError: 'task://localhost/m_00017_preads'
Traceback (most recent call last):
  File "/usr/local/bin/fc_run.py", line 5, in <module>
    pkg_resources.run_script('falcon-kit==0.2.1', 'fc_run.py')
  File "/usr/local/lib/python2.7/dist-packages/setuptools-2.1-py2.7.egg/pkg_resources.py", line 488, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/lib/python2.7/dist-packages/setuptools-2.1-py2.7.egg/pkg_resources.py", line 1338, in run_script
    execfile(script_filename, namespace, namespace)
  File "/usr/local/lib/python2.7/dist-packages/falcon_kit-0.2.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/fc_run.py", line 653, in <module>
    wf.refreshTargets(updateFreq = wait_time) # larger number better for more jobs
  File "/usr/local/lib/python2.7/dist-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/controller.py", line 519, in refreshTargets
    self._refreshTargets(objs = objs, callback = callback, updateFreq = updateFreq, exitOnFailure = exitOnFailure)
  File "/usr/local/lib/python2.7/dist-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/controller.py", line 649, in _refreshTargets
    task2thread[URL].join()
KeyError: 'task://localhost/ct_00016'

In both cases the files for jobs 00017 / 00016 appeared to be complete, with no errors in the logs. Several other FALCON runs have completed without error (including an exact rerun of the second job).

I'm running on a single 64-core machine running Ubuntu 14.04. Concurrency settings for the second job:

pa_concurrent_jobs = 14
cns_concurrent_jobs = 14
ovlp_concurrent_jobs = 14
pa_HPCdaligner_option =  -v -dal4 -t16 -e.70 -l1000 -s1000
ovlp_HPCdaligner_option = -v -dal4 -t32 -h60 -e.96 -l500 -s1000
falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --local_match_count_threshold 2 --max_n_read 200 --n_core 4

falcon hangs with ecoli test dataset

With the ecoli test of falcon, it doesn't complete for
me. It gets to a point where fc_run.py is using about 0.7% of cpu
with no sge jobs running. Up to this point everything works fine:
many d_8b7e4620_raw_reads-6b44f214 jobs are submitted and complete,
many m_00011_raw_reads-77b3196c jobs are submitted and complete,
many ct_00012-8fd20896 jobs are submitted and complete.

I didn't find any errors or warnings in any of the logs I am aware of.

Then 8 jobs with names such as:
d_c6b8e2ad_preads-f4c28b8a
are submitted. They complete in less than 30 seconds, and that is the
end. Forever after (I've let it go for over an hour), fc_run.py just
sits there.

At this point the 0-rawreads directory has cns_done, da_done, and
rdb_build_done as well as many .las files (as well as m_* and job*
subdirectories).

The 1-preads_ovl directory has rdb_build_done as well as 6 m_* and
8 job* subdirectories. It has a large preads_norm.fasta file.

The 2-asm-falcon directory is empty.

Anyone have any things I could investigate?

Note: the only thing I changed from the normal install was changing
"-pe smp" to "-pe orte" because our sge doesn't have "-pe smp". (I've
also tried "-pe serial".)

can falcon use local disk rather than nfs disk?

Hi, Jason,

Is there a method of having each of falcon's hundreds of jobs use /tmp or /var/tmp of the node it is running rather than the nfs-mounted disk? The io to /tmp or /var/tmp (being local) is much faster and doesn't slow down nfs-mounted disk access for everyone else on the cluster.

If this requires source-code changes, how difficult would it be? And how much improvement would it make (I assume each job would have to copy from nfs to local disk at the beginning of its run and then copy back to nfs after it has completed its run)? Are these changes I could make myself?

just asking...

Thanks,
david

About get_rdata.py

Hi!

I try to use FALCON for our species.

How can I define group_ID?

I've got 14 queries and do I need to run get_rdata.py 14 times?

Thank you.

Won

l 1-dist_map-falcon/

drwxr-xr-x 16 wyim 32K Jun 23 12:53 ./
drwxr-xr-x 10 wyim 32K Jun 23 17:03 ../
-rw-r--r-- 1 wyim 0 Jun 16 18:41 gather_target_done
drwxr-xr-x 2 wyim 32K Jun 21 02:31 q00001_md/
drwxr-xr-x 2 wyim 32K Jun 21 00:10 q00002_md/
drwxr-xr-x 2 wyim 32K Jun 21 02:22 q00003_md/
drwxr-xr-x 2 wyim 32K Jun 20 23:56 q00004_md/
drwxr-xr-x 2 wyim 32K Jun 21 02:27 q00005_md/
drwxr-xr-x 2 wyim 32K Jun 20 23:48 q00006_md/
drwxr-xr-x 2 wyim 32K Jun 21 00:03 q00007_md/
drwxr-xr-x 2 wyim 32K Jun 21 02:35 q00008_md/
drwxr-xr-x 2 wyim 32K Jun 20 23:52 q00009_md/
drwxr-xr-x 2 wyim 32K Jun 21 02:39 q00010_md/
drwxr-xr-x 2 wyim 32K Jun 21 00:07 q00011_md/
drwxr-xr-x 2 wyim 32K Jun 20 23:59 q00012_md/
drwxr-xr-x 2 wyim 32K Jun 21 00:16 q00013_md/
drwxr-xr-x 2 wyim 32K Jun 21 02:24 q00014_md/
-rw-r--r-- 1 wyim 1.4K Jun 23 12:53 queries.fofn
-rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00001.fofn
-rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00002.fofn
-rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00003.fofn
-rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00004.fofn
-rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00005.fofn
-rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00006.fofn
-rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00007.fofn
-rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00008.fofn
-rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00009.fofn
-rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00010.fofn
-rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00011.fofn
-rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00012.fofn
-rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00013.fofn
-rw-r--r-- 1 wyim 100 Jun 16 18:41 query_00014.fofn
-rw-r--r-- 1 wyim 0 Jun 16 18:41 split_fofn_done
-rw-r--r-- 1 wyim 500 Jun 16 18:41 target_00001.fofn
-rw-r--r-- 1 wyim 500 Jun 16 18:41 target_00002.fofn
-rw-r--r-- 1 wyim 400 Jun 16 18:41 target_00003.fofn
-rw-r--r-- 1 wyim 1.4K Jun 23 12:52 target.fofn

difference between fc_run_LG.cfg/fc_run_LG.py and fc_run.cfg/fc_run.py

Dear Dr. Chin,

Are the *_LG files better for large genome assembly(e.g. human)? Could you please share something about the differences between these two sets of files?

Also, besides these example *.cfg files, do you have some recommended parameter setting for high coverage human genome ? I heard from PacBio's recent AGBT that Falcon can generate 10MB (N50) contigs, but we only got 50 Kbp (N50) contigs from CHM1 data offered with Dr. Chaisson's paper.

I really appreciate your help. Thank you very much!

Best,
Yunfei

kmer_match.count might be zero for some hybrid assembly cases

an issue report by @lexnederbragt in email using falcon for hybrid reads

Than I saw something. I ran this command

/node/work/lex/falcon/fc_env/bin/fc_run.py fc_run.cfg

and save stdin and stdout, and the file contains this:

Traceback (most recent call last):
  File "/node/work/lex/falcon/fc_env/bin/fc_graph_to_contig.py", line 4, in <module>
    __import__('pkg_resources').run_script('falcon-kit==0.2.1', 'fc_graph_to_contig.py')
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 505, in run_script
  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1245, in run_script
  File "/node/work/lex/falcon/fc_env/lib/python2.7/site-packages/falcon_kit-0.2.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/fc_graph_to_contig.py", line 300, in <module>
    aln_data, x, y = get_aln_data(base_seq, seq)
  File "/node/work/lex/falcon/fc_env/lib/python2.7/site-packages/falcon_kit-0.2.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/fc_graph_to_contig.py", line 68, in get_aln_data
    x,y = zip( * [ (kmer_match.query_pos[i], kmer_match.target_pos[i]) for i in range(kmer_match.count)] )
ValueError: need more than 0 values to unpack
 No target specified, assuming "assembly" as target

The root cause is the aligner can't find any k-mer match for the two sequences represent the two branches in a bubble. I have not seen this in non-hybrid case. There might be something weird of the two sequences to be aligned.

daligner error in read correction step

I have been experiencing and issue where during the raw read error correction phase, when the initial d_* (daligner) jobs are running, the vast majority will complete without issue, but some small number never really start. When I check the individual log file (ex: 0-rawreads/job_d72b16ef/rj_d72b16ef.log)
all I see is:

Fri Feb 20 09:20:28 PST 2015
daligner: System error, read failed!

I have tried many different avenues and can't find the cause of the error.

How many more LA4Falcon/fc_consensus.py jobs yet to be processed?

Hi, Jason,

When Falcon starts running the LA4Falcon | fc_consensus.py jobs, is there any way of determining how many are left to go (or how many total)?

You had pointed me to run_jobs.sh but that only refers to the daligner and LAsort/LAmerge jobs.

Thanks!
David

falcon immediate crash...doesn't like fasta file?

Hi, Jason,

I'm now trying yeast and drosophila test datasets. These are filtered pacbio reads. (fastq which I converted to fasta).

falcon immediately crashes with this:

Exception in thread Thread-6:
Traceback (most recent call last):
File "/net/gs/vol3/software/modules-sw/python/2.7.3/Linux/RHEL6/x86_64/lib/python2.7/threading.py", line 551, in *bootstrap_inner
self.run()
File "/net/gs/vol3/software/modules-sw/python/2.7.3/Linux/RHEL6/x86_64/lib/python2.7/threading.py", line 504, in run
self.__target(_self.__args, _self.__kwargs)
File "/net/gs/vol1/home/dgordon/falcon/FALCON-master/install/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 317, in __call

runFlag = self._getRunFlag()
File "/net/gs/vol1/home/dgordon/falcon/FALCON-master/install/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 147, in _getRunFlag
runFlag = any( [ f(self.inputDataObjs, self.outputDataObjs, self.parameters) for f in self._compareFunctions] )
File "/net/gs/vol1/home/dgordon/falcon/FALCON-master/install/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 812, in timeStampCompare
if min(outputDataObjsTS) < max(inputDataObjsTS):
ValueError: max() arg is an empty sequence

There is one other clue: In the sge_logs directory build_rdb-af03c270.o4030757 says:

File yeast_filtered.fasta, Line 488307: Pacbio header line name inconsisten
DBsplit: Cannot open ./raw_reads.db for 'r'
cat: raw_reads.db: No such file or directory
HPCdaligner: Cannot open ./raw_reads.db for 'r'

I do notice that the header lines for your ecoli dataset have RQ=0.808 on each line while my
dataset for yeast and drosophila do not. Could that be the problem?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.