smithlabcode / falco Goto Github PK

View Code? Open in Web Editor NEW

92.0 92.0 10.0 2.31 MB

A C++ drop-in replacement of FastQC to assess the quality of sequence read data

Home Page: https://falco.readthedocs.io

License: GNU General Public License v3.0

HTML 72.36% Makefile 0.32% Shell 0.38% C++ 24.40% M4 2.54%

falco's People

Contributors

Stargazers

Watchers

Forkers

jianshu93 zzygyx9119 shelestova-anastasia y9c andrewdavidsmith hotliu patg13 wook2014

falco's Issues

Memory leak or stall?

Thanks for falco!

Running v1.2.0 installed from bioconda on a nanopore FASTQ

falco --format fastq -skip-report -t 1 -skip-summary nanopore.fastq.gz 
[limits]	using default limit cutoffs (no file specified)
[adapters]	using default adapters (no file specified)
[contaminants]	using default contaminant list (no file specified)
[Thu Sep 15 10:26:59 2022] Started reading file nanopore.fastq.gz
[Thu Sep 15 10:27:00 2022] reading file as gzipped FASTQ format
[running falco|===================================================|100%]
[Thu Sep 15 10:27:13 2022] Finished reading file
[Thu Sep 15 10:27:13 2022] Writing text report to ./fastqc_data.txt

It looks like it generates the fastqc_data.txt properly, but then after that, it consumes over 32GB of RAM over several minutes until I kill it...
Is this expected? For comparison, FastQC processes the file in 18s within 1GB RAM.

Here are the stats of the file:

seqkit stats nanopore.fastq.gz 
file              format  type  num_seqs      sum_len  min_len  avg_len  max_len
nanopore.fastq.gz  FASTQ   DNA     30,720  204,534,138      100    6,658   35,768

JSON Output

Hi and thanks for this wonderful FastQC alternative!

Would it be possible to export all results in a standardized machine readable JSON file like for example fastp?
That would be awesome.

No results for Kmer content module in HTML and fastqc_data.txt

Falco version is 0.2.4. Kmer module turned on in configuration by changing line in limits.txt to:

kmer 				ignore 		0

Falco indicates that Kmer content FAIL, but there are no Kmer content in fastqc_data.txt:

>>Kmer Content	fail
#Sequence	Count	PValue	Obs/Exp Max	Max Obs/Exp Position
>>END_MODULE

In HTML file section not reported also:

For the same fastq file FastQC display Kmer content module as expected.

Q: How to make `falco` work with `multiqc`?

As I understood, current version 0.2.4 output should be compatible with current multiqc, I am using 1.9.

No matter which files I create with falco, multiqc always fails to find any analysis results.

sample_S17_L002_R1_001.fastq.gz_fastqc_data.txt
sample_S17_L002_R1_001.fastq.gz_fastqc_report.html
sample_S17_L002_R1_001.fastq.gz_summary.txt

Even when putting these files into sample_S17_L002_R1_001.fastq.gz.zip ... no success.

So I obviously missed here soemthing probably very basic/simple.

Any idea what I am doing wrong?

Non-identical length distribution

Same file. Running falco v1.2.1 from bioconda and MultiQC 1.12. Can reproduce by running on nanopore data from SRA with long read lengths.

MultiQC report of FastQC:

MultiQC report of falco:

I believe falco calculates length distribution for every length, while FastQC creates a histogram in fastqc_data.txt. Which is better? The granularity and detail is nice, but it can also obscure plotting. Should falco reproduce FastQC behaviour or perform some kind of binning of read lengths? Interested in your thoughts.

setting "--format fastq" with gzip-compressed FASTQ crashes falco

and this is because it is reading the gzip-compressed FASTQ as if it was a normal FASTQ file. I'm not 100% sure but I believe FastQC runs normally when this set of parameters is used, so falco must behave similarly.

We can probably be able to figure out if it's compressed or not prior to processing, probably use igzfstream from smithlab_cpp would be the best option.

adapter too long. Maximum adapter size is 32bp

Falco version 0.3.0.

Hello!
We want to switch to falco from fastqc.
We are using bgi adapters. There is an adapter with length = 42.
There was no any problems to use it in fasqc.

With falco I receive the error:
adapter too long. Maximum adapter size is 32bp: AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG

I see 32 is hardcoded adapter size limit. Why?

Reporting the configuration when falco is run

This seems like a good option -- at present the bisulfite mode seems not to tell users that the run specified bisulfite mode.

--threads option

Hello!
Thank you for the tool! Could you answer, please, when are you going to add parallelization? It's critical for me due to processing BAM files of 300 GB and more...

Adapter content plot shows wrong read lengths

In the version 1.0.0, the plot shows read length of 489 bp, even though the actual length is 100 bp.

The command, fastq file, and adapter list to reproduce this issue are as follows:

falco -a qc_adaptors_list.txt example.fq.gz

qc_adaptors_list.txt
example.fq.gz

falco crashes with multiQC version 1.9

Tried running multiqc on a falco output directory and it crashed. Here is the error log:

[WARNING]          fastqc : Sample had zero reads: 'falco_out | DNA67062_S3_L001_R1_001'
[ERROR  ]         multiqc : Oops! The 'fastqc' MultiQC module broke...
 Please copy the following traceback and report it at https://github.com/ewels/MultiQC/issues
 If possible, please include a log file that triggers the error - the last file found was:
   falco_out/fastqc_data.txt
============================================================
Module fastqc raised an exception: Traceback (most recent call last):
 File "/home/xlinak/conda_envs/conda_med/lib/python3.8/site-packages/multiqc/multiqc.py", line 569, in run
   output = mod()
 File "/home/xlinak/conda_envs/conda_med/lib/python3.8/site-packages/multiqc/modules/fastqc/fastqc.py", line 117, in __init__
   self.overrepresented_sequences()
 File "/home/xlinak/conda_envs/conda_med/lib/python3.8/site-packages/multiqc/modules/fastqc/fastqc.py", line 824, in overrepresented_sequences
   max_pcnt   = max( [ float(d['percentage']) for d in self.fastqc_data[s_name]['overrepresented_sequences']] )
ValueError: max() arg is an empty sequence
============================================================

Require: Support multiple fastq files in subfolder.

Falco version 1.0.0 could not parse the correct output directory, when I work on multiple fastq files within subfolder.

Here is the directory structure:

$ tree demo/
demo/
├── test_rep1.fq.gz
└── test_rep2.fq.gz

Then Run Falco as following:

$ falco demo/*gz
[limits]        using default limit cutoffs (no file specified)
[adapters]      using default adapters (no file specified)
[contaminants]  using default contaminant list (no file specified)
[Fri Sep  2 08:46:59 2022] Started reading file demo/test_rep1.fq.gz
[Fri Sep  2 08:46:59 2022] reading file as gzipped FASTQ format
[running falco|===================================================|100%]
[Fri Sep  2 08:46:59 2022] Finished reading file
[Fri Sep  2 08:46:59 2022] Writing text report to demo/demo/test_rep1.fq.gz_fastqc_data.txt
[Fri Sep  2 08:46:59 2022] Writing HTML report to demo/demo/test_rep1.fq.gz_fastqc_report.html
Elapsed time for file demo/test_rep1.fq.gz: 0s
[limits]        using default limit cutoffs (no file specified)
[adapters]      using default adapters (no file specified)
[contaminants]  using default contaminant list (no file specified)
[Fri Sep  2 08:46:59 2022] Started reading file demo/test_rep2.fq.gz
[Fri Sep  2 08:46:59 2022] reading file as gzipped FASTQ format
[running falco|===================================================|100%]
[Fri Sep  2 08:46:59 2022] Finished reading file
[Fri Sep  2 08:46:59 2022] Writing text report to demo/demo/test_rep2.fq.gz_fastqc_data.txt
[Fri Sep  2 08:46:59 2022] Writing HTML report to demo/demo/test_rep2.fq.gz_fastqc_report.html
Elapsed time for file demo/test_rep2.fq.gz: 0s

$ falco --version
falco 1.0.0

The subfolder path is duplicated as follow: (so could not generate output files)
Writing text report to demo/demo/test_rep1.fq.gz_fastqc_data.txt，

Overrepresented Sequences shows "no hit" for all sequences

We compared a FastQC run and a falco 0.2.4 (from bioconda) run and the Overrepresented sequences table shows hit names such as "Truseq adaptor XX" for FastQC while all overrepresented sequences are shown as "no hit" for falco.

The result is identical when adding --contaminants and the path to the contaminant list file (this file is not shipped with the conda installation).

Thanks

Edgardo

Segfault on fastq data

I've encountered a segfault occurring with some data I was trimming. Falco works fine on the untrimmed data, but fails on the post-trim results.

I was trying to narrow down the break to a single read, but I've encountered a strange result - the data attached is a collection of 16 reads, broken into two groupings of 8. Oddly, Falco works on either grouping of 8, but not on the concatenated 16. Even more strangely, it works on all 16 reads when I change the ordering of the reads in the file.

segfault_fq.zip

Force file format not work

I have a fastq.gz file, but the suffix is not fq.gz. The I added the -f fastq.gz arg, but falco still exit with error,

Cannot recognize file format for file

malloc with nanopore FASTQ file

Hi there,

I'm running falco on some nanopore sequencing data and on one out of 45 FASTQ files, I hit the following error:

[limits]        using file /usr/local/opt/falco/Configuration/limits.txt
[adapters]      using file /usr/local/opt/falco/Configuration/adapter_list.txt
[contaminants]  using file /usr/local/opt/falco/Configuration/contaminant_list.txt
[Thu Jan 19 13:04:51 2023] Started reading file x.fastq.gz
[Thu Jan 19 13:04:51 2023] reading file as gzipped FASTQ format
[running falco|                                                   |  0%]malloc(): unsorted double linked list corrupted
31/cf81b3880a6686b11bd6f7b4f43575/.command.sh: line 2:    29 Aborted                 (core dumped) falco --threads 1 x.fastq.gz -D x_raw_falco_data.txt -S x_raw_falco_summary.txt -R x_raw_falco_report.html

I'm afraid, I can't share the FASTQ file with you. If I can somehow investigate further, please let me know.

I'm running falco from Docker quay.io/biocontainers/falco:1.2.1--h867801b_3.

The command used was:

falco  --threads 1 x.fastq.gz -D x_raw_falco_data.txt -S x_raw_falco_summary.txt -R x_raw_falco_report.html

version number for v0.2.2

Nothing serious...
In the source code of your v0.2.2 release it seems you forgot to change the value of FALCO_VERSION at line 303 of src/falco.cpp. It is still "falco v0.2.1" so it prints 'v0.2.1' when you run 'falco -v', instead of 'v0.2.2'.

add some additional information to the README for compilation

Hi there

After just encountering some errors/problems installing Falco by compilation, I think the ReadMe would benefit from adding a few specifics how exactly to compile. after trying around a while, my working procedure was:

cd falco-0.2.1
sudo autoreconf -fvi
sudo autoconf
sudo ./configure CXXFLAGS="-O3 -Wall" --enable-hts
sudo make all
sudo make install

Best,
Eckart

Options for using falco without root privilege

The installation instructions for falco appear to require root privileges.

[moldach@cedar1 falco-0.2.1]$ make install
make[1]: Entering directory '/home/moldach/bin/falco-0.2.1'
 /cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/bin/mkdir -p '/usr/local/bin'
  /cvmfs/soft.computecanada.ca/custom/bin/install -c falco falcodiff '/usr/local/bin'
/cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/bin/install: cannot create regular file '/usr/local/bin/falco': Permission denied
/cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/bin/install: cannot create regular file '/usr/local/bin/falcodiff': Permission denied
make[1]: *** [Makefile:407: install-binPROGRAMS] Error 1
make[1]: Leaving directory '/home/moldach/bin/falco-0.2.1'
make: *** [Makefile:935: install-am] Error 2

I'm wondering if you have a singularity container or another preferred method for academics using HPCs.

Thank you

Package adapters and other default files with bioconda

Hi, thanks for the great tool!

Is it still the case that the adapter_list.txt and potentially the other default configuration files are not shipped with the bioconda package?
#16 and https://github.com/smithlabcode/falco/releases/tag/v0.3.0

I could not find them when I tried a simple find. Is there a reason for this? If not, perhaps they should be shipped with bioconda and falco should default to using them, similar to how FastQC functions? Thanks for your consideration!

Broken script in HTML report

Falco version 0.3.0.

Generated HTML report:
fastqc_report.html.zip
contains number with leading zeros (y : [004.29734000000]) in section for K-mer chart. It's not valid JavaScript syntax, so script is broken and all charts dissapper from report.

Segmentation fault (core dumped) for zero length reads

After trimming, some reads might be 0 in length. falco can not deal with these reads correctly.

add CSS and javascript to HTML page source code

Currently, the HTML page has a static link to bootstrap and plotly CDN. This means that, in the absence of internet connection, the page will break. Source code for these files should be built into the HTML so pages display nicely offline (e.g. when viewing from a cluster environment)

Data for "Per base sequence quality" section differ from FastQC

falco version 0.3.0

Fastqc and falco results are differrent for section "Per base sequence quality"

Per base sequence quality seqtion to compare in results:
pbsq_falco.txt
pbsq_fastqc.txt

Could not attach sample (25.3 Mb). Let me know if I can provide more information.

Better warnings for command line arguments that exist for compatibility

Tried to run falco with 1 and 6 cores but I didn't see any difference.

Basic benchmark

Job Wall-clock time: 01:08:53
Memory Utilized: 87.13 MB
PE FASTQ (17G for R1, 22G for R2)

X axis title is overwritten on values. How to fix this?

Nanopore support

I have run the Falco using the following commands
falco --nano reads.fastq.gz
I am using nanopore reads. Html report says it is using the sanger/illumina encoding even though I have specified the nanopore. Do I need to change anything else. Publication says falco has been tested with both Illumina and nanopore data. Does falco supports the following for nanopore reads
-Q score of nanopore instead of Illumina phread score

Does it detects the adapter or barcodes of nanopore reads
Does it supports the overrepresentation feature for nanopore reads

summary.txt
fastqc_data.txt

Segmentation fault seemingly at random with v1.2.1

I have analyzed this file successfully in the past with previous versions of Falco, but now version 1.2.1 is producing a Segmentation fault (identical behavior in Linux and Mac)

$ falco --outdir test_falco --threads 1 Antrev_R2.fq.gz
[Mon Oct  9 13:13:27 2023] creating directory for output: test_falco
[limits]	using file /Users/emortiz/y/envs/captus/opt/falco/Configuration/limits.txt
[adapters]	using file /Users/emortiz/y/envs/captus/opt/falco/Configuration/adapter_list.txt
[contaminants]	using file /Users/emortiz/y/envs/captus/opt/falco/Configuration/contaminant_list.txt
[Mon Oct  9 13:13:27 2023] Started reading file Antrev_R2.fq.gz
[Mon Oct  9 13:13:27 2023] reading file as gzipped FASTQ format
[running falco|===================================================|100%]
[Mon Oct  9 13:13:27 2023] Finished reading file
[Mon Oct  9 13:13:27 2023] Writing summary to test_falco/summary.txt
[Mon Oct  9 13:13:27 2023] Writing text report to test_falco/fastqc_data.txt
[Mon Oct  9 13:13:27 2023] Writing HTML report to test_falco/fastqc_report.html
Segmentation fault: 11

I hope you can help... I am also attaching the reads
Antrev_R2.fq.gz

Edgardo

Error when processing bam file: No known encoding with chars < 33. Yours was 9)

I get the error, "No known encoding with chars < 33. Yours was 9)" when I try to process a bam file with falco. Here is the call and stdout:

$ falco ${BAM_FILE} -o falco_fastqc/
[limits]        using default limit cutoffs (no file specified)
[adapters]      using default adapters (no file specified)
[contaminants]  using default contaminant list (no file specified)
[Fri May 13 08:09:56 2022] Started reading file DS-376333.hg19.bam
[Fri May 13 08:09:56 2022] reading file as bam format
[Fri May 13 08:10:00 2022] Processed 1M reads
[Fri May 13 08:10:05 2022] Processed 2M reads
[Fri May 13 08:10:10 2022] Processed 3M reads
[Fri May 13 08:10:15 2022] Processed 4M reads
[Fri May 13 08:10:19 2022] Processed 5M reads
[Fri May 13 08:10:21 2022] Finished reading file
[Fri May 13 08:10:21 2022] Writing text report to falco_fastqc//fastqc_data.txt
[Fri May 13 08:10:21 2022] Writing HTML report to falco_fastqc//fastqc_report.html
No known encoding with chars < 33. Yours was 9)

This is similar to issue #24 but not the same.

The 9 must be referring to the ASCII quality scores. 9 is a TAB (\t).

samtools view DS-376333.hg19.bam | grep -P "\t" | wc -l shows me that every line in ${BAM_FLE} has a \t in it, which makes sense because BAMs are "tab delimited". So i'm not sure how to even find the offending \t that I imaging must be at the beginning, end, or middle of the quality scores.

However, all my BAMs were created with the GATK best practices pipeline, so I don't see how they could be poorly formatted. Additionally, fastqc is able to process them without error, albeit very slowly.

Thanks for any help!

is_fastq_gz for bam file

Is is_fastq_gz suppose to be false for bam files? I think it is currently set to false, which causes bam files to be read as regular file instead of binary in get_tile_split_position. I get weird-looking output when I print out the first line of bam file in this funciton.

Segmentation fault for BAM files

I'm getting errors on both my C. elegans and H. sapiens .bam files:

[moldach@cedar5 HG03583]$ salloc --time=0:10:0 --mem=100000
[moldach@cedar5 HG03583]$ module load htslib

[moldach@cdr861 HG03583]$ /home/moldach/bin/falco-0.2.1/bin/falco alignment/HG03583_S1_L001.bam
[limitst]       using file /home/moldach/bin/falco-0.2.1/Configuration/limits.txt
[Sun May  3 17:59:40 2020] Started reading file alignment/HG03583_S1_L001.bam
[Sun May  3 17:59:40 2020] reading file as bam format
Segmentation fault (core dumped)

Any idea what the issue is here?

[Question]; How is % duplication calculated compared to FastQC

I have a question on Falco and how it calculates % duplication.

FastQC uses the first 100K different sequences to deal with duplication (duplication and overrepresented sequences) as quoted by the FastQC author here: s-andrews/FastQC#64 (comment)

Question: How does Falco calculate this - does it also use the first 100K different sequences?

Thanks in advance.

falco v1.1.0 hangs when processing bam file

hi! i'm trying to use the most recent version of falco. i created a conda env with a recipe file as below.

name: falco_env
channels:
  - conda-forge
  - bioconda
dependencies:
 - falco=1.1.0

in that env when i tried processing a 73M bam file the output looked like so:

[limits]        using default limit cutoffs (no file specified)
[adapters]      using default adapters (no file specified)
[contaminants]  using default contaminant list (no file specified)
[Wed Sep 21 16:03:09 2022] Started reading file data/DS-366791.hg19.bam
[Wed Sep 21 16:03:09 2022] reading file as BAM format
[running falco|===================================================|100%]
[Wed Sep 21 16:03:18 2022] Finished reading file
[Wed Sep 21 16:03:18 2022] Writing text report to falco_conda_test//fastqc_data.txt
[Wed Sep 21 16:03:18 2022] Writing HTML report to falco_conda_test//fastqc_report.html

but then it just hung for about an hour, after which point i just killed it. fastqc_data.txt had some data in it, but fastqc_report.html and summary.txt were empty files. i was able to process the exact same bam file with falco v1.0.0 also installed with conda. seems like a bug report here.

thanks!

No known encoding with chars < 33. Yours was 13 (CR LF as line break in fastq sample)

falco version 0.3.0

fastqc file with line break CR LF:
reads_with_cr_ln.zip

There is no any issues to process it with fastqc, but falco reports an error:
No known encoding with chars < 33. Yours was 13)

memory issue?

Dear developers,
thanks for your valuable tool! I'm trying to use it for some nanopore data and I got the following error:

[Thu Sep  8 14:45:34 2022] creating directory for output: KO_fastqc
[limitst]	using file /falco-1.1.0/Configuration/limits.txt
[adapters]	using file /falco-1.1.0/Configuration/adapter_list.txt
[contaminants]	using file /falco-1.1.0/Configuration/contaminant_list.txt
[Thu Sep  8 14:45:34 2022] Started reading file KO.fq.gz
[Thu Sep  8 14:45:34 2022] reading file as gzipped FASTQ format
[running falco|=                                                  |  2%]/ 2: 19870 Killed                  falco -o KO_fastqc -t 1 KO.fq.gz

I used 80Gb of RAM so I don't think I have a problem with RAM.

Luca

Does not correctly decode old Illumina Phred score encoding

I ran Falco 0.2.4 and FastQC on old Illumina data, and noticed that the recognition of Phred score encoding differs between programs as follow:
Falco recognized as Illumina 1.9, and Per base sequence quality and Per seqence quality score look incorrect.
FastQC recognized as Illumina 1.5, and everything looks fine.

Attached, small example of old fastq and output from each program.

Here are the commands I ran:

# Falco
falco --nogroup read_1.fastq.gz --outdir falco_result

# FastQC
fastqc --nogroup read_1.fastq.gz --outdir fastqc_result

I hope these help reproduce the problem.
Thanks in advance.

Floating point exception if fastq is empty

When running falco in a sequencing processing pipeline, some fastq files can be empty. In such cases, falco (v.1.2.1) throws an error:

Floating point exception

It would be helpful if falco just provided a warning if the fastq is empty. This could be an option (e.g., falco --just-warn).

Adaptor content is 0 unless we specify --adapters

Thanks for the program, we had been looking for a faster replacement for FastQC.

However, in a default run with falco 0.2.4 (from bioconda), the Adaptor Content plot is empty unless we add --adapters and the path to the adapter list file (which by the way seems not to be shipped in the conda installation, but I may be wrong on that).

Edgardo

Heatmap by Falco

Hello,

Thanks for writing Falco. Really liked the processing speed, 515M reads in 45 min. FastQC is still processing.
The results are comparable. Also it the dynamic nature of plots is good. However, I am not sure about heatmap. In fastQC, a "cold map" is shown. It's much easier to figure out issues. In the heatmap, what is the spectrum based on? How to best interpret heatmap? Additionally, the tiles on heatmaps is having a different range from fastQC, how are tiles captured and the range is set?

I am using version 0.11.9 for fastQC and v0.2.1 of Falco (via conda).

Thanks in advance with this.

Segmentation fault for FASTQ (merged reads from NovaSeq)

Hello!

I'm getting error while trying to run falco on merged reads (FASTQ, NovaSeq):

[Tue Dec 22 15:05:11 2020] Started reading file smp.fq.gz
[Tue Dec 22 15:05:11 2020] reading file as gzipped fastq format
Segmentation fault (core dumped)

Merging was done with BBMap tools.
FASTQ looks normal, no validation errors found with fq lint (with fqlib).
FastQC accepts this file without errors.

falco v.0.2.1, from bioconda repository.
Please find the example file in the attachment (smp.fq.gz)

With kind regards,
Vladimir

output is overriden when multiple fastq files are provided.

If multiple inputs are given and the -o flag sets a directory name, only the results for the last file shows up. This is due to each file overriding the previous because they are all called output_dir/fastqc_data.txt.

Fastqc zips each report. We should create subdirectories within the output directory, one for each file name, but only if more than one file is provided.

Segmentation faults apon writing output

Hi,
I've compiled Falco:
configure:

 ./configure CXXFLAGS="-O3 -Wall"
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a race-free mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports nested variables... (cached) yes
checking for g++... g++
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether the compiler supports GNU C++... yes
checking whether g++ accepts -g... yes
checking for g++ option to enable C++11 features... none needed
checking whether make supports the include directive... yes (GNU style)
checking dependency style of g++... gcc3
checking whether g++ supports C++11 features with -std=c++11... yes
checking for g++ -std=c++11 option to support OpenMP... -fopenmp
checking for zlibVersion in -lz... yes
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating Makefile
config.status: creating config.h
config.status: executing depfiles commands

make:

make all
make  all-am
make[1]: Entering directory '/home/blahuser/progs/falco-1.2.1'
  CXX      src/falco-falco.o
  CXX      src/falco-FastqStats.o
  CXX      src/falco-HtmlMaker.o
  CXX      src/falco-Module.o
  CXX      src/falco-StreamReader.o
  CXX      src/falco-FalcoConfig.o
  CXX      src/falco-OptionParser.o
  CXX      src/falco-smithlab_utils.o
  CXXLD    falco
make[1]: Leaving directory '/home/blahuser/progs/falco-1.2.1'

install:

sudo make install
make[1]: Entering directory '/home/blahuser/progs/falco-1.2.1'
 /usr/bin/mkdir -p '/usr/local/bin'
  /usr/bin/install -c falco '/usr/local/bin'
make[1]: Nothing to be done for 'install-data-am'.
make[1]: Leaving directory '/home/blahuser/progs/falco-1.2.1'

and run falco:
falco sequencing.fq.gz

and get output:

[limits]        using file /home/blahuser/progs/falco-1.2.1/Configuration/limits.txt
[adapters]      using file /home/blahuser/progs/falco-1.2.1/Configuration/adapter_list.txt
[contaminants]  using file /home/blahuser/progs/falco-1.2.1/Configuration/contaminant_list.txt
[Mon May  8 14:33:37 2023] Started reading file sequencing.fq.gz
[Mon May  8 14:33:37 2023] reading file as gzipped FASTQ format
[running falco|===================================================|100%]
[Mon May  8 14:42:22 2023] Finished reading file
[Mon May  8 14:42:22 2023] Writing summary to ./summary.txt
[Mon May  8 14:42:22 2023] Writing text report to ./fastqc_data.txt
[Mon May  8 14:42:22 2023] Writing HTML report to ./fastqc_report.html
Segmentation fault

I have paired end sequences, that have gone through trim galore!

I've tested on the R1 - and falco runs fine (it's sooooo much faster than fastQC, it's amazing).

But falco crashes with a segfault on the R2 sequence.
The file is a 15302780411 byte (~15.3 gig) gzipped fastq file.

The head of the original file started something like this (I passed a modified version of this which had gone through trimgalore).

@V350096722L1C001R00100001050
GTTCGAACTAATTTCCAAAACGAATATACAAACTTACAATCGCACCAACAATAAAAAAAAATTCCTCTTTCTCCACATCCACACCAACATCTACTATCAC
+
HA=HH;C?BED@;BF9EFFCBGE8AECEEEED/</FGEDBEH7E7BFCEFC7DFEECEC'E.<8D:C=3=@3F1EAD0FD/GDFDFE4E,BCFFD@CGFF
@V350096722L1C001R00100001075
GCGACACTATCAAAACACTACACCCACCTCAATTTACCCAAACTCTACCACCCTTTTTAAAAAAAAAAAAAAACCCCTCTTATCCTAAACTATCTCTCAA
+
G?FBGDDCCBBEEBEADBCCFE792CD<DCCEC;BE:B>EBEBA<:CBDBD9BB@?B@CEEEEEECBCCECBE:+=C61C=EB=AAC@B98E,A:(C5>#
@V350096722L1C001R00100001079
TCGACTACTACAAACCTATCTCCCAACTCCACACTACCTACCTCTACTACACAAAACCCACAAATCAAAAAAACACACAACTAAACACCAAACACGTGTA
+
@ECC5;EDBE=EDBDCCE?DDDCD6FE:8@E*C@'EBD9E7=A7BCADE6F:AC9D8:CDDEDBB=EEEEDDE<C7D>C(1+C?C+/EDCE7*E,2CB:E
@V350096722L1C001R00100001117
CGAATACTTCACTAACTCCAAACAACTCGAAACCAACCTTACCAAACTTACTAAAACGAAATAACGTATTACCCTCTCTAATATTCACTTTCCGAAATCA
+
FIFDDDCCCFDFDDFDCDHDFDHDFDDHIFFEGDDFHGDCDGIFEDEDDFECFFFFGIEFFCDFGHCFCDDGGGCDDHDFEBFCDHDH;DAHHHAEFDGE
@V350096722L1C001R00100001129
GCGAAAAAAAATAAAACCAATCTCATTAATCATTATCATAACTATAAAACAACAAAAAACGAAAATAAAAAAAACACACAACAAAACTCCAATCACGTGT
+
CD:=CCEDE8F-?FCD($ED3EB;E7BE=4AD,<F8ED3B@C@C4FECEFDFAFEAA>60;FC?D$7DECEDDD>B=D9E:<$EDEB3G2B?D&E(;DG%
@V350096722L1C001R00100001130
CGAACACAACCAACCATCTTCAAAAAATCACCACCCTTCACACACACAAACATCAATACACAACAACTCACCACACCTCACAATCCACACACCCCAAACA
+
EFBECCBCDCBEBDACCADBBCBDBCD8BE>:BA.A?B>E@CBEBEEEAEDABBE?BCDE=DCBEEB@=E@3B<E.%?DDFB&??AA5EEBD9@ABEC@E
@V350096722L1C001R00100001146
CGAAACCCGAACCCCCACGAACCGACGACTCTTACCGCCTAATCACCCACCAACAACCAACGATCAACAACAAACGACAAACAACAAACACCACTAAATC
+
=@<EE>B8H2CC?.59CBG$=?8??>GDA@?BAAC>H(2.BA?8A;45E5DEA<?@='CEA7<*)CE&@E8=C?&;C;DCE<6C9CCC.E4;D23ECB1@
@V350096722L1C001R00100001155
CGACCCTACATAATAATTTTAATAATTTAAAAAACGAAACAATTCCGCGATATAAAATTTTCTACTCTAAAACGACATCGAAATTTACAACCGAAAAATC
+
FDDFIFCDFFCDECFDCCCDDFCDECDBDFFFCDDHEEECFEDDG@HEIFDFDDEBFBBCDECDGCGDEEEFFHEED3=HC#D@DCDGDEDBHDEEECCG
@V350096722L1C001R00100001162
CGACCAACAAACAACACACACACCCACACAACTCTAAACACCCCAAACCTTAACACCAAACCTCTCAACCCTAACACCATAACTTAACCCTAACCACAAA
+
FFDDGCCDEECDCECDEEFEEDBEFDEECDEEBBDECEFEDDGGCDDFFDC?CECE@DDEFEBF<CEE@EDABCDC8FD<$E(@CEEEFA<EAA5CD2DC
@V350096722L1C001R00100001181
CGACTTCTACCTAAATAAAACATCCAAAAATTAAATTATATTTTATAAAACTAATACCACCAAAACAAAAAAACACACATCTAAACTCCAATCACGTGTA
+
GHEEDCGCDFCDDFFCDFFDGD,GGDDEDDDDDDFC3DCDDCABDCDEEEGDDDBDGFFFFFCC<GDDFDDDC9DFDFF6C=DEDFA=GDF6GDFHBGCF

It has come from an MGI instrument, but it's nothing special. Falco is happy when I pass my R1 to it.

Any ideas why this would seg fault?
All three outputs (summary, txt and html) are empty files when it faults with R2. When falco processes the R1 sequence, it's fine and the output looks good.

I'm running ubuntu 20.04 if that matters.

many thanks,
Kieran

Requested output format changes

Hey Guilherme.

Thanks for the quick reply! Great that you've gotten the problem fixed. However, I was hoping to run falco as part of general pipeline, and we are relying on conda for managing the environment and I'd like to keep the custom parts down to a bare minimum.

Could I ask why you don't include a more stable output form, perhaps as a supplement to the fastqc like output, and write a specific falco module for multiQC? (https://multiqc.info/docs/#custom-content).

Maybe this could be a more stable way to integrate into multiQC, and also have falco being q part of the list of modules available in multiQC (raising awareness).
As we are developing pipelines for running in cloud based pipelines, core-minutes become important, and having a fast tool (like falco) is really high up on the priority list.

Originally posted by @pbiology in #7 (comment)

why does falco generate zero file size output?

This is a fantastic job.
But I have found that when using data (https://github.com/nf-core/test-datasets/tree/rnaseq/testdata/GSE110004) from Nextflow for quality control testing, it generates zero file size output.
How can I solve this problem?

Add option to specify file output names

In cases where you have you have paired-end reads (e.g. HG03583_S1_L001_R1.fastq.gz & HG03583_S1_L001_R2.fastq.gz) or a number of FASTQ files in a directory falco will over-write fastqc_data.txt, fastqc_report.html and summary.txt.

At the moment the only way around this, I see, would be to have each FASTQ file in it's own directory (not ideal IMO).

It would be nice to be able to specify the name of output so you could use wild-card rules in a Snakemake workflow for example.

Too many adapters error while using built-in adapter file

[Mon Sep 13 13:26:36 2021] Writing text report to falco_err/fastq/F3D2_S190_L001_R1_001.fastq_fastqc_data.txt
[Mon Sep 13 13:26:36 2021] Writing HTML report to falco_err/fastq/F3D2_S190_L001_R1_001.fastq_fastqc_report.html
Elapsed time for file fastq/F3D2_S190_L001_R1_001.fastq: 3s
[limitst]       using file /local/downloads/falco-0.3.0/Configuration/limits.txt
[adapters]      using file /local/downloads/falco-0.3.0/Configuration/adapter_list.txt
You are testing too many adapters. The maximum number is 128!

I get this error when I specify multiple fastq files as input but not when running them individually. I can run F3D2_S190 forward and reverse just fine:

$ falco fastq/F3D2_S190_L001_R*.fastq -o F3D2_S190
[Mon Sep 13 13:31:13 2021] creating directory for output: F3D2_S190
[limitst]	using file /local/downloads/falco-0.3.0/Configuration/limits.txt
[adapters]	using file /local/downloads/falco-0.3.0/Configuration/adapter_list.txt
[contaminants]	using file /local/downloads/falco-0.3.0/Configuration/contaminant_list.txt
[Mon Sep 13 13:31:13 2021] Started reading file fastq/F3D2_S190_L001_R1_001.fastq
[Mon Sep 13 13:31:13 2021] reading file as uncompressed fastq format
[Mon Sep 13 13:31:13 2021] Finished reading file
[Mon Sep 13 13:31:13 2021] Writing text report to F3D2_S190/fastq/F3D2_S190_L001_R1_001.fastq_fastqc_data.txt
[Mon Sep 13 13:31:13 2021] Writing HTML report to F3D2_S190/fastq/F3D2_S190_L001_R1_001.fastq_fastqc_report.html
Elapsed time for file fastq/F3D2_S190_L001_R1_001.fastq: 0s
[limitst]	using file /local/downloads/falco-0.3.0/Configuration/limits.txt
[adapters]	using file /local/downloads/falco-0.3.0/Configuration/adapter_list.txt
[contaminants]	using file /local/downloads/falco-0.3.0/Configuration/contaminant_list.txt
[Mon Sep 13 13:31:14 2021] Started reading file fastq/F3D2_S190_L001_R2_001.fastq
[Mon Sep 13 13:31:15 2021] reading file as uncompressed fastq format
[Mon Sep 13 13:31:16 2021] Finished reading file
[Mon Sep 13 13:31:16 2021] Writing text report to F3D2_S190/fastq/F3D2_S190_L001_R2_001.fastq_fastqc_data.txt
[Mon Sep 13 13:31:16 2021] Writing HTML report to F3D2_S190/fastq/F3D2_S190_L001_R2_001.fastq_fastqc_report.html
Elapsed time for file fastq/F3D2_S190_L001_R2_001.fastq: 2s

I'm testing with the test dataset from here: https://mothur.org/wiki/miseq_sop/

direct link to fastq zip download: https://mothur.s3.us-east-2.amazonaws.com/wiki/miseqsopdata.zip

[Feature request] Add option to subsample reads

In most of the time, we run falco (fastqc) to have a rough estimation on data quality. So we do not need to parse every single read in the fastq file. I think we can randomly subsample certain amount/ fraction of reads to increase the speed or save computational power.

It is possible to add a argument (--subsample/-s) for this?

Thanks!

smithlabcode / falco Goto Github PK

falco's People

Contributors

Stargazers

Watchers

Forkers

falco's Issues

Basic benchmark

Recommend Projects

Recommend Topics

Recommend Org