hiseq vs. miseq output mixup? about amptk HOT 9 CLOSED

nextgenusfs commented on August 20, 2024

hiseq vs. miseq output mixup?

from amptk.

Comments (9)

devonorourke commented on August 20, 2024 1

So here's what's really strange to me: I have another MiSeq dataset - same sequencing center, same primers used, same sequencing parameters... same everything. It's working right away with the same kind of amptk illumina commands that my initial dataset is "failing" with.

Here's the log file for that successful amptk illumina process:

2018-01-27 09:15:15,959: usearch9 -fastq_mergepairs trim/CONTROL-mockP11L1IM4hiEN-USA-2017-076-JF.fq.pretrim_R1.fq -reverse trim/CONTROL-mockP11L1IM4hiEN-USA-2017-076-JF.fq.pre
trim_R2.fq -fastqout trim/CONTROL-mockP11L1IM4hiEN-USA-2017-076-JF.fq.merged.fq -fastq_trunctail 5 -fastqout_notmerged_fwd trim/CONTROL-mockP11L1IM4hiEN-USA-2017-076-JF.fq.notm
erged.R1.fq -minhsp 12 -fastq_maxdiffs 8 -report trim/CONTROL-mockP11L1IM4hiEN-USA-2017-076-JF.fq.merge_report.txt -fastq_minmergelen 160
2018-01-27 09:15:18,518: usearch v9.2.64_i86linux32, 4.0Gb RAM (132Gb total), 24 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: [email protected]


2018-01-27 09:15:18,519: 
Merging
  Fwd trim/CONTROL-mockP11L1IM4hiEN-USA-2017-076-JF.fq.pretrim_R1.fq
  Rev trim/CONTROL-mockP11L1IM4hiEN-USA-2017-076-JF.fq.pretrim_R2.fq
  Keep read labels

00:00 71Mb      0.1% 0% merged^M00:01 156Mb     0.2% 50.0% merged^M00:02 157Mb    41.6% 97.6% merged^M00:03 158Mb    83.4% 97.8% merged^M00:03 158Mb   100.0% 97.8% merged

Merged length distribution:
       174  Min
       181  Low quartile
       181  Median
       181  High quartile
       206  Max

Totals:
    408813  Pairs (408.8k)
    399848  Merged (399.8k, 97.81%)
    323831  Alignments with zero diffs (79.21%)
      8750  Too many diffs (> 8) (2.14%)
       207  No alignment found (0.05%)
         0  Alignment too short (< 16) (0.00%)
         8  Merged too short (< 160)
    408541  Staggered pairs (99.93%) merged & trimmed
    180.97  Mean alignment length
    180.97  Mean merged length
      0.74  Mean fwd expected errors
      2.48  Mean rev expected errors
      0.03  Mean merged expected errors

The strange thing: when you look at the sizes of the temporary pretrim_R*.fq files, they're not that different between the failing and succeeding datasets. Both are ~ 100 Mb each. Yet once those files are input into the next process, I get just 100's of reads merging in one case vs. 100,000's of reads in the other.

What's the take away from the fact that only a fraction of these pretrim reads aren't merging?

from amptk.

devonorourke commented on August 20, 2024 1

I tried running USEARCH on its own, with a single pair of my raw .fastq.gz files (not put through any amptk process). This generated an error message suggesting there is an unexpected @ symbol causing the program to fail. Maybe this is just because USEARCH doesn't like .gz files?

[devon@premise p8-2]$ usearch9 -fastq_mergepairs \
> NHCS-rut161461-xx-VE-USA-2017-076-JF_S1_L001_R1_001.fastq.gz \
> -reverse NHCS-rut161461-xx-VE-USA-2017-076-JF_S1_L001_R2_001.fastq.gz \
> -fastqout test.merged.fq \
> -fastq_trunctail 5 \
> -fastqout_notmerged_fwd test.notmerged.fq \
> -minhsp 12 -fastq_maxdiffs 8 -report test.fq.merge_report.txt -fastq_minmergelen 160
usearch v9.2.64_i86linux32, 4.0Gb RAM (264Gb total), 24 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: [email protected]


Merging
  Fwd NHCS-rut161461-xx-VE-USA-2017-076-JF_S1_L001_R1_001.fastq.gz
  Rev NHCS-rut161461-xx-VE-USA-2017-076-JF_S1_L001_R2_001.fastq.gz
  Keep read labels

00:01 71Mb      0.1% 0% merged

usearch9 -fastq_mergepairs NHCS-rut161461-xx-VE-USA-2017-076-JF_S1_L001_R1_001.fastq.gz -reverse NHCS-rut161461-xx-VE-USA-2017-076-JF_S1_L001_R2_001.fastq.gz -fastqout test.merged.fq -fastq_trunctail 5 -fastqout_notmerged_fwd test.notmerged.fq -minhsp 12 -fastq_maxdiffs 8 -report test.fq.merge_report.txt -fastq_minmergelen 160

---Fatal error---
Bad line 1 in FASTQ file 'NHCS-rut161461-xx-VE-USA-2017-076-JF_S1_L001_R1_001.fastq.gz': expected '@'

So I then decompressed those files with gzip -d (not the bgzip command you've set up, but that can't be the difference, right?), and then applied the same command, and it not only worked (insofar that it ran without crashing), but it generated a lot more than 113 reads... it generated many more (and what I would have expected)!

[devon@premise test_bin]$ usearch9 -fastq_mergepairs \
> NHCS-rut161461-xx-VE-USA-2017-076-JF_S1_L001_R1_001.fastq \
> -reverse NHCS-rut161461-xx-VE-USA-2017-076-JF_S1_L001_R2_001.fastq \
> -fastqout test.merged.fq \
> -fastq_trunctail 5 \
> -fastqout_notmerged_fwd test.notmerged.fq \
> -minhsp 12 -fastq_maxdiffs 8 -report test.fq.merge_report.txt -fastq_minmergelen 160
usearch v9.2.64_i86linux32, 4.0Gb RAM (264Gb total), 24 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: [email protected]


Merging
  Fwd NHCS-rut161461-xx-VE-USA-2017-076-JF_S1_L001_R1_001.fastq
  Rev NHCS-rut161461-xx-VE-USA-2017-076-JF_S1_L001_R2_001.fastq
  Keep read labels

00:06 161Mb   100.0% 90.0% merged

Merged length distribution:
       160  Min
       181  Low quartile
       181  Median
       187  High quartile
       391  Max

Totals:
    943548  Pairs (943.5k)
    849126  Merged (849.1k, 89.99%)
    462926  Alignments with zero diffs (49.06%)
     92633  Too many diffs (> 8) (9.82%)
      1693  No alignment found (0.18%)
         6  Alignment too short (< 16) (0.00%)
        90  Merged too short (< 160)
    940184  Staggered pairs (99.64%) merged & trimmed
    182.64  Mean alignment length
    182.64  Mean merged length
      1.64  Mean fwd expected errors
      1.56  Mean rev expected errors
      0.14  Mean merged expected errors

from amptk.

devonorourke commented on August 20, 2024

deleted initial comment; have different question now.

from amptk.

nextgenusfs commented on August 20, 2024

The samples you sent seem to be running fine on my system.... In the log file you sent, the script is re-using a previous run intermediate files, so I can't see exactly what it is doing, can you re-run after deleting that rut output directory first. But based on this log file, there are only 113 reads...

2018-01-27 06:10:46,298: Now merging PE reads
2018-01-27 06:10:46,298: usearch9 -fastq_mergepairs rut/rut16-1382.fq.pretrim_R1.fq -reverse rut/rut16-1382.fq.pretrim_R2.fq -fastqout rut/rut16-1382.fq.merged.fq -fastq_trunctail 5 -fastqout_notmerged_fwd rut/rut16-1382.fq.notmerged.R1.fq -minhsp 12 -fastq_maxdiffs 8 -report rut/rut16-1382.fq.merge_report.txt -fastq_minmergelen 160
2018-01-27 06:10:46,311: usearch v9.2.64_i86linux32, 4.0Gb RAM (132Gb total), 24 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: [email protected]


2018-01-27 06:10:46,311: 
Merging
  Fwd rut/rut16-1382.fq.pretrim_R1.fq
  Rev rut/rut16-1382.fq.pretrim_R2.fq
  Keep read labels

00:00 71Mb      0.1% 0% merged
00:00 156Mb   100.0% 81.4% merged

Merged length distribution:
       180  Min
       181  Low quartile
       181  Median
       181  High quartile
       187  Max

Totals:
       113  Pairs (113)
        92  Merged (92, 81.42%)
        40  Alignments with zero diffs (35.40%)
        19  Too many diffs (> 8) (16.81%)
         2  No alignment found (1.77%)
         0  Alignment too short (< 16) (0.00%)
         0  Merged too short (< 160)
       110  Staggered pairs (97.35%) merged & trimmed
    181.01  Mean alignment length
    181.01  Mean merged length
      1.59  Mean fwd expected errors
      1.77  Mean rev expected errors
      0.25  Mean merged expected errors

from amptk.

devonorourke commented on August 20, 2024

okay, I think the problem is resolved, though I have no idea why this was the case (other than perhaps the bgzip command wasn't properly decompressing the files I had?)

I ended up taking my raw reads and first just decompressing them with gzip -d *.gz.

I then just ran the normal amptk illumina command on the files as usual, and all the samples ended up getting properly merged.

Weird.

Thanks for being so quick with your feedback Jon - you're the best!

from amptk.

nextgenusfs commented on August 20, 2024

Interesting. Yes usearch doesn't take the gzipped files. So the method that AMPtk uses I call "fast unzip" I guess, which will run pigz if installed, if not will run bgzip if installed, and then will finally default to plain old gzip. This is because gzip runs on only a single thread and can be quite slow. bgzip was written by Heng Li so hard for me to imagine there would be a bug, but probably more along the lines of a character it wasn't expecting? bgzip is now part of htslib so if this is a reproducible problem they would probably like to hear about it: https://github.com/samtools/htslib.

Can you get pigz installed? That is what I use as it is the fastest.

I agree that strange that this is only happening with one of the runs by the sequencing center? Could it be that they were compressed differently at sequencing center and that is the difference? Or is there actually some characters causing problems, which I would find harder to believe?

from amptk.

hyphaltip commented on August 20, 2024

I thought bgzip was special for making tabix indexed bed/vcf/gff files. I would use pigz or gzip as main gzip ? Jason Jason E Stajich, PhD Professor and Director, Microbiology Graduate Program Department of Microbiology and Plant Pathology University of California, Riverside http://lab.stajich.org @stajichlab @hyphaltip @zygolife +1 951.827.2363

…

On Jan 27, 2018, 9:04 AM -0800, Jon Palmer ***@***.***>, wrote: Interesting. Yes usearch doesn't take the gzipped files. So the method that AMPtk uses I call "fast unzip" I guess, which will run pigz if installed, if not will run bgzip if installed, and then will finally default to plain old gzip. This is because gzip runs on only a single thread and can be quite slow. bgzip was written by Heng Li so hard for me to imagine there would be a bug, but probably more along the lines of a character it wasn't expecting? bgzip is now part of htslib so if this is a reproducible problem they would probably like to hear about it: https://github.com/samtools/htslib. Can you get pigz installed? That is what I use as it is the fastest. I agree that strange that this is only happening with one of the runs by the sequencing center? Could it be that they were compressed differently at sequencing center and that is the difference? Or is there actually some characters causing problems, which I would find harder to believe? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

from amptk.

devonorourke commented on August 20, 2024

I'll keep an eye out for the behavior on subsequent runs and let you know if this was a one-off or something more pervasive. We're getting a bunch of more data coming in over the next couple of weeks so I'll be in touch to confirm.

from amptk.

nextgenusfs commented on August 20, 2024

Perhaps it is now? I thought when it was originally written that Heng was using it for random access to compressed FASTQ files as well. Maybe it has changed. Certainly can just drop it from the fast unzip function if it is causing problems.

from amptk.

hiseq vs. miseq output mixup? about amptk HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent