vpc-ccg / calib Goto Github PK

View Code? Open in Web Editor NEW

38.0 6.0 9.0 10.63 MB

Calib clusters barcode tagged paired-end reads based on their barcode and sequence similarity.

License: MIT License

Python 1.10% C++ 1.13% Makefile 0.47% Shell 0.71% Awk 0.01% HTML 96.43% C 0.15%

tagged-reads barcode-sequencing clustering paired-end-sequencing liquid-biopsy

calib's People

Contributors

Stargazers

Watchers

Forkers

aerowild liujinglu xiaoqiwang19 corburn mortunco bowhan eliasonaws sschmeier

calib's Issues

calib and use with conda

Hey @baraaorabi,

I was wondering how the consensus and error correction steps are performed with the conda installed version of calib?

I was able to generate the test.cluster with the following command:

calib --input-forward R1.fastq.gz --input-reverse R2.fastq.gz --barcode-length 4 --output-prefix test. --minimizer-count 7 --kmer-size 8 --error-tolerance 1 --minimizer-threshold 2

BUT, I'm unable to proceed with the clustering and error correction steps because there are no additional calib arguments with the conda installed version:

$ calib --help
Combined barcode lengths must be a positive integer and each mate barcode length must be non-negative! Note if both mates have the same barcode length you can use -l/--barcode-length parameter instead.
Calib: Clustering without alignment using LSH and MinHashing of barcoded reads
Usage: calib [--PARAMETER VALUE]
Example: calib -f R1.fastq -r R2.fastq -o my_out. -e 1 -l 8 -m 5 -t 2 -k 4 --silent
Calib's paramters arguments:
-f --input-forward (type: string; REQUIRED paramter)
-r --input-reverse (type: string; REQUIRED paramter)
-o --output-prefix (type: string; REQUIRED paramter)
-s --silent (type: no value; default: unset)
-q --no-sort (type: no value; default: unset)
-g --gzip-input (type: no value; default: unset)
-l --barcode-length (type: int; REQUIRED paramter unless -l1 and -l2 are provided)
-l1 --barcode-length-1 (type: int; REQUIRED paramter unless -l is provided)
-l2 --barcode-length-2 (type: int; REQUIRED paramter unless -l is provided)
-p --ignored-sequence-prefix-length (type: int; default: 0)
-m --minimizer-count (type: int; default: Depends on observed read length;)
-k --kmer-size (type: int; default: Depends on observed read length;)
-e --error-tolerance (type: int; default: Depends on observed read length;)
-t --minimizer-threshold (type: int; default: Depends on observed read length;)
-c --threads (type: int; default: 1)
-h --help

Am I missing something here?

Best,
Chad

barcode at 3' end

Hi,

I've just stepped into NGS data analysis and I am not really familiar to it yet but I'm motivated to analyze my data and my research on the internet took me here...
Sequencing gave me single end reads, with my 16N UMI Tag on the 3' end of the read.
I've cleaned my data so only the 'sequence of interest with the UMI's attached' reads should be left in my input file.
I'm a little bit confused from the readme. How can I define the postion of my UMI or should I move it in front of the sequence ?
And since I don't have paired end reads, I can copy my dataset without UMI's for r2 input right?

Thanks in advance

quality scores of consensus sequences

Hi, I'm trying to use calib in combination with UMI VarCal, but apparently calib adjusts the fastq quality score beyond what is used by Illumina (see https://gitlab.com/vincent-sater/umi-varcal/-/issues/12) and is thus not compatible with UMI VarCal. Is this intended? Is it possible to add an option to adjust this to max 41?

UMI only at Read2

Hi,

I have a fastq file in which UMI has attached on first 8 bp of Read2.
How can I use this software?

Use igraph lib instead of using own implementation

default output for base quality when setting -m 1

Can calib deal with reads that do not have a molecular duplicate (-m 1). Currently, I'm seeing all these reads as having a default adjusted base quality of K (Q42). Is this correct behaviour? Can these reads retain original quality denotation apart from any overlap between mates where error correction could correctly be applied?

sample index and barcode in deduplicated reads

Hello,

I used Calib to deduplicate my paired end reads.
calib -f S1_R1.fastq -r S1_R2.fastq -o S1_Calib. -l1 17 -l2 0

I have 8 bases index and 9 bases barcode attached to R1 while nothing attached to R2. Now, After running Calib and calib_cons, I noticed that the 17 bases of index+barcode is still attached to my reads. Is there a way to strip this?

Thanks!

Paper branch

Make a paper branch and freeze it

Check and try running iDES and other tools

From StarCode paper:

gzip input

I have gzip file as input. However, when I switched on -g, the program finished with just a few read processed. I compiled the lastest master, 721830a.

No error or minimizer parameters passed. Selecting parameters based on barcode and inferred read length
Inferred read length 149 from sample of 10000 reads
Selected paramters for (mean) barcode length 6 are:
error_tolerance 1
kmer_size 8
minimizer_count 7
minimizer_threshold 2
Extracting minimizers and barcodes...
Memory before reading FASTQ:
1MB
Memory right after reading FASTQ:
1MB
Memory after reserving for read_to_node_vector & node_to_minimizers:
1MB
Memory after filling barcode_to_node_map:
1MB
Memory after releasing node_to_read_map:
1MB
Memory after reserving barcode_to_nodes_vector:
1MB
Memory after filling barcodes & barcode_to_nodes_vector:
1MB
Memory after releasing barcode_to_node_map:
1MB
Read count: 4
Node count: 4
Barcode count: 4
Memory after exiting extract_barcodes_and_minimizers():
1MB
Clustering...
Adding edges due to barcode barcode similarity
Number of masks is 12
011111111111 is assigned to thread 0
Thread 0 built LSH in: 0
Thread 0 processed LSH in: 0
101111111111 is assigned to thread 0
Thread 0 built LSH in: 0
Thread 0 processed LSH in: 0
110111111111 is assigned to thread 0
Thread 0 built LSH in: 0
Thread 0 processed LSH in: 0
111011111111 is assigned to thread 0
Thread 0 built LSH in: 0
Thread 0 processed LSH in: 0
111101111111 is assigned to thread 0
Thread 0 built LSH in: 0
Thread 0 processed LSH in: 0
111110111111 is assigned to thread 0
Thread 0 built LSH in: 0
Thread 0 processed LSH in: 0
111111011111 is assigned to thread 0
Thread 0 built LSH in: 0
Thread 0 processed LSH in: 0
111111101111 is assigned to thread 0
Thread 0 built LSH in: 0
Thread 0 processed LSH in: 0

Output each cluster's reads

Calib run error: Segmentation Fault 11

Hi there,
I'm getting a Segmentation fault 11 error code when trying to run Calib. Please see below.

Thanks for your help!
Kartik

Kartiks-MBP-2:dms2-subamp-seq-test kartik$ calib -f ./fastqfiles_A33R/appendedWithUMI_s1.CTGATCGT-GCGCATAT.J1230.AHCFMLAFX2.L1-4.1.fastq -r ./fastqfiles_A33R/s1.CTGATCGT-GCGCATAT.J1230.AHCFMLAFX2.L1-4.2.fastq -l 9 -o foo_
No error or minimizer parameters passed. Selecting parameters based on barcode and inferred read length
Inferred read length 155 from sample of 10000 reads
Selected paramters for (mean) barcode length 9 are:
error_tolerance 2
kmer_size 8
minimizer_count 7
minimizer_threshold 2
Extracting minimizers and barcodes...
Memory before reading FASTQ:
Segmentation fault: 11

experiments/parameter_tests/README.md broken link

The experiments/parameter_tests/README.md plots link references the repository through rawgit.com which has been shutdown.

rawgit.com:

RawGit is now in a sunset phase and will soon shut down. It's been a fun five years, but all things must end.

GitHub repositories that served content through RawGit within the last month will continue to be served until at least October of 2019. URLs for other repositories are no longer being served.

If you're currently using RawGit, please stop using it as soon as you can.

The README can contain relative links to files in the repository:
https://help.github.com/en/github/creating-cloning-and-archiving-repositories/about-readmes#relative-links-and-image-paths-in-readme-files

Single-end support

Allow for single end input
Run simulation experiments for single-end parameter optimization.

Bioconda recipe

Dependency on spoa should be resolved by wget
We should add bioconda dependencies on:
- ART
- cmake +3.2 (for spoa)
- Python3 packages for simulation
- GNU Time v1.9 for benchmarking
Other possibilities:
- SiNVICT
- samtools

Clustering and error correction for duplex sequencing

Can calib call duplex consensus reads? I have data with xGen duplex seq adapters.

Downgrade GCC

Calib depends on GCC v5.2. This is because earlier versions have not implemented the full C++11 file stream functions (something about copy constructors is broken in the earlier versions). But if we want Calib to work with bioconda, we should downgrade our GCC to v4.8.5. This should be relatively an easy fix.

not working without parameter selction

Hi,

I don't have paired end reads but as described in previous issues I have copied my input fastq, removed the umi's (16 N long) and used it as input for the second file as shown in the screenshot. I've gotten an error message (no error or minimizer parameters passed. Selecting parameters based on barcode and inferred read length
Inferred read length 55 from sample of 10000 reads). Then I've tried to use the example command and only adjusted my input file names and the barcode length and my outfile (cluster) had been generated. But I'm not sure if this is the right parameter selection for my sequences - they are very short - only 55 bases already including 16 bases umi.

But I've tried further if I can use the generated cluster file for calib_cons. No error message here, but empty files. So my question here is, does the described example command refer to the same input files as in the first calib command for clustering or is this another fastq file, different from the input.

To run Calib error correction, run:

calib_cons -c <cluster_file> -q <space_separated_FASTQ_list> -o <space_separated_output_prefix_list>

For example:

calib_cons -c R.cluster -q R1.fastq R2.fastq -o R1. R2.

Thanks in advance and sorry for the probably dumb questions for experts, but I'm new in this topic (:

calib_con cannot accept gzipped files

Hi, I tried to run the program vi conda.

My commands:

calib -f SF_1.fastq.gz -r SF_2.fastq.gz -l 8 -o SF --gzip-input --no-sort --threads 8 # for creating cluster file
The first step runs smoothly.

calib_cons -c SFcluster -q SF_1.fastq.gz SF_2.fastq.gz -o SF_1.out SF_2.out -t 8
This exits after reading the fastq file (because it is gzipped). I used file streaming to solve the problem, but can the -g parameter be implemented here as well?

Thank you
@baraaorabi

dealing with UMIs containing N

After running calib, about 30% of my reads contain an UMI with at least one 'N'. How do you suggest that we deal with these? Currently, we are filtering out all these reads.

Thanks.

using -p for reads with umi one one read only

Hi
If using "-p" in clustering (--ignored-sequence-prefix-length), and having a umi on one read only (eg -l1 10, -l2 0), will the value of -p be ignored on both fastq reads or just the one with > 0 bases specified for the UMI (-l1 in this case).
Thanks

Refactoring Makefile

Goals:

Simulation
- Separating simulation from Calib's makefile
- Allowing simulation scripts to be run from any directory
- Allowing the user to specify working directory
Combining calib_cons with Calib's makefile
Benchmarking
- Remove all other tools from master branch
- Allowing the user to run the benchmarking script from any directory
- Allowing the user to specify parameter search space & the working directory
- Add a Slurm configuration file to inherit for child files

"Illegal instruction: 4" error while running calib_cons

Hi there,

I'm working on.an Apple M1 Pro Mac running OS X Monterey 12.2.1. I was able to run calib on my fastq files to generate the cluster file but calib_cons gives me an "Illegal instruction: 4" error.

I was able to run Calib fine on my previous Intel Mac so I thought this might be due to the change in chip architecture. I tried running in Rosetta mode by prefixing with "arch -x86_64" but that gave me the same error.

Look forward to your input on how to fix this.

Thanks!
Kartik

Input: Quality sequence needed?

Hi,
Since FASTQ is the input format I was wondering if I could also use a fake-FASTQ file as my input, where the quality sequence is missing or randomly inserted but the read sequence is real?
So in other words, does Calib sort and cluster the reads also based on quality information or only on sequence similarity?
Thanks for upcoming answers.

Error when used conda calib

Hi Baraa!

I tried to use calib which was installed throught conda but on the second step (calib_cons) it's generated dump error.
When calib was istalled from git all is without problem. I use ubuntu 18.04 LST.

As far as I understand, such a problem with the conda has already been found in several users on different systems.
There is no question as such, I just wanted to voice this fact.

Best wishes,
Marsel

Using Calib with 4 input fastqs

Hello,

I have 4 fastq files for each sample - Read_R1.fq Read_R2.fq Read_I1.fq and Read_I2.fq
The I1 and I2 files have the barcodes for my paired end reads. Can I use these as input to Calib?

Get rid of unordered sets

Convert unordered sets to vectors. Resolve removing unmatching nodes.

Getting an "ERROR: Something is fishy with read:" error

Hi there,

I'm trying to run a pair of R1 and R2 files with 12-bp UMI at the 5' end end of each read.

calib ran with the following messages but seemed to exit with the above error.

The reads it lists as "fishy" are the very first read pair in the fastq files. I've attached a sample of R1 and R2 fastq files (each read mate has a 12-bp UMI at the 5' end)
R1_fastq.txt
R2_fastq.txt
R1_fastq.txt
. I did some text processing to move the UMIs but did not edit the quality cigar string. Would this cause the error?

Thank you for your help,
Kartik

calib -f ./final_fastq_processed/A-S_R1_filtered.fastq -r ./final_fastq_processed/A-S_R2_reformatted_filtered.fastq -l1 12 -l2 12 -e 0 -k 4 -m 7 -t 3 -o out_
Extracting minimizers and barcodes...
Read count: 3258645
Node count: 2027186
Barcode count: 1145285
Clustering...
Adding edges due to barcode barcode similarity
Number of masks is 1
111111111111111111111111 is assigned to thread 0
Thread 0 built LSH in: 0
Thread 0 processed LSH in: 0
On thread 0 building all LSH took: 0
On thread 0 processing all LSH took: 0
R2_fastq.txt

On thread 0 merging local graph with global graph
On thread 0 merging took 0
Building the graph on 1 thread(s) took 1
Adding edges between nodes of identical barcodes with thread 0
Adding edges due to barcodes similarity took: 1807
Extracting clusters
Extracting clusters took: 68
Outputting clusters
ERROR: Something is fishy with read:
name_1 @M01243:273:000000000-K64MD:1:1101:15987:1337 1:N:0:1
sequence_1 TTCCCAGCCGCAACTTTGTGAGTATGGGTAGTAGACTCCTTGAAGAGCTACTACTACAAGTGCTGGGAAGAGCCAACTCAGGGAAATACAGGAAGAGATCACTCGCCATGAGCAGCAGCTTGTCATT
trash +
quality_1 AAAABFFFFFBBGGGFFFF5BGFHFEA2EAAFGBGHCF5BFHFBGHGGHHHHHHHHDGEGBFHHGFFH3EGFGHH3BB@@55c>13BF??BF???DB331AF1GFEHBHHBEEEG3?FGFGFECAC2B??GC20F2FHHC2>FG
name_2 @M01243:273:000000000-K64MD:1:1101:15987:1337 2:N:0:1
sequence_2 TGCGGCTGGGAAAATGNCAAGCTGCTGCTCATGGCGAGTGATCTCTTACTGTATTTCCCAGAGTTGGCTCATCCCAGCACTTGTAGGAGTAGCTCTTCAAGGAGTCTACTAACCATACTCACAAAGT
trash +
quality_2 AA1A#>>1AA1F1BBF11E1BFGE00A0D222D221B122D2AD2F111//1DD11FEFHFHAGBFF0GFAGDD11BG01FFFHHHFD2EA>0GEGDDF2BGF110BBF1GFGBB2BFG?//>/E<C/BG<G0</</B/F<F2<

R1_fastq.txt

Change from lsh dictionary to lsh arraylist

Add barcode length & error tolerance command-line flags

Left & right LSH with vectors rather than dicts

Travis testing

Add travis-ci testing to Calib

Illegal instruction (core dumped) - calib_cons

Hi there, wondering if any ideas on the following.

I'm running calib installed with conda on a Linux server. I've successfully run the calib command and generated a cluster file. The problem occurs when I try to run calib_cons. The following is the full command and output:

calib_cons -c B5-testgz.cluster -q B5_3_1_S5_L001_R1_UMI_full.fastq B5_3_1_S5_L001_R2_UMI_full.fastq -t 4 -o B5gz-R1 B5gz-R2

Reading cluster file: B5-testgz.cluster
Reading fastq file: B5_3_1_S5_L001_R1_UMI_full.fastq
Writing output files: B5gz-R1
Illegal instruction (core dumped)

I end up with 6 empty files in my directory (.msa, .msa1, .fastq and .fast0/1/2).

A factor here is that some preprocessing occurred before the data was passed to me and the UMIs had been removed from the front of the read sequences - I have had to copy them back to the sequence start from the read header and put dummy characters (!!!!!!!!!!!) in the quality scores line so that the lengths match the sequences. Not sure if something here could be problematic. An example read:

@A01439:100:HC3VHDRX2:1:2101:2808:1000:CTTGCATCTTA 1:N:0:CGGCATTA+TGACTGAC CTTGCATCTTACTTAAAAACCTACAAATGAAACCCAGCATGCATACACACACCCCTCCATACCCTCACATAAATTATATATACCCTTATCTATACTAACTATAAAATGTAT + !!!!!!!!!!!F:F:FFF,:F,:F::FFFF::FF:F,,FFFFF:FFF,F:FF,:FFFFF,FFFFFFFFFFFFFFF:F:FFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFF

results of head B5-testgz.cluster if it helps:
26551449 34947546 10 @A01439:100:HC3VHDRX2:1:2101:12319:1000:GGAAACTGTCT GGAAACTGTCTAGTCAGTTTTCTAAATCTATAATGGAAAAGAAAATCGAATCTCGTCTTTATTTTTAAAAAGGGAAGGATGTTCAAGATCGGAAGAGCACACGTCTGAA !!!!!!!!!!!FF::FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @A01439:100:HC3VHDRX2:1:2101:12319:1000:GGAAACTGTCT GGAAACTGTCTTGAACATCCTTCCCTTTTTAAAAATAAAGACGAGATTCGATTTTCTTTTCCATTATAGATTTAGAAAACTGACTTAGATCGGAAGAGCGTCGTGTAGGGAA !!!!!!!!!!!FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 11104884 14491263 46 @A01439:100:HC3VHDRX2:1:2101:6777:1016:CGGATTCATCA CGGATTCATCAGTACTGGAAAGTCCCATTTTTCTCTGCACTGAACAGCCAGAAAAAGAAACAACGTTTCTAACTTAATTGGCTAGATCGGAAGAGCACACGTCTGAACTC !!!!!!!!!!!,,FF:FFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @A01439:100:HC3VHDRX2:1:2101:6777:1016:CGGATTCATCA CGGATTCATCAAGCCAATTAAGTTAGAAACGTTGTTTCTTTTTCTGGCTGTTCAGTGCAGAGAAAAATGGGACTTTCCAGTACAAGATCGGAAGAGCGTCGTGTTGGGAAAG !!!!!!!!!!!FFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFF,FF,,FF,,:F 15253236 19933987 47 @A01439:100:HC3VHDRX2:1:2101:7229:1016:GGAAATCGGTT GGAAATCGGTTGCATAACACAGCAGAGCCACTATGAAATTCAGCTCTTATAGCAAACATTTAAATGATTTTTGTTGGATATTTTCTCTCAGTTGGCATGTGAACAAATGTG !!!!!!!!!!!F,,,FFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @A01439:100:HC3VHDRX2:1:2101:7229:1016:GGAAATCGGTT GGAAATCGGTTCACATTTGTTCACATGCCAACTGAGAGAAAATATCCAACAAAAATCATTTAAATGTTTGCTATAAGAGCTGAATTTCATAGTGGCTCTGCTGTGTTATG !!!!!!!!!!!FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

any thoughts on what may be leading to this error?

Use bed file to target human exons to simulate real ctDNA dataset

Assume non-overlapping probes of 120bp
Assume minimum overlap of 16bp (?) between a probe a molecule in order for the molecule to be captured
Assume molecule mean length of 170bp and some variation (20bp?)

Accuracy and performance measures

how to deal with read pairs with opposite barcode

the FASTQ was generated with the following description:
one DNA molecular was aligned to different adaptors, resulting two DNA molecular from both strands. The result two DNA molecular were sequenced and get diffenrent barcode. eg. mate pairs from + strand tagged with barcode ATG-----TTA, while those from - strands tagged with barcode TTA------ATG.
And we want to call consensus from mate pairs from both strands

Calib uses more than 128 GB memory on two 1.2 GB FASTQ files

Hi,

I have two 1.2 GB FASTQ (R1, R2) containing less than 3.5 million reads.

Despite the dataset being small, the tool uses more than 128GB. Because of this, I have yet to find a setting where calib runs to completion.

I currently use these settings:

calib --threads 1 --error-tolerance 1 --kmer-size 8 --minimizer-count 7 --minimizer-threshold 2 -f $R1_UNPACKED -r $R2_UNPACKED -o $CALIB_OUT_PREFIX -l $BARCODE_LENGTH

Where BARCODE_LENGTH is 8 in this cases. Note that these settings are the default for 150KB reads, but the error tolerance for the barcodes has been set to 1. Since there are only 65536 barcodes, all possible barcodes are used.

I am currently tweaking the settings, but I wonder if there is any advise on how to run this tool on this sort of data. Thanks!

README file

Get rid of dict use when finding nodes

Output read # as part of the TSV file
Sort the TSV file using system's sort by barcode then by minimizers
Turn the node dict into a node list

Seg fault on multiple files and default parameters not selected

I installed calib via conda and tried two different sets of paired-end FASTQ files with a barcode only in Read 1. On the first set of files, I ran without providing parameters and got a segmentation fault.

% calib -f test_illumina_50000_read1.fastq -r test_illumina_50000_read2.fastq -l1 8 -l2 0 -o my_out. --no-sort
No error or minimizer parameters passed. Selecting parameters based on barcode and inferred read length
Inferred read length 250 from sample of 10000 reads
Selected paramters for (mean) barcode length 4 are:
        error_tolerance 1
        kmer_size       8
        minimizer_count 7
        minimizer_threshold     2
Extracting minimizers and barcodes...
Memory before reading FASTQ:
zsh: segmentation fault  calib -f test_illumina_50000_read1.fastq -r  -l1 8 -l2 0 -o  --no-sor

The first few lines of the files tested above looked like this:

% head -4 test_illumina_50000_read1.fastq 
@0
CTGTGACGTGAGGAGACGGTGACCGTGGTCCCTTGGCCCCACGCAGATTCCTTTGTATCGGTGTTCCGGTTGGATAAAGGGTACCTCGCTGAACAGTAATACACGGCCGTGTCCTCAGATCTCAGGCTGCTCAGCTCCATGTAGGCTGTGCTTATGGAGGTGTTCCTGGTGATGGTGACTCTGCCCTGGAACTTCTGTGCATAGCCGAAGAACGCATGAGTTGTCTCCCATCCCATCCACTCAAGCCCTT
+
=B>B=8:9B@@B9=:=98@=@<<BB;B=9A<9@;88<8B<BA;B?;9B=<8<@;>9>BA:B>@A<A99=8>@??B9B8A;<=9B=@B9;==9@@;B<:;9;<<@<::?89>=>8:8:99<@<;8?>@B;<A88?B>:B>@??;9A99A88<<?B@A>A?A9A;9A<:<:9<B=B9;A:8<A@89<@A;??8B9:@8=BB>;8?BA<<<<@8>=<8@<@B9=<8?:<<;:898:@;9<=?:BAA8AB><BA
% head -4 test_illumina_50000_read2.fastq
@0
GCTCTCAGCAGGTGCAGCTGGTGCTGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTTACCCCAGATGCATGCTCCTACTCGCGATCAACTGGGTGCGACAGGCTACTGGACAAGGGCTTGAGTGGATGGGATGGGAGACAACTCATGCGTTCTTCGGCTATGCACAGAAGTTCCAGGGCAGAGTCACCATCACCAGGAACACCTCCATAAGCACAGCCTACAT
+
AAA@<9@?88?<;;BB8?<?;8:>=8>B8;:?>A=@9=8A;8=:>9:B?:>:=8?>;;A;<8=?=;8;9B=>9@8@=<9<8=@?8@><A;>A=?=?>>;A9<;;:9:=88=9;;>8<;<;@@:<:>><:<:;8AB;>:<;9A9::=><>=B?>9>=<>:9<<:B9?A>B=9?88A@=AB=;?B?A8::9?9B8?<;?>:@=8<==8?B=?B8=8?;8=>;:=<=>8B9::A??9??@9;9>8>9>B;A;9

Next, I tried a new set of paired-end reads (10,010 reads) and the selected parameters were all -1:

% calib -f LM_PCR_sample_reads_R1.fastq -r LM_PCR_sample_reads_R2.fastq -l1 8 -l2 0 -o my_out.
No error or minimizer parameters passed. Selecting parameters based on barcode and inferred read length
Inferred read length 301 from sample of 10000 reads
Selected paramters for (mean) barcode length 4 are:
        error_tolerance -1
        kmer_size       -1
        minimizer_count -1
        minimizer_threshold     -1
Missing clustering error and minimizer parameters!
Calib: Clustering without alignment using LSH and MinHashing of barcoded reads
Usage: calib [--PARAMETER VALUE]
Example: calib -f R1.fastq -r R2.fastq -o my_out. -e 1 -l 8 -m 5 -t 2 -k 4 --silent
Calib's paramters arguments:
        -f    --input-forward                   (type: string;   REQUIRED paramter)
        -r    --input-reverse                   (type: string;   REQUIRED paramter)
        -o    --output-prefix                   (type: string;   REQUIRED paramter)
        -s    --silent                          (type: no value; default: unset)
        -q    --no-sort                         (type: no value; default:  unset)
        -g    --gzip-input                      (type: no value; default:  unset)
        -l    --barcode-length                  (type: int;      REQUIRED paramter unless -l1 and -l2 are provided)
        -l1   --barcode-length-1                (type: int;      REQUIRED paramter unless -l is provided)
        -l2   --barcode-length-2                (type: int;      REQUIRED paramter unless -l is provided)
        -p    --ignored-sequence-prefix-length  (type: int;      default: 0)
        -m    --minimizer-count                 (type: int;      default: Depends on observed read length;)
        -k    --kmer-size                       (type: int;      default: Depends on observed read length;)
        -e    --error-tolerance                 (type: int;      default: Depends on observed read length;)
        -t    --minimizer-threshold             (type: int;      default: Depends on observed read length;)
        -c    --threads                         (type: int;      default: 1)
        -h    --help

Then, I tried providing my own parameters but got a segmentation fault.

% calib -f LM_PCR_sample_reads_R1.fastq -r LM_PCR_sample_reads_R2.fastq -l1 8 -l2 0 -o my_out. -m 7 -k 8 -e 1 -t 2
Extracting minimizers and barcodes...
Memory before reading FASTQ:
zsh: segmentation fault  calib -f LM_PCR_sample_reads_R1.fastq -r LM_PCR_sample_reads_R2.fastq -l1 8

The first few lines of the files tested above looked like this:

% head -4 LM_PCR_sample_reads_R1.fastq           
@M03525:380:000000000-CDJ38:1:1101:16781:1441 1:N:0:TTCTGCCT
CGGCTTACAATTCCTGCGACTATTTCCCTTTCCTCCGCTTAAGGGCCTAGGAGTCCGTTGTTGGCATGGTTGCAGTTCCTGGTGGCGTGTTGTGTTGACACGTTCTCTAGAACGCATGCTGCGGAGCAGATGGTTCCGAGGCAGCCACGCTGAGGAAATGCTGTGTGCCTCATGCTAGAGATTTTCCACACTGACTAAAAGGGTCTTATAGCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTTCTGCCTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAATATCACGCCACTC
+
A-,@@@8E,,CCFFE<@7+8@+CEEEEEFFFFEEFF7CCC7,,,,,8B6,,,,CEE7FG7FE,,C,C8,EF7E9<EFEEFC8C8,C+C:FF8E7EF8,B?F7EEFFEFC9,?E7F@FCC?9BC7B7><F,?,8EEEE7+4=C:7CF@FDEEC+>@8DCF9FF9E8F@FFCF7D9;EGFGCGGGGGGGGGGGGGGGGGFGGEGGGGGGGGFGGGGGGGGGGGCFGGGGGGGFGGGGGGGFGGFFFAFFF=FFFFFCBEAEDFFFFFBF?>B?DB>><AF::4<?0>FFF(4.64),(,((((
% head -4 LM_PCR_sample_reads_R2.fastq 
@M03525:380:000000000-CDJ38:1:1101:16781:1441 2:N:0:TTCTGCCT
GCTATAAGACCCTTTTAGTCAGTGTGGAAAATCTCTAGCATGAGGCACACAGCATTTCCTCACCCTTCCTTCCTCTCTCCCCTCTCCTCCTCCTCCTCCCTTCTTCATTCCTTTTCCCCCCCCCACCCCCCCCCCCCCTCCCTCCCTCCCCACCACCCCCTCCTCTTCCCTTCCCCCCTCGTCCCCTTCCTATTCCCCCCCTTTGTCACCCCCCTTCCCCACTCCCTCCTCTTTGCCTCCCTTCTCTTTCTCCCTCCTCCCCCTCTCCTTCCCCATCCCCTCCTCCCCCTCCCCCCCCCCC
+
-ACCCGGGGGGGGGGGGGGGGGC9C<,,6,,;CBEF,,<,C,,,,:,;,;,,;,;CC6@;C,,;,6,,;;,;;66,,,,6:,689,96:9,6,,:,4,:,4599,,,,,,:,5,59,,4+8++6+6+64+84+++++33*6*,43*3,61*,1*,****4*64,,,4622,2***3**************2++23*2*))*)00*2***01))))*2***))*).).0)))/******)()))(0))**.)))(,(((((--((((,()))(((((((((((,(((,(,(,((((,(,((-

Output corrected fastq files

Find consensus of a cluster of reads

From a read cluster, we need to find the left consensus and the right consensus. Since the reads are highly similar, we can assume no indels, and those reads with indels will be highly corrected.

Make wrapper around .cc and .py files

And maybe python install prerequisites
And also perhaps include make simulate in the make file

What's the best way to generate de-duplicated fastq files from the cluster file?

Hello,

I have 2 questions -

My UMI barcodes are in this format - 'XXXXXXXXNNNNNNNNN'. So, when I use calib, my -l option for barcode tag length should be 8 or 17?
After running calib and generating the cluster file, what is the best way to generate deduplicated fastq files?

Thanks!

How do I obtaining the default parameters

Dear Baraa,

A quick question. I understand that the parameters for k-mer size, minimizer count, error tolerance and minimizer threshold are set to the defaults that are dependent on read length if I had not set them myself. Is there a way to find out these defaults parameters?

Thanks,
Wee

Get rid of HTML files somehow

It sucks to see that +90% of the code is HTML

vpc-ccg / calib Goto Github PK

calib's People

Contributors

Stargazers

Watchers

Forkers

calib's Issues

Recommend Projects

Recommend Topics

Recommend Org