From [email protected] on July 06, 2012 19:23:08
Hi, I tried to testing the performance of CutAdapt. So I simulated 1M reads with 2 adapters randomly added to either 3' or 5' end. I also randomized the length of adapter sequences to be added to the read.
As a summary, I generated 39957 contaminated reads, half of them contaminated with adapter 1 and the other half contaminated with adapter 2. The size of adapter 1 is 25 bp, the size of adapter 2 is 33 bp.
Then I ran CutAdapt 3 times using the following commands:
CutAdapt -b adapter_1 -b adapter_2 contamined.reads
CutAdapt -a adapter_1 -a adapter_2 contamined.reads
CutAdapt -g adapter_1 -g adapter_2 contamined.reads
Here is the histogram of adapter lengths. (For the sake of the issue, I only posted the relevant information):
For command: CutAdapt -b adapter_1 -b adapter_2 contamined.reads
===Adapter 1===
Histogram of adapter lengths (5')
length count
24 375
25 402
Histogram of adapter lengths (3' or within)
length count
24 400
25 394
=== Adapter 2 ===
Histogram of adapter lengths (5')
length count
32 277
33 288
Histogram of adapter lengths (3' or within)
length count
32 301
33 346
For command: CutAdapt -a adapter_1 -a adapter_2 contamined.reads
===Adapter 1===
Histogram of adapter lengths
length count
24 400
25 1558
=== Adapter 2 ===
Histogram of adapter lengths
length count
32 301
33 1526
For command: CutAdapt -g adapter_1 -g adapter_2 contamined.reads
===Adapter 1===
Histogram of adapter lengths
length count
24 375
25 1604
=== Adapter 2 ===
Histogram of adapter lengths
length count
32 277
33 1546
In my simulation, there are 402 reads contaminated with adapter 1 of size 25 bp at 5' end, 394 reads contaminated with adapter 1 of size 25 bp at 3' end, 288 reads contaminated with adapter 2 of size 33 bp at 5' end, 346 reads contaminated with adapter 2 with size 33 bp at 3' end.
Therefore, when I used -b option, the sensitivity and specificity of CutAdapt are almost 100%. But when I used -a or -g option, the sensitivity of CutAdapt is still 100%, while the false positive rate of trimming increased significantly (from 0 to around 0.1%).
I am really confused by this result.
Based on a previous post: https://code.google.com/p/cutadapt/issues/detail?id=8 , I thought the decreased specificity might be caused by the error-tolerate mapping in longer sequence contamination. To test this possibility, I ran command: CutAdapt -a adapter_1 -a adapter_2 -e 0.05 contamined.reads
===Adapter 1===
Histogram of adapter lengths
length count
24 400
25 1171
=== Adapter 2 ===
Histogram of adapter lengths
length count
32 301
33 911
So I did observe a decrease in the false positive rate when I used -e 0.05.
But why the specificity is "perfect" when I used the -b option?
I am running the test using cutadpat 1.0 version on CentOS.
Best,
Ying
Original issue: http://code.google.com/p/cutadapt/issues/detail?id=47