The htstream's discuss from s4hts

R stats/HTML output

Take a tab delimited file and generates a report

segfault if output path doesn't exist

Perhaps check that output path exists and fail gracefully?

,"Overlapper_28917": {
"Notes" : "",
"SE_Length" : 50837768,
"R2_Adapter_Trim" : 0,
"R1_Adapter_Trim" : 0,
"R2_Discard" : 0,
"R1_Discard" : 0,
"SE_In" : 3165,
"PE_Out" : 153892,
"TotalReadsOutput" : 0,
"PE_In" : 323417,
"TotalReadsInput" : 326582,
"lins" : 165433,
"SE_Out" : 172690,
"SE_Discard" : 0,
"sins" : 4092,
"Nolins" : 153892
}

,"Phix-Remover_28932": {
"Notes" : "",
"SE_Out" : 0,
"TotalReadsInput" : 221207,
"PE_In" : 221207,
"TotalReadsOutput" : 205954,
"PE_Out" : 205954,
"SE_In" : 0
}

Possibly link boost statically rather than dynamically?

Not sure the cost/benefit to the static vs dynamic argument, but post install I need to make sure I load my boost module before using HTStream. If static linking is a big no-no, then can be lived with and when we get to the point of installing HTStream as a module, it will then load boost automatically. More curious as to the choice

post recent build

msettles@ganesh: build$super-deduper --version

Unhandled Exception: character conversion failed

Drop down change

Hi, @bioSandMan

I am going to be a pain and change my mind about the drop down. Anyway you could just put the drop down options directly under the drop down button.

That way the user can immediately find all the Application of interest and it isn't hidden in the slightest.

Thanks!

Quality Trimming

Remove low quality scores from the ends of reads

NRemover will ignore 1 bp of N

Just a note to fix it

Overlapper SE file has no IDs and quality strings appear to be wrong and/or made up of sequences

@id
ATGCTCACTACCACTCTCTCCGTACTGGAATATTCATCTACTCCCATTTTGCTGACTCTTATACCTCTCAGCGACATTTCCGTGCTTGGGGTCATATCAGGCAATATTCCGATCATTCCCATTACACTGTCAATGGGTTCAATCCCCCAGTTCTGGAAAAGCACTTTTGCATCCTTTTGGAAATGCCTCAAGAGTTGGTG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@id
TTTCCCCAACGATTGCTCCCTCCTCAGTGAAAGCCCTTAGTAGTATCAAAGTTTCTAATCGATTAAAGATCACACTGAAGTTCGCTTTCAGTATAATGTTCTTTTCCATGATCGCCTGGTCCATTCGCACATAAAGAGGGCCTATTATCTTTTGCCTAGGCATGAGCATGAACCAGTCTCGTGACATTTCCTCGAGGGTCATGTCAGCTATGT
+
"IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@id
NCGCCAAATATGCCTCTAGTTTGTTTCTCTGGCACATTCCGCATTCCTGTTGCCAATTTCAGAGTGTTTTGCTTAACATATCTGGGACAGACCCCATATGTGATCCTGTTTACATTTTGAAAGGGTTTGTCATTGGGAATGCTTCCATTTGGAGTAATGCATTCAGATTTGCAATTGACAATGGGTGCATCTGATCTCATTATTGAGCTTTTCCCACTCTGTATTTTGAAGTAACCCCGAGGGGCAATTAGATTCCCTGTGCTGTTAATCAAAAGTATGTCTCCCGGTTTTACTATTGT
+
#<0<BHIHHHHEGHHEHHHI?GGHIIEEHIE1DHHIIFHHFCDHEHEH7III4IIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII=;GI<<IIGIIIIIIATTCCCTGTGCTGTTAATCAAAAGTATGTCTCCCGGTTTTACTATTGT
@id
GGGCACCCCAGCTCAATCCTATTGATGGACCACTACCTGAAGATAATGAACCAAGTGGGTATGCACAAACAGACTGTGTCCTGGAGGCCATGGCTTTCCTTGAAGAATCCCACCCAGGGATATTTGAGAATTCATGCCTTGAAACAATGGAAATTGTCCAACAAGCAAGGGTGGATAAACTAACTCAGGGTCGCCAGACTTATGATTGGACATTAAATAGAAATCAACCGGCAGCAACTGCATTGGCCAACACCATAGAAGTTTTTAGATCAAATGGTCTCACAGCTAATGAGTCAGGAAGGCTAATAGATTTCTTAAAAGATGTAATGGAATCAATGGATAAAGAGGAAATAGAGATAACAACACACTTTCAAAGAAAAAGGAGAGTAAGGGATAACATGACCAAGAAGATGGTCACACAA
+
DCDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIHHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIHIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIACCATAGAAGTTTTTAGATCAAATGGTCTCACAGCTAATGAGTCAGGAAGGCTAATAGATTTCTTAAAAGATGTAATGGAATCAATGGATAAAGAGGAAATAGAGATAACAACACACTTTCAAAGAAAAAGGAGAGTAAGGGATAACATGACCAAGAAGATGGTCACACAA
@id
AATGTACTCAAATGCAAATGTTGCACCTAATGTTGCCTTTTTGGCAGGCCCACATAATGAACCCCAGCAGAACAACACAAAGCAAAAAGCATGACATGGCAAAGGAAATCCATAGGATCCAATCTTTGTATCCTGACTTCAGCTGAACACCTTTGATCTGGAACCGATTGCTTAATGCCTCGTCTCTGTATAC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@id
CCTCTGCTGCTTGTTCACTTGACCCAGCCATTTGTTCCATGGCCTTAGCTGTTGTGCTGGCTATTACCATTCTGTTCTCGTGCCTGATTAGTGGATTGGTTGTTGTCACCATTTGTCTATGAGATCGATGCTGGGAATCAGCAATCTGTTCACAGGTTGCGCATACCAGACCAAATGCCACCTCAGTGGCGACAGTCCCCATTCTGTTGTATAT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@id
NCAGAATGCCCTCAATGGGAATGGTGACCCAAACAACATGGACAAGGCAGTCAAACTGTATAGGAAACTAAAACGAGAAATAACATTCCATGGGGCCAAAGAAGTAGCACTCAGTTATTCTGCCGGTGCACTTGCCAGTTGCATGGGCCTCATATACAACAGAATGGGGACTGTCGCCACTGAGGTGGCATTTGGTCTGGTATGCGCAACCTGTGAACAGATTGCTGATTCCCAGCATCGATCTCATAGACAAATGGTGACAACAACCAATCCACTAATCAGGCACGAGAACAGAATGGTAATAGCCAGCACAACAG
+
#<DDDIGHHHHIIIIIHHIIHIHHIEHIHIHHIIIIHHHIHIIIIHIIIHCEHIHHHIHHIHIIHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII"AAATGGTGACAACAACCAATCCACTAATCAGGCACGAGAACAGAATGGTAATAGCCAGCACAACAG
@id

Phix-remover comments/issue/questions

How were the parameters chosen, just in a quick test the new Phix-remover identified twice as many reads as phix as the old bowtie2 local method. Not saying this isn't correct, but curious as to what was done to validate.

why print the WHOLE phiX on help, can probably reduce via ...
default
-x [ --hits ] arg (=50) How many 8-mer hits to phix needs to happen to discard

What happens if there isn't 50 kmers (short --seq), or wouldn't it be better to do a % perfect kmers of the shorter of seq or read? Something like that

Stats.out output isn't consistent with other sub-apps stats output

Add an option to discard orphans to all trimming tools

If any trimming tool removes enough bases that a read would be discarded, have the option of discarding the other read in the pair as well. This will allow the user to know that all reads are paired if needed.

Superdeduper starts outputting immediately, is "-q" the default behavior?

Running super-deduper like:
~/HTStream/build/Super-Deduper/super_deduper -1 00-RawData/1116_S61_R1_001.fastq.gz -2 00-RawData/1116_S61_R2_001.fastq.gz -t -O

Will immediately start generating output reads even though the gzip processes are still running. This doesn't seem possible. Is super_deduper using -q mode by default and skipping the best-quality check?

Stats

Wondering what the verdict/timeline is for the possible conversion of application output stats to JSON. For me at least preprocessing stats are as important as the actual processed reads as they give SO much information on whether I can even use the data. So I can't really see moving over to HTStream to even test vigorously until there are good stats to review afterwards

test flash2 with valgrind

Poly AT Remover

Removes As and Ts at both ends of the sequences

phix-remover has two -s switches

commit 9140633

-s [ --stranded ] If R1 is orphaned, R2 is RC (for
stranded RNA)
-m [ --min-length ] arg (=50) Min length for acceptable outputted read
-l [ --no-left ] Turns of trimming of the left side of
the read
-r [ --no-right ] Turns of trimming of the right side of
the read
-s [ --seq ] arg Please supply a fasta file - default -
Phix Sequence - default
https://www.ncbi.nlm.nih.gov/nuccore/962
6372

Check -s and -l if out of bounds happen.

-s and -l out of bounds ignore read.

It appears that -A still prints "Program" in the stats log file

EG:
Program TotalRecords R2_Discarded R1_Discarded R2_Length SE_Out SE_Length Ignored TotalReadsInput PE_In PE_Out SE_In SE_Left_Trim Sins Replaced SE_Discarded R2_Right_Trim R1_Left_Trim SE_Right_Trim R1_Right_Trim R1_Length R2_Left_Trim Overlap_BPs Lins Super-Deduper 267139 0 0 66367161 0 0 0 0 0 264411 0 0 0 2728 0 0 0 0 0 66367161 0 0 0 Program Q-Trim 264411 14 0 65862254 14 0 0 0 0 264397 0 0 0 0 0 496200 231 0 440264 65926666 5193 0 0 Program Overlapper 264411 0 0 24600101 163746 41630620 0 0 0 98694 0 0 0 0 1971 0 0 0 0 24669593 0 0 0

AT trim defaults parameters

Hey, @samhunter

Specific to AT trim - what should the default values be for min trim length and number of mismatch?

In general, min accepted length default?

All trimming algorithms will also have parameters for stranded, 3' trim, 5' trim.

Anything I am missing? We can run test later to actually get optimal values, but is there a decent first guess?

Thank you!

Issues with install

Now getting

-- Install configuration: ""
-- Up-to-date: /share/biocore/software/lib/libhts_common.so
-- Up-to-date: /share/biocore/software/bin/super-deduper
-- Up-to-date: /share/biocore/software/bin/tab-converter
-- Up-to-date: /share/biocore/software/bin/polyATtrim
-- Up-to-date: /share/biocore/software/bin/q-window-trim
-- Up-to-date: /share/biocore/software/bin/cut-trim
-- Up-to-date: /share/biocore/software/bin/phix-remover
-- Up-to-date: /share/biocore/software/bin/overlapper
-- Up-to-date: /share/biocore/software/bin/n-remover
msettles@ganesh: build$phix-remover 0h
phix-remover: error while loading shared libraries: libhts_common.so: cannot open shared object file: No such file or directory
msettles@ganesh: build$phix-remover -h
phix-remover: error while loading shared libraries: libhts_common.so: cannot open shared object file: No such file or directory

Implementing force

Just overwrites

Cannot compile on OSX

Hey @joe-angell and @samhunter , hate to be a noodge, but I'm having issues compiling on both develop and master.

(Talking about develop) I had to add #include<numeric> to use std::accumulate with common/src/read.cpp, however, after that I am still getting this error when trying to make.

[ 86%] Linking CXX executable Super-Deduper_test ld: library not found for -llibhts_common.so clang: error: linker command failed with exit code 1 (use -v to see invocation) make[2]: *** [Super-Deduper/Super-Deduper_test] Error 1 make[1]: *** [Super-Deduper/CMakeFiles/Super-Deduper_test.dir/all] Error 2

Also, a ton of warnings with gtest stuff. Is that expected?

Any suggestions?

Thanks, sorry to be a newb with this cmake stuff.

overlapper thoughts/questions

So ran the scheme on a dataset already processed and the result differs some, so would like to also know again how this app has been verified, there appears to be some mods

Question 1
-e [ --hist-file ] arg A tab delimited hist file with insert lengths.
This is to output the hist?? help text seems to imply its an input file
Question 2
-k [ --kmer ] arg (=8) Kmer size of the lookup table for the
longer read
-r [ --kmer-offset ] arg (=1) Offset of kmers. Offset of 1, would be
perfect overlapping kmers. An offset of
kmer would be non-overlapping kmers
that are right next to each other. Must
be greater than 0.
I don't recall kmers being a part of the old algorithm, how are they used here?
Question 3
-c [ --check-lengths ] arg (=20) Check lengths on the ends
???? What is this
Question 4
-a [ --adapter-trimming ] Trims adapters based on overlap, only
returns PE reads, will correct quality
scores and BP in the PE reads
I think you mean "Only perform adapter trimming based on overlap, PE reads in produces PE output trimmed of adapters" yes??
Question 5
-s [ --stranded ] Makes sure the correct complement is returned upon overlap
Just curious on where this comes up, did someone see an example. 'stranded' is important, when 1) RNA input 2) when Read 1 is removed and Read 2 is retained, in that case only Read 2 then needs to be RC in order to correctly represent the underlying strand the read came from, as SE reads are inferred to be R1 reads.

Implement Stranded ophans

RC R2 if R1 is orphaned

Q-window-trim generates PE files, but crashes on SE write (maybe?)

$ ~/HTStream/build/Q-Window-Trim/q-window-trim -1 202_S14_R1_001.fastq.gz -2 202_S14_R2_001.fastq.gz -f -p htq -m 200

    Unhandled Exception: No write_read class, only accessable with SE

$ ll
total 416
drwxrwxr-x 2 shunter grc 4096 Dec 30 01:27 ./
drwxr-xr-x 8 shunter grc 4096 Dec 30 01:23 ../
lrwxrwxrwx 1 shunter grc 341682103 Dec 30 01:26 202_S14_R1_001.fastq.gz
lrwxrwxrwx 1 shunter grc 366655813 Dec 30 01:26 202_S14_R2_001.fastq.gz
-rw-rw-r-- 1 shunter grc 204735 Dec 30 01:27 htqPE1.fastq
-rw-rw-r-- 1 shunter grc 204243 Dec 30 01:27 htqPE2.fastq

N Remover Tool

Returns the longest sequence without Ns

Weird loading of the gitio pages

Hey, @bioSandMan ,

There is something a bit odd when the gitio pages load. It first goes to an HTML like page (split second) then reformats to look correct. Could you look into why this is happening?

Thanks!

Primer Remover

Quick lookup 8mer primer lookup

Stats

Basic meta stats on a sequences
ACTG, Avg Length, Avg Quality

Phix Remover

Help text, Version information and description

For all programs, the help-text output should note that the program is part of the HTStream package, probably with a link to the GitHub repo and a short description of what it does, and preferably version information (I also think it would be amazing if the git commit hash was displayed but that might be too much of a PITA).

Currently the "-v" switch reports nothing.

Need to start bumping version.

Currently it is difficult to verify that the most recent version is installed.

Tab Converter

Remove images

Hey, @bioSandMan
(sorry to spam you with these)

Let's remove the picture place holder stuff.

I'm not sure how many applications will have corresponding pictures but I image maybe only 1 or 2. If they do need a picture, I can just put them inline.

Let me know if you have any questions, about this one or any of the issues I just submitted.

Thank you!

Overlapper has ambiguous switches

-l --> hist file and minLength

Overlapper help text

-l [ --minLength ] arg (=50) Mismatches allowed in overlapped section
-x [ --max-mismatches ] arg (=5) Mismatches allowed in overlapped section

possible install issue with boost

Hoping things are ready to test, I tried to install on our systems and ran into an issue. Hard to tell if its the installer or our system. I've loaded modules for cmake and boost/1.60 BUT

""""
statussources: /share/biocore/software/src/HTStream/Overlapper/src/overlapper.cpp
CMake Error at /afs/genomecenter.ucdavis.edu/software/cmake/3.5.1/x86_64-linux-ubuntu14.04/share/cmake-3.5/Modules/FindBoost.cmake:1657 (message):
Unable to find the requested Boost libraries.

Boost version: 1.54.0

Boost include path: /usr/include

Detected version of Boost is too old. Requested version was 1.56 (or
newer).
Call Stack (most recent call first):
Overlapper/CMakeLists.txt:14 (FIND_PACKAGE)
""""

its looking in /usr/include for boost (which is old), while env reports

msettles@ganesh: HTStream$env | grep boost
CPPFLAGS=-I/software/cmake/3.5.1/x86_64-linux-ubuntu14.04/include -I/software/boost/1.60/x86_64-linux-ubuntu14.04/include
LIBRARY_PATH=/software/cmake/3.5.1/x86_64-linux-ubuntu14.04/lib:/software/boost/1.60/x86_64-linux-ubuntu14.04/lib
LD_LIBRARY_PATH=/software/cmake/3.5.1/x86_64-linux-ubuntu14.04/lib:/software/boost/1.60/x86_64-linux-ubuntu14.04/lib
CPATH=/software/cmake/3.5.1/x86_64-linux-ubuntu14.04/include:/software/boost/1.60/x86_64-linux-ubuntu14.04/include
LMFILES=/software/modules/3.2.10/x86_64-linux-ubuntu14.04/Modules/3.2.10/modulefiles/boost/1.60:/software/modules/3.2.10/x86_64-linux-ubuntu14.04/Modules/3.2.10/modulefiles/cmake/3.5.1
LOADEDMODULES=boost/1.60:cmake/3.5.1
BOOST_INCLUDE_DIR=/software/boost/1.60/x86_64-linux-ubuntu14.04/include/boost
BOOST_LIBRARY_DIR=/software/boost/1.60/x86_64-linux-ubuntu14.04/lib

So maybe hard coded path for boost??

Matt

gh-pages Update

@msettles @samhunter (Someone tag Alida as well). Hi, all. :)

I'm starting to update the website. It is located here https://ibest.github.io/HTStream/ . Please, feel free to update gh-pages (_layout/*.html) directly or submit issues/enhancements.

It is looking real rough right now, but it is a start. I will continue to add to it and edit it the next couple of weeks. There is already some formatting / CSS stuff I need to iron out - however, don't hesitate to open an issue about anything.

Merging into master

When should we be merging into master? In the past, it always lagged pretty far behind.

n-remove help needs an update.

$ n-remover -h
Tab-Converter
Options:
...

New issues with install

after loading module cmake/boost AND

cmake -DBOOST_ROOT=/software/boost/1.60/x86_64-linux-ubuntu14.04 -DCMAKE_BUILD_TYPE=Release -DBoost_NO_SYSTEM_PATHS=TRUE -DBoost_NO_BOOST_CMAKE=TRUE ..

[ 16%] Built target googletest
[ 18%] Linking CXX executable /usr/local/bin/hts_common_test
/usr/bin/ld: cannot open output file /usr/local/bin/hts_common_test: Permission denied
collect2: error: ld returned 1 exit status
make[2]: *** [/usr/local/bin/hts_common_test] Error 1
make[1]: *** [common/CMakeFiles/hts_common_test.dir/all] Error 2
make: *** [all] Error 2

Looks like its trying to reference /usr/local/bin which contains nothing I as a user I don't have access to write to

Matt

Master branch readme needs to have a link to the manual

It takes way to long to select the gh-pages branch!

SE adapter trimmer

Hey all.

As @msettles pointed out in the email last night, a pretty big feature we are missing is SE adapter trimmers.

I am assuming adapters can only show up on the 3' end still (just like in PE reads)? Or do we want a more robust tool that can fuzzy cut both 5' and 3' (in case of primers or something at the start). Or is that a different tool all together?

What are your thoughts @samhunter and @msettles ?

gitio pages

Documentations

SingleEndRead key compare does not match old superd

If you compare the output of superd using single end reads, we get different results to the old super master. Something different must be happening with the keys, but I didn't look into it much.

Segmentetion fault in location with no write permit

When trying to run with default parameters in a folder where I had no writing permits, it gave me a segmentation fault. Need a more a informative error message.

How should overlapper handle N's?

Should N's match other N's?
How does overlapper deal with Ns?
Should overlapper subtract the quality value associated with an N, or just use the overlapped base + qual score?

output file parameter thoughts/opinion

seems weird to have to include an ending '' when you specify output file prefiex (eg myoutput ) can't the '_' be added by default (seems the norm)

cleaned reads should have similar postfix as Illumina, so instead of PE1/PE2 how about R1/R2 (maybe even include the _001?, so cleaned_reads_R1_001.fasta.gz). Seems more apps will expect the R1/R2 within read id more than PE1/PE2 (our use of PE is I think legacy)

gz output by default, helps with good behavior ;) but of course only when outputting a file.

m_bits in set hash fuction in phix_remover.h

Hey, @joe-angell .

Here is the issue https://github.com/ibest/HTStream/blob/TestingPhix/Phix-Remover/src/phix_remover.h#L32

I am trying to model the dbhash class you created in Super Deduper in Phix Remover (phix_remover.h line 32) for some sensitivity specificity testing. I keep getting an error saying m_bits is private. I have defined BOOST_DYNAMIC_BITSET_DONT_USE_FRIENDS, but I'm not sure what else I have to do.

I have some working code in TestingPhix that is using to_ulong(), but think that is going to be too slow. Would you mind taking a looking when you have a chance?

Thanks, Joe!

s4hts / htstream Goto Github PK

htstream's Issues

Recommend Projects

Recommend Topics

Recommend Org