s4hts / htstream Goto Github PK
View Code? Open in Web Editor NEWA high throughput sequence read toolset using a streaming approach facilitated by Linux pipes
Home Page: https://s4hts.github.io/HTStream/
License: Apache License 2.0
A high throughput sequence read toolset using a streaming approach facilitated by Linux pipes
Home Page: https://s4hts.github.io/HTStream/
License: Apache License 2.0
Find a flash replacement
Take a tab delimited file and generates a report
Perhaps check that output path exists and fail gracefully?
In git commit: 9140633
,"Overlapper_28917": {
"Notes" : "",
"SE_Length" : 50837768,
"R2_Adapter_Trim" : 0,
"R1_Adapter_Trim" : 0,
"R2_Discard" : 0,
"R1_Discard" : 0,
"SE_In" : 3165,
"PE_Out" : 153892,
"TotalReadsOutput" : 0,
"PE_In" : 323417,
"TotalReadsInput" : 326582,
"lins" : 165433,
"SE_Out" : 172690,
"SE_Discard" : 0,
"sins" : 4092,
"Nolins" : 153892
}
,"Phix-Remover_28932": {
"Notes" : "",
"SE_Out" : 0,
"TotalReadsInput" : 221207,
"PE_In" : 221207,
"TotalReadsOutput" : 205954,
"PE_Out" : 205954,
"SE_In" : 0
}
Not sure the cost/benefit to the static vs dynamic argument, but post install I need to make sure I load my boost module before using HTStream. If static linking is a big no-no, then can be lived with and when we get to the point of installing HTStream as a module, it will then load boost automatically. More curious as to the choice
msettles@ganesh: build$super-deduper --version
Unhandled Exception: character conversion failed
Hi, @bioSandMan
I am going to be a pain and change my mind about the drop down. Anyway you could just put the drop down options directly under the drop down button.
That way the user can immediately find all the Application of interest and it isn't hidden in the slightest.
Thanks!
Remove low quality scores from the ends of reads
Just a note to fix it
@id
ATGCTCACTACCACTCTCTCCGTACTGGAATATTCATCTACTCCCATTTTGCTGACTCTTATACCTCTCAGCGACATTTCCGTGCTTGGGGTCATATCAGGCAATATTCCGATCATTCCCATTACACTGTCAATGGGTTCAATCCCCCAGTTCTGGAAAAGCACTTTTGCATCCTTTTGGAAATGCCTCAAGAGTTGGTG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@id
TTTCCCCAACGATTGCTCCCTCCTCAGTGAAAGCCCTTAGTAGTATCAAAGTTTCTAATCGATTAAAGATCACACTGAAGTTCGCTTTCAGTATAATGTTCTTTTCCATGATCGCCTGGTCCATTCGCACATAAAGAGGGCCTATTATCTTTTGCCTAGGCATGAGCATGAACCAGTCTCGTGACATTTCCTCGAGGGTCATGTCAGCTATGT
+
"IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@id
NCGCCAAATATGCCTCTAGTTTGTTTCTCTGGCACATTCCGCATTCCTGTTGCCAATTTCAGAGTGTTTTGCTTAACATATCTGGGACAGACCCCATATGTGATCCTGTTTACATTTTGAAAGGGTTTGTCATTGGGAATGCTTCCATTTGGAGTAATGCATTCAGATTTGCAATTGACAATGGGTGCATCTGATCTCATTATTGAGCTTTTCCCACTCTGTATTTTGAAGTAACCCCGAGGGGCAATTAGATTCCCTGTGCTGTTAATCAAAAGTATGTCTCCCGGTTTTACTATTGT
+
#<0<BHIHHHHEGHHEHHHI?GGHIIEEHIE1DHHIIFHHFCDHEHEH7III4IIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII=;GI<<IIGIIIIIIATTCCCTGTGCTGTTAATCAAAAGTATGTCTCCCGGTTTTACTATTGT
@id
GGGCACCCCAGCTCAATCCTATTGATGGACCACTACCTGAAGATAATGAACCAAGTGGGTATGCACAAACAGACTGTGTCCTGGAGGCCATGGCTTTCCTTGAAGAATCCCACCCAGGGATATTTGAGAATTCATGCCTTGAAACAATGGAAATTGTCCAACAAGCAAGGGTGGATAAACTAACTCAGGGTCGCCAGACTTATGATTGGACATTAAATAGAAATCAACCGGCAGCAACTGCATTGGCCAACACCATAGAAGTTTTTAGATCAAATGGTCTCACAGCTAATGAGTCAGGAAGGCTAATAGATTTCTTAAAAGATGTAATGGAATCAATGGATAAAGAGGAAATAGAGATAACAACACACTTTCAAAGAAAAAGGAGAGTAAGGGATAACATGACCAAGAAGATGGTCACACAA
+
DCDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIHHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIHIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIACCATAGAAGTTTTTAGATCAAATGGTCTCACAGCTAATGAGTCAGGAAGGCTAATAGATTTCTTAAAAGATGTAATGGAATCAATGGATAAAGAGGAAATAGAGATAACAACACACTTTCAAAGAAAAAGGAGAGTAAGGGATAACATGACCAAGAAGATGGTCACACAA
@id
AATGTACTCAAATGCAAATGTTGCACCTAATGTTGCCTTTTTGGCAGGCCCACATAATGAACCCCAGCAGAACAACACAAAGCAAAAAGCATGACATGGCAAAGGAAATCCATAGGATCCAATCTTTGTATCCTGACTTCAGCTGAACACCTTTGATCTGGAACCGATTGCTTAATGCCTCGTCTCTGTATAC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@id
CCTCTGCTGCTTGTTCACTTGACCCAGCCATTTGTTCCATGGCCTTAGCTGTTGTGCTGGCTATTACCATTCTGTTCTCGTGCCTGATTAGTGGATTGGTTGTTGTCACCATTTGTCTATGAGATCGATGCTGGGAATCAGCAATCTGTTCACAGGTTGCGCATACCAGACCAAATGCCACCTCAGTGGCGACAGTCCCCATTCTGTTGTATAT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@id
NCAGAATGCCCTCAATGGGAATGGTGACCCAAACAACATGGACAAGGCAGTCAAACTGTATAGGAAACTAAAACGAGAAATAACATTCCATGGGGCCAAAGAAGTAGCACTCAGTTATTCTGCCGGTGCACTTGCCAGTTGCATGGGCCTCATATACAACAGAATGGGGACTGTCGCCACTGAGGTGGCATTTGGTCTGGTATGCGCAACCTGTGAACAGATTGCTGATTCCCAGCATCGATCTCATAGACAAATGGTGACAACAACCAATCCACTAATCAGGCACGAGAACAGAATGGTAATAGCCAGCACAACAG
+
#<DDDIGHHHHIIIIIHHIIHIHHIEHIHIHHIIIIHHHIHIIIIHIIIHCEHIHHHIHHIHIIHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII"AAATGGTGACAACAACCAATCCACTAATCAGGCACGAGAACAGAATGGTAATAGCCAGCACAACAG
@id
How were the parameters chosen, just in a quick test the new Phix-remover identified twice as many reads as phix as the old bowtie2 local method. Not saying this isn't correct, but curious as to what was done to validate.
why print the WHOLE phiX on help, can probably reduce via ...
default
-x [ --hits ] arg (=50) How many 8-mer hits to phix needs to happen to discard
What happens if there isn't 50 kmers (short --seq), or wouldn't it be better to do a % perfect kmers of the shorter of seq or read? Something like that
Stats.out output isn't consistent with other sub-apps stats output
If any trimming tool removes enough bases that a read would be discarded, have the option of discarding the other read in the pair as well. This will allow the user to know that all reads are paired if needed.
Running super-deduper like:
~/HTStream/build/Super-Deduper/super_deduper -1 00-RawData/1116_S61_R1_001.fastq.gz -2 00-RawData/1116_S61_R2_001.fastq.gz -t -O
Will immediately start generating output reads even though the gzip processes are still running. This doesn't seem possible. Is super_deduper using -q mode by default and skipping the best-quality check?
Wondering what the verdict/timeline is for the possible conversion of application output stats to JSON. For me at least preprocessing stats are as important as the actual processed reads as they give SO much information on whether I can even use the data. So I can't really see moving over to HTStream to even test vigorously until there are good stats to review afterwards
Removes As and Ts at both ends of the sequences
commit 9140633
-s [ --stranded ] If R1 is orphaned, R2 is RC (for
stranded RNA)
-m [ --min-length ] arg (=50) Min length for acceptable outputted read
-l [ --no-left ] Turns of trimming of the left side of
the read
-r [ --no-right ] Turns of trimming of the right side of
the read
-s [ --seq ] arg Please supply a fasta file - default -
Phix Sequence - default
https://www.ncbi.nlm.nih.gov/nuccore/962
6372
-s and -l out of bounds ignore read.
EG:
Program TotalRecords R2_Discarded R1_Discarded R2_Length SE_Out SE_Length Ignored TotalReadsInput PE_In PE_Out SE_In SE_Left_Trim Sins Replaced SE_Discarded R2_Right_Trim R1_Left_Trim SE_Right_Trim R1_Right_Trim R1_Length R2_Left_Trim Overlap_BPs Lins Super-Deduper 267139 0 0 66367161 0 0 0 0 0 264411 0 0 0 2728 0 0 0 0 0 66367161 0 0 0 Program Q-Trim 264411 14 0 65862254 14 0 0 0 0 264397 0 0 0 0 0 496200 231 0 440264 65926666 5193 0 0 Program Overlapper 264411 0 0 24600101 163746 41630620 0 0 0 98694 0 0 0 0 1971 0 0 0 0 24669593 0 0 0
Hey, @samhunter
Specific to AT trim - what should the default values be for min trim length and number of mismatch?
In general, min accepted length default?
All trimming algorithms will also have parameters for stranded, 3' trim, 5' trim.
Anything I am missing? We can run test later to actually get optimal values, but is there a decent first guess?
Thank you!
Now getting
-- Install configuration: ""
-- Up-to-date: /share/biocore/software/lib/libhts_common.so
-- Up-to-date: /share/biocore/software/bin/super-deduper
-- Up-to-date: /share/biocore/software/bin/tab-converter
-- Up-to-date: /share/biocore/software/bin/polyATtrim
-- Up-to-date: /share/biocore/software/bin/q-window-trim
-- Up-to-date: /share/biocore/software/bin/cut-trim
-- Up-to-date: /share/biocore/software/bin/phix-remover
-- Up-to-date: /share/biocore/software/bin/overlapper
-- Up-to-date: /share/biocore/software/bin/n-remover
msettles@ganesh: build$phix-remover 0h
phix-remover: error while loading shared libraries: libhts_common.so: cannot open shared object file: No such file or directory
msettles@ganesh: build$phix-remover -h
phix-remover: error while loading shared libraries: libhts_common.so: cannot open shared object file: No such file or directory
Just overwrites
Hey @joe-angell and @samhunter , hate to be a noodge, but I'm having issues compiling on both develop and master.
(Talking about develop) I had to add #include<numeric>
to use std::accumulate with common/src/read.cpp, however, after that I am still getting this error when trying to make.
[ 86%] Linking CXX executable Super-Deduper_test ld: library not found for -llibhts_common.so clang: error: linker command failed with exit code 1 (use -v to see invocation) make[2]: *** [Super-Deduper/Super-Deduper_test] Error 1 make[1]: *** [Super-Deduper/CMakeFiles/Super-Deduper_test.dir/all] Error 2
Also, a ton of warnings with gtest stuff. Is that expected?
Any suggestions?
Thanks, sorry to be a newb with this cmake stuff.
So ran the scheme on a dataset already processed and the result differs some, so would like to also know again how this app has been verified, there appears to be some mods
Question 1
-e [ --hist-file ] arg A tab delimited hist file with insert lengths.
This is to output the hist?? help text seems to imply its an input file
Question 2
-k [ --kmer ] arg (=8) Kmer size of the lookup table for the
longer read
-r [ --kmer-offset ] arg (=1) Offset of kmers. Offset of 1, would be
perfect overlapping kmers. An offset of
kmer would be non-overlapping kmers
that are right next to each other. Must
be greater than 0.
I don't recall kmers being a part of the old algorithm, how are they used here?
Question 3
-c [ --check-lengths ] arg (=20) Check lengths on the ends
???? What is this
Question 4
-a [ --adapter-trimming ] Trims adapters based on overlap, only
returns PE reads, will correct quality
scores and BP in the PE reads
I think you mean "Only perform adapter trimming based on overlap, PE reads in produces PE output trimmed of adapters" yes??
Question 5
-s [ --stranded ] Makes sure the correct complement is returned upon overlap
Just curious on where this comes up, did someone see an example. 'stranded' is important, when 1) RNA input 2) when Read 1 is removed and Read 2 is retained, in that case only Read 2 then needs to be RC in order to correctly represent the underlying strand the read came from, as SE reads are inferred to be R1 reads.
RC R2 if R1 is orphaned
$ ~/HTStream/build/Q-Window-Trim/q-window-trim -1 202_S14_R1_001.fastq.gz -2 202_S14_R2_001.fastq.gz -f -p htq -m 200
Unhandled Exception: No write_read class, only accessable with SE
$ ll
total 416
drwxrwxr-x 2 shunter grc 4096 Dec 30 01:27 ./
drwxr-xr-x 8 shunter grc 4096 Dec 30 01:23 ../
lrwxrwxrwx 1 shunter grc 341682103 Dec 30 01:26 202_S14_R1_001.fastq.gz
lrwxrwxrwx 1 shunter grc 366655813 Dec 30 01:26 202_S14_R2_001.fastq.gz
-rw-rw-r-- 1 shunter grc 204735 Dec 30 01:27 htqPE1.fastq
-rw-rw-r-- 1 shunter grc 204243 Dec 30 01:27 htqPE2.fastq
Returns the longest sequence without Ns
Hey, @bioSandMan ,
There is something a bit odd when the gitio pages load. It first goes to an HTML like page (split second) then reformats to look correct. Could you look into why this is happening?
Thanks!
Quick lookup 8mer primer lookup
Basic meta stats on a sequences
ACTG, Avg Length, Avg Quality
For all programs, the help-text output should note that the program is part of the HTStream package, probably with a link to the GitHub repo and a short description of what it does, and preferably version information (I also think it would be amazing if the git commit hash was displayed but that might be too much of a PITA).
Currently the "-v" switch reports nothing.
Currently it is difficult to verify that the most recent version is installed.
Hey, @bioSandMan
(sorry to spam you with these)
Let's remove the picture place holder stuff.
I'm not sure how many applications will have corresponding pictures but I image maybe only 1 or 2. If they do need a picture, I can just put them inline.
Let me know if you have any questions, about this one or any of the issues I just submitted.
Thank you!
-l --> hist file and minLength
-l [ --minLength ] arg (=50) Mismatches allowed in overlapped section
-x [ --max-mismatches ] arg (=5) Mismatches allowed in overlapped section
Hoping things are ready to test, I tried to install on our systems and ran into an issue. Hard to tell if its the installer or our system. I've loaded modules for cmake and boost/1.60 BUT
""""
statussources: /share/biocore/software/src/HTStream/Overlapper/src/overlapper.cpp
CMake Error at /afs/genomecenter.ucdavis.edu/software/cmake/3.5.1/x86_64-linux-ubuntu14.04/share/cmake-3.5/Modules/FindBoost.cmake:1657 (message):
Unable to find the requested Boost libraries.
Boost version: 1.54.0
Boost include path: /usr/include
Detected version of Boost is too old. Requested version was 1.56 (or
newer).
Call Stack (most recent call first):
Overlapper/CMakeLists.txt:14 (FIND_PACKAGE)
""""
its looking in /usr/include for boost (which is old), while env reports
msettles@ganesh: HTStream$env | grep boost
CPPFLAGS=-I/software/cmake/3.5.1/x86_64-linux-ubuntu14.04/include -I/software/boost/1.60/x86_64-linux-ubuntu14.04/include
LIBRARY_PATH=/software/cmake/3.5.1/x86_64-linux-ubuntu14.04/lib:/software/boost/1.60/x86_64-linux-ubuntu14.04/lib
LD_LIBRARY_PATH=/software/cmake/3.5.1/x86_64-linux-ubuntu14.04/lib:/software/boost/1.60/x86_64-linux-ubuntu14.04/lib
CPATH=/software/cmake/3.5.1/x86_64-linux-ubuntu14.04/include:/software/boost/1.60/x86_64-linux-ubuntu14.04/include
LMFILES=/software/modules/3.2.10/x86_64-linux-ubuntu14.04/Modules/3.2.10/modulefiles/boost/1.60:/software/modules/3.2.10/x86_64-linux-ubuntu14.04/Modules/3.2.10/modulefiles/cmake/3.5.1
LOADEDMODULES=boost/1.60:cmake/3.5.1
BOOST_INCLUDE_DIR=/software/boost/1.60/x86_64-linux-ubuntu14.04/include/boost
BOOST_LIBRARY_DIR=/software/boost/1.60/x86_64-linux-ubuntu14.04/lib
So maybe hard coded path for boost??
Matt
@msettles @samhunter (Someone tag Alida as well). Hi, all. :)
I'm starting to update the website. It is located here https://ibest.github.io/HTStream/ . Please, feel free to update gh-pages (_layout/*.html) directly or submit issues/enhancements.
It is looking real rough right now, but it is a start. I will continue to add to it and edit it the next couple of weeks. There is already some formatting / CSS stuff I need to iron out - however, don't hesitate to open an issue about anything.
When should we be merging into master? In the past, it always lagged pretty far behind.
$ n-remover -h
Tab-Converter
Options:
...
after loading module cmake/boost AND
cmake -DBOOST_ROOT=/software/boost/1.60/x86_64-linux-ubuntu14.04 -DCMAKE_BUILD_TYPE=Release -DBoost_NO_SYSTEM_PATHS=TRUE -DBoost_NO_BOOST_CMAKE=TRUE ..
[ 16%] Built target googletest
[ 18%] Linking CXX executable /usr/local/bin/hts_common_test
/usr/bin/ld: cannot open output file /usr/local/bin/hts_common_test: Permission denied
collect2: error: ld returned 1 exit status
make[2]: *** [/usr/local/bin/hts_common_test] Error 1
make[1]: *** [common/CMakeFiles/hts_common_test.dir/all] Error 2
make: *** [all] Error 2
Looks like its trying to reference /usr/local/bin which contains nothing I as a user I don't have access to write to
Matt
It takes way to long to select the gh-pages branch!
Hey all.
As @msettles pointed out in the email last night, a pretty big feature we are missing is SE adapter trimmers.
I am assuming adapters can only show up on the 3' end still (just like in PE reads)? Or do we want a more robust tool that can fuzzy cut both 5' and 3' (in case of primers or something at the start). Or is that a different tool all together?
What are your thoughts @samhunter and @msettles ?
Documentations
If you compare the output of superd using single end reads, we get different results to the old super master. Something different must be happening with the keys, but I didn't look into it much.
When trying to run with default parameters in a folder where I had no writing permits, it gave me a segmentation fault. Need a more a informative error message.
Should N's match other N's?
How does overlapper deal with Ns?
Should overlapper subtract the quality value associated with an N, or just use the overlapped base + qual score?
seems weird to have to include an ending '' when you specify output file prefiex (eg myoutput ) can't the '_' be added by default (seems the norm)
cleaned reads should have similar postfix as Illumina, so instead of PE1/PE2 how about R1/R2 (maybe even include the _001?, so cleaned_reads_R1_001.fasta.gz). Seems more apps will expect the R1/R2 within read id more than PE1/PE2 (our use of PE is I think legacy)
gz output by default, helps with good behavior ;) but of course only when outputting a file.
Hey, @joe-angell .
I am trying to model the dbhash class you created in Super Deduper in Phix Remover (phix_remover.h line 32) for some sensitivity specificity testing. I keep getting an error saying m_bits is private. I have defined BOOST_DYNAMIC_BITSET_DONT_USE_FRIENDS, but I'm not sure what else I have to do.
I have some working code in TestingPhix that is using to_ulong(), but think that is going to be too slow. Would you mind taking a looking when you have a chance?
Thanks, Joe!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.