pachterlab / kallisto Goto Github PK

Near-optimal RNA-Seq quantification

Home Page: https://pachterlab.github.io/kallisto

License: BSD 2-Clause "Simplified" License

Python 0.08% CMake 1.40% C++ 21.76% C 38.85% Shell 0.34% JavaScript 0.56% Jupyter Notebook 0.03% Makefile 0.55% M4 0.13% Roff 0.20% Perl 0.32% Scilab 0.03% HTML 26.56% CSS 0.32% TeX 8.86% Dockerfile 0.01% VBScript 0.01%

kallisto rna-seq pseudoalignment

kallisto's Introduction

kallisto

kallisto is a program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads. It is based on the novel idea of pseudoalignment for rapidly determining the compatibility of reads with targets, without the need for alignment. On benchmarks with standard RNA-Seq data, kallisto can quantify 30 million human bulk RNA-seq reads in less than 3 minutes on a Mac desktop computer using only the read sequences and a transcriptome index that itself takes than 10 minutes to build. Pseudoalignment of reads preserves the key information needed for quantification, and kallisto is therefore not only fast, but also comparably accurate to other existing quantification tools. In fact, because the pseudoalignment procedure is robust to errors in the reads, in many benchmarks kallisto significantly outperforms existing tools. The kallisto algorithms are described in more detail in:

NL Bray, H Pimentel, P Melsted and L Pachter, Near optimal probabilistic RNA-seq quantification, Nature Biotechnology 34, p 525--527 (2016).

Scripts reproducing all the results of the paper are available here.

kallisto quantified bulk RNA-Seq can be analyzed with sleuth.

kallisto can be used together with bustools to pre-process single-cell RNA-seq data. See the kallistobus.tools website for instructions.

Manual

Please visit http://pachterlab.github.io/kallisto/manual.html for the manual.

License

kallisto is distributed under the BSD-2 license. The license is distributed with kallisto in the file license.txt, which is also viewable here. Please read the license before using kallisto.

Getting help

For help running kallisto, please post to the kallisto-and-applications Google Group.

Reporting bugs

Please report bugs to the Github issues page

Development and pull requests

We typically develop on separate branches, then merge into devel once features have been sufficiently tested. devel is the latest, stable, development branch. master is used only for official releases and is considered to be stable. If you submit a pull request (thanks!) please make sure to request to merge into devel and NOT master. Merges usually only go into master, but not out.

kallisto's People

Contributors

Stargazers

Watchers

Forkers

pmelsted ndaniel eernst friendlyneighborhoodnematode churchill-lab bli25 andurill nraghupa martenson ttriche bradleycolquitt jvandrov xiuying kmakat eyay xtmgah mortya gatoravi ifantasy robertoalvarezm gpcr hrk2109 sunxingqiang biocyberman roryk danielsday ttdtrang yixf-self genomicsnx alenzhao juliahi swatikaushik khl0798 randy3465 makaho renzhonglu vreuter mingleiyang skurscheid tankmermaid flyfl1992 mackaay cwt1 xiaoxiaff harpur bound-to-love jakegoodall raymondshang bioinfotools tlorin myrefoli threadmapper scchess chizhou-siti hliang marcnioz jiaolongsun hanruizhang kpeaton vesteinn leonfodoulian burkovba mr-eyes jaudoux s-cooper18 zihua qiangzhongwork dfajar2 hyphaltip tmaklin biocodings sqsun chrisamiller shahirb yusufhihm wwood sultan-mia pdexheimer dvzhukov lagzxadr qiao-xin zchatt xiyupeng roseorenbuch eawilson-compbio amynjaaye gordian-biotechnology munfred astamagg ravidivya891 fuchanghe rsquared-rnaseq laureneliu ylee9912 vivekbhr brittanymareeott michelle-tan bpar1 miaoyu01 nbahti

kallisto's Issues

./kallisto: error while loading shared libraries: libhdf5.so.7: cannot open shared object file:

I built HDF5 as instructed, (but not in my home directory). Added the lib directory to LD_LIBRARY_PATH.

Running kallisto gives me the error ./kallisto: error while loading shared libraries: libhdf5.so.7: cannot open shared object file: No such file or directory

I solved this by going into /path/to/hdf5-1.8.15-patch1/lib and running ln -s libhdf5.so libhdf5.so.7. Not sure if others have experienced this, but this worked for me.

Index additional information (index inspect)

Add additional info so an index can be "inspected". There are the proposed stats:

kallisto version
index version specification
command that generated index
fasta file name that generated index
MD5 of fasta
date generated

@pmelsted @lakigigar @nicolasbray any other ideas for things you might find helpful?

Bootstrap samples

Probably going to implement tomorrow. It should be pretty quick.

Sample multinomial from ECs
Rerun EM
- Should we seed new bootstrap with previous solution?
Run in parallel

TOLERANCE

So I don't really see why we have to worry about the denominator being too small at all. Can someone explain this to me?

libhdf5.so.7 error with ubuntu binary

I get the following with the ubuntu binary v0.42:

kallisto: error while loading shared libraries: libhdf5.so.7: cannot open shared object file: No such file or directory

maybe linked against an older version of libhdf5?

Fails to compile on Mac OS Mountain Lion: error: 'unordered_map' file not found

If this file is required, perhaps the cmake test could check and produce a better error message.

See http://bot.brew.sh/job/Homebrew%20Science%20Pull%20Requests/1857/version=mountain_lion/testReport/junit/brew-test-bot/mountain_lion/install_Homebrew_science_kallisto/

cd /tmp/kallisto20150506-92455-1q2gpzh/kallisto-0.42/src && /usr/local/Library/ENV/4.3/clang++    -I/tmp/kallisto20150506-92455-1q2gpzh/kallisto-0.42/src/../ext -I/tmp/kallisto20150506-92455-1q2gpzh/kallisto-0.42/src -I/usr/local/include    -std=c++11 -O3 -o CMakeFiles/kallisto_core.dir/Kmer.cpp.o -c /tmp/kallisto20150506-92455-1q2gpzh/kallisto-0.42/src/Kmer.cpp
In file included from /tmp/kallisto20150506-92455-1q2gpzh/kallisto-0.42/src/H5Writer.cpp:1:
In file included from /tmp/kallisto20150506-92455-1q2gpzh/kallisto-0.42/src/H5Writer.h:4:
In file included from /tmp/kallisto20150506-92455-1q2gpzh/kallisto-0.42/src/EMAlgorithm.h:5:
/tmp/kallisto20150506-92455-1q2gpzh/kallisto-0.42/src/KmerIndex.h:8:10: fatal error: 'unordered_map' file not found
#include <unordered_map>
         ^

BAM!

In the BAM branch this now reads the BAM input and compares the alignment reported by the BAM and Kallisto. The stat's I'm showing are for /home/bray/kallisto_test_data/NA12716_7/NA12716_7.bam

I'll look into what is happening, and why there is a mismatch between the alignments. But for now you can stare at the numbers.

Aligned 78619701
exact matches = 7924
Kallisto mapped, not BAM = 2889402
Bam mapped, not Kallisto = 588710
Both mapped, mismatches = 23910707
Neither mapped = 9450778

quant: unrecognized option '--pseudobam'

It is my understanding that we should only supply one sample at a time to kallisto; therefore I ran the following command on one of my 48 Single-end FASTQ files. However, I'm getting a strange error.

Although 'quant' does produce three output files (abundance.h5, abundances.tsv, run_info.jso), I cannot understand why the --pseudobam command is not recognized (or executed) and why there is no SAM format output?
Hoping someone can help!

small benchmark

Create a small benchmark dataset:

Few number of trans, but share equivalence classes
Few reads

Execution time should be fast so can iteratively do tests

Fail with Cmake BUILD_TESTING=ON

Description of error

Pulled source from GitHub. Attempted to build on Ubuntu 12.04.5 LTS. Used the build testing, and the testing did not show errors. However, at the end unittests it claims to not find HDF5, even though it shows HDF5 as found.

Recreating the issue

Command

git clone http://github.com/pachterlab/kallisto
cd kallisto
mkdir build
cd build
cmake -DBUILD_TESTING=ON ..

Output

-- The C compiler identification is GNU 5.1.0
-- The CXX compiler identification is GNU 5.1.0
-- Check for working C compiler: /home/eli/bin/gcc_out/bin/gcc
-- Check for working C compiler: /home/eli/bin/gcc_out/bin/gcc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /home/eli/bin/gcc_out/bin/c++
-- Check for working CXX compiler: /home/eli/bin/gcc_out/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
shared build
-- Looking for include file pthread.h
-- Looking for include file pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found HDF5: /home/eli/src/hdf5-1.8.15-patch1/hdf5/lib/libhdf5.so;/usr/lib/x86_64-linux-gnu/librt.so;/usr/lib/x86_64-linux-gnu/libz.so;/usr/lib/x86_64-linux-gnu/libdl.so;/usr/lib/x86_64-linux-gnu/libm.so
-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.3.4")
-- Found Git: /usr/bin/git (found version "1.7.9.5")
CMake Error at unit_tests/CMakeLists.txt:22 (message):
  HDF5 not found.  Required to output files

System details

Ubuntu 12.04.5 LTS
cmake v3.2.3
gcc v5.1.0
hdf5 v1.8.15-patch1

Attempted to download the precompiled Ubuntu binary, but the link appears to be dead. Unsure what is triggering the error, as HDF5 is listed as found right before the unittest says it can't be found.

--thread option fails on Ubuntu14.04 source build

kallisto v.0.42.3 build from source runs fine as long as the -t or --thread option isn't called. The error arises in the following:

kallisto quant -i ref.fa.idx -o output -b 100 reads_1.fq reads_2.fq --threads=7

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 1
[index] number of k-mers: 857,372
[index] number of equivalence classes: 1
[quant] running in paired-end mode
[quant] will process pair 1: reads_1.fq
reads_2.fq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 43,651,471 reads, 284,674 reads pseudoaligned
[quant] estimated average fragment length: 324.741
[ em] quantifying the abundances ... done
[ em] the Expectation-Maximization algorithm ran for 1 rounds
terminate called after throwing an instance of 'std::system_error'
what(): Enable multithreading to use std::thread: Operation not permitted
Aborted (core dumped)

The resulting directory 'output' is empty.

Read BAM files

My site (and I believe others) do not store the FASTQ file, only the BAM file.

Linux build missing

Hi everyone,

It looks like the past couple of releases are missing the Linux binaries:

https://github.com/pachterlab/kallisto/releases/download/v0.42.2.1/kallisto_linux-v0.42.2.1.tar.gz

and

https://github.com/pachterlab/kallisto/releases/download/v0.42.2/kallisto_linux-v0.42.2.tar.gz

404-- any shot of getting those put up? I can provide builds if that would help.

Thanks!

1 column of output for multiple samples

Are we able to run multiple samples with a single command? Or should we execute one command for each sample?

After reading the documentation, I expected the following to quantify abundances for two samples.

kallisto quant -i idx -o . pairA_1.fastq.gz pairA_2.fastq.gz pairB_1.fastq.gz pairB_2.fastq.gz

However, the output file has only 1 column of output instead of 2.

kallisto quant -i idx -o . pairA_1.fastq.gz pairA_2.fastq.gz pairB_1.fastq.gz pairB_2.fastq.gz
[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 15
[index] number of k-mers: 18902
[index] number of equivalence classes: 22
[quant] running in paired-end mode
[quant] will process pair 1: pairA_1.fastq.gz
                             pairA_2.fastq.gz
[quant] will process pair 2: pairB_1.fastq.gz
                             pairB_2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 20000 reads, 20000 reads pseudoaligned
[quant] estimated average fragment length: 178.074
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 41 rounds

cat abundance.txt
target_id       length  eff_length      est_counts      tpm
NM_001168316    2283    2105.93 328.264 12856.9
NM_174914       2385    2207.93 2991.2  111742
NR_031764       1853    1675.93 208.535 10263.2
NM_004503       1681    1503.93 664.003 36416.7
NM_006897       1541    1363.93 1328    80308.9
NM_014212       2037    1859.93 110     4878.13
NM_014620       2300    2122.93 1185.2  46048.4
NM_017409       1959    1781.93 94      4351.06
NM_017410       2396    2218.93 84      3122.43
NM_018953       1612    1434.93 455.995 26211.2
NM_022658       2288    2110.93 9762    381437
NM_153633       1666    1488.93 719.765 39872.6
NM_153693       2072    1894.93 145.028 6312.71
NM_173860       849     671.926 1924    236179
NR_003084       1640    1462.93 0.00808901      0.456068

Index building for test data crashes

After installing either the latest version from git or source version v0.42.3 from http://pachterlab.github.io/kallisto/download.html on my 32-bit Linux (Debian), index building crashes:

~/Software/kallisto-0.42.3/test$ ~/Software/kallisto/build/src/kallisto index -i transcripts.idx transcripts.fasta.gz

[build] loading fasta file transcripts.fasta.gz
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ... done
[build] creating equivalence classes ... kallisto: /home/inken/Software/tmp/kallisto/src/KmerIndex.cpp:588: void KmerIndex::FixSplitContigs(const ProgramOptions&, std::vectorstd::vector&): Assertion `search != kmap.end()' failed.
Aborted

Since I am on a 32-bit Linux, I unfortunately cannot use the binaries.
I guess eventually I want to do this (?), which also crashes:

~/Software/kallisto/build/src/kallisto index -i Homo_sapiens.GRCh38.rel79.cdna.all.fa.idx Homo_sapiens.GRCh38.rel79.cdna.all.fa.gz

[build] loading fasta file Homo_sapiens.GRCh38.rel79.cdna.all.fa.gz
[build] k-mer length: 31
[build] warning: clipped off poly-A tail (longer than 10)
from 1372 target sequences
[build] warning: replaced 85 non-ACGUT characters in the input sequence
with pseudorandom nucleotides
[build] counting k-mers ... terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted

Is the second perhaps an issue with the 32-bit architecture?

Thanks for your help!
Inken

hdf5 files corrupted in multi-thread mode

kalliso only produces correct hdf5 files in non-threaded mode. When run in threaded mode (much faster), the abundance file is fine, but the h5 file has an incorrect structure:

## threaded (-t 6 -b 100, about 10 mins per sample)
kallisto h5dump 1563-01/abundance.h5 -o out 2>&1 | head -n 6
HDF5-DIAG: Error detected in HDF5 (1.8.13) thread 24488128:
#000: ../../../src/H5F.c line 1598 in H5Fopen(): unable to open file
major: File accessibilty
minor: Unable to open file
#001: ../../../src/H5F.c line 1389 in H5F_open(): unable to read superblock
major: File accessibilty

$ ls -lh 1563-0[12]
1563-01:
total 20M
-rw-rw-r-- 1 bioinfo_user bioinfo_user 4.3M Aug 24 12:02 abundance.h5
-rw-rw-r-- 1 bioinfo_user bioinfo_user  16M Aug 24 12:02 abundance.tsv
-rw-rw-r-- 1 bioinfo_user bioinfo_user  613 Aug 24 12:02 run_info.json

1563-02:
total 21M
-rw-rw-r-- 1 bioinfo_user bioinfo_user 4.5M Aug 24 12:11 abundance.h5
-rw-rw-r-- 1 bioinfo_user bioinfo_user  16M Aug 24 12:11 abundance.tsv
-rw-rw-r-- 1 bioinfo_user bioinfo_user  613 Aug 24 12:11 run_info.json

## unthreaded (-b 50), about 40 mins per sample
$ kallisto h5dump test2/1563-01/abundance.h5 -o out 2>&1 | head -n 6
[h5dump] number of targets: 418753
[h5dump] number of bootstraps: 50
[h5dump] kallisto version: 0.42.2.1
[h5dump] index version: 10
[h5dump] start time: Tue Aug 25 17:20:52 2015
[h5dump] shell call: kallisto quant -i /mnt/BigBirdShared/BioInfoProjects/GLG_Nippo/Mapping/RefSets/Nb_Trinity/kallisto_out -o /mnt/BigBirdShared/BioInfoProjects/GLG_Nippo/Mapping/MappedSets/TB-15-07_Nb/Nb_Trinity/test2/1563-01/ -b 50 /mnt/BigBirdShared/BioInfoProjects/GLG_Nippo/PreProcessing/ReadSets/TB-15-07_Nb/MIMR_Processed/1563-01-01-01_R1.fastq.gz /mnt/BigBirdShared/BioInfoProjects/GLG_Nippo/PreProcessing/ReadSets/TB-15-07_Nb/MIMR_Processed/1563-01-01-01_R2.fastq.gz

$ ls -lh test2/1563-0[12] 
test2/1563-01:
total 70M
-rw-rw-r-- 1 bioinfo_user bioinfo_user 54M Aug 25 17:54 abundance.h5
-rw-rw-r-- 1 bioinfo_user bioinfo_user 16M Aug 25 17:28 abundance.tsv
-rw-rw-r-- 1 bioinfo_user bioinfo_user 612 Aug 25 17:28 run_info.json

test2/1563-02:
total 74M
-rw-rw-r-- 1 bioinfo_user bioinfo_user 59M Aug 25 19:08 abundance.h5
-rw-rw-r-- 1 bioinfo_user bioinfo_user 16M Aug 25 18:40 abundance.tsv
-rw-rw-r-- 1 bioinfo_user bioinfo_user 612 Aug 25 18:40 run_info.json

Just in case it's important, this was compiled on a different system from the one it was run on, using statically-compiled libraries. This was to work around a lack of hdf5 libraries on the running system. Here's the compilation command that I used:

cmake -DLINK=static -DCMAKE_INSTALL_PREFIX:PATH=~/install/kallisto/kallisto-0.42.2.1 ..

And kallisto's default output:

$ kallisto
kallisto 0.42.2.1

Usage: kallisto <CMD> [arguments] ..

Where <CMD> can be one of:

    index         Builds a kallisto index 
    quant         Runs the quantification algorithm 
    h5dump        Converts HDF5-formatted results to plaintext
    version       Prints version information

Running kallisto <CMD> without arguments prints usage information for <CMD>

This was compiled on a Debian system:

$ uname -a
Linux elegans 4.1.0-trunk-amd64 #1 SMP Debian 4.1.2-1~exp1 (2015-07-11) x86_64 GNU/Linux
$ aptitude show libhdf5-dev
Package: libhdf5-dev
State: installed
Automatically installed: no
Version: 1.8.13+docs-15
Priority: optional
Section: libdevel
Maintainer: Debian GIS Project <[email protected]>
Architecture: amd64
Uncompressed Size: 25.0 M
Depends: libhdf5-8 (= 1.8.13+docs-15), zlib1g-dev, libjpeg-dev, hdf5-helpers,
         libhdf5-cpp-8 (= 1.8.13+docs-15)
Suggests: libhdf5-doc
Conflicts: libhdf5-dev
Breaks: libhdf5-serial-dev (< 1.8.12-9~), libhdf5-serial-dev (< 1.8.12-9~)
Replaces: libhdf5-serial-dev (< 1.8.12-9~), libhdf5-serial-dev (< 1.8.12-9~)
Provides: libhdf5-serial-dev
Description: Hierarchical Data Format 5 (HDF5) - development files - serial version
 HDF5 is a file format and library for storing scientific data. HDF5 was
 designed and implemented to address the deficiencies of HDF4.x. It has a more
 powerful and flexible data model, supports files larger than 2 GB, and supports
 parallel I/O. 

 This package contains development files for serial platforms.
Homepage: http://hdfgroup.org/HDF5/

Tags: devel::library, role::devel-lib

$ dpkg -S libhdf5_serial
libhdf5-dev: /usr/lib/x86_64-linux-gnu/libhdf5_serial_hl.a
libhdf5-dev: /usr/lib/x86_64-linux-gnu/libhdf5_serial_hl.so
libhdf5-8:amd64: /usr/lib/x86_64-linux-gnu/libhdf5_serialhl_fortran.so.8
libhdf5-8:amd64: /usr/lib/x86_64-linux-gnu/libhdf5_serial_fortran.so.8
libhdf5-dev: /usr/lib/x86_64-linux-gnu/libhdf5_serial.so
libhdf5-dev: /usr/lib/x86_64-linux-gnu/libhdf5_serialhl_fortran.so
libhdf5-dev: /usr/lib/x86_64-linux-gnu/libhdf5_serial_fortran.so
libhdf5-dev: /usr/lib/x86_64-linux-gnu/libhdf5_serial.a
libhdf5-8:amd64: /usr/lib/x86_64-linux-gnu/libhdf5_serial_hl.so.8
libhdf5-8:amd64: /usr/lib/x86_64-linux-gnu/libhdf5_serialhl_fortran.so.8.0.2
libhdf5-8:amd64: /usr/lib/x86_64-linux-gnu/libhdf5_serial_fortran.so.8.0.2
libhdf5-8:amd64: /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.8
libhdf5-dev: /usr/lib/x86_64-linux-gnu/libhdf5_serial_fortran.a
libhdf5-dev: /usr/lib/x86_64-linux-gnu/libhdf5_serialhl_fortran.a
libhdf5-8:amd64: /usr/lib/x86_64-linux-gnu/libhdf5_serial_hl.so.8.0.2
libhdf5-8:amd64: /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.8.0.2
libhdf5-dev: /usr/lib/x86_64-linux-gnu/libhdf5_serial.settings

Test accuracy of k-mers and skips

Test accuracy with several skip lengths and k-mer lengths

Candidate k-mers:

Candidate skips:

None
2
4
8

Output more readable numbers

[per Marcel Schilling]

Add comma separators to numbers greater than 999. E.g:

[index] k-mer length: 31
[index] number of targets: 193,761
[index] number of k-mers: 120,402,343
[index] number of equivalence classes: 782,723
[quant] running in single-end mode
[quant] will process file 1: reads.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 14,977,707 reads, 9,857,295 reads pseudoaligned
[ em] quantifying the abundances ... done
[ em] the Expectation-Maximization algorithm ran for 1,193 rounds

vs current output:

[index] k-mer length: 31
[index] number of targets: 193761
[index] number of k-mers: 120402343
[index] number of equivalence classes: 782723
[quant] running in single-end mode
[quant] will process file 1: reads.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 14977707 reads, 9857295 reads pseudoaligned
[ em] quantifying the abundances ... done
[ em] the Expectation-Maximization algorithm ran for 1193 rounds

Test index fails to build - on 32 bit system

Attempted to follow the Gettting started instructions with the test datasets but got the following abortion:

[nick@laptop test] kallisto index -i transcripts.idx transcripts.fasta.gz 

[build] loading fasta file transcripts.fasta.gz
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ... kallisto: /home/nick/Documents/gitRepos/kallisto/src/KmerIndex.cpp:469: void KmerIndex::FixSplitContigs(const ProgramOptions&, std::vector<std::vector<TRInfo> >&): Assertion `search != kmap.end()' failed.
Aborted

I'm using v0.42.1 on Linux after making from the git repo .
Any other info I can provide to help?

Index tools

We should put together a set of tools for indexing, similar to Bowtie. This should make pipelines a bit more straightforward. Suggested tools:

kallisto_build - build the index
kallisto_inspect - get high level stats about the index (e.g. k, # of equivalence classes)
kallisto - runs the code given an index

Give helpful error message when no reads align

Currently, we do not check at the end of a run if no reads aligned. As a result, some people have been getting 0 reads mapping and 'nan' everywhere in their results. We should error, and give a helpful message such as: 'no reads mapped, you should probably check that your k is less than the read length by at least %.'

EM algorithm

For the first cut, we are assuming that each read can only exist once in a transcript. Thus, for every equivalence class you have a count, instead of having a count for each equivalence class => transcript

Tasks to be performed:

~~Calculate weights (done)~~
~~Calculate alphas by iterating through EID and TID in EID~~

Refactor EM/EM-only in main

The EM/EM-only branches in main are kinda hideous. I'll leave this as an open issue to remind me to fix it when there isn't something urgent open.

Strandedness

So I was starting to compare the indices between the python and c++ versions and I notice that the c++ version is reporting precisely half as many k-mers. This makes me suspect that only one strand of each transcript is being indexed although I'm not sure how that would work since based on the results it seems highly unlikely that we're just throwing out all the reads that are on the negative strand...

testing environments

I'm thinking of making either a branch vagrant or a new repo kallisto_vagrant with a few different VMs:

Linux build(s)
Linux testing (several of these)

Any thoughts, @pmelsted ? Preference to either one?

use https in _config.yml

Every time I visit the kallisto page, I have to click this button and "load unsafe scripts":

https is supported for github pages hosted on *.github.io domains.

And the console shows these errors:

Mixed Content: The page at 'https://pachterlab.github.io/kallisto/local_build.html' was loaded over HTTPS, but requested an insecure stylesheet 'http://pachterlab.github.io/kallisto/assets/themes/twitter/bootstrap/css/bootstrap.2.2.2.min.css'. This request has been blocked; the content must be served over HTTPS.

Mixed Content: The page at 'https://pachterlab.github.io/kallisto/local_build.html' was loaded over HTTPS, but requested an insecure stylesheet 'http://pachterlab.github.io/kallisto/assets/themes/twitter/css/style.css?body=1'. This request has been blocked; the content must be served over HTTPS.

The fix should be as simple as changing line 57 in your _config.yml:

  # Use https!
  #BASE_PATH : http://pachterlab.github.io/kallisto
  BASE_PATH : https://pachterlab.github.io/kallisto

Read interleaved paired-end FASTQ files

The output of sam bam2fq of an unaligned BAM file is an interleaved paired-end FASTQ file. It would be useful to be able to read this format.

Estimated Counts Different From Pseudobam Read Counts

I am a bit confused with the meaning of estimated counts. I aligned my (single ended) reads to a transcriptome and generated a pseudobam file. Based on samtools flagstat this was a mapping efficiency of 77% (41335716/53448456), so i cracked open the tsv file and thought the sum of estimated counts should be 41335716, but it is only 20717104, close to half as many. The relevant options are --single -l 174 -b 100 -t 8 --pseudobam. Why doesn't the mapped reads in the pseudobam file match the estimated counts?

Homebrew installation failure due to SHA mismatch

==> Installing kallisto from homebrew/homebrew-science
==> Downloading https://github.com/pachterlab/kallisto/archive/v0.42.2.tar.gz

################################################################## 100.0%

Error: SHA256 mismatch
Expected: 85b89b840dae06c56f59b6ad98b7832ff965df5d259975bc19244584cf918fbf
Actual: 6107fed6089c4c4e8a3f8d16c2539736321c134b518c95600ae93ddd3d699219

No minimum version on HDF5 libraries

HDF5 libraries version 1.8.4 has different API than what kallisto expects.

From cmake
-- Found HDF5: /usr/lib/libhdf5.so (found version "1.8.4")

after make
/home/vagrant/kallisto/src/h5utils.h:122:64: error: too many arguments to function 'hid_t H5Dopen1(hid_t, const char*)'
dataset_id = H5Dopen(group_id, dset_name.c_str(), H5P_DEFAULT);

Linux binaries are missing from download page

This returns 404:

https://github.com/pachterlab/kallisto/releases/download/v0.42.2.1/kallisto_linux-v0.42.2.1.tar.gz

Mac and source versions are fine

cmake_minimum_required_version

The following error message is printed when attempting to compile with older cmake versions:

CMake 2.8.8 or higher is required.  You are running version 2.8.7

But I think this should be 2.8.12 or higher because of _add_compile_options. This is corrected in INSTALL.md but not in the code but caught me out (the only 12.04 cmake backport I found was 2.8.11, so I went for the laziest option...)

Indexer should not accept duplicated fasta entries

See discussion in pachterlab/sleuth#12

Unique counts

Add unique counts column to output.

I'll do this later today

Bootstrap fails when no reads map

and all ECs have a count of 0.

Yes it is a silly input, but it is better to warn agains this, than to run code that will throw an exception.

This fails in mult_.sample()

/Users/pmelsted/work/code/kallisto/src/../ext/seqan/basic/basic_exception.h:368 FAILED! (Uncaught exception of type std::domain_error: nsamp must be -1 or >=1)

Output transcript-equivalence class mapping

We've chatted about this a few times, but let's aim for next release. Perhaps a format that's easy to parse table that looks something like this:

EC1 trans1 trans2 ... transN
EC2 trans3 trans9 ... transM
....

"Smart skip"

That's the name that Lior came up with the skipping strategy that I was talking about before. I think it's not totally clear how it should be implemented in the presence of errors but I just did a little test in python and computed the potential skip distance for each k-mer in the transcriptome.

70% of k-mers have a skip of at least 75.

hdf5 bootstrap representation

Bootstrap names left-padded with zeros bs00001, etc, have a natural sort order, currently these sort as bs0, bs1, bs10, bs11, ..., bs2, bs20, ...

/bootstrap as a 2-dimensional array would be more readily subset, especially by id. This would seem to be a common use case.

Trailing '\001' character

There is a trailing '\001' character in the first target_id that I noticed writing H5Writer. I'm not sure where it's coming from, but it existed before the H5 utils (it shows up in some of the plaintext results). My guess is it's coming from SeqAN... perhaps we can look at this later next week, @pmelsted ?

TPM is “-nan”

Hi, running the following command with this reference index and this file

kallisto quant -i Mus_musculus.GRCm38.rel79.cdna.all.fa.idx -o CAAAG --single -l 100 CAAAG_1.fastq.gz

results in a abundances.tsv like this:

target_id   length  eff_length  est_counts  tpm
ENSMUST00000178537  12  12  0   -nan
ENSMUST00000178862  14  14  0   -nan
ENSMUST00000177564  16  16  0   -nan
ENSMUST00000179664  11  11  0   -nan
ENSMUST00000179883  16  16  0   -nan
...

is there anything wrong with the input or is this a bug?

New intersection rules

Following the inspection of BAM files and some offline discussion here are the new rules for determining an equivalence classes.

For each read of a pair separately do the following

  v <- ec(x_0)
  For each k-mer x_i that maps to an equivalence class (sorted by first appearance within the read).
    if the intersection of v and x_i is non-empty
      let v <- intersect(v,x_i)
  return v

If v1 and v2 are the results of step 1 return intersect(v1,v2)

Here are a few things to think about

What if one end maps and the other doesn't do we report nothing? I think the consensus is that we should report only pairs of reads where both ends map. We'll lose some bad reads but I think that is fine.
What if the first k-mer is bogus, i.e. it maps to a completely different inconsistent transcript by error/chance. My feeling is that this is rare and we can see how often it happens.

Pseudobam output (very) occasionally prints two alignments in one line

I've been looking in to using the pseudoalignment for some downstream analysis. It happens at times that two alignment records end up at the same line. This causes samtools to fail parsing the sam file.

For one file I've been running Kallisto on, this has happened all of the three times I've run it.

The first time I ran Kallisto, the line in the sam file that caused the problem was

SRR1161560.21043094:CELL_TGATGCGC:UMI_GGGG  0   ENSMUST00000183090  308 255SRR1161560.21199582:CELL_GTCTTATC:UMI_TGAA   0   ENSMUST00000119438  473 255 51M *   0   0   AGTTCCCTGGCCGCCAGAAGATCCACATCTCAAAGAAGTGGGGCTTCACCA @??BDDDDFDHFHFHGEIIII@G?DGFFBGGIEAGI@@DCGIIIIIIGIGH NH:i:8

Another read is inserted right after the mapping quality field in the sam record. I didn't see anything special about these reads in the input FASTQ file, so I tried just running Kallisto again. This time a similar thing happened but with a different read!

SRR1161559.26399301:CELL_CATCAGAA:UMI_AAAA  16  ENSMUST00000115104  22954   255 51M *   0   0   AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA [email protected]:CELL_TGATGGAA:UMI_AAAA 16  ENSMUST00000187290  345 255 51M *   0   0   AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA BDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDFJJJHHHHHFFFFFCCC NH:i:129

Now a second read is inserted a bit in to the read quality string.

And just for good measure I tried a third time, and again this phenomenon happened.

SRR1161560.26326741:CELL_TGATTGAA:UMI_TGCG  0   ENSMUST00000082409  656 255 51M *   0   0   TCAGAGTTCTACTAAAATTTCACTTCACATCAAAACATCACTTCGGATTTG @@@DSRR1161560.26345311:CELL_CATCAGAA:UMI_AAAA  16  ENSMUST00000193553  242 255 51M *   0   0   AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACA DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDFHJJJJHHGHHFFFFF@C@ NH:i:12

I had a very quick look at the code, but couldn't see where anything particularly fishy would happen. It has the characteristics of a race condition, but since Kallisto is single-core, that's probably not it.

The input is a single end 63 982 233 reads fastq file.

What could cause this?

genome + annotation = good

Enhancement:

Input a genome + annotation file. Kallisto pulls out the CDS and indexes/maps appropriately. Should be 'easy' to pull out the coding sequences given these data.

kallisto v0.42.3 make install not working ubuntu VM but v0.42.2.1 is

I tried to install kallisto on a virtual machine of ubuntu 15.04 running on a Windows 10 machine. I had no problem getting anything to work except that last step of installing

\user@user-VirtualBox:/usr/local/bin/kallisto/build$ sudo make install
[ 92%] Built target kallisto_core
[100%] Built target kallisto
Install the project...
-- Install configuration: ""
-- Up-to-date: /usr/local/bin/kallisto
CMake Error at src/cmake_install.cmake:45 (file):
file RPATH_REMOVE given FILE "/usr/local/bin/kallisto" that does not exist.
Call Stack (most recent call first):
cmake_install.cmake:37 (include)

Makefile:66: recipe for target 'install' failed
make: *** [install] Error 1

I got this source from github but I then tried the version before it v0.42.2.1 (from https://github.com/pachterlab/kallisto/archive/v0.42.2.1.tar.gz) and I install kallisto no problem. I would rather use the newer version but I thought I should suggest a potential bug. Also, can you do an in situ upgrade of kallisto?

James

Test data

Sorry about this being so late. I actually put it all together and then completely forgot that I hadn't actually sent an email about it. You can get some test data here:

lucille2:/home/bray/kallisto_test_data

So the files are as follows:

input.fld: fragment length distribution for kallisto. I think right now this is actually cheating a bit since this is the true distribution. It's pretty easy to get (basically) the correct distribution from the simulated reads though. (Although maybe not so easy for real data...)
input_params.xprs + results.xprs: the parameters here are based on the results of running eXpress on one of the Geuvadis samples. This is the express output.
transcripts.in: the other input file (along with input_params.xprs) for the eXpress simulator.
non_ase_levels.xprs: The eXpress results were actually based on personalized genome with (imputed) haplotyped transcripts. This is the total expression of each transcript since we won't (necessarily) be asking programs to get the correct haplotype expression values.
sim_1.fa.gz + sim_2.fa.gz: the reads!
Homo_sapiens.GRCh37.75.cdna.pc.fa + NA12716.fa: the reference and personalized transcriptomes for this sample.

I hear everything up to the actual EM algorithm is working now. Any chance we could see profiling data on this input?

Please rename abundance.txt to abundance.tsv

TSV is the standard file extension for a tab-separated-values table. Having the correct file extension enables pretty rendering in GitHub.
See https://help.github.com/articles/rendering-csv-and-tsv-data/

Better format for kallisto output

While the plaintext format isn't an issue for one file, it grows out of control once you introduce the bootstrap. Each file is about 6 MB... but there are 100 of them. Now do this on many experiments and you have a nightmare.

Output to a binary format that could be read into R
- HDF5?
Add option to kallisto to output to plaintext or binary format
Create a utility that converts binary <-> plaintext (might just be fine as an R utility)

target_ids in H5

Something happened when we changed the variable length strings to fixed length and the target_ids got all mucked up:

"h\xf6H\017" "" "" "" "" "\017"

Could NOT find HDF5

Could I ask you to please include instructions for the following?

Downloading and installing HDF5 from source.
Including this installed version of HDF5 when building kallisto on linux.

I tried the following steps without success.

1. Install HDF5

wget http://www.hdfgroup.org/ftp/HDF5/current/src/CMake/hdf5-1.8.15-patch1-CMake.tar.gz
tar xf hdf5-1.8.15-patch1-CMake.tar.gz
cd HDF518CMake/
./HDF5-1.8.15-patch1-Linux.sh --prefix=$HOME/.local

This leaves me with a new installation here:

$HOME/.local/HDF5-1.8.15-patch1-Linux/HDF_Group/HDF5/1.8.15-patch1/

2. Make kallisto

Download the latest release: https://github.com/pachterlab/kallisto/archive/v0.42.2.1.tar.gz

cd kallisto-0.42.2.1/
mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX:PATH=$HOME/.local ..

At this point, I get the following error:

static build
CMake Error at /apps/source/cmake/3.1.0/share/cmake-3.1/Modules/FindPackageHandleStandardArgs.cmake:138 (message):
  Could NOT find HDF5 (missing: HDF5_LIBRARIES)
Call Stack (most recent call first):
  /apps/source/cmake/3.1.0/share/cmake-3.1/Modules/FindPackageHandleStandardArgs.cmake:374 (_FPHSA_FAILURE_MESSAGE)
  /apps/source/cmake/3.1.0/share/cmake-3.1/Modules/FindHDF5.cmake:360 (find_package_handle_standard_args)
  src/CMakeLists.txt:30 (find_package)


-- Configuring incomplete, errors occurred!
See also "/PHShome/ks38/src/kallisto-0.42.2.1/build/CMakeFiles/CMakeOutput.log".
See also "/PHShome/ks38/src/kallisto-0.42.2.1/build/CMakeFiles/CMakeError.log".

The CMakeFiles outputs are not helpful, but I'll send them to you if you wish.

Now, I tried appending the HDF5 directory (mentioned above) or the lib/ subdirectory to my PATH or LD_LIBRARY_PATH without success. I also tried setting HDF5_DIR (as recommended in the page below) without success.

ftp://www.hdfgroup.org/HDF5/current/src/unpacked/release_docs/USING_HDF5_CMake.txt

Indexing

The indexing is still slow and takes too much memory. I know how to fix this, but it's not a priority for the deadline.

If you get bugs with the index, run the inspect command which checks that the index is correct and provides a histogram of number of k-mers that map to an ec and the number of transcripts that map to an ec.

I've run this for the ensembl.75 for k=21,23,25,27. The inspect command works, both for the original index as well as the saved indices after mapping.