Coder Social home page Coder Social logo

hmyh1202 / bustools Goto Github PK

View Code? Open in Web Editor NEW

This project forked from bustools/bustools

0.0 0.0 0.0 507 KB

Tools for working with BUS files

Home Page: https://bustools.github.io/

License: BSD 2-Clause "Simplified" License

CMake 0.21% C++ 17.86% C 81.93%

bustools's Introduction

bustools

bustools is a program for manipulating BUS files for single cell RNA-Seq datasets. It can be used to error correct barcodes, collapse UMIs, produce gene count or transcript compatibility count matrices, and is useful for many other tasks. See the kallisto | bustools website for examples and instructions on how to use bustools as part of a single-cell RNA-seq workflow.

If you use bustools please cite

Melsted, Páll, Booeshaghi, A. Sina et al. Modular and efficient pre-processing of single-cell RNA-seq. BioRxiv (2019): 673285, doi.org/10.1101/673285.

For some background on the design and motivation for the BUS format and bustools see

Melsted, Páll, Ntranos, Vasilis and Pachter, Lior The Barcode, UMI, Set format and BUStools, Bioinformatics, btz279, 2019.

BUS format

bustools works with BUS files which can be generated efficiently from raw sequencing data, e.g. using kallisto.

Installation

Binaries for Mac, Linux, Windows, and Rock64 can be downloaded from the bustools website. Binary installation time is less than two minutes.

To compile bustools download the source code with

git clone https://github.com/BUStools/bustools.git

Navigate to the bustools directory

cd bustools

Make a build directory and move there:

mkdir build

cd build

Run cmake:

cmake ..

Build the code:

make

The bustools executable will be located in build/src. To install bustools into the cmake install prefix path type:

make install

Usage

To see a list of available commands, type bustools in the terminal

> bustools 
Usage: bustools <CMD> [arguments] ..

Where <CMD> can be one of: 

capture         Capture records from a BUS file
correct         Error correct a BUS file
count           Generate count matrices from a BUS file
inspect         Produce a report summarizing a BUS file
linker          Remove section of barcodes in BUS files
project         Project a BUS file to gene sets
sort            Sort a BUS file by barcodes and UMIs
text            Convert a binary BUS file to a tab-delimited text file
whitelist       Generate a whitelist from a BUS file

Running bustools <CMD> without arguments prints usage information for <CMD>

capture

bustools capture can separate BUS files into multiple files according to the capture criteria.

Usage: bustools capture [options] bus-files

Options: 
-o, --output          Directory for output 
-c, --capture         List of transcripts to capture
-e, --ecmap           File for mapping equivalence classes to transcripts
-t, --txnames         File with names of transcripts

correct

BUS files can be barcode error corrected with respect to a technology-specific whitelist of barcodes using bustools correct.

> bustools correct
Usage: bustools correct [options] bus-files

Options: 
-o, --output          File for corrected bus output
-w, --whitelist       File of whitelisted barcodes to correct to
-p, --pipe            Write to standard output

count

BUS files can be converted into a barcode-feature matrix, where the feature can be TCCs (Transcript Compatibility Counts) or genes using bustools count.

> bustools count
Usage: bustools count [options] bus-files

Options: 
-o, --output          File for corrected bus output
-g, --genemap         File for mapping transcripts to genes
-e, --ecmap           File for mapping equivalence classes to transcripts
-t, --txnames         File with names of transcripts
--genecounts          Aggregate counts to genes only

inspect

A report summarizing the contents of a sorted BUS file can be output either to standard out or to a JSON file for further analysis using bustools inspect.

> bustools inspect
Usage: bustools inspect [options] sorted-bus-file

Options: 
-o, --output          File for JSON output (optional)
-e, --ecmap           File for mapping equivalence classes to transcripts
-w, --whitelist       File of whitelisted barcodes to correct to
-p, --pipe            Write to standard output

--ecmap and --whitelist are optional parameters; bustools inspect is much faster without them, especially without the former.

Sample output (to stdout):

Read in 3148815 BUS records
Total number of reads: 3431849

Number of distinct barcodes: 162360
Median number of reads per barcode: 1.000000
Mean number of reads per barcode: 21.137281

Number of distinct UMIs: 966593
Number of distinct barcode-UMI pairs: 3062719
Median number of UMIs per barcode: 1.000000
Mean number of UMIs per barcode: 18.863753

Estimated number of new records at 2x sequencing depth: 2719327

Number of distinct targets detected: 70492
Median number of targets per set: 2.000000
Mean number of targets per set: 3.091267

Number of reads with singleton target: 1233940

Estimated number of new targets at 2x seuqencing depth: 6168

Number of barcodes in agreement with whitelist: 92889 (57.211752%)
Number of reads with barcode in agreement with whitelist: 3281671 (95.623992%)

linker

bustools linker removes specified section of barcode in BUS files.

Usage: bustools linker [options] bus-files

Options: 
-s, --start           Start coordinate for section of barcode to remove (0-indexed, inclusive)
-e, --end             End coordinate for section of barcode to remove (0-indexed, exclusive)
-p, --pipe            Write to standard output

If --start is -1, the removed section begins at beginning of barcode. Likewise, if --end is -1, the removed section ends at the end of the barcode. BUS files should contain barcodes of the same length.

project

The kallisto bus command maps reads to a set of transcripts. bustools project takes as input kallisto's (sorted) output and a transcript to gene map (tr2g file), and outputs a BUS file, a matrix.ec file, and a list of genes, which collectively map each read to a set of genes.

Usage: bustools project [options] sorted-bus-file

Options: 
-o, --output          File for project bug output and list of genes (no extension)
-g, --genemap         File for mapping transcripts to genes
-e, --ecmap           File for mapping equivalence classes to transcripts
-t, --txnames         File with names of transcripts
-p, --pipe            Write to standard output

sort

Raw BUS output from pseudoalignment programs may be unsorted. To simply and accelerate downstream processing BUS files can be sorted using bustools sort

> bustools sort 
Usage: bustools sort [options] bus-files

Options: 
-t, --threads         Number of threads to use
-m, --memory          Maximum memory used
-T, --temp            Location and prefix for temporary files 
                      required if using -p, otherwise defaults to output
-o, --output          File for sorted output
-p, --pipe            Write to standard output

This will create a new BUS file where the BUS records are sorted by barcode first, UMI second, and equivalence class third.

text

BUS files can be converted to a tab-separated format for easy inspection and processing using shell scripts or high level languages with bustools text.

> bustools text
Usage: bustools text [options] bus-files

Options: 
-o, --output          File for text output

whitelist

bustools whitelist generates a whitelist based on the barcodes in a sorted BUS file.

Usage: bustools whitelist [options] sorted-bus-file

Options: 
-o, --output        File for the whitelist
-f, --threshold     Minimum number of times a barcode must appear to be included in whitelist

--threshold is a (highly) optional parameter. If not provided, bustools whitelist will determine a threshold based on the first 200 to 100,200 records.

bustools's People

Contributors

anmolsagarwal avatar jamestwebber avatar lakigigar avatar laureneliu avatar pmelsted avatar sbooeshaghi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.