Coder Social home page Coder Social logo

qs's Introduction

qs

Build Status CRAN_Status_Badge CRAN_Downloads_Badge CRAN_Downloads_Total_Badge

Quick serialization of R objects

qs provides an interface for quickly saving and reading objects to and from disk. The goal of this package is to provide a lightning-fast and complete replacement for the saveRDS and readRDS functions in R.

Inspired by the fst package, qs uses a similar block-compression design using either the lz4 or zstd compression libraries. It differs in that it applies a more general approach for attributes and object references.

saveRDS and readRDS are the standard for serialization of R data, but these functions are not optimized for speed. On the other hand, fst is extremely fast, but only works on data.frame’s and certain column types.

qs is both extremely fast and general: it can serialize any R object like saveRDS and is just as fast and sometimes faster than fst.

Usage

library(qs)
df1 <- data.frame(x=rnorm(5e6), y=sample(5e6), z=sample(letters,5e6, replace=T))
qsave(df1, "myfile.qs")
df2 <- qread("myfile.qs")

Installation:

# CRAN version
install.packages("qs")

# CRAN version compile from source (recommended)
remotes::install_cran("qs", type="source", configure.args="--with-simd=AVX2")

Features

The table below compares the features of different serialization approaches in R.

qs fst saveRDS
Not Slow
Numeric Vectors
Integer Vectors
Logical Vectors
Character Vectors
Character Encoding (vector-wide only)
Complex Vectors
Data.Frames
On disk row access
Random column access
Attributes Some
Lists / Nested Lists
Multi-threaded

qs also includes a number of advanced features:

  • For character vectors, qs also has the option of using the new alt-rep system (R version 3.5+) to quickly read in string data.
  • For numerical data (numeric, integer, logical and complex vectors) qs implements byte shuffling filters (adopted from the Blosc meta-compression library). These filters utilize extended CPU instruction sets (either SSE2 or AVX2).

Both of these features have the possibility of additionally increasing performance by orders of magnitude, for certain types of data. See sections below for more details.

Summary Benchmarks

The following benchmarks were performed on a Ryzen 2700x desktop using various data types (detailed below). qs was compared with saveRDS/readRDS in base R and the fst package for serializing and de-serializing a medium sized data.frame with 5 million rows (approximately 115 Mb):

data.frame(a=rnorm(5e6), 
           b=rpois(100,5e6),
           c=sample(starnames$IAU,5e6,T),
           d=sample(state.name,5e6,T),
           stringsAsFactors = F)

qs is highly parameterized and can be tuned by the user to extract as much speed and compression as possible, if desired. For simplicity, qs comes with 4 presets, which trades speed and compression ratio: “fast”, “balanced”, “high” and “archive”.

The tables and plots below summarize the performance of saveRDS, qs and fst with various parameters:

Summary table

Algorithm Threads Write Time (s) Read Time (s) File Size (Mb)
saveRDS / readRDS 1 4.680 1.500 55.2
saveRDS / readRDS 4 1.370 1.050 55.0
fst C=0 1 0.186 0.288 121.0
fst C=0 4 0.184 0.286 121.0
fst C=50 1 0.188 0.300 92.0
fst C=50 4 0.183 0.296 92.0
fst C=85 1 0.612 0.371 70.5
fst C=85 4 0.463 0.332 70.5
qs:lz4 shuffle=0 C=100 (fast) 1 0.196 0.319 106.0
qs:lz4 shuffle=0 C=100 4 0.161 0.322 106.0
qs:lz4 shuffle=7 C=1 (balanced) 1 0.262 0.363 59.4
qs:lz4 shuffle=7 C=1 4 0.194 0.365 59.4
qs:zstd shuffle=7 C=4 (high) 1 0.393 0.409 50.0
qs:zstd shuffle=7 C=4 4 0.212 0.411 50.0
qs:zstd_stream shuffle=7 C=14 (archive) 1 9.160 0.452 46.9

Serializing

De-serializing

Benchmarking write and read speed is a bit tricky and depends highly on a number of factors, such as operating system, the hardware being run on, the distribution of the data, or even the state of the R instance. Reading data is also further subjected to various hardware and software memory caches.

Generally speaking, qs and fst are considerably faster than saveRDS regardless of using single threaded or multi-threaded compression. qs also manages to achieve superior compression ratio through various optimizations (e.g. see “Byte Shuffle” section below).

Byte Shuffle

Byte shuffling (adopted from the Blosc meta-compression library) is a way of re-organizing data to be more ammenable to compression. For example: an integer contains four bytes and the limits of an integer in R are +/- 2^31-1. However, most real data doesn’t use anywhere near the range of possible integer values. For example, if the data were representing percentages, 0% to 100%, the first three bytes would be unused and zero.

Byte shuffling rearranges the data such that all of the first bytes are blocked together, the second bytes are blocked together, etc. This procedure often makes it very easy for compression algorithms to find repeated patterns and can often improves compression ratio by orders of magnitude. In the example below, shuffle compression achieves a compression ratio of over 1000x. See ?qsave for more details.

# With byte shuffling
x <- 1:1e8
qsave(x, "mydat.qs", preset="custom", shuffle_control=15, algorithm="zstd")
cat( "Compression Ratio: ", as.numeric(object.size(x)) / file.info("mydat.qs")$size, "\n" )
# Compression Ratio:  1389.164

# Without byte shuffling
x <- 1:1e8
qsave(x, "mydat.qs", preset="custom", shuffle_control=0, algorithm="zstd")
cat( "Compression Ratio: ", as.numeric(object.size(x)) / file.info("mydat.qs")$size, "\n" )
# Compression Ratio:  1.479294 

Alt-rep character vectors

The alt-rep system was introduced in R version 3.5. Briefly, alt-rep vectors are objects that are not represented by R internal data, but have accesor functions which promise to “materialize” elements within the vector on the fly. To the user, this system is completely hidden and appears seamless.

In qs, only alt-rep character vectors are implemented because it is often the mostly costly of data types to read into R. Numeric and integer data are already fast enough and do not largely benefit. An example use case: if you have a large data.frame, and you are only interested in processing certain columns, it is wasted computation to materialize the whole data.frame. The alt-rep system solves this problem.

df1 <- data.frame(x = randomStrings(1e6), y = randomStrings(1e6), stringsAsFactors = F)
qsave(df1, "temp.qs")
rm(df1); gc() ## remove df1 and call gc for proper benchmarking

# With alt-rep
system.time(qread("temp.qs", use_alt_rep=T))[1]
#     0.109 seconds


# Without alt-rep
gc(verbose=F)
system.time(qread("temp.qs", use_alt_rep=F))[1]
#     1.703 seconds

Serializing to memory

You can use qs to directly serialize objects to memory.

Example:

library(qs)
x <- qserialize(c(1,2,3))
qdeserialize(x)
[1] 1 2 3

Using qs within Rcpp

qs functions can be called directly with C++ code via Rcpp.

Example C++ script:

// [[Rcpp::depends(qs)]]
#include <Rcpp.h>
#include <qs.h>
using namespace Rcpp;

// [[Rcpp::export]]
void test() {
  qs::c_qsave(IntegerVector::create(1,2,3), "/tmp/myfile.qs", "high", "zstd", 1, 15, true, 1);
}

R side:

library(qs)
library(Rcpp)
sourceCpp("test.cpp")
# save file using Rcpp interface
test()
# read in file create through Rcpp interface
qread("/tmp/myfile.qs")
[1] 1 2 3

The direct C++ functions do not have default parameters; all paramters must be specified.

Future developments

  • Additional compression algorithms
  • Improved alt-rep string reading support
  • Improved S4 object serialization

Future versions will be backwards compatible with the current version.

qs's People

Contributors

renkun-ken avatar romainfrancois avatar traversc avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.