Coder Social home page Coder Social logo

szcompressor / cusz Goto Github PK

View Code? Open in Web Editor NEW
58.0 14.0 23.0 41.48 MB

A GPU accelerated error-bounded lossy compression for scientific data.

Home Page: http://szcompressor.org

License: Other

Makefile 0.11% C++ 59.55% Cuda 5.74% Shell 0.42% CMake 2.44% Python 3.82% C 1.58% Jupyter Notebook 26.07% SWIG 0.27%
lossy-compression scientific-data data-reduction gpu cuda

cusz's People

Contributors

codyjrivera avatar dingwentao avatar disheng222 avatar jtian0 avatar wenyugai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cusz's Issues

(question) cusz changes input data after compression

Hey everybody,

I've been using cusz APIs and after compressing input data I noticed the input data is not the same anymore -- it looks modified. This can be reproduced with cusz API example.

As far as I understood, according to cusz's wiki the nondestructive=true configuration should avoid this behavior but switching the flags I had no success. I tried checking out the source code, and I didn't find any implementation for that.

Are my assumptions correct? Can I consider that cusz nowadays changes the input data after compression or am I doing something wrong?

Thanks very much!

`internal` cannot handle skipping Huffman scenario

As of 20d4bb3,

> ./bin/cusz -f32 -m r2r -e 1.0e-4.0 -i ~/Parihaka_PSTM_far_stack.f32 -D parihaka -z --skip huffman
[info] datum:           /path/to/Parihaka_PSTM_far_stack.f32 (4850339584 bytes) of type f32
[dbg]  original len:    1212584896 (padding: 34823)
[dbg]  Time loading data:       3.21981s
[info] quant.cap:       1024    input eb:       0.0001
[dbg]  Time inspecting data range:      0.0232671s
[info] eb change:       (input eb) x 12342.2 (rng) = 1.23422 (relative-to-range)
[dbg]  2-byte quant type, 4-byte internal Huff type

[info] Commencing compression...
[info] nnz.outlier:     16607   (0.00136955%)
[info] Compression finished, saved quant.code (Huffman skipped).

tar: Parihaka_PSTM_far_stack.f32.hbyte: Cannot stat: No such file or directory
tar: Parihaka_PSTM_far_stack.f32.canon: Cannot stat: No such file or directory
tar: Parihaka_PSTM_far_stack.f32.hmeta: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors
[info] Written to:      /path/to/Parihaka_PSTM_far_stack.f32.sz

By giving --skip huffman, tar throws errors.

p.s. parihaka is a new test case that is added in 90b521b ahead of current 20d4bb3

Buffer overflow crash

I'm trying to use the new pSZ API on the master branch. But it crashes once it executes 'psz_compress_init' with error 'buffer overflow detected'. This happens to both my own program that links with cuSZ and the example program 'demo_capi_cuda' and 'demo_capi_cuda_nvcc' that come with pSZ itself.

Here is how I build cuSZ

cmake -S . -B build \
      -DPSZ_BACKEND=CUDA \
      -DPSZ_BUILD_EXAMPLES=on \
      -DCMAKE_BUILD_TYPE=Release \
      -DBUILD_TESTING=on \
      -DCMAKE_CUDA_ARCHITECTURES="86" \
      -DCMAKE_INSTALL_PREFIX=install

I'm using CUDA 12.2 with GCC 11.4. The GPU is RTX 3090 (Arch=86).

I also tried the latest releases v0.6, but it seems I cannot build it due to some missing headers from Thrust.

@jtian0 Do you know what might be wrong? Thanks a lot!

`internal` not working for 8-GB device

Test shows it's working for 16-GB V100: #18 (comment)

For large dataset such as Parihaka (4.8 GB), a 8-GB device cannot generate correct unzipped data. The peak memory usage should be around 1.5x the input datum size (in this case, 7.5 GB), because the workflow can be stated as

  1. datum of 1x size loaded to GPU, generating (at max) quant. code of 0.5x
  2. CSR, preserving according to compression ratio, after which, 1x datum is freed
    In one case, 4.8 GB generates 800 MB CSR; this may exceed device memory capacity.
> ./bin/cusz -f32 -m r2r -e 1.0e-4.0 -i ~/Parihaka_PSTM_far_stack.f32 -D parihaka -z          
[info] datum:           /path/to/Parihaka_PSTM_far_stack.f32 (4850339584 bytes) of type f32
[dbg]  original len:    1212584896 (padding: 34823)
[dbg]  Time loading data:       3.19802s
[info] quant.cap:       1024    input eb:       0.0001
[dbg]  Time inspecting data range:      0.0232662s
[info] eb change:       (input eb) x 12342.2 (rng) = 1.23422 (relative-to-range)
[dbg]  2-byte quant type, 4-byte internal Huff type

[info] Commencing compression...
[info] nnz.outlier:     16607   (0.00136955%)
[dbg]  Optimal Huffman deflating chunksize      131072
[info] entropy:         3.85809
[dbg]  Huffman enc:     #chunk=9252, chunksze=131072 => 212256108 4-byte words/6792051551 bits
[dbg]  Time writing Huff. binary:       0.501431s
[info] Compression finished, saved Huffman encoded quant.code.
[dbg]  Time tar'ing     1.01489s
[info] Written to:      /path/to/Parihaka_PSTM_far_stack.f32.sz

> ./bin/cusz -i ~/Parihaka_PSTM_far_stack.f32.sz -x --origin ~/Parihaka_PSTM_far_stack.f32 --skip write.x 
[info] Commencing decompression...
[info] Huffman decoding into quant.code.
[info] Extracted outlier from CSR format.
[info] Decompression finished.

[info] Huffman metadata of chunking and reverse codebook size (in bytes): 150336
[info] Huffman coded output size: 849024432
[info] To compare with the original datum

[info] Verification start ---------------------
| min.val             -6893.359375
| max.val             5448.8828125
| val.rng             12342.2421875
| max.err.abs.val     6893.359375
| max.err.abs.idx     706941819
| max.err.vs.rng      0.55851759107283360795
| max.pw.rel.err      1
| PSNR                32.837037623060211899
| NRMSE               0.022811199295531395248
| correl.coeff        -NAN
| comp.ratio.w/o.gzip 5.709997
[info] Verification end -----------------------

[info] Decompressed file is written to /path/to/Parihaka_PSTM_far_stack.f32.szx.
[info] Please use compressed data (*.sz) to calculate final comp ratio (w/ gzip).
[info] Skipped writing unzipped to filesystem.

misc. todo:

  1. writing to filesystem info not correct
  2. verification too slow
  3. trailing slash in printing output file

`internal` build failure w/ cmake 3.11 or eariler

The problem lies in the build process of external dependent libraries. Specifically, cmake -S <src path> -B <build path> is a new command.

cuSZ/build.py

Line 180 in 2d12138

cmake_nvcomp_pre_cuda11 = "cmake -DCUB_DIR=$(pwd)/external/cub -DCMAKE_C_COMPILER={1} -DCMAKE_CXX_COMPILER={2} -S {0} -B {0}/build && make -j -C {0}/build".format(NVCOMP_DIR, cc, cxx)

`internal` optimize memory management of Huffman encoder

problem statement
The memory management of Huffman encoder is much scattered. It affects in several ways:

  1. CUDA memory allocation is slow.
  2. Ad hoc allocation (and free) does not consider the use case beyond a one-time/demo run on multiple datasets, when API is called.
  3. When transferring to host, the only copy should be archiving.

proposal
The memory footprint, related to compression ration, is possible to be

  • figured out memory footprint after getting histogram,
  • based on 1/(lower-bound CR) * some empirical constant (e.g., 1.1x).

The lower-bound CR is estimated using

  1. information entropy, H(X)
  2. redundancy estimation, R = <b> - H(X)

milestone
consider to resolve in release 0.3

related
issue #57

other comment
(horrible, no TeX rendering support)

`log` Pascal GPU support

(v0) This record reflects early-stage development and is subsequent to commited code. 
(v0) There are several reasons that this record shows up,
   1. An issue from user or developer exposes manifold subproblems.
   2. Such a record remarks a solved subproblem.
   3. The solved subproblem is not instaneously documented.
   4. This is helpful as a cross-reference.

This issue is exposed in #6: Pascal GPU failed because of overestimated cache size, resolved in 037bf6e.

`internal` export raw binary of quantization code

problem statement
The current quant code is exported along with metadata, whereas internal evaluation requires handy export of the raw quantization code.

proposal

  • new CLI option --export quant (updated in the followup)

Out of memory error on Power 8 IBM machine

@dingwentao, @MauricioAP, @jiemeng-total

Hi Dingwen,
Thank you very much for letting me know that the cuda version has been released!
Here are couple of issues/questions:

   . It is still not practical for our users since from the doc there is no stand alone decompression yet.
   . It generates the following four files from compression operation. Which one could be used to compute compression ratio( I assume .b16.h)?
          .b16.outlier,  .b16meta, .b16.dh and .b16.cHcb
   . Please provide an option to user to choose where to output the compressed data. 
   . It seems that there is problem running on IBM Power 8 Machines. The following are the environments and error messages:
      os:  Red Hat Enterprise Linux Server 7.4 (Maipo)
      compiler: gcc/7.3.0
      cuda: 10.1
      GPU:  Nvidia Tesla P100, Memory: 16Gig
      
      **Commands/error for small data size:**
      $ cusz -z -f32 -m r2r -e 0.0001 -i  mytestinput_38.dat  -3 1601 5850 38
      [info] datum:           mytestinput_38.dat (1423609200 bytes) of type f32
      [info] quant.capacity:  1024
      [info] input eb:        0.0001 x 10^(0) = 0.0001
      [info] eb change:       0.0001 (input eb) x 1405.54 (rng) = 0.140554 (relative-to-value-range, r2r mode)
      [dbg]  exponent = 0.000 (base10) (or) -13.288 (base2)
      [dbg]  uint16_t to represent quant. code, uint32_t internal Huffman bitstream
      [dbg]  original len: 355902300, m the padded: 18866, mxm: 355925956

      [inof] Commencing compression...
      [info] nnz.outlier:     355902297       (100%)
      [info] entropy:         0
      terminate called after throwing an instance of 'thrust::system::system_error'
       what():  radix_sort: failed on 1st step: cudaErrorInvalidDeviceFunction: invalid device function
       Aborted

      **Command for large data size:**
     $ cusz -z -f32 -m r2r -e 0.0001 -i  mytestinput_150.dat  -3 1601 5850 150
     See error messages in the attached file

cusz_errmsg.txt

Best,
Shelton

Questions about the project

Hi team,
I had a couple of questions for the project. I am trying to do a study and wanted to consider your project in that.

  1. When are you planning to decouple the compression and decompression algorithms?
  2. Do you expose an API/Kernel I can use for compression and decompression or is it just the command?
  3. Can we only save the results to files? I imagine it should not be an issue to store it on an actual memory instead of a file.

Looking forward to you answers!

cmake --install . does not work

The generate Makefile/ninja.rules does not have any install rule. While I understand that cuSZ comes with the build.py script, this makes things complicated when trying to use SZ from other CMake projects.

`internal` type checking and binding

Continuing with discussion ce286a8#r44475046, (cross-platform) type checking and binding are needed and uint<n>_t should be explicitly in use. In addition, uint_64_t may be repintepret_cast outside a kernel that contains atomic<Op>.

Building a standalone example

Hello everybody, I cloned the newest code release from main branch and now I want to build a standalone example, separately from cusz build.

First: I'm building cusz following the documentation. I'm replacing [/path/to/install/dir] by /opt/cusz. This step is pretty straight forward I'm doing cmake {params} -> make -j8 -> make install.

Then I just copied this example to a different folder (for example, standalone_example/src) and created a CMakeLists.txt file -- its source code can be found in this gist.

Having this structure created, I build my standalone example and try to execute it:

mkdir build && cd build
cmake ..
make

./ex-api ${cesm-CLDHGH-3600x1800}

I'm having this error:

root@c5dd7f9907f7:/cusz/my_example/build# ./ex-api ../../cesm-CLDHGH-3600x1800
peeking uncompressed data, 20 elements
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: cudaErrorUnsupportedPtxVersion: the provided PTX was compiled with an unsupported toolchain.
Aborted

Sometimes I'm getting segmentation fault.

root@c5dd7f9907f7:/cusz/my_example/build# ./ex-api ../../cesm-CLDHGH-3600x1800
peeking uncompressed data, 20 elements
Segmentation fault

Curiously, when I run the same example from the cusz build folder (build/example/capi) it works well. I'm suspecting something is wrong in my CMakeLists.txt but I'm not finding what could be. Note: I'm building my standalone example in the same machine I'm building cusz (my machine can be found on docker hub)

Any thoughts about that?

Thank you very much in advance.

Cuda memory problem

Hello,

I am using CuSZ compressor in my research work. I was able to run the example file capi.cu for CESM data, but when I altered it and ran for my own dataset, i got an error:

terminate called after throwing an instance of 'thrust::system::system_error'
what(): copy_if failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Aborted (core dumped)

I am running it in the GPU - Tesla V100.

commands executed:
To build SZ:
mkdir build && cd build
cmake .. -DCUSZ_BUILD_EXAMPLES=on -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=on -DCMAKE_CUDA_ARCHITECTURES="70" -DCMAKE_PREFIX_PATH=/usr/local/cuda-11.7/targets/x86_64-linux/lib/cmake -DCMAKE_INSTALL_PREFIX=main
make
make install
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/mnt/tank/bem-nr/cuSZ-develop/build

To compile capi.cu:
nvcc -o capi capi.cu -I/mnt/tank/bem-nr/cuSZ-develop/include -I/usr/local/cuda-11.7/include -L/usr/local/cuda-11.7/lib64 -lcudart -L/mnt/tank/bem-nr/cuSZ-develop/build -lcusz --extended-lambda -g
for CESM data: ./capi /data/cesm-CLDHGH-3600x1800
for my data set: ./capi /data/mydata

My dataset file is a binary file with floating point values in 1D array (length - 1x89999991)

I tried altering the length of my array, same as CESM data. But got the same error.

Can you please tell me why CuSZ is giving me this error? Please let me know if you need further information.

Thanks in advance!

crash report

On a A10 machine: I could not build using ./build.py ampere. So I compiled it using ./build.py turing.

(With -e 1e-4, 1e-5 it worked)

time ./bin/cusz -t f32 -m abs -e 1e-6 -i ./data/randompatch.bin.npy -l 256,256 -z --report time
  ::  load ./data/randompatch.bin.npy 262144 bytes                         [ok]
  ::  compressing...                                                       [ok]
  ::  #outlier =  65409 (99.806213%)                                       [ok]
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  radix_sort: failed on 1st step: cudaErrorIllegalAddress: an illegal memory access was encountered
Aborted (core dumped)

real	0m0.570s
user	0m0.261s
sys	0m0.138s

Measuring memory footprint

Hi!

I am executing some tests with cusz API and measuring the memory footprint and I found something interesting and I'd like to share with the community -- and I'd be more than happy to hear your thoughts about that.

After compressing and decompressing, cusz seems leaving a residue of allocated memory. In my code example, I'm doing 10 iterations of compression and decompression on the same input.

In the chart below, we can see a peak of memory usage during compression and decompression (as expected). After decompression, cusz frees memory and my client code example also frees all allocated memory and the memory consumption decreases. However, notice that memory consumption does not get back to the original state and for each iteration, it leaves a little of memory. I'd say it might suggest a memory leak.

image

Thank you very much!

`log` 5-GB datum running out of memory

(v0) This record reflects early-stage development and is subsequent to commited code. 
(v0) There are several reasons that this record shows up,
   1. An issue from user or developer exposes manifold subproblems.
   2. Such a record remarks a solved subproblem.
   3. The solved subproblem is not instaneously documented.
   4. This is helpful as a cross-reference.

This issue is exposed in #6. By freeing up memory it is runnable on 16-GB devices, resolved in 90b521b. Not that it is still not sufficient for 8-GB device, as mention in #19.

`internal` CUDA 11 refresh of gather-scatter

problem statement

  • Gather-scatter APIs are deprecated as of CUDA 11.1, raising warnings when compiling.
  • cuSPARSE of the current CUDA 10 API may break on A100 system (CUDA 11 only).

proposal

  • retain the current implementation for CUDA 10 compatiblity
  • implement using CUDA 11 API
  • cmake configuration and macro to prioritize CUDA 11 API and fall back to CUDA 10 API

Additional lossless compression

I'm wondering if it is possible to invoke other lossless compressors (NVComp, etc.) in addition to the Huffman encoding for getting higher compression ratios?

`internal` good-shape API

problem statement
make API in good shape

proposal
general compressor

cusz_compressor<Args...>( config<Args...>(args...) ) 
{
    // predictor candidates
    PredictorL<Args...>(args...);    // release 0.3
    PredictorS<Args...>(args...);    // release 0.3
    // handler candidates
    SparsityHandler_CUDA10<Args...>(args...);   // release 0.3
    SparsityHandler_CUDA11<Args...>(args...);
    // encoder candidates
    HuffmanEncoderFallback<Args...>(args...);   // release 0.3
    HuffmanEncoderFass<Args...>(args...);       // release 0.3
    RLE_LZ_etc<Args...>(args...);    // future, sparsity-aware
    ...
}

It is possible to initiate all candidates of each component and later to decide online.
And predictors, sparsity handlers and encoders shared the same use,

using BYTE = uint8_t;
size_t poolsize_nbyte;
Component<Args...> c(& poolsize_nbyte, args...);
BYTE* h_sapce, d_space;
cudaMallocHost(&h_space, poolsize_nbyte);
cudaMalloc(&d_space, poolsize_nbyte);
// process
c.op1<Args...>(args...).op2<Args...>(args...).op3<Args...>(args...);
// access wrapped array and transfer
c.some_array.d2h();

There is an underlying data wrapper that is expected to be ergonomically better than the plain CUDA APIs, esp. in data movement,

template<typename T>
class DataWrapper {  // name to be refined
  public:
   T *hptr, *dptr;
   unsigned int len;   // we don't deal with large arrays
   unsigned int query_some_size() const;
   DataWrapper& d2h();

   template <MODE, WHERE>
   DataWrapper& alloc();
};

Note that per data segment allocation/free can be inefficient; that should be used in development and testing only

template <MODE m, WHERE w>
DataWrapper<T>* DataWrapper<T>::alloc() {
   if (not (m == MODE::DEV or m==MODE::TEST) throw std::runtime_error("should be in DEV or TEST mode only");
}

And we can put a file-scope (or global?) constexpr MODE to check the API use.

timeline
partially to be done in release 0.3

reference

`internal` workflow rework: predictor-quantizer and lossless encoder

problem statement
The workflow is space and compute inefficient.

  1. The separation of outlier and quant code requires if (or its equivalence) and much more than 1x the original input size.
  2. Gather-outlier (i.e., parallel scan) across the memory space of the original input size is wasteful.
  3. The data type of quant code and that of Huffman encoding space is not overlapped, i.e., the procedure of in-place encoding should reuse the space of quant code.

proposal

  1. multipass predictor-quantizer
  2. multipass predictor-quantizer and gather/scan in a much smaller space
  3. conditionally use uint32 for both quant code and Huffman encoding space

relate
#57

timeline
indefinite

`internal` (optional) overlapping predictor-Huffman codec memory space

problem statement
Quantization code (uint8,16,32) and encoding space (uint32 to settle first) are in separated spaces. It is possible to take up 1.25x to 2.0x the original data space.

proposal
For most cases, in which uint32 is used for encoding space, in order to decrease the memory footprint, uint32 can be used across quantization code and Huffman encoding space.

A potential tradeoff is exposed:

  • footprint-good: multiple passes
  • throughout-good: more space

milestone
consider to resolve in release 0.3

related
issue #56

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.