veluca93 / fpnge Goto Github PK

View Code? Open in Web Editor NEW

80.0 80.0 8.0 502 KB

Demo of a fast PNG encoder.

License: Apache License 2.0

Shell 3.84% C++ 92.54% C 3.62%

fpnge's People

Contributors

Stargazers

Watchers

Forkers

sorieil skal65535 k0lb3 animetosho hohmiyazawa stefanvk mwilsnd jk128

fpnge's Issues

Tweak heuristics for fast predictor selection

Have you tried subtracting 1 iff nbits[0] <= (1/2)?

Even better I think could be to double all the bit costs (as that changes nothing), but remap bitcost 1 into 1 instead of 2 for 0, and perhaps bitcost 2 into 3 (or 4), or something like that.

Rationale being that RLE only kicks in for many consecutive 0s, so you'd need the 0 probability to be quite high.

Originally posted by @veluca93 in #19 (comment)

Filter search short circuiting

In some images (mainly photographic content), considering other filters than paeth is futile.

I therefore tested a simple heuristic: Group scanlines into groups of 8. If the filter search selects paeth for all of the first 3, the remaining five get paeth applied as well without other filters considered.

This reaps most of the benefits of optimal filter selection, at the cost of half of fpnge4's speed advantage over fpnge.

Sorted file sizes for a corpus of images, both photographic and synthetic images:

While this is just a quick attempt, I think it shows that trying to achieve near optimal filter selection without doing a full filter search is both doable and worthwhile.

The opposite case, getting rid of paeth calculations in images where paeth doesn't perform well may also yield some gains. (flat areas?)

Appears to not handle 16-bit RGB images

Output image is 8bit from 16bit image input according to file and ImageMagick compare.

Steps to reproduce:

git clone <this repo>
./build.sh
# Check inputs
file 8b.png 
# 8b.png: PNG image data, 1920 x 1200, 8-bit/color RGB, non-interlaced
file 12b.png
# 12b.png: PNG image data, 1920 x 1200, 16-bit/color RGB, non-interlaced

# Convert images
./build/fpnge 8b.png 8b-out.png 100
./build/fpnge 12b.png 12b-out.png 100

# Compare:
compare -metric AE 8b.png 8b-out.png compare.png  # gives 0
compare -metric AE 12b.png 12b-out.png compare.png  # Gives a very very large number (2.304e+06)

file 12b-out.png 
# 12b-out.png: PNG image data, 1920 x 1200, 8-bit/color RGB, non-interlaced

Note: the 12b-out.png claims to be 8-bit/color. Interestingly opencv reads this still as 16-bit, but wrongly.

Extra Info:

CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
OS: Debian GNU/Linux 11 (bullseye) (in docker)
Compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110 g++ (Debian 10.2.1-6) 10.2.1 20210110

Extra extra

I stumbled on this via the python binding here. Below is an example out showing the errors:

blank_image = np.random.randint(0, high=np.iinfo(np.uint16).max, size=(1,2,3), dtype=np.uint16)
image_bytes = fpnge.binding.encode_view(blank_image.data, blank_image.shape[1], blank_image.shape[0], blank_image.shape[2], blank_image.dtype.itemsize * 8)
image_check: cv2.Mat = cv2.imdecode(np.frombuffer(image_bytes, np.uint8), cv2.IMREAD_ANYCOLOR | cv2.IMREAD_ANYDEPTH)
assert image_check.dtype == blank_image.dtype
assert image_check.shape == blank_image.shape
assert (image_check == blank_image).all()
# If you print these:
pprint.pprint(image_check)
array([[[51929, 35260,  2880],
        [ 6872, 23392, 15276]]], dtype=uint16)
pprint.pprint(blank_image)
array([[[16395, 48265, 55754],
        [44091, 24667, 55322]]], dtype=uint16)

Images:

Fail to compile on GitHub vs, MSVC

 D:\a\fpng-java\fpng-java\fpnge\src\main\cpp\fpnge.cc(816): error C2416: attribute 'msvc::forceinline' cannot be applied in this context

Please refer to: https://github.com/manticore-projects/fpng-java/actions/runs/6925604696/job/18836513239#step:4:106

Can you please help me with this? I have no idea what it means.

Assertion failed in HuffmanTable::FillBits with certain images

I've discovered some images which fail the assert in HuffmanTable::FillBits. I've been able to reproduce the issue on GCC and MSVC, samples loaded with lodepng or stb_image. Failure happens when trying to encode images in RGB format.

Sample images, including raw pixel data at 3 channel RGB: sample_images.zip

Image 1:

width: 256
height: 256
num_channels: 3
bytes_per_channel: 1
FPNGE_COMPRESS_LEVEL_BEST
FPNGE_CICP_NONE

Image 2:

width: 512
height: 512
num_channels: 3
bytes_per_channel: 1
FPNGE_COMPRESS_LEVEL_BEST/FPNGE_COMPRESS_LEVEL_DEFAULT/3/2/1 (failing all compression levels, libpng encodes this image without issue)
FPNGE_CICP_NONE

Apple Silicon, MacOS AARCH64 (ARM)

Greetings.

The code does not compile on AARCH64 since the SSE/AVX intrinsics would depend on NEON.
Example:

#if defined(__SSE4_2__)
#include <nmmintrin.h>
#elif defined(__aarch64__)
#include <arm_neon.h>
#endif

// z[i] = x[i] + y[i]
void vadd(const int* x, const int* y, int* z, unsigned int count) {
    // process 4 integers (128bits) with simd
    unsigned int i = 0;
    for (; i + 4 <= count; i += 4) {
#if defined(__SSE4_2__)
        const __m128i vx = _mm_lddqu_si128((const __m128i*)(x + i));
        const __m128i vy = _mm_lddqu_si128((const __m128i*)(y + i));
        const __m128i vz = _mm_add_epi32(vx, vy);
        _mm_storeu_si128((__m128i*)(z + i), vz);
#elif defined(__aarch64__)
        const int32x4_t vx = vld1q_s32(x + i);
        const int32x4_t vy = vld1q_s32(y + i);
        const int32x4_t vz = vaddq_s32(vx, vy);
        vst1q_s32(z + i, vz);
#endif
    }

    // tail loop
    for (; i < count; ++i) {
        z[i] = x[i] + y[i];
    }
}

I have setup a working Github pipeline for compiling and testing this FPNGe on AARCH. But I am not a CPP programmer, would you be able and willing to help me when I have questions on the porting?

Java Wrapper

Greetings!

Thank you and compliments for this great library. I wrapped it into a Java Library pursuing for the fastest PNG Encoder. Please see https://github.com/manticore-projects/fpng-java

As it looks by now, FPNGE is the winner by size and also performance:

Benchmark                                           (imageName)  Mode  Cnt     Score    Error  Units
FPNGEBenchmark.encode                               example.png  avgt    3     5.895 ±  0.369  ms/op
FPNGEBenchmark.encode                   looklet-look-scale6.png  avgt    3   242.326 ±  3.215  ms/op
FPNGEncoderBenchmark.encode                         example.png  avgt    3     7.267 ±  4.725  ms/op
FPNGEncoderBenchmark.encode             looklet-look-scale6.png  avgt    3   353.969 ± 14.833  ms/op
ImageIOEncoderBenchmark.encode                      example.png  avgt    3    54.096 ±  0.742  ms/op
ImageIOEncoderBenchmark.encode          looklet-look-scale6.png  avgt    3  1285.791 ± 14.237  ms/op
ObjectPlanetPNGEncoderBenchmark.encode              example.png  avgt    3    25.882 ±  0.670  ms/op
ObjectPlanetPNGEncoderBenchmark.encode  looklet-look-scale6.png  avgt    3   639.232 ± 23.186  ms/op
PNGEncoderBenchmark.encode                          example.png  avgt    3    29.218 ±  0.965  ms/op
PNGEncoderBenchmark.encode              looklet-look-scale6.png  avgt    3   573.997 ± 16.234  ms/op
PNGEncoderBenchmark.encodeFastest                   example.png  avgt    3    17.628 ±  0.511  ms/op
PNGEncoderBenchmark.encodeFastest       looklet-look-scale6.png  avgt    3   362.208 ±  4.346  ms/op

Although I still need to improve those Benchmarks so they compare based on similar achieved file-sizes and also need to reflect different JDKs.

Ideas for improving compression ratio?

fpnge is pretty fast, so I was wondering what, in theory, could be done to improve compression slightly without sacrificing too much speed. I was hoping you might have some ideas on the topic (regardless of if they'd ever be implemented).
Thoughts I had:

RLE

Currently only full vectors of zeroes are candidates for RLE.

Detecting sequences starting/ending partway through a vector might help density very slightly, with relatively little performance cost
Detecting sequences completely within a vector may be possible without too much cost, but I suspect would have little impact on compression, due to it being relatively short sequences
Detecting sequences of bytes other than 0 - not sure of the effectiveness
- Perhaps use the distance of one pixel to the left instead of one byte?

Unfortunately due to how the Huffman table is currently constructed, the LZ77 symbols always get assigned rather long sequences, making RLE unattractive (particularly with 0s) unless it's a rather long sequence. Which leads to the next point.

Huffman encoding

The current implementation doesn't consider symbol counts for many symbols, as the Huffman table has to be specially crafted for the algorithm used. Trying to relax some length restrictions often leads to ComputeCodeLengths becoming rather slow - from what I can tell, it appears to be finding the optimal tree via a brute-force approach. Would you happen to know of a faster algorithm for this?

The other idea is to remove almost all length restrictions, which allows normal optimal Huffman tree algorithms to be used. I can't think of a quick way to allow the middle section (symbols 16-239) to have differing bit lengths, so you'd need to manipulate with symbol counts there a bit to ensure they all end up getting the same bit length. Otherwise, the unrestricted Huffman code could be handled with some slowdown, as, in addition to shuffle-lookup for the low bits, you'd need to do it for the high bits as well.

Another thing to consider might be having multiple deflate blocks and sampling the image at various points, to be able to adapt to changes which occur vertically down the image.

I suspect a more optimal Huffman code might generally give a low single-digit percentage gain most of the time, but at some speed cost.

LZ77

It seems that there's a fair bit of scope here to improve compression through better match finding, but this is also often the slowest part of deflate implementations. I can't think of a way to implement this without dropping to scalar code either.

Furthermore, I'm not sure that knowing the data is an image, provides much benefit over generic LZ77 implementations - except that perhaps you can only look for matches on pixel boundaries instead of for every byte.

Any ideas here?

double free or corruption (!prev) with avx2

imagedata.zip
width 460
height 460
num_channels 4
bytes_per_channel 1

FPNGEOutputAllocSize can underestimate the required sized

There exist some scenarios for which FPNGEOutputAllocSize underestimates the required size. This can lead to a heap buffer overflows.

So far I only ran into the problem with 1 pixel wide grayvalue images at compression level 2.

Example image 1 channel 1x64000 noise

With added -g option for grayvalue images:
./fpnge -2 -g ../testdata/narrow_gv.png out.png
Segmentation fault (core dumped)

Allocated size: 129024
Encoded size: 129875

[fjxld] group tiles by type, then decode using SIMD

Hi, Luca.

The question is related to jxl.

Previously I noticed that decoder is 10 times slower than encoder: libjxl/libjxl#1975 (comment)

I guess, it is because of dependency between pixels at decoding. What if to group tiles with the same predictor type?

I said "guess" because I'm unable to locate the code for encoding and decoding -e 1, the structure of libjxl is a bit over-complicated. Can you provide the direct links at current master?

Or, maybe, you want to test this idea and write fjxld?