Coder Social home page Coder Social logo

pffft's Introduction


PFFFT: a pretty fast FFT and fast convolution with PFFASTCONV



Brief description:

PFFFT does 1D Fast Fourier Transforms, of single precision real and complex vectors. It tries do it fast, it tries to be correct, and it tries to be small. Computations do take advantage of SSE1 instructions on x86 cpus, Altivec on powerpc cpus, and NEON on ARM cpus. The license is BSD-like.

PFFFT is a fork of Julien Pommier's library on bitbucket with some changes and additions.

PFFASTCONV does fast convolution (FIR filtering), of single precision real vectors, utilizing the PFFFT library. The license is BSD-like.

PFDSP contains a few other signal processing functions. Currently, mixing and carrier generation functions are contained. It is work in progress - also the API! The fast convolution from PFFASTCONV might get merged into PFDSP.

Why does it exist:

I (Julien Pommier) was in search of a good performing FFT library , preferably very small and with a very liberal license.

When one says "fft library", FFTW ("Fastest Fourier Transform in the West") is probably the first name that comes to mind -- I guess that 99% of open-source projects that need a FFT do use FFTW, and are happy with it. However, it is quite a large library , which does everything fft related (2d transforms, 3d transforms, other transformations such as discrete cosine , or fast hartley). And it is licensed under the GNU GPL , which means that it cannot be used in non open-source products.

An alternative to FFTW that is really small, is the venerable FFTPACK v4, which is available on NETLIB. A more recent version (v5) exists, but it is larger as it deals with multi-dimensional transforms. This is a library that is written in FORTRAN 77, a language that is now considered as a bit antiquated by many. FFTPACKv4 was written in 1985, by Dr Paul Swarztrauber of NCAR, more than 25 years ago ! And despite its age, benchmarks show it that it still a very good performing FFT library, see for example the 1d single precision benchmarks here. It is however not competitive with the fastest ones, such as FFTW, Intel MKL, AMD ACML, Apple vDSP. The reason for that is that those libraries do take advantage of the SSE SIMD instructions available on Intel CPUs, available since the days of the Pentium III. These instructions deal with small vectors of 4 floats at a time, instead of a single float for a traditionnal FPU, so when using these instructions one may expect a 4-fold performance improvement.

The idea was to take this fortran fftpack v4 code, translate to C, modify it to deal with those SSE instructions, and check that the final performance is not completely ridiculous when compared to other SIMD FFT libraries. Translation to C was performed with f2c. The resulting file was a bit edited in order to remove the thousands of gotos that were introduced by f2c. You will find the fftpack.h and fftpack.c sources in the repository, this a complete translation of fftpack, with the discrete cosine transform and the test program. There is no license information in the netlib repository, but it was confirmed to me by the fftpack v5 curators that the [same terms do apply to fftpack v4] (http://www.cisl.ucar.edu/css/software/fftpack5/ftpk.html). This is a "BSD-like" license, it is compatible with proprietary projects.

Adapting fftpack to deal with the SIMD 4-element vectors instead of scalar single precision numbers was more complex than I originally thought, especially with the real transforms, and I ended up writing more code than I planned..

The code:

Good old C:

The FFT API is very very simple, just make sure that you read the comments in pffft.h.

The Fast convolution's API is also very simple, just make sure that you read the comments in pffastconv.h.

C++:

A simple C++ wrapper is available in pffft.hpp.

Git:

This archive's source can be downloaded with git (without the submodules):

git clone https://github.com/marton78/pffft.git

Only two files?:

"Only two files, in good old C, pffft.c and pffft.h"

This statement does NO LONGER hold!

With new functionality and support for AVX, there was need to restructure the sources. But you can compile and link pffft as a static library.

CMake:

There's now CMake support to build the static libraries libPFFFT.a and libPFFASTCONV.a from the source files, plus the additional libFFTPACK.a library. Later one's sources are there anyway for the benchmark.

There are several CMake options to modify library size and optimization. You can explore all available options with cmake-gui or ccmake, the console version - after having installed (on Debian/Ubuntu Linux) one of

sudo apt-get install cmake-qt-gui
sudo apt-get install cmake-curses-gui

Some of the options:

  • PFFFT_USE_TYPE_FLOAT to activate single precision 'float' (default: ON)
  • PFFFT_USE_TYPE_DOUBLE to activate 'double' precision float (default: ON)
  • PFFFT_USE_SIMD to use SIMD (SSE/AVX/NEON/ALTIVEC) CPU features? (default: ON)
  • DISABLE_SIMD_AVX to disable AVX CPU features (default: OFF)
  • PFFFT_USE_SIMD_NEON to force using NEON on ARM (requires PFFFT_USE_SIMD) (default: OFF)
  • PFFFT_USE_SCALAR_VECT to use 4-element vector scalar operations (if no other SIMD) (default: ON)

Options can be passed to cmake at command line, e.g.

cmake -DPFFFT_USE_TYPE_FLOAT=OFF -DPFFFT_USE_TYPE_DOUBLE=ON

My Linux distribution defaults to GCC. With installed CLANG and the bash shell, you can use it with

mkdir build
cd build
CC=/usr/bin/clang CXX=/usr/bin/clang++ cmake -DCMAKE_BUILD_TYPE=Debug ../
cmake -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=~ ../
ccmake .                          # or: cmake-gui .
cmake --build .                   # or simply: make
ctest                             # to execute some tests - including benchmarks
cmake --build . --target install  # or simply: [sudo] make install

With MSVC on Windows, you need some different options. Following ones to build a 64-bit Release with Visual Studio 2019:

mkdir build
cd build
cmake -G "Visual Studio 16 2019" -A x64 ..
cmake --build . --config Release
ctest -C Release

see https://cmake.org/cmake/help/v3.15/manual/cmake-generators.7.html#visual-studio-generators

History / Origin / Changes:

Origin for this code/fork is Julien Pommier's pffft on bitbucket: https://bitbucket.org/jpommier/pffft/

Git history shows following first commits of the major contributors:

  • Julien Pommier: November 19, 2011
  • Marton Danoczy: September 30, 2015
  • Hayati Ayguen: December 22, 2019
  • Dario Mambro: March 24, 2020

There are a few other contributors not listed here.

The main changes include:

  • improved benchmarking, see https://github.com/hayguen/pffft_benchmarks
  • double support
  • avx(2) support
  • c++ headers (wrapper)
  • additional API helper functions
  • additional library for fast convolution
  • cmake support
  • ctest

Comparison with other FFTs:

The idea was not to break speed records, but to get a decently fast fft that is at least 50% as fast as the fastest FFT -- especially on slowest computers . I'm more focused on getting the best performance on slow cpus (Atom, Intel Core 1, old Athlons, ARM Cortex-A9...), than on getting top performance on today fastest cpus.

It can be used in a real-time context as the fft functions do not perform any memory allocation -- that is why they accept a 'work' array in their arguments.

It is also a bit focused on performing 1D convolutions, that is why it provides "unordered" FFTs , and a fourier domain convolution operation.

Very interesting is https://www.nayuki.io/page/free-small-fft-in-multiple-languages. It shows how small an FFT can be - including the Bluestein algorithm, but it's everything else than fast. The whole C++ implementation file is 161 lines, including the Copyright header, see https://github.com/nayuki/Nayuki-web-published-code/blob/master/free-small-fft-in-multiple-languages/FftComplex.cpp

Dependencies / Required Linux packages

On Debian/Ubuntu Linux following packages should be installed:

sudo apt-get install build-essential gcc g++ cmake

Benchmarks and results

Quicklink

Find results at https://github.com/hayguen/pffft_benchmarks.

General

My (Hayati Ayguen) first look at FFT-benchmarks was with benchFFT and especially the results of the benchmarks results, which demonstrate the performance of the FFTW. Looking at the benchmarked computer systems from todays view (2021), these are quite outdated.

Having a look into the benchFFT source code, the latest source changes, including competitive fft implementations, are dated November 2003.

In 2019, when pffft got my attention at bitbucket, there were also some benchmark results. Unfortunately the results are tables with numbers - without graphical plots. Without the plots, i could not get an impression. That was, why i started https://github.com/hayguen/pffft_benchmarks, which includes GnuPlot figures.

Today in June 2021, i realized the existence of https://github.com/FFTW/benchfft. This repository is much more up-to-date with a commit in December 2020. Unfortunately, it looks not so simple to get it run - including the generation of plots.

Is there any website showing benchFFT results of more recent computer systems?

Of course, it's very important, that a benchmark can be compared with a bunch of different FFT algorithms/implementations. This requires to have these compiled/built and utilizable.

Git submodules for Green-, Kiss- and Pocket-FFT

Sources for Green-, Kiss- and Pocket-FFT can be downloaded directly with the sources of this repository - using git submodules:

git clone --recursive https://github.com/marton78/pffft.git

Important is --recursive, that does also fetch the submodules directly. But you might retrieve the submodules later, too:

git submodule update --init

Fastest Fourier Transform in the West: FFTW

To allow comparison with FFTW http://www.fftw.org/, cmake option -DPFFFT_USE_BENCH_FFTW=ON has to be used with following commands. The cmake option requires previous setup of following (debian/ubuntu) package:

sudo apt-get install libfftw3-dev

Intel Math Kernel Library: MKL

Intel's MKL https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html currently looks even faster than FFTW.

On Ubuntu-Linux it's easy to setup with the package intel-mkl. Similar on Debian: intel-mkl-full.

There are special repositories for following Linux distributions:

Performing the benchmarks - with CMake

Benchmarks should be prepared by creating a special build folder

mkdir build_benches
cd build_benches
cmake ../bench

There are several CMake options to parametrize, which fft implementations should be benched. You can explore all available options with cmake-gui or ccmake, see CMake.

Some of the options:

  • BENCH_ID name the benchmark - used in filename
  • BENCH_ARCH target architecture passed to compiler for code optimization
  • PFFFT_USE_BENCH_FFTW use (system-installed) FFTW3 in fft benchmark? (default: OFF)
  • PFFFT_USE_BENCH_GREEN use Green FFT in fft benchmark? (default: ON)
  • PFFFT_USE_BENCH_KISS use KissFFT in fft benchmark? (default: ON)
  • PFFFT_USE_BENCH_POCKET use PocketFFT in fft benchmark? (default: ON)
  • PFFFT_USE_BENCH_MKL use Intel MKL in fft benchmark? (default: OFF)

These options can be passed to cmake at command line, e.g.

cmake -DBENCH_ARCH=native -DPFFFT_USE_BENCH_FFTW=ON -DPFFFT_USE_BENCH_MKL=ON ../bench

The benchmarks are built and executed with

cmake --build .

You can also specify to use a different compiler/version with the cmake step, e.g.:

CC=/usr/bin/gcc-9 CXX=/usr/bin/g++-9 cmake -DBENCH_ID=gcc9 -DBENCH_ARCH=native -DPFFFT_USE_BENCH_FFTW=ON -DPFFFT_USE_BENCH_MKL=ON ../bench
CC=/usr/bin/clang-11 CXX=/usr/bin/clang++-11 cmake -DBENCH_ID=clang11 -DBENCH_ARCH=native -DPFFFT_USE_BENCH_FFTW=ON -DPFFFT_USE_BENCH_MKL=ON ../bench

For using MSVC/Windows, the cmake command requires/needs the generator and architecture options and to be called from the VS Developer prompt:

cmake -G "Visual Studio 16 2019" -A x64 ../bench/

see https://cmake.org/cmake/help/v3.15/manual/cmake-generators.7.html#visual-studio-generators

For running with different compiler version(s):

  • copy the result file (.tgz), e.g. cp *.tgz ../
  • delete the build directory: rm -rf *
  • then continue with the cmake step

Benchmark results and contribution

You might contribute by providing us the results of your computer(s).

The benchmark results are stored in a separate git-repository: See https://github.com/hayguen/pffft_benchmarks.

This is to keep this repositories' sources small.

pffft's People

Contributors

grmblbl avatar hayguen avatar howard0su avatar ilmai avatar marton78 avatar menzi11 avatar mx avatar unevens avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pffft's Issues

My code works without SIMD, but gives the wrong answer when enabled

Hi, thanks for a cool library.

I've been testing my code with TCC (tiny c compiler) where there's no SIMD headers, so I disabled it. It's a Yin Pitch Detection algorithm, and I get a fairly accurate pitch when SIMD is disabled, but once I enable it in GCC and MSVC, the detected pitch is quite far from the non-SIMD pitch. There are no other code using SIMD, and I've verified that the values in the function difference are... different with/without SIMD.

I'm using doubles, but it's the same with floats.

The result is supposed to be as close as possible to the frequency (3398) and without SIMD I get 3443.696594 while SIMD gets me 3793.724228 in both GCC and MSVC. All compilers return 3443.696594 without SIMD.

Can someone tell me what I'm doing wrong?

The code is attached below and is compiled like so:

tcc -Ilib lib/pffft/*.c main.c -o main.exe -D_USE_MATH_DEFINES -DPFFFT_SIMD_DISABLE -D__GNUC__ -D__MINGW32__
gcc -Ilib lib/pffft/*.c main.c -o main.exe -march=native -D_USE_MATH_DEFINES
cl main.c lib/pffft/*.c /Fe:main.exe /I lib
#include <stdio.h>
#include <stdlib.h>
#include <float.h>
#include <math.h>

#include "pffft/pffft_double.h"

#ifndef M_PI
#define M_PI 3.14159265358979323846
#endif

#define YIN_THRESHOLD 0.20

void sinewave(double frequency, int samplerate, int size, double *output)
{
    int lut_size = size;
    double delta_phi = frequency * lut_size * 1.0 / samplerate;
    double phase = 0.0;
    double min = DBL_MAX;
    double max = -DBL_MAX;
    int *lut = malloc(lut_size * sizeof(int));

    for (int i = 0; i < lut_size; ++i)
        lut[i] = (int)roundf(0x7FFF * sin(2.0 * M_PI * i / lut_size));

    for (int i = 0; i < size; ++i)
    {
        int val = (double)lut[(int)phase];
        max = fmax(max, val);
        min = fmin(min, val);
        output[i] = val;
        phase += delta_phi;
        if (phase >= lut_size)
        phase -= lut_size;
    }

    free(lut);
}

void difference(double *audio_buffer, int audio_buffer_size, double *yin_buffer)
{
    int yin_buffer_size = audio_buffer_size / 2;
    PFFFTD_Setup *setup = pffftd_new_setup(audio_buffer_size, PFFFT_COMPLEX);
    double *data = pffftd_aligned_malloc(2 * audio_buffer_size * sizeof(double));
    double *power_terms = malloc(yin_buffer_size * sizeof(double));
    double *kernel = pffftd_aligned_malloc(2 * audio_buffer_size * sizeof(double));

    for (int j = 0; j < yin_buffer_size; ++j)
        power_terms[0] += audio_buffer[j] * audio_buffer[j];
    
    for (int tau = 1; tau < yin_buffer_size; ++tau)
        power_terms[tau] =
            power_terms[tau-1] - audio_buffer[tau-1] * audio_buffer[tau-1] +
            audio_buffer[tau+yin_buffer_size] * audio_buffer[tau+yin_buffer_size];

    for (int i = 0; i < audio_buffer_size; ++i)
    {
        data[2*i+0] = audio_buffer[i];
        data[2*i+1] = 0;
    }

    pffftd_transform(setup, data, data, 0, PFFFT_FORWARD);

    for (int j = 0; j < yin_buffer_size; ++j)
    {
        kernel[2*j+0] = audio_buffer[(audio_buffer_size / 2 - 1) - j];
        kernel[2*j+1] = 0;
        kernel[2*j+audio_buffer_size+0] = 0;
        kernel[2*j+audio_buffer_size+1] = 0;
    }
    
    pffftd_transform(setup, kernel, kernel, 0, PFFFT_FORWARD);
    
    for (int j = 0; j < audio_buffer_size; ++j)
    {
        data[2*j+0] = data[2*j+0] * kernel[2*j] - data[2*j+1] * kernel[2*j+1];
        data[2*j+1] = data[2*j+1] * kernel[2*j] + data[2*j+0] * kernel[2*j+1];
    }

    pffftd_transform(setup, data, data, 0, PFFFT_BACKWARD);

    for (int j = 0; j < yin_buffer_size; ++j)
        yin_buffer[j] =
            power_terms[0] + power_terms[j] - 2 * data[2 * (yin_buffer_size - 1 + j)];

    free(power_terms);
    pffftd_aligned_free(data);
    pffftd_aligned_free(kernel);
    pffftd_destroy_setup(setup);
}

void cumulative_mean_normalized_difference(double *yin_buffer, int yin_buffer_size)
{
    double running_sum = 0.0;

    yin_buffer[0] = 1;

    for (int tau = 1; tau < yin_buffer_size; tau++) {
        running_sum += yin_buffer[tau];
        yin_buffer[tau] *= tau / running_sum;
    }
}

int absolute_threshold(double *yin_buffer, int yin_buffer_size)
{
    int tau;

    for (tau = 2; tau < yin_buffer_size; tau++)
        if (yin_buffer[tau] < YIN_THRESHOLD)
        {
            while (tau + 1 < yin_buffer_size && yin_buffer[tau + 1] < yin_buffer[tau])
                tau++;
            break;
        }

    return (tau == yin_buffer_size || yin_buffer[tau] >= YIN_THRESHOLD) ? -1 : tau;
}

double parabolic_interpolation(int tau_estimate, double *yin_buffer, int yin_buffer_size)
{
    double better_tau;
    int x0;
    int x2;

    if (tau_estimate < 1)
        x0 = tau_estimate;
    else
        x0 = tau_estimate - 1;
    if (tau_estimate + 1 < yin_buffer_size)
        x2 = tau_estimate + 1;
    else
        x2 = tau_estimate;

    if (x0 == tau_estimate)
        if (yin_buffer[tau_estimate] <= yin_buffer[x2])
            better_tau = tau_estimate;
        else
            better_tau = x2;
    else if (x2 == tau_estimate)
        if (yin_buffer[tau_estimate] <= yin_buffer[x0])
            better_tau = tau_estimate;
        else
            better_tau = x0;
    else
    {
        double s0, s1, s2;
        s0 = yin_buffer[x0];
        s1 = yin_buffer[tau_estimate];
        s2 = yin_buffer[x2];
        better_tau = tau_estimate + (s2 - s0) / (2 * (2 * s1 - s2 - s0));
    }

    return better_tau;
}

double yin_pitch(double *audio_buffer, int audio_buffer_size, int samplerate)
{
    int yin_buffer_size = audio_buffer_size / 2;
    double *yin_buffer = malloc(yin_buffer_size * sizeof(double));
    int tau_estimate;
    double better_tau;

    difference(audio_buffer, audio_buffer_size, yin_buffer);
    cumulative_mean_normalized_difference(yin_buffer, yin_buffer_size);
    tau_estimate = absolute_threshold(yin_buffer, yin_buffer_size);
    better_tau = parabolic_interpolation(tau_estimate, yin_buffer, yin_buffer_size);

    free(yin_buffer);

    return samplerate / better_tau;
}

int main()
{
    int audio_buffer_size = 8192;
    double frequency = 3398.0;
    int samplerate = 48000;
    double *audio_buffer = malloc(audio_buffer_size * sizeof(double));
    double pitch;

    sinewave(frequency, samplerate, audio_buffer_size, audio_buffer);

    pitch = yin_pitch(audio_buffer, audio_buffer_size, samplerate);

    printf("result=%f\n", pitch);
    printf("Success.\n");

    free(audio_buffer);

    return 0;
}

(Code is influenced by https://github.com/JorenSix/TarsosDSP/blob/master/core/src/main/java/be/tarsos/dsp/pitch/FastYin.java and https://github.com/sevagh/pitch-detection/blob/master/src/yin.cpp)

There is an SSE instruction in pf_neon_float.h.

/* reverse/flip all floats */
#  define VREV_S(a)    _mm_shuffle_ps(a, a, _MM_SHUFFLE(0,1,2,3))
/* reverse/flip complex floats */
#  define VREV_C(a)    _mm_shuffle_ps(a, a, _MM_SHUFFLE(1,0,3,2))

Perhaps the following is correct.

/* reverse/flip all floats */
#  define VREV_S(a)    vcombine_f32(vrev64_f32(vget_high_f32(a)), vrev64_f32(vget_low_f32(a)))
/* reverse/flip complex floats */
#  define VREV_C(a)    vextq_f32(a, a, 2)

I consulted the following site.

https://stackoverflow.com/questions/32536265/how-to-convert-mm-shuffle-ps-sse-intrinsic-to-neon-intrinsic

ZCONVOLVE_USING_INLINE_NEON_ASM is bugged

There are two bugs involving ZCONVOLVE_USING_INLINE_NEON_ASM.

First of all, there's a typo which results in the hand-written assembler version never getting used. See

# ifndef __clang__
#   define ZCONVOLVE_USING_INLINE_NEON_ASM
# endif

vs

#ifdef ZCONVOLVE_USING_INLINE_ASM

However, if that's fixed, we get a lot of complaints from GCC:

[build] /tmp/ccN5565L.s: Assembler messages:
[build] /tmp/ccN5565L.s:5723: Error: operand 1 must be an integer register -- `mov r8,x7'
[build] /tmp/ccN5565L.s:5724: Error: unknown mnemonic `vdup.f32' -- `vdup.f32 q15,x1'
[build] /tmp/ccN5565L.s:5726: Error: unknown mnemonic `pld' -- `pld [x5,#64]'
[build] /tmp/ccN5565L.s:5727: Error: unknown mnemonic `pld' -- `pld [x6,#64]'
[build] /tmp/ccN5565L.s:5728: Error: unknown mnemonic `pld' -- `pld [x7,#64]'
[build] /tmp/ccN5565L.s:5729: Error: unknown mnemonic `pld' -- `pld [x5,#96]'
[build] /tmp/ccN5565L.s:5730: Error: unknown mnemonic `pld' -- `pld [x6,#96]'
[build] /tmp/ccN5565L.s:5731: Error: unknown mnemonic `pld' -- `pld [x7,#96]'
[build] /tmp/ccN5565L.s:5732: Error: unknown mnemonic `vld1.f32' -- `vld1.f32 {q0,q1},[x5,:128]!'
[build] /tmp/ccN5565L.s:5733: Error: unknown mnemonic `vld1.f32' -- `vld1.f32 {q4,q5},[x6,:128]!'
[build] /tmp/ccN5565L.s:5734: Error: unknown mnemonic `vld1.f32' -- `vld1.f32 {q2,q3},[x5,:128]!'
[build] /tmp/ccN5565L.s:5735: Error: unknown mnemonic `vld1.f32' -- `vld1.f32 {q6,q7},[x6,:128]!'
[build] /tmp/ccN5565L.s:5736: Error: unknown mnemonic `vld1.f32' -- `vld1.f32 {q8,q9},[r8,:128]!'
[build] /tmp/ccN5565L.s:5737: Error: unknown mnemonic `vmul.f32' -- `vmul.f32 q10,q0,q4'
[build] /tmp/ccN5565L.s:5738: Error: unknown mnemonic `vmul.f32' -- `vmul.f32 q11,q0,q5'
[build] /tmp/ccN5565L.s:5739: Error: unknown mnemonic `vmul.f32' -- `vmul.f32 q12,q2,q6'
[build] /tmp/ccN5565L.s:5740: Error: unknown mnemonic `vmul.f32' -- `vmul.f32 q13,q2,q7'
[build] /tmp/ccN5565L.s:5741: Error: unknown mnemonic `vmls.f32' -- `vmls.f32 q10,q1,q5'
[build] /tmp/ccN5565L.s:5742: Error: unknown mnemonic `vmla.f32' -- `vmla.f32 q11,q1,q4'
[build] /tmp/ccN5565L.s:5743: Error: unknown mnemonic `vld1.f32' -- `vld1.f32 {q0,q1},[r8,:128]!'
[build] /tmp/ccN5565L.s:5744: Error: unknown mnemonic `vmls.f32' -- `vmls.f32 q12,q3,q7'
[build] /tmp/ccN5565L.s:5745: Error: unknown mnemonic `vmla.f32' -- `vmla.f32 q13,q3,q6'
[build] /tmp/ccN5565L.s:5746: Error: unknown mnemonic `vmla.f32' -- `vmla.f32 q8,q10,q15'
[build] /tmp/ccN5565L.s:5747: Error: unknown mnemonic `vmla.f32' -- `vmla.f32 q9,q11,q15'
[build] /tmp/ccN5565L.s:5748: Error: unknown mnemonic `vmla.f32' -- `vmla.f32 q0,q12,q15'
[build] /tmp/ccN5565L.s:5749: Error: unknown mnemonic `vmla.f32' -- `vmla.f32 q1,q13,q15'
[build] /tmp/ccN5565L.s:5750: Error: unknown mnemonic `vst1.f32' -- `vst1.f32 {q8,q9},[x7,:128]!'
[build] /tmp/ccN5565L.s:5751: Error: unknown mnemonic `vst1.f32' -- `vst1.f32 {q0,q1},[x7,:128]!'
[build] /tmp/ccN5565L.s:5752: Error: operand 2 must be an integer or stack pointer register -- `subs x4,#2'

This is on a Raspberry Compute Module 4 with the beta 64-bit operating system and GCC 8.3.0:
Linux convolutionpi 5.10.17-v8+ #1414 SMP PREEMPT Fri Apr 30 13:23:25 BST 2021 aarch64 GNU/Linux

Build fails on powerpc64le

LLVM 11.0.1 on FreeBSD 13.0-RELEASE

/usr/bin/cc -DPFFFT_EXPORTS -DPFFFT_SCALVEC_ENABLED=1 -D_USE_MATH_DEFINES  -O2 -pipe  -fstack-protector-strong -fno-strict-aliasing -O2 -pipe  -fstack-protector-strong -fno-strict-aliasing -fPIC -std=c99 -MD -MT CMakeFiles/PFFFT.dir/pffft.c.o -MF CMakeFiles/PFFFT.dir/pffft.c.o.d -o CMakeFiles/PFFFT.dir/pffft.c.o -c /wrkdirs/usr/ports/math/pffft/work/pffft-9603871/pffft.c
In file included from /wrkdirs/usr/ports/math/pffft/work/pffft-9603871/pffft.c:98:
In file included from /wrkdirs/usr/ports/math/pffft/work/pffft-9603871/simd/pf_float.h:64:
/wrkdirs/usr/ports/math/pffft/work/pffft-9603871/simd/pf_altivec_float.h:41:9: warning: /wrkdirs/usr/ports/math/pffft/work/pffft-9603871/simd/pf_altivec_float.h: ALTIVEC float macros are defined [-W#pragma-messages]
#pragma message( __FILE__ ": ALTIVEC float macros are defined" )
        ^
In file included from /wrkdirs/usr/ports/math/pffft/work/pffft-9603871/pffft.c:132:
/wrkdirs/usr/ports/math/pffft/work/pffft-9603871/pffft_priv_impl.h:1937:15: warning: implicit declaration of function 'VLOAD_ALIGNED' is invalid in C99 [-Wimplicit-function-declaration]
        C.v = VLOAD_ALIGNED( ptr );
              ^
/wrkdirs/usr/ports/math/pffft/work/pffft-9603871/pffft_priv_impl.h:1937:13: error: assigning to 'v4sf' (vector of 4 'float' values) from incompatible type 'int'
        C.v = VLOAD_ALIGNED( ptr );
            ^ ~~~~~~~~~~~~~~~~~~~~
/wrkdirs/usr/ports/math/pffft/work/pffft-9603871/pffft_priv_impl.h:1943:15: warning: implicit declaration of function 'VLOAD_UNALIGNED' is invalid in C99 [-Wimplicit-function-declaration]
        C.v = VLOAD_UNALIGNED( ptr );
              ^
/wrkdirs/usr/ports/math/pffft/work/pffft-9603871/pffft_priv_impl.h:1943:13: error: assigning to 'v4sf' (vector of 4 'float' values) from incompatible type 'int'
        C.v = VLOAD_UNALIGNED( ptr );
            ^ ~~~~~~~~~~~~~~~~~~~~~~
/wrkdirs/usr/ports/math/pffft/work/pffft-9603871/pffft_priv_impl.h:2186:11: warning: implicit declaration of function 'VREV_S' is invalid in C99 [-Wimplicit-function-declaration]
    C.v = VREV_S(A.v);
          ^
/wrkdirs/usr/ports/math/pffft/work/pffft-9603871/pffft_priv_impl.h:2186:9: error: assigning to 'v4sf' (vector of 4 'float' values) from incompatible type 'int'
    C.v = VREV_S(A.v);
        ^ ~~~~~~~~~~~
/wrkdirs/usr/ports/math/pffft/work/pffft-9603871/pffft_priv_impl.h:2206:11: warning: implicit declaration of function 'VREV_C' is invalid in C99 [-Wimplicit-function-declaration]
    C.v = VREV_C(A.v);
          ^
/wrkdirs/usr/ports/math/pffft/work/pffft-9603871/pffft_priv_impl.h:2206:9: error: assigning to 'v4sf' (vector of 4 'float' values) from incompatible type 'int'
    C.v = VREV_C(A.v);
        ^ ~~~~~~~~~~~
5 warnings and 4 errors generated.

Question about real forward transform output size

Hi, I tried to use pffft in my project and I found output size is different between pffft and fftw.
In a N sample real forward fft, the output of pffft is of length N, with re and im interleaved. While in a fftw, the output is a fftw_complex which holds N samples real and N samples image.
Could anyone tell the different?

confusion about memory alignment.

Hello, I have some questions about this part of the code related to memory alignment.

static void * Valigned_malloc(size_t nb_bytes) {
  void *p, *p0 = malloc(nb_bytes + MALLOC_V4SF_ALIGNMENT);
  if (!p0) return (void *) 0;
  p = (void *) (((size_t) p0 + MALLOC_V4SF_ALIGNMENT) & (~((size_t) (MALLOC_V4SF_ALIGNMENT-1))));
  *((void **) p - 1) = p0;
  return p;
}

When p0 is allocated an address of xxxxx63, the aligned address after alignment would be xxxxx64, and *((void **) p - 1) would exceed the space of p0.
This is my understanding, is it correct?

Maybe give another name to the library?

Hi,

It seems that your project diverged from the original and your changes didn't make it back into the main.
At this point would it make sense to rename your library to avoid confusion?

Cheers,
Alex

github's about text

Github's main page for https://github.com/marton78/pffft shows following about text in upper right corner:

A GitHub mirror of Julien Pommier's PFFFT: a pretty fast FFT.

With the many changes and additions, i would suggest changing to something like

A fork of pretty fast FFT (PFFFT) with several additions

Support for 2D?

Thanks for this great library! Is there a plan to support also 2D transforms? It would be quite helpful.

ARM compiler options are wrong

16:37:39 [ 8%] Building C object PFFFT/CMakeFiles/PFFFT.dir/pffft_double.c.o
16:37:39 arm-buildroot-linux-gnueabihf-gcc.br_real: error: unrecognized command-line option '-msse2'

Starting from line 170 in CMakeLists.txt there is no check for arm platforms, like this:

elseif(CMAKE_COMPILER_IS_GNUCC AND NOT USE_SIMD_NEON)

Missing September 2016 commit

So something really weird is going on... First, thanks for hosting a mirror of sorts.

A Software called VCVRack requires pffft library.
I've got a github action which builds that software from source for a few versions across all three major OS's.
It's run successfully in 2020 fetching a https://bitbucket.org/jpommier/pffft/get/29e4f76ac53b.zip which does not seem to exist any more. The files in that archive are from September 22 2016, which is not a commit I can see on bitbucket or here.

Why does it seem like the original author and this repo have missing commits?

crashes on macOS in threads with buffer size > 64k

not sure if you want to deal with issues since this is technically a mirror, but you are ahead of the bitbucket repo for fixes ...

I've managed to find an issue where pffft crashes with an address boundary error when pffft is used in a thread with a buffer size > 64k:

#include <pthread.h>
#include <stdio.h>

#include "pffft.h"

void *thread_test(void *f) {
  PFFFT_Setup* setup = pffft_new_setup(65536, PFFFT_REAL);

  float *in = pffft_aligned_malloc(sizeof(float) * 65536 * 2);
  float *out = pffft_aligned_malloc(sizeof(float) * 65536 * 2);
  pffft_transform_ordered(setup, in, out, NULL, PFFFT_FORWARD);
  pffft_transform_ordered(setup, in, out, NULL, PFFFT_BACKWARD);

  pffft_destroy_setup(setup);

  return NULL;
}

int main() {
  fprintf(stderr, "does not crash\n");
  thread_test(NULL);
  fprintf(stderr, "does crash\n");
  pthread_t thread_id;
  pthread_create(&thread_id, NULL, thread_test, NULL);

  pthread_join(thread_id, NULL);

  return 0;
}

seems to work in linux.

$ clang --version
Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

apple support

i have an experimental branch with support for Apple M1 and Raspberry 400.
it is here https://github.com/unevens/pffft/tree/m1
i'm not sure about the stuff that i commented regarding "-mfpu=neon", which may need to be conditionally enabled on other platforms.

clang-cl (windows) confused by std::complex

Hi,

clang-cl (windows) is confused by std::complex<> you need to add the :: in front to fix that issue:

@@ -492,8 +492,8 @@
 template<>
-class Setup< std::complex<float> >
+class Setup< ::std::complex<float> >
 {
   PFFFT_Setup* self;
 
 public:
@@ -496,8 +496,8 @@
 {
   PFFFT_Setup* self;
 
 public:
-  typedef std::complex<float> value_type;
+  typedef ::std::complex<float> value_type;
   typedef Types< value_type >::Scalar Scalar;
 
   Setup()

Building on M1 computer failed: CMake warning: unsupported CMAKE_SYSTEM_PROCESSOR 'arm64'

When I run this command: CC=/usr/bin/clang CXX=/usr/bin/clang++ cmake -DCMAKE_BUILD_TYPE=Debug ../
I get the following warning:

CMake Warning at cmake/target_optimizations.cmake:57 (message):
  unsupported CMAKE_SYSTEM_PROCESSOR 'arm64'

Does that mean arm64 architecture is not supported by the make file? I have zero experience with modifying makefile, but what are the necessary changes to make it build for arm64? Thank you!

Prime decomposition

It seems that this implementation does not support N decomposition with prime > 5 which was initialy supported by FFTPACK.
ifac variable in decompose output is incorrect and only contains primes <= 5.
Ex: N=55
FFTPACK legacy can decompose 55 as ifac {5,11}.
PFFFT decompose reports ifac={5}
There is no return code nor assertion that reports this limitation, and the limitation is only documented in pffft.h file => u could add the limitation to README ?
Or add bluestein support ?

Algorithm verification

Dear People

Thanks for all effort, like this project. I have build the project in VS2019 with cmake (C++11)
I have update internally some project cmake files to work with Visual studio (MVC Compiler):

target_optimizations.cmake:
line 62: set(TARGET_C_ARCH "none" CACHE STRING "msvc target C architecture (/arch): SSE2/AVX/AVX2/AVX512")
changed to: ADD_DEFINITIONS(/arch:AVX2)
Compiler accept only one value. AVX, AVX2 or AVX512
The same for line 109
set(TARGET_CXX_ARCH "none" CACHE STRING "msvc target C++ architecture (/arch): SSE2/AVX/AVX2/AVX512")
ADD_DEFINITIONS(/arch:AVX2)

Peter

Anonymous namespace causing warning

I'm getting the following warning (or error in this case, as I have warnings as errors enabled) while compiling pffft with GCC on ARM64, with strict warnings enabled:

[build] /home/pi/convolution-thing/source/../externals/pffft/pffft.hpp: In instantiation of ‘class pffft::Fft<float>’:
[build] /home/pi/convolution-thing/source/./math/FFTpffft.h:25:20:   required from here
[build] /home/pi/convolution-thing/source/../externals/pffft/pffft.hpp:124:7: error: ‘pffft::Fft<float>’ has a field ‘pffft::Fft<float>::setup’ whose type uses the anonymous namespace [-Werror=subobject-linkage]

I looked into the issue and looks like the problem is that each source file including the header will get its own copy of the anonymous namespace, which means the type will have a different definition in different compile units. I was unable to fix the issue myself as I don't have a good grasp of the structure of pffft, so can't attach a pull request either.

Here's some discussion on the issue:
https://stackoverflow.com/questions/37722850/how-to-silence-whose-type-uses-the-anonymous-namespace-werror-gcc-version-4

Support for unaligned arrays

Do I get it right that the library only supports aligned arrays as input (and output)? Is there a way to make it work with a non-aligned array?

Background: I'm trying to wrap the library in Java Native Interface, but it keeps crashing with a segfault at a vmovapd instruction. Inspecting the register dump, I can see that the library is indeed trying to aligned-load to an AVX register an address that's only aligned to 16 bytes. Java's double arrays are normally aligned to 32 bytes, but they also have a 16-byte header.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.