We should use BLAS for matrix multiplication when available. The eas

On 19 January 2014 23:04, Fredrik Johansson <a href="mailto:notifications@github.comwr

Use BLAS for matrix multiplication about flint HOT 4 CLOSED

flintlib commented on July 16, 2024

Use BLAS for matrix multiplication

from flint.

Comments (4)

ElSeidy commented on July 16, 2024

I am also very interested in this functionality. Is there any progress there? I can do this if you like, I just need more info. FLINT is amazing, keep up the good job guys 👍

from flint.

wbhart commented on July 16, 2024

Nobody is currently working on this. There is a --with-blas option in
configure which should at least allow flint to be linked against BLAS,
though of course at present that does nothing.

The nmod_poly module is obviously the logical place for this to be used,
particularly in the case where dim*p^2 fits into 53 bits or 23 bits.

Bill.

On 19 January 2014 09:36, Mohammed ElSeidy [email protected] wrote:

I am also very interested in this functionality. Is there any progress
there? I can do this if you like, I just need more info. FLINT is amazing,
keep up the good job guys [image: 👍]

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/15#issuecomment-32703633
.

from flint.

fredrik-johansson commented on July 16, 2024

Correction to Bill's comment: nmod_mat, not nmod_poly.

The first step after making linking to BLAS work should probably be to write an nmod_mat_mul_blas that multiplies two nmod_mat's by converting them to floating-point arrays, doing a BLAS multiplication, and converting the result back to an nmod_mat.

Then you might want to tune it for different sizes. You probably want to use single precision floats instead of doubles when the modulus is small enough. When the modulus is really small, you could pack several entries into each double (nmod_mat_mul already does such packing into 64-bit words). Also, when dim*p^2 does not fit in 53 bits, you can still use BLAS by computing the matrix product modulo a couple of smaller primes and then patching the results together using the Chinese remainder theorem.

Finally, there's making nmod_mat_mul use BLAS multiplication. You might want to do Strassen multiplication on top of the BLAS multiplication, as this should be faster when sizes get large enough, so presumably the BLAS multiplication should just replace nmod_mat_mul_basecase. Some careful profiling is needed to determine good cutoffs.

Optionally, you could also improve fmpz_mat_mul_multi_mod so that it does BLAS multiplications directly instead of converting to nmod_mats first. This should eliminate a lot of overhead, and should be faster if you also choose primes here that are optimal for BLAS (i.e. much smaller than the default ~64 bits).

from flint.

wbhart commented on July 16, 2024

On 19 January 2014 23:04, Fredrik Johansson [email protected]:

Correction to Bill's comment: nmod_mat, not nmod_poly.

Yes, that was a stupid typo. Thanks for spotting it Fredrik.

The first step after making linking to BLAS work should probably be to
write an nmod_mat_mul_blas that multiplies two nmod_mat's by converting
them to floating-point arrays, doing a BLAS multiplication, and converting
the result back to an nmod_mat.

Then you might want to tune it for different sizes. You probably want to
use single precision floats instead of doubles when the modulus is small
enough. When the modulus is really small, you could pack several entries
into each double (nmod_mat_mul already does such packing into 64-bit
words). Also, when dim*p^2 does not fit in 53 bits, you can still use BLAS
by computing the matrix product modulo a couple of smaller primes and then
patching the results together using the Chinese remainder theorem.

I am sceptical that this last one will actually be faster than what we
currently have. But I'm definitely interested in seeing a comparison and
being proved wrong.

Finally, there's making nmod_mat_mul use BLAS multiplication. You might
want to do Strassen multiplication on top of the BLAS multiplication, as
this should be faster when sizes get large enough, so presumably the BLAS
multiplication should just replace nmod_mat_mul_basecase.

Definitely. It's unlikely that BLAS will be faster beyond about dimension
256.

Some careful profiling is needed to determine good cutoffs.

Optionally, you could also improve fmpz_mat_mul_multi_mod so that it does
BLAS multiplications directly instead of converting to nmod_mats first.
This should eliminate a lot of overhead, and should be faster if you also
choose primes here that are optimal for BLAS (i.e. much smaller than the
default ~64 bits).

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/15#issuecomment-32722887
.

from flint.

Use BLAS for matrix multiplication about flint HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent