Hi! I'm trying to increase the speed of LPCNet with OpenMP (and want to make PR af

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Add OpenMP support about lpcnet HOT 6 CLOSED

xiph commented on July 4, 2024

Add OpenMP support

from lpcnet.

Comments (6)

jmvalin commented on July 4, 2024

Any particular reason you want OpenMP support? The current code is already much faster than real-time on x86 and faster ARM chips (e.g. smartphones, but not RPi yet).

from lpcnet.

gosha20777 commented on July 4, 2024

Yes, we know. But it loads only 1 thread of cpu. And we are trying to parallel it with open mp. But our code is not works and ve have no idea why? We are wont to understand this promlem and overcome it. That's why we ask you about it.

This fact causes us more bewilderment. And we want to solve this problem. Though, just for fun. And in addition to further increase productivity. I think it does not hurt.

from lpcnet.

jmvalin commented on July 4, 2024

I'm not sure I understand this code, but I don't see how you can parallelize it without restructuring the data. As for my original question, parallelizing is a means to get the code to run fast enough, but the current code is already fast enough.

from lpcnet.

gosha20777 commented on July 4, 2024

Yes i understand. But it interesting for me)
So now code is fully works. And lpcnet uses all cpu cores.

But in fact code is not become much faster (even a bit slower in little samples) and i have no idea why? When i looked through profiler sparse_sgemv_accum16 takes 80% of time that's while i've decided to parallel it. But performance even NOT faster 2 times... Can someone explain why it so?

static void sparse_sgemv_accum16(float *out, const float *weights, int rows, const int *idx, const float *x)
{  
   int i, j;
   //initialization
   const int *precomputed_idx[rows];
   const float *precomputed_weights[rows];
   for (i=0;i<rows;i+=16)
   {
      precomputed_weights[i] = weights;
      weights += 16 * (*idx);
      precomputed_idx[i] = idx++;
      idx += *precomputed_idx[i];
   }

   #pragma omp parallel
   {
      const int *lc_idx;
      const float *lc_weights;
      float * restrict y;
      __m256 vy0, vy8;
      int cols;
      for (i=0;i<rows;i+=16)
      {  
         lc_weights = precomputed_weights[i];
         y = &out[i];
         vy0 = _mm256_loadu_ps(&y[0]);
         vy8 = _mm256_loadu_ps(&y[8]);
         lc_idx = precomputed_idx[i];
         cols = *local_idx++;
         #pragma omp critical
         for (j=0;j<cols;j++)
         {
            int id;
            __m256 vxj;
            __m256 vw;
            id = *lc_idx++;
            vxj = _mm256_broadcast_ss(&x[id]);

            vw = _mm256_loadu_ps(&lc_weights[0]);
            vy0 = _mm256_fmadd_ps(vw, vxj, vy0);

            vw = _mm256_loadu_ps(&lc_weights[8]);
            vy8 = _mm256_fmadd_ps(vw, vxj, vy8);
            lc_weights += 16;
         }
         _mm256_storeu_ps (&y[0], vy0);
         _mm256_storeu_ps (&y[8], vy8);
      }
   }
}

UPD:
@SashaMN your code dose not works as #pragma omp parallel parallels ALL cycles (OMG). And you have segmentation error after id = *local_idx++;.

from lpcnet.

gosha20777 commented on July 4, 2024

I ve parallelized main loop in functions from main and divided input data. It worked and I got a slight increase in performance but in those places where the data were divided, the bangs and noise could be heard. Therefore, this requires more complex work and not the fact that the performance gain will be good.

We can close this issue.

from lpcnet.

hdmjdp commented on July 4, 2024

@jmvalin @gosha20777 faster ARM chips? which one, what is the main frequency? thank you.

Any particular reason you want OpenMP support? The current code is already much faster than real-time on x86 and faster ARM chips (e.g. smartphones, but not RPi yet).

from lpcnet.

Add OpenMP support about lpcnet HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent