Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Dear <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Dear <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The result is weird. about quantized-cnn HOT 17 CLOSED

jiaxiang-wu commented on August 19, 2024

The result is weird.

from quantized-cnn.

Comments (17)

guoyilin commented on August 19, 2024 1

hi, @jiaxiang-wu, I do more experiment .
(1) use Makefile.native, do not use any blas library and openvml, test 100 images.
use qcnn, time: 19s.
not use qcnn, time: 66s.
(2) use openblas + openVML, test 100 images.
use qcnn, time: 17s.
not use qcnn, time: 19s. I look at 'top' command, the cpu only use 100%.
(3) caffe + openblas : 650ms one image one thread.
So, my question is
(1)qcnn time do not depends on blas library, since it only do +, not multi operation?
(2)not-qcnn time seems depends on the blas library. When use openblas , the speed is faster(190ms) than caffe(650ms), Why?

from quantized-cnn.

jiaxiang-wu commented on August 19, 2024

Dear @guoyilin ,

For the first question, the classification accuracy between the original and quantized AlexNet is indeed very close. On the ILSVRC-12 validation set (50k images), Q-CNN's performance is slightly inferior, but since you only test a subset of 1k images, things might be different.
Considering the run-time speed, this can be affected by various factors, i.e. CPU, memory, BLAS library, multi-threading, etc. The time comparison reported in based on the single-thread setting.

from quantized-cnn.

hiyijian commented on August 19, 2024

In my expiriment, qcnn is about one time slower than caffe with mkl, while three times faser than caffe with atlas and openvml. It is actually for mobile devices that mkl is unavailable.
When we consider server or PC with intel chip inside, MKL do a better job even if qcnn has a much lower time complex. It is quite a pity.
So I wonder that if it is possible to optimize qcnn by certain tech such as SIMD(SSE) or fully apply functions provided by mkl.Would you please to shed a light on this? @jiaxiang-wu

from quantized-cnn.

jiaxiang-wu commented on August 19, 2024

@guoyilin @hiyijian I have already tried to optimize the code with SSE instructions, and it can indeed speed-up the table look-up operations by ~20%. Further improvement may be possible, but I am not an expert at all in using SSE instructions. I am now considering combining both implementations (with or without SSE) in this project, so that others may also help to optimize the code.

from quantized-cnn.

hiyijian commented on August 19, 2024

Thanks a lot. Let's rock

from quantized-cnn.

guoyilin commented on August 19, 2024

Thanks, jiaxiang.

from quantized-cnn.

jiaxiang-wu commented on August 19, 2024

@guoyilin

Q-CNN partially depends on the BLAS library, since the pre-computation of look-up tables involves matrix multiplication. In the current implementation, "cblas_saxpy" is used instead of direct matrix multiplication for better efficiency (see CaffeEva::GetInPdMat()).
OpenVML can speed-up certain layers' computation, i.e. the LRN layer. Besides, there are also other optimization tricks presented to speed-up the non Q-CNN computation.

P.S.: @hiyijian @guoyilin The SSE-accelerated version might be slightly delayed, since I am a little bit busy at the moment. Sorry for this.

from quantized-cnn.

hiyijian commented on August 19, 2024

@jiaxiang-wu That's OK

from quantized-cnn.

hiyijian commented on August 19, 2024

Dear @jiaxiang-wu,
I speed up original Q-CNN with SSE instruction by ~x0.2 as you once reported
I found that it is difficult to give further optimization with SIMD technology.
Below is the heart of acceleration:

// ------------------------------------orignal---------------------------------------------------
void sumupPerInstanceA(int subSpaceCnt, int ctrdCntPerSpace, int nodCnt, int dstChannel,
const float* pLUTVec, const uint8_t* pAsmtVec, float* featVecDst)
{
for (int subSpaceInd = 0; subSpaceInd < subSpaceCnt; subSpaceInd++) {
for (int chnIndDst = 0; chnIndDst < nodCnt; chnIndDst += 8) {
featVecDst[chnIndDst] += pLUTVec[pAsmtVec[chnIndDst]];
featVecDst[chnIndDst + 1] += pLUTVec[pAsmtVec[chnIndDst + 1]];
featVecDst[chnIndDst + 2] += pLUTVec[pAsmtVec[chnIndDst + 2]];
featVecDst[chnIndDst + 3] += pLUTVec[pAsmtVec[chnIndDst + 3]];
featVecDst[chnIndDst + 4] += pLUTVec[pAsmtVec[chnIndDst + 4]];
featVecDst[chnIndDst + 5] += pLUTVec[pAsmtVec[chnIndDst + 5]];
featVecDst[chnIndDst + 6] += pLUTVec[pAsmtVec[chnIndDst + 6]];
featVecDst[chnIndDst + 7] += pLUTVec[pAsmtVec[chnIndDst + 7]];
}
pLUTVec += ctrdCntPerSpace;
pAsmtVec += dstChannel;
}
}

// ------------------------------------SSE/AVX2---------------------------------------------------
void sumupPerInstance(int subSpaceCnt, int ctrdCntPerSpace, int nodCnt, int dstChannel,
const float* pLUTVec, const uint8_t* pAsmtVec, float* featVecDst)
{
__m256 regLUT;
float* pTarget;
for (int subSpaceInd = 0; subSpaceInd < subSpaceCnt; subSpaceInd++) {
for (int chnIndDst = 0; chnIndDst < nodCnt; chnIndDst += 8) {
pTarget = featVecDst + chnIndDst;
regLUT = _mm256_set_ps(pLUTVec[pAsmtVec[chnIndDst + 7]],
pLUTVec[pAsmtVec[chnIndDst + 6]],
pLUTVec[pAsmtVec[chnIndDst + 5]],
pLUTVec[pAsmtVec[chnIndDst + 4]],
pLUTVec[pAsmtVec[chnIndDst + 3]],
pLUTVec[pAsmtVec[chnIndDst + 2]],
pLUTVec[pAsmtVec[chnIndDst + 1]],
pLUTVec[pAsmtVec[chnIndDst]]);
_mm256_store_ps(pTarget, _mm256_add_ps(regLUT, _((_m256)pTarget)));
}
pLUTVec += ctrdCntPerSpace;
pAsmtVec += dstChannel;
}
}

from quantized-cnn.

jiaxiang-wu commented on August 19, 2024

@hiyijian My version of SIMD-based optimization is basically the same as yours.

from quantized-cnn.

hiyijian commented on August 19, 2024

@jiaxiang-wu. I compared Alexnet-Caffe and Alexnet-QCNN(both are linked to openblas) on Huawei Mata 8. They are almost the same in test speed, 600ms~700ms per 227*227 image. This is quite diffrient from your report in paper, which is 2.93s vs. 0.95s.
It is unsurprising that the slightly difference of QCNN speed, since we use different mobile devices, e.g., Mate 8 vs. Mata 7. However, the difference of Caffe speed is really unreasonable.
We know that test speed can be affected by various factors, So Would you like to share your experiment setting such as blas library and Caffe version(this repo or offical one) and so on?

from quantized-cnn.

jiaxiang-wu commented on August 19, 2024

@hiyijian In our experiments, both AlexNet-Caffe and AlexNet-QCNN were not compiled with OpenBLAS. I was not sure whether OpenBLAS could be supported by all mobile devices.

from quantized-cnn.

hiyijian commented on August 19, 2024

So which BLAS library did you link to and which Caffe(this repo or offical one) did you use?

from quantized-cnn.

jiaxiang-wu commented on August 19, 2024

@hiyijian Sorry, I did not make myself clear. In our experiments, I used this repo to evaluate Caffe's efficiency on mobile devices, and for CNN & Q-CNN, no BLAS library was used to minimize the dependency on external libraries. In other words, we compiled the source code with Makefile.native.

from quantized-cnn.

hiyijian commented on August 19, 2024

Crystal clear now. Thanks

from quantized-cnn.

Xuezhi-Liang commented on August 19, 2024

Your idea is very cool .I want to konw how to learn the codebook from a model file. Could you give me some advices? Thank you!

from quantized-cnn.

jiaxiang-wu commented on August 19, 2024

@lianghu2015 Please check out the detailed method described in our paper. The code for the training phase is not included in this repo.

from quantized-cnn.

The result is weird. about quantized-cnn HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent