Coder Social home page Coder Social logo

Comments (17)

guoyilin avatar guoyilin commented on August 19, 2024 1

hi, @jiaxiang-wu, I do more experiment .
(1) use Makefile.native, do not use any blas library and openvml, test 100 images.
use qcnn, time: 19s.
not use qcnn, time: 66s.
(2) use openblas + openVML, test 100 images.
use qcnn, time: 17s.
not use qcnn, time: 19s. I look at 'top' command, the cpu only use 100%.
(3) caffe + openblas : 650ms one image one thread.
So, my question is
(1)qcnn time do not depends on blas library, since it only do +, not multi operation?
(2)not-qcnn time seems depends on the blas library. When use openblas , the speed is faster(190ms) than caffe(650ms), Why?

from quantized-cnn.

jiaxiang-wu avatar jiaxiang-wu commented on August 19, 2024

Dear @guoyilin ,

For the first question, the classification accuracy between the original and quantized AlexNet is indeed very close. On the ILSVRC-12 validation set (50k images), Q-CNN's performance is slightly inferior, but since you only test a subset of 1k images, things might be different.
Considering the run-time speed, this can be affected by various factors, i.e. CPU, memory, BLAS library, multi-threading, etc. The time comparison reported in based on the single-thread setting.

from quantized-cnn.

hiyijian avatar hiyijian commented on August 19, 2024

In my expiriment, qcnn is about one time slower than caffe with mkl, while three times faser than caffe with atlas and openvml. It is actually for mobile devices that mkl is unavailable.
When we consider server or PC with intel chip inside, MKL do a better job even if qcnn has a much lower time complex. It is quite a pity.
So I wonder that if it is possible to optimize qcnn by certain tech such as SIMD(SSE) or fully apply functions provided by mkl.Would you please to shed a light on this? @jiaxiang-wu

from quantized-cnn.

jiaxiang-wu avatar jiaxiang-wu commented on August 19, 2024

@guoyilin @hiyijian I have already tried to optimize the code with SSE instructions, and it can indeed speed-up the table look-up operations by ~20%. Further improvement may be possible, but I am not an expert at all in using SSE instructions. I am now considering combining both implementations (with or without SSE) in this project, so that others may also help to optimize the code.

from quantized-cnn.

hiyijian avatar hiyijian commented on August 19, 2024

Thanks a lot. Let's rock

from quantized-cnn.

guoyilin avatar guoyilin commented on August 19, 2024

Thanks, jiaxiang.

from quantized-cnn.

jiaxiang-wu avatar jiaxiang-wu commented on August 19, 2024

@guoyilin

  1. Q-CNN partially depends on the BLAS library, since the pre-computation of look-up tables involves matrix multiplication. In the current implementation, "cblas_saxpy" is used instead of direct matrix multiplication for better efficiency (see CaffeEva::GetInPdMat()).
  2. OpenVML can speed-up certain layers' computation, i.e. the LRN layer. Besides, there are also other optimization tricks presented to speed-up the non Q-CNN computation.

P.S.: @hiyijian @guoyilin The SSE-accelerated version might be slightly delayed, since I am a little bit busy at the moment. Sorry for this.

from quantized-cnn.

hiyijian avatar hiyijian commented on August 19, 2024

@jiaxiang-wu That's OK

from quantized-cnn.

hiyijian avatar hiyijian commented on August 19, 2024

Dear @jiaxiang-wu,
I speed up original Q-CNN with SSE instruction by ~x0.2 as you once reported
I found that it is difficult to give further optimization with SIMD technology.
Below is the heart of acceleration:

// ------------------------------------orignal---------------------------------------------------
void sumupPerInstanceA(int subSpaceCnt, int ctrdCntPerSpace, int nodCnt, int dstChannel,
const float* pLUTVec, const uint8_t* pAsmtVec, float* featVecDst)
{
for (int subSpaceInd = 0; subSpaceInd < subSpaceCnt; subSpaceInd++) {
for (int chnIndDst = 0; chnIndDst < nodCnt; chnIndDst += 8) {
featVecDst[chnIndDst] += pLUTVec[pAsmtVec[chnIndDst]];
featVecDst[chnIndDst + 1] += pLUTVec[pAsmtVec[chnIndDst + 1]];
featVecDst[chnIndDst + 2] += pLUTVec[pAsmtVec[chnIndDst + 2]];
featVecDst[chnIndDst + 3] += pLUTVec[pAsmtVec[chnIndDst + 3]];
featVecDst[chnIndDst + 4] += pLUTVec[pAsmtVec[chnIndDst + 4]];
featVecDst[chnIndDst + 5] += pLUTVec[pAsmtVec[chnIndDst + 5]];
featVecDst[chnIndDst + 6] += pLUTVec[pAsmtVec[chnIndDst + 6]];
featVecDst[chnIndDst + 7] += pLUTVec[pAsmtVec[chnIndDst + 7]];
}
pLUTVec += ctrdCntPerSpace;
pAsmtVec += dstChannel;
}
}

// ------------------------------------SSE/AVX2---------------------------------------------------
void sumupPerInstance(int subSpaceCnt, int ctrdCntPerSpace, int nodCnt, int dstChannel,
const float* pLUTVec, const uint8_t* pAsmtVec, float* featVecDst)
{
__m256 regLUT;
float* pTarget;
for (int subSpaceInd = 0; subSpaceInd < subSpaceCnt; subSpaceInd++) {
for (int chnIndDst = 0; chnIndDst < nodCnt; chnIndDst += 8) {
pTarget = featVecDst + chnIndDst;
regLUT = _mm256_set_ps(pLUTVec[pAsmtVec[chnIndDst + 7]],
pLUTVec[pAsmtVec[chnIndDst + 6]],
pLUTVec[pAsmtVec[chnIndDst + 5]],
pLUTVec[pAsmtVec[chnIndDst + 4]],
pLUTVec[pAsmtVec[chnIndDst + 3]],
pLUTVec[pAsmtVec[chnIndDst + 2]],
pLUTVec[pAsmtVec[chnIndDst + 1]],
pLUTVec[pAsmtVec[chnIndDst]]);
_mm256_store_ps(pTarget, _mm256_add_ps(regLUT, _((_m256)pTarget)));
}
pLUTVec += ctrdCntPerSpace;
pAsmtVec += dstChannel;
}
}

from quantized-cnn.

jiaxiang-wu avatar jiaxiang-wu commented on August 19, 2024

@hiyijian My version of SIMD-based optimization is basically the same as yours.

from quantized-cnn.

hiyijian avatar hiyijian commented on August 19, 2024

@jiaxiang-wu. I compared Alexnet-Caffe and Alexnet-QCNN(both are linked to openblas) on Huawei Mata 8. They are almost the same in test speed, 600ms~700ms per 227*227 image. This is quite diffrient from your report in paper, which is 2.93s vs. 0.95s.
It is unsurprising that the slightly difference of QCNN speed, since we use different mobile devices, e.g., Mate 8 vs. Mata 7. However, the difference of Caffe speed is really unreasonable.
We know that test speed can be affected by various factors, So Would you like to share your experiment setting such as blas library and Caffe version(this repo or offical one) and so on?

from quantized-cnn.

jiaxiang-wu avatar jiaxiang-wu commented on August 19, 2024

@hiyijian In our experiments, both AlexNet-Caffe and AlexNet-QCNN were not compiled with OpenBLAS. I was not sure whether OpenBLAS could be supported by all mobile devices.

from quantized-cnn.

hiyijian avatar hiyijian commented on August 19, 2024

So which BLAS library did you link to and which Caffe(this repo or offical one) did you use?

from quantized-cnn.

jiaxiang-wu avatar jiaxiang-wu commented on August 19, 2024

@hiyijian Sorry, I did not make myself clear. In our experiments, I used this repo to evaluate Caffe's efficiency on mobile devices, and for CNN & Q-CNN, no BLAS library was used to minimize the dependency on external libraries. In other words, we compiled the source code with Makefile.native.

from quantized-cnn.

hiyijian avatar hiyijian commented on August 19, 2024

Crystal clear now. Thanks

from quantized-cnn.

Xuezhi-Liang avatar Xuezhi-Liang commented on August 19, 2024

Your idea is very cool .I want to konw how to learn the codebook from a model file. Could you give me some advices? Thank you!

from quantized-cnn.

jiaxiang-wu avatar jiaxiang-wu commented on August 19, 2024

@lianghu2015 Please check out the detailed method described in our paper. The code for the training phase is not included in this repo.

from quantized-cnn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.