drnikolaev / caffe Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nvidia/caffe

6.0 6.0 4.0 94.67 MB

Caffe: a fast open framework for deep learning.

Home Page: http://caffe.berkeleyvision.org/

License: Other

CMake 2.30% Makefile 0.59% Shell 0.63% C++ 76.92% MATLAB 0.66% Python 10.00% Cuda 8.82% Dockerfile 0.07%

caffe's People

Contributors

Stargazers

Watchers

Forkers

jdemouth hayden-du wenjuz meikuam

caffe's Issues

Convolution error

Caffe compiled with:

cmake .. -DPROTOBUF_INCLUDE_DIR="/beegfs/120x/home/ilia/protobuf/include/" -DUSE_NCCL=True -DCUDA_ARCH_NAME=Manual -DCUDA_ARCH_BIN="30 35 50 52 60 61 62 70" -DCUDA_ARCH_PTX="30 35 50 52 60 61 62 70" -DCUDA_NVCC_FLAGS=--Wno-deprecated-gpu-targets -Wno-dev

-- Boost version: 1.54.0
-- Found the following Boost libraries:
--   system
--   thread
--   filesystem
-- Found gflags  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libgflags.so)
-- Found glog    (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libglog.so)
-- Found PROTOBUF Compiler: /beegfs/120x/home/ilia/protobuf/bin/protoc
-- Found lmdb    (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/liblmdb.so)
-- Found LevelDB (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libleveldb.so)
-- Found Snappy  (include: /usr/include, library: /usr/lib/libsnappy.so)
-- Found JPEGTurbo: /usr/include
-- CUDA detected: 9.0
-- Found CUDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so (found version "7.0")
-- Added CUDA NVCC flags for: sm_30 sm_35 sm_50 sm_52 sm_60 sm_61 sm_62 sm_70 compute_30 compute_35 compute_50 compute_52 compute_60 compute_61 compute_62 compute_70
-- Found OpenCV 2.x: /usr/share/OpenCV
-- Found Atlas: /usr/include
-- Found Atlas (include: /usr/include, library: /usr/lib/libatlas.so)
-- Found PythonInterp: /beegfs/120x/home/ilia/nvcaffe_comp/bin/python2.7 (found suitable version "2.7.6", minimum required is "2.7")
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython2.7.so (found suitable version "2.7.6", minimum required is "2.7")
-- Found NumPy: /beegfs/120x/home/ilia/nvcaffe_comp/local/lib/python2.7/site-packages/numpy/core/include (found suitable version "1.13.1", minimum required is "1.7.1")
-- NumPy ver. 1.13.1 found (include: /beegfs/120x/home/ilia/nvcaffe_comp/local/lib/python2.7/site-packages/numpy/core/include)
-- Boost version: 1.54.0
-- Found the following Boost libraries:
--   python
-- Could NOT find Doxygen (missing:  DOXYGEN_EXECUTABLE)
-- Found NCCL: /usr/include
-- Found NCCL (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libnccl.so)
-- Found NVML: /usr/include
-- Found NVML (include: /usr/include, library: /usr/lib/nvidia-384/libnvidia-ml.so)
-- Found Git: /usr/bin/git (found version "1.9.1")
--
-- ******************* Caffe Configuration Summary *******************
-- General:
--   Version           :   0.16.4
--   Git               :   v0.16.1-404-g860701c
--   System            :   Linux
--   C++ compiler      :   /usr/bin/c++
--   Release CXX flags :   -O3 -DNDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare -Wno-uninitialized
--   Debug CXX flags   :   -g -DDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare -Wno-uninitialized
--   Build type        :   Release
--
--   BUILD_SHARED_LIBS :   ON
--   BUILD_python      :   ON
--   BUILD_matlab      :   OFF
--   BUILD_docs        :   ON
--   CPU_ONLY          :   OFF
--   USE_LEVELDB       :   ON
--   USE_LMDB          :   ON
--   ALLOW_LMDB_NOLOCK :   OFF
--   TEST_FP16         :   OFF
--
-- Dependencies:
--   BLAS              :   Yes (Atlas)
--   Boost             :   Yes (ver. 1.54)
--   glog              :   Yes
--   gflags            :   Yes
--   protobuf          :   Yes (ver. 3.4.0)
--   lmdb              :   Yes (ver. 0.9.10)
--   LevelDB           :   Yes (ver. 1.15)
--   Snappy            :   Yes (ver. 1.1.0)
--   OpenCV            :   Yes (ver. 2.4.8)
--   JPEGTurbo         :   No
--   CUDA              :   Yes (ver. 9.0)
--
-- NVIDIA CUDA:
--   Target GPU(s)     :   Manual
--   GPU arch(s)       :   sm_30 sm_35 sm_50 sm_52 sm_60 sm_61 sm_62 sm_70 compute_30 compute_35 compute_50 compute_52 compute_60 compute_61 compute_62 compute_70
--   cuDNN             :   Yes (ver. 7.0)
--   NCCL              :   Yes (ver. 2.0.5)
--   NVML              :   /usr/lib/nvidia-384/libnvidia-ml.so
--
-- Python:
--   Interpreter       :   /beegfs/120x/home/ilia/nvcaffe_comp/bin/python2.7 (ver. 2.7.6)
--   Libraries         :   /usr/lib/x86_64-linux-gnu/libpython2.7.so (ver 2.7.6)
--   NumPy             :   /beegfs/120x/home/ilia/nvcaffe_comp/local/lib/python2.7/site-packages/numpy/core/include (ver 1.13.1)
--
-- Documentaion:
--   Doxygen           :   No
--   config_file       :
--
-- Install:
--   Install path      :   /beegfs/120x/home/ilia/caffe_builds/nvc/build/install
--
-- Configuring done
-- Generating done
-- Build files have been written to: /beegfs/120x/home/ilia/caffe_builds/nvc/build

When I tried to run training process with:
./build/tools/caffe train -solver='solver.prototxt'

I got following error:

I1019 15:17:12.441572 108568 solver.cpp:315] Iteration 0 (0.371277 s), loss = 1383.36
I1019 15:17:12.441620 108568 solver.cpp:332]     Train net output #0: loss_bbox = 8.39254e-06 (* 100 = 0.000839254 loss)
I1019 15:17:12.441634 108568 solver.cpp:332]     Train net output #1: loss_cls = 2.47564 (* 500 = 1237.82 loss)
I1019 15:17:12.441706 108568 solver.cpp:332]     Train net output #2: rpn_cls_loss = 0.693479 (* 100 = 69.3479 loss)
I1019 15:17:12.441738 108568 solver.cpp:332]     Train net output #3: rpn_loss_bbox = 0.93409 (* 100 = 93.409 loss)
I1019 15:17:12.441750 108568 sgd_solver.cpp:136] Iteration 0, lr = 5e-05, m = 0.5

*** Aborted at 1508415432 (unix time) try "date -d @1508415432" if you are using GNU date ***
PC: @     0x7f8922466b8d caffe::CuDNNConvolutionLayer<>::FindExConvAlgo()
*** SIGSEGV (@0x0) received by PID 108568 (TID 0x7f89242d4900) from PID 0; stack trace: ***
    @     0x7f8920263cb0 (unknown)
    @     0x7f8922466b8d caffe::CuDNNConvolutionLayer<>::FindExConvAlgo()
    @     0x7f892248b9f0 caffe::CuDNNConvolutionLayer<>::Reshape()
    @     0x7f89223a7d0b caffe::Layer<>::Forward()
    @     0x7f892261da3b caffe::Net::ForwardFromTo()
    @     0x7f892261db97 caffe::Net::Forward()
    @     0x7f8922620325 caffe::Net::ForwardBackward()
    @     0x7f8922630652 caffe::Solver::Step()
    @     0x7f8922631395 caffe::Solver::Solve()
    @           0x40d9e8 train()
    @           0x40ae18 main
    @     0x7f892024ef45 (unknown)
    @           0x40b6fb (unknown)
    @                0x0 (unknown)

It's a bug in elementwise/sum start from 0.16

The diff error in my topology was caused by the commted ShareDIff.
change back to old style walk-around the issue.
Just FYI.
case EltwiseParameter_EltwiseOp_SUM:
if (coeffs_[i] == 1.F) {
Btype* bottom_diff = bottom[i]->mutable_gpu_diff();
//bottom[i]->ShareDiff(top[0]);
caffe_copy(count, top_diff, bottom_diff);
} else {
Btype bottom_diff = bottom[i]->mutable_gpu_diff();
caffe_gpu_scale(count, Btype(coeffs_[i]), top_diff, bottom_diff);
}
break;

Bugs with caffe-0.14-cnmem branch

I'm working with this code:

git checkout NVIDIA/v0.14.0-alpha
git merge lukeyeager/nvidia/versioning
git merge drnikolaev/caffe-0.14-cnmem

When I build and run the tests, I get two failures:

[  FAILED  ] CuDNNConvolutionLayerTest/0.TestGradientGroupCuDNN, where TypeParam = float (561 ms)

...

[ RUN      ] CuDNNConvolutionLayerTest/1.TestGradientGroupCuDNN
F1009 18:25:32.508326 17488 cudnn_conv_layer.cu:37] Check failed: status == CUDNN_STATUS_SUCCESS (7 vs. 0)  CUDNN_STATUS_MAPPING_ERROR
*** Check failure stack trace: ***
    @     0x2aafa63a8daa  (unknown)
    @     0x2aafa63a8ce4  (unknown)
    @     0x2aafa63a86e6  (unknown)
    @     0x2aafa63ab687  (unknown)
    @     0x2aafa5cfd17a  caffe::CuDNNConvolutionLayer<>::Forward_gpu()
    @           0x7bc756  caffe::Layer<>::Forward()
    @           0x8c8890  caffe::GradientChecker<>::CheckGradientSingle()
    @           0x8d56d3  caffe::GradientChecker<>::CheckGradientExhaustive()
    @           0xa62bb2  caffe::CuDNNConvolutionLayerTest_TestGradientGroupCuDNN_Test<>::TestBody()
    @           0xbd6b23  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @           0xbcd5a7  testing::Test::Run()
    @           0xbcd64e  testing::TestInfo::Run()
    @           0xbcd755  testing::TestCase::Run()
    @           0xbd04a8  testing::internal::UnitTestImpl::RunAllTests()
    @           0xbd0747  testing::UnitTest::Run()
    @           0x7af817  main
    @     0x2aafabb81ec5  (unknown)
    @           0x7b62a2  (unknown)
    @              (nil)  (unknown)
Aborted (core dumped)
make[3]: *** [src/caffe/test/CMakeFiles/runtest] Error 134
make[2]: *** [src/caffe/test/CMakeFiles/runtest.dir/all] Error 2
make[1]: *** [src/caffe/test/CMakeFiles/runtest.dir/rule] Error 2
make: *** [runtest] Error 2

The second one kills the test suite, so I can't say if any more would have failed or not.

encounter a gpu memory increase when using nvcaffe and tensorrt

recently, we encounter a very strange problem. when using nvcaffe and tensorrt with JNI, if we use tomcat container in java, gpu memory will increase slightly, but if we use it in c++, it will be ok

ERROR on digits server when using bvlc_cub_v4_v5

Hi,
I' have compiled successfully the nvcaffe bvlc_cub_v4_v5 version, but when trying to execute digits I get the following error:

ERROR: Library at "libcaffe.so.1.0.0-rc3" does not have expected suffix "-nv". Are you using the NVIDIA/caffe fork? Invalid input
In nvcaffe/build/lib dir I see libcaffe.a libcaffe.so libcaffe.so.1.0.0-rc3

I can run successfully nvcaffe of master branch and also nvcaffe 0.15 by drnikolaev, but I fail to run on that drnikolaev bvlc_cub_v4_v5 version. I have just update DIGITS to the last version but the error is still here.
Any hints?
thanks

learn nvcaffe code

Hi, Sergei Nikolaev, Ph.D. I am a AI engineer, I want to explore the lower level of knowledge about AI framework, I have been studying nvcaffe-017.3 code for more than a month, I am still confused about the details of the code, can you share the notes about nvcaffe code or about documents that helping to understand nvcaffe code with me? Thanks for your response.

batch normalisation 16fp?

I am porting some caffe code to fp16. I face problem with batch normalisation (BN). It seems that cudnn doesn't support BN for fp16. The caffe engine seems prone to overflow if i use fp 16. Do you have a solution?

Thanks.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.