drnikolaev / caffe Goto Github PK
View Code? Open in Web Editor NEWThis project forked from nvidia/caffe
Caffe: a fast open framework for deep learning.
Home Page: http://caffe.berkeleyvision.org/
License: Other
This project forked from nvidia/caffe
Caffe: a fast open framework for deep learning.
Home Page: http://caffe.berkeleyvision.org/
License: Other
Caffe compiled with:
cmake .. -DPROTOBUF_INCLUDE_DIR="/beegfs/120x/home/ilia/protobuf/include/" -DUSE_NCCL=True -DCUDA_ARCH_NAME=Manual -DCUDA_ARCH_BIN="30 35 50 52 60 61 62 70" -DCUDA_ARCH_PTX="30 35 50 52 60 61 62 70" -DCUDA_NVCC_FLAGS=--Wno-deprecated-gpu-targets -Wno-dev
-- Boost version: 1.54.0
-- Found the following Boost libraries:
-- system
-- thread
-- filesystem
-- Found gflags (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libgflags.so)
-- Found glog (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libglog.so)
-- Found PROTOBUF Compiler: /beegfs/120x/home/ilia/protobuf/bin/protoc
-- Found lmdb (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/liblmdb.so)
-- Found LevelDB (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libleveldb.so)
-- Found Snappy (include: /usr/include, library: /usr/lib/libsnappy.so)
-- Found JPEGTurbo: /usr/include
-- CUDA detected: 9.0
-- Found CUDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so (found version "7.0")
-- Added CUDA NVCC flags for: sm_30 sm_35 sm_50 sm_52 sm_60 sm_61 sm_62 sm_70 compute_30 compute_35 compute_50 compute_52 compute_60 compute_61 compute_62 compute_70
-- Found OpenCV 2.x: /usr/share/OpenCV
-- Found Atlas: /usr/include
-- Found Atlas (include: /usr/include, library: /usr/lib/libatlas.so)
-- Found PythonInterp: /beegfs/120x/home/ilia/nvcaffe_comp/bin/python2.7 (found suitable version "2.7.6", minimum required is "2.7")
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython2.7.so (found suitable version "2.7.6", minimum required is "2.7")
-- Found NumPy: /beegfs/120x/home/ilia/nvcaffe_comp/local/lib/python2.7/site-packages/numpy/core/include (found suitable version "1.13.1", minimum required is "1.7.1")
-- NumPy ver. 1.13.1 found (include: /beegfs/120x/home/ilia/nvcaffe_comp/local/lib/python2.7/site-packages/numpy/core/include)
-- Boost version: 1.54.0
-- Found the following Boost libraries:
-- python
-- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE)
-- Found NCCL: /usr/include
-- Found NCCL (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libnccl.so)
-- Found NVML: /usr/include
-- Found NVML (include: /usr/include, library: /usr/lib/nvidia-384/libnvidia-ml.so)
-- Found Git: /usr/bin/git (found version "1.9.1")
--
-- ******************* Caffe Configuration Summary *******************
-- General:
-- Version : 0.16.4
-- Git : v0.16.1-404-g860701c
-- System : Linux
-- C++ compiler : /usr/bin/c++
-- Release CXX flags : -O3 -DNDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare -Wno-uninitialized
-- Debug CXX flags : -g -DDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare -Wno-uninitialized
-- Build type : Release
--
-- BUILD_SHARED_LIBS : ON
-- BUILD_python : ON
-- BUILD_matlab : OFF
-- BUILD_docs : ON
-- CPU_ONLY : OFF
-- USE_LEVELDB : ON
-- USE_LMDB : ON
-- ALLOW_LMDB_NOLOCK : OFF
-- TEST_FP16 : OFF
--
-- Dependencies:
-- BLAS : Yes (Atlas)
-- Boost : Yes (ver. 1.54)
-- glog : Yes
-- gflags : Yes
-- protobuf : Yes (ver. 3.4.0)
-- lmdb : Yes (ver. 0.9.10)
-- LevelDB : Yes (ver. 1.15)
-- Snappy : Yes (ver. 1.1.0)
-- OpenCV : Yes (ver. 2.4.8)
-- JPEGTurbo : No
-- CUDA : Yes (ver. 9.0)
--
-- NVIDIA CUDA:
-- Target GPU(s) : Manual
-- GPU arch(s) : sm_30 sm_35 sm_50 sm_52 sm_60 sm_61 sm_62 sm_70 compute_30 compute_35 compute_50 compute_52 compute_60 compute_61 compute_62 compute_70
-- cuDNN : Yes (ver. 7.0)
-- NCCL : Yes (ver. 2.0.5)
-- NVML : /usr/lib/nvidia-384/libnvidia-ml.so
--
-- Python:
-- Interpreter : /beegfs/120x/home/ilia/nvcaffe_comp/bin/python2.7 (ver. 2.7.6)
-- Libraries : /usr/lib/x86_64-linux-gnu/libpython2.7.so (ver 2.7.6)
-- NumPy : /beegfs/120x/home/ilia/nvcaffe_comp/local/lib/python2.7/site-packages/numpy/core/include (ver 1.13.1)
--
-- Documentaion:
-- Doxygen : No
-- config_file :
--
-- Install:
-- Install path : /beegfs/120x/home/ilia/caffe_builds/nvc/build/install
--
-- Configuring done
-- Generating done
-- Build files have been written to: /beegfs/120x/home/ilia/caffe_builds/nvc/build
When I tried to run training process with:
./build/tools/caffe train -solver='solver.prototxt'
I got following error:
I1019 15:17:12.441572 108568 solver.cpp:315] Iteration 0 (0.371277 s), loss = 1383.36
I1019 15:17:12.441620 108568 solver.cpp:332] Train net output #0: loss_bbox = 8.39254e-06 (* 100 = 0.000839254 loss)
I1019 15:17:12.441634 108568 solver.cpp:332] Train net output #1: loss_cls = 2.47564 (* 500 = 1237.82 loss)
I1019 15:17:12.441706 108568 solver.cpp:332] Train net output #2: rpn_cls_loss = 0.693479 (* 100 = 69.3479 loss)
I1019 15:17:12.441738 108568 solver.cpp:332] Train net output #3: rpn_loss_bbox = 0.93409 (* 100 = 93.409 loss)
I1019 15:17:12.441750 108568 sgd_solver.cpp:136] Iteration 0, lr = 5e-05, m = 0.5
*** Aborted at 1508415432 (unix time) try "date -d @1508415432" if you are using GNU date ***
PC: @ 0x7f8922466b8d caffe::CuDNNConvolutionLayer<>::FindExConvAlgo()
*** SIGSEGV (@0x0) received by PID 108568 (TID 0x7f89242d4900) from PID 0; stack trace: ***
@ 0x7f8920263cb0 (unknown)
@ 0x7f8922466b8d caffe::CuDNNConvolutionLayer<>::FindExConvAlgo()
@ 0x7f892248b9f0 caffe::CuDNNConvolutionLayer<>::Reshape()
@ 0x7f89223a7d0b caffe::Layer<>::Forward()
@ 0x7f892261da3b caffe::Net::ForwardFromTo()
@ 0x7f892261db97 caffe::Net::Forward()
@ 0x7f8922620325 caffe::Net::ForwardBackward()
@ 0x7f8922630652 caffe::Solver::Step()
@ 0x7f8922631395 caffe::Solver::Solve()
@ 0x40d9e8 train()
@ 0x40ae18 main
@ 0x7f892024ef45 (unknown)
@ 0x40b6fb (unknown)
@ 0x0 (unknown)
The diff error in my topology was caused by the commted ShareDIff.
change back to old style walk-around the issue.
Just FYI.
case EltwiseParameter_EltwiseOp_SUM:
if (coeffs_[i] == 1.F) {
Btype* bottom_diff = bottom[i]->mutable_gpu_diff();
//bottom[i]->ShareDiff(top[0]);
caffe_copy(count, top_diff, bottom_diff);
} else {
Btype bottom_diff = bottom[i]->mutable_gpu_diff();
caffe_gpu_scale(count, Btype(coeffs_[i]), top_diff, bottom_diff);
}
break;
I'm working with this code:
git checkout NVIDIA/v0.14.0-alpha
git merge lukeyeager/nvidia/versioning
git merge drnikolaev/caffe-0.14-cnmem
When I build and run the tests, I get two failures:
[ FAILED ] CuDNNConvolutionLayerTest/0.TestGradientGroupCuDNN, where TypeParam = float (561 ms)
...
[ RUN ] CuDNNConvolutionLayerTest/1.TestGradientGroupCuDNN
F1009 18:25:32.508326 17488 cudnn_conv_layer.cu:37] Check failed: status == CUDNN_STATUS_SUCCESS (7 vs. 0) CUDNN_STATUS_MAPPING_ERROR
*** Check failure stack trace: ***
@ 0x2aafa63a8daa (unknown)
@ 0x2aafa63a8ce4 (unknown)
@ 0x2aafa63a86e6 (unknown)
@ 0x2aafa63ab687 (unknown)
@ 0x2aafa5cfd17a caffe::CuDNNConvolutionLayer<>::Forward_gpu()
@ 0x7bc756 caffe::Layer<>::Forward()
@ 0x8c8890 caffe::GradientChecker<>::CheckGradientSingle()
@ 0x8d56d3 caffe::GradientChecker<>::CheckGradientExhaustive()
@ 0xa62bb2 caffe::CuDNNConvolutionLayerTest_TestGradientGroupCuDNN_Test<>::TestBody()
@ 0xbd6b23 testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0xbcd5a7 testing::Test::Run()
@ 0xbcd64e testing::TestInfo::Run()
@ 0xbcd755 testing::TestCase::Run()
@ 0xbd04a8 testing::internal::UnitTestImpl::RunAllTests()
@ 0xbd0747 testing::UnitTest::Run()
@ 0x7af817 main
@ 0x2aafabb81ec5 (unknown)
@ 0x7b62a2 (unknown)
@ (nil) (unknown)
Aborted (core dumped)
make[3]: *** [src/caffe/test/CMakeFiles/runtest] Error 134
make[2]: *** [src/caffe/test/CMakeFiles/runtest.dir/all] Error 2
make[1]: *** [src/caffe/test/CMakeFiles/runtest.dir/rule] Error 2
make: *** [runtest] Error 2
The second one kills the test suite, so I can't say if any more would have failed or not.
recently, we encounter a very strange problem. when using nvcaffe and tensorrt with JNI, if we use tomcat container in java, gpu memory will increase slightly, but if we use it in c++, it will be ok
Hi,
I' have compiled successfully the nvcaffe bvlc_cub_v4_v5 version, but when trying to execute digits I get the following error:
ERROR: Library at "libcaffe.so.1.0.0-rc3" does not have expected suffix "-nv". Are you using the NVIDIA/caffe fork? Invalid input
In nvcaffe/build/lib dir I see libcaffe.a libcaffe.so libcaffe.so.1.0.0-rc3
I can run successfully nvcaffe of master branch and also nvcaffe 0.15 by drnikolaev, but I fail to run on that drnikolaev bvlc_cub_v4_v5 version. I have just update DIGITS to the last version but the error is still here.
Any hints?
thanks
Hi, Sergei Nikolaev, Ph.D. I am a AI engineer, I want to explore the lower level of knowledge about AI framework, I have been studying nvcaffe-017.3 code for more than a month, I am still confused about the details of the code, can you share the notes about nvcaffe code or about documents that helping to understand nvcaffe code with me? Thanks for your response.
I am porting some caffe code to fp16. I face problem with batch normalisation (BN). It seems that cudnn doesn't support BN for fp16. The caffe engine seems prone to overflow if i use fp 16. Do you have a solution?
Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.