tusimple / mx-maskrcnn Goto Github PK
View Code? Open in Web Editor NEWAn MXNet implementation of Mask R-CNN
License: Apache License 2.0
An MXNet implementation of Mask R-CNN
License: Apache License 2.0
In the following codes:
mx-maskrcnn/rcnn/CXX_OP/roi_align.cu
Lines 72 to 73 in e8a05da
3.0
computing strides. Why use 3.0
? Shouldn't it be pooled_height
or pooled_width
?
How can I download the cityscape database? THX
My training was ended with rcnn1-0005.params for someone else killed my program. Then how to resume training at this checkpoint?
now, I meet the new problem as about. I guesss that cudnn's version is different.
in order to avoid this error, could I not to use cudnn when i compile mxnet?
I try to test my own image,but errors found.can you tell me the detail to test, thank you for very much
A small advice about the small error between document train“_”alternate.sh and file name train“-”alternate.sh
Traceback (most recent call last):
File "train_alternate_mask_fpn.py", line 114, in
main()
File "train_alternate_mask_fpn.py", line 111, in main
args.rcnn_epoch, args.rcnn_lr, args.rcnn_lr_step)
File "train_alternate_mask_fpn.py", line 31, in alternate_train
train_shared=False, lr=rpn_lr, lr_step=rpn_lr_step)
File "/home/luo/mx-maskrcnn-master/rcnn/tools/train_rpn.py", line 149, in train_rpn
arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
File "/home/luo/mx-maskrcnn-master/incubator-mxnet/python/mxnet/module/base_module.py", line 460, in fit
for_training=True, force_rebind=force_rebind)
File "/home/luo/mx-maskrcnn-master/rcnn/core/module.py", line 141, in bind
force_rebind=False, shared_module=None)
File "/home/luo/mx-maskrcnn-master/incubator-mxnet/python/mxnet/module/module.py", line 417, in bind
state_names=self._state_names)
File "/home/luo/mx-maskrcnn-master/incubator-mxnet/python/mxnet/module/executor_group.py", line 231, in init
self.bind_exec(data_shapes, label_shapes, shared_group)
File "/home/luo/mx-maskrcnn-master/incubator-mxnet/python/mxnet/module/executor_group.py", line 327, in bind_exec
shared_group))
File "/home/luo/mx-maskrcnn-master/incubator-mxnet/python/mxnet/module/executor_group.py", line 603, in _bind_ith_exec
shared_buffer=shared_data_arrays, **input_shapes)
File "/home/luo/mx-maskrcnn-master/incubator-mxnet/python/mxnet/symbol/symbol.py", line 1491, in simple_bind
raise RuntimeError(error_msg)
RuntimeError: simple_bind error. Arguments:
data: (1, 3, 1024, 2048)
bbox_weight: (1, 12, 174592)
bbox_target: (1, 12, 174592)
label: (1, 523776)
[15:49:36] src/storage/storage.cc:59: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: invalid device ordinal
how i should do? thanks
The number of sampled regular locations in your implementation seems 3.
Dtype h_stride = (hend - hstart)/3.0;
Dtype w_stride = (wend - wstart)/3.0;
But the author samples 4 regular locations.
Is 3 better than 4?
Do you observe diminishing as the number of regular locations increase?
Any instructions on how to train it on our own dataset?
I got some warning when I ran make:
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I/usr/local/lib/python2.7/dist-packages/numpy/core/include -I/usr/local/cuda/include -I/usr/include/python2.7 -c gpu_nms.cpp -o build/temp.linux-x86_64-2.7/gpu_nms.o -Wno-unused-function
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarraytypes.h:1788:0,
from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayobject.h:4,
from gpu_nms.cpp:499:
/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
c++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wl,-Bsymbolic-functions -Wl,-z,relro -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security build/temp.linux-x86_64-2.7/nms_kernel.o build/temp.linux-x86_64-2.7/gpu_nms.o -L/usr/local/cuda/lib64 -Wl,-R/usr/local/cuda/lib64 -lcudart -o /home/lzq12138/lkd/maskrcnn/mx-maskrcnn/rcnn/cython/gpu_nms.so
cd rcnn/pycocotools; python2 setup.py build_ext --inplace; rm -rf build; cd ../../
Warning: Extension name '_mask' does not match fully qualified name 'rcnn.pycocotools._mask' of '_mask.pyx'
Compiling _mask.pyx because it depends on /usr/local/lib/python2.7/dist-packages/Cython/Includes/numpy/init.pxd.
I can run bash scripts/train_alternate.sh successfully ,then I canceled the train with ctrl+c and trun to bash scripts/demo.sh
Error in CustomOp.forward: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/mxnet/operator.py", line 782, in forward_entry
aux=tensors[4])
File "/home/lzq12138/lkd/maskrcnn/mx-maskrcnn/rcnn/PY_OP/fpn_roi_pooling.py", line 76, in forward
roi_pool = mx.nd.ROIAlign(feat_dict['stride%s' % s], _rois, (self._pool_h, self._pool_w), 1.0 / float(s))
AttributeError: 'module' object has no attribute 'ROIAlign'
[00:34:38] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [00:34:38] src/operator/custom/custom.cc:293: Check failed: reinterpret_cast(params.info->callbacks[kCustomOpForward])( ptrs.size(), ptrs.data(), tags.data(), reinterpret_cast<const int*>(req.data()), static_cast(ctx.is_train), params.info->contexts[kCustomOpForward])
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x272c4c) [0x7f1961857c4c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x33ffaf) [0x7f1961924faf]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x20ef112) [0x7f19636d4112]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(MXExecutorForward+0x15) [0x7f1963665675]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f19d312ce40]
[bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f19d312c8ab]
[bt] (6) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f19d333c3df]
[bt] (7) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) [0x7f19d3340d82]
[bt] (8) python2(PyObject_Call+0x43) [0x4b0cb3]
[bt] (9) python2(PyEval_EvalFrameEx+0x5faf) [0x4c9faf]
Emm.... I propose issue again...
My dataset is 5 classes. There are several terms of my config.
config.NUM_CLASSES = 5
config.SCALES = [(1024, 2048)]
config.CLASS_ID = [0, 1, 2, 3, 4]
config.TRAIN.SCALE = True (!!! This term you never use in your project!!!)
default.rpn_epoch = 4
default.rcnn_epoch = 12
dataset.ObjectSnap.NUM_CLASSES = 5
dataset.ObjectSnap.CLASS_ID = [0, 1, 2, 3, 4]
dataset.ObjectSnap.SCALES = [(1024, 2048)]
I build a small dataset with 41 images for verification. The net is trained well. Parts of my Log:
Epoch[2] Batch [20] Speed: 10.90 samples/sec Train-RPNAcc=0.903181, RPNLogLoss=0.238827, RPNL1Loss=0.933737,
Epoch[2] Batch [40] Speed: 12.69 samples/sec Train-RPNAcc=0.919255, RPNLogLoss=0.200416, RPNL1Loss=0.927370,
Epoch[2] Train-RPNAcc=0.919255
Epoch[2] Train-RPNLogLoss=0.200416
Epoch[2] Train-RPNL1Loss=0.927370
Epoch[2] Time cost=7.387
Epoch[0] Batch [20] Speed: 1.18 samples/sec Train-RCNNAcc=0.936756, RCNNLogLoss=0.921319, RCNNL1Loss=1.693366, MaskACC=0.981722, MaskLogLoss=0.033281,
Epoch[0] Batch [40] Speed: 1.28 samples/sec Train-RCNNAcc=0.946408, RCNNLogLoss=0.617575, RCNNL1Loss=1.681111, MaskACC=0.983959, MaskLogLoss=0.030308,
Epoch[0] Train-RCNNAcc=0.946408
Epoch[0] Train-RCNNLogLoss=0.617575
Epoch[0] Train-RCNNL1Loss=1.681111
Epoch[0] Train-MaskACC=0.983959
Epoch[0] Train-MaskLogLoss=0.030308
Epoch[0] Time cost=67.991
Finally, I get my final-0000.params model
However! When I use several images in my train lists, I got all NAN scores for both my demo test and evaluation test. Jesus! How could that happen??
PS: One more doubt point, the log shows I get a very high Acc (like 0.8~0.9) even at the very beginning of my training!
Hope get your reply soon~ Thanks~
Seems out of resources ,but I have 8G GPU memory and it not take all, why happens this?
MXNetError: [11:03:30] /media/jintian/Netac/CodeSpace/ng/auto_car/mx-maskrcnn/incubator-mxnet/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (7 vs. 0) Name: MapPlanKernel ErrStr:too many resources requested for launch
I feel both sad and happy to that you find the filp-mask mistake. I do think it affects the effectiveness of my trained net(cost more than 6 days). So I wonder: Is there some other commits which will affect the training of net?
Thanks! I still admire your works!
mx-maskrcnn/rcnn/CXX_OP/roi_align.cu
Lines 77 to 78 in 36910cf
hlow
seems to be height-1
, and the max value of hhigh
is height
. So is there some risk that the most bottom or right value out of feature array boundary?
Also, in these two lines,
mx-maskrcnn/rcnn/CXX_OP/roi_align.cu
Lines 86 to 87 in 36910cf
hlow
is never equal to hhigh
and denominator seems always be 1.Ubuntu 14.04.2
GPU:TITAN X;
Driver Version: 375.39
CUDA8.0;
cudnn 5.1.5
python2.7;
numpy1.8.2
mxnet 0.12.0
How to solve this problem? Thanks!
when i compile mxnet, i got this error: OSError: libPlayCtrl.so: undefined symbol: AR_SetParam, could you help me solve this problem.I usually use caffe framework, this is my first time to use mxnet. Thank you.
mkdir: cannot create directory 'data/cityscape/results/': File exists
Namespace(dataset='Cityscape', dataset_path='data/cityscape', epoch=0, gpu=0, has_rpn=True, image_set='val', network='resnet_fpn', prefix='model/final', proposal='rpn', result_path='data/cityscape/results/', root_path='data', shuffle=False, thresh=0.001, vis=False)
{'ANCHOR_RATIOS': [0.5, 1, 2],
'ANCHOR_SCALES': [8],
'CLASS_ID': [0, 24, 25, 26, 27, 28, 31, 32, 33],
'FIXED_PARAMS': ['conv0', 'stage1', 'gamma', 'beta'],
'FIXED_PARAMS_SHARED': ['conv0',
'stage1',
'stage2',
'stage3',
'stage4',
'P5',
'P4',
'P3',
'P2',
'gamma',
'beta'],
'NUM_ANCHORS': 3,
'NUM_CLASSES': 9,
'PIXEL_MEANS': array([0, 0, 0]),
'RCNN_FEAT_STRIDE': [32, 16, 8, 4],
'ROIALIGN': True,
'RPN_FEAT_STRIDE': [64, 32, 16, 8, 4],
'SCALES': [(1024, 2048)],
'TEST': {'BATCH_IMAGES': 1,
'HAS_RPN': True,
'NMS': 0.3,
'PROPOSAL_MIN_SIZE': [64, 32, 16, 8, 4],
'PROPOSAL_NMS_THRESH': 0.7,
'PROPOSAL_POST_NMS_TOP_N': 2000,
'PROPOSAL_PRE_NMS_TOP_N': 20000,
'RPN_MIN_SIZE': [64, 32, 16, 8, 4],
'RPN_NMS_THRESH': 0.7,
'RPN_POST_NMS_TOP_N': 1000,
'RPN_PRE_NMS_TOP_N': 6000},
'TRAIN': {'ASPECT_GROUPING': True,
'BATCH_IMAGES': 1,
'BATCH_ROIS': 256,
'BBOX_MEANS': [0.0, 0.0, 0.0, 0.0],
'BBOX_NORMALIZATION_PRECOMPUTED': False,
'BBOX_REGRESSION_THRESH': 0.5,
'BBOX_STDS': [0.1, 0.1, 0.2, 0.2],
'BBOX_WEIGHTS': array([ 1., 1., 1., 1.]),
'BG_THRESH_HI': 0.5,
'BG_THRESH_LO': 0.0,
'FG_FRACTION': 0.25,
'FG_THRESH': 0.5,
'RPN_BATCH_SIZE': 256,
'RPN_BBOX_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'RPN_CLOBBER_POSITIVES': False,
'RPN_FG_FRACTION': 0.5,
'RPN_MIN_SIZE': [64, 32, 16, 8, 4],
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESH': 0.7,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POSITIVE_WEIGHT': -1.0,
'RPN_POST_NMS_TOP_N': 2000,
'RPN_PRE_NMS_TOP_N': 12000,
'SCALE': True,
'SCALE_RANGE': [0.8, 1]}}
num_images 500
cityscape_val gt roidb loaded from data/cache/cityscape_val_gt_roidb.pkl
scripts/demo.sh: line 24: 31359 Segmentation fault (core dumped) python demo_mask.py --network resnet_fpn --dataset ${DATASET} --image_set ${TEST_SET} --prefix ${PREFIX} --result_path ${RESULT_PATH} --has_rpn --epoch 0 --gpu 0
This is my running log, I have finished the mxnet install, but when I run the demo.sh, I met this problem, anyone helps?
When i made the cython, it abort error message below, I run this in virtual machine, I know it's a problem about the gpu or cpu, i just don't know how to fix it, Someone can help me?
cd rcnn/cython/; python setup.py build_ext --inplace; rm -rf build; cd ../../
Traceback (most recent call last):
File "setup.py", line 58, in
CUDA = locate_cuda()
File "setup.py", line 46, in locate_cuda
raise EnvironmentError('The nvcc binary could not be '
EnvironmentError: The nvcc binary could not be located in your $PATH. Either add it to your path, or set $CUDAHOME
cd rcnn/pycocotools; python setup.py build_ext --inplace; rm -rf build; cd ../../
Warning: Extension name '_mask' does not match fully qualified name 'rcnn.pycocotools._mask' of '_mask.pyx'
running build_ext
I convert my dataset to the type of cityscapes to use this net. Of course, I modify several palaces for file index of dataset and class_id.
However, the training gets crashed after achieving the first training part of RPN.
The crash happend in
data_on_imgs['img_%s' % im_i]['bbox_targets_on_levels']['stride%s' % s] = np.concatenate([_bbox_targets, bbox_targets_pad])
data_on_imgs['img_%s' % im_i]['bbox_weights_on_levels']['stride%s' % s] = np.concatenate([_bbox_weights, bbox_weights_pad])
File "mx-maskrcnn/rcnn/core/loader.py", line 278, in _make_data_and_labels
ValueError: all the input array dimensions except for the concatenation axis must match exactly
PS: I check the pairs of corresponding img and label, they all get the same size in pairs.
Really Thank you guys for these a lot of help!
the terminal suggest "you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb)." What should i do?
Hi I encountered the following problem while runing scripts/demo_single_image.sh:
~/mx-maskrcnn$ bash scripts/demo_single_image.sh
Namespace(dataset='Cityscape', epoch=0, gpu=0, image_name='figures/test.jpg', network='resnet_fpn', prefix='model/final', thresh=0.3, vis=True)
This application failed to start because it could not find or load the Qt platform plugin "xcb"
in "".
Available platform plugins are: minimal, offscreen, xcb.
Reinstalling the application may fix this problem.
scripts/demo_single_image.sh: 行 18: 18915 已放弃 (核心已转储) python2 -m rcnn.tools.demo_single_image --network resnet_fpn --dataset ${DATASET} --prefix ${PREFIX} --epoch 0 --gpu 0 --image_name figures/test.jpg --thresh 0.3 --vis true
Thank you for your answer!!
How long did the full training take for the provided training script? Also, what hardware did you use?
Thanks for this "Wonderfull" opensource, but I have to say the codes are really shit. I opened a issue reported a bug and return a commit point, after I tried it, same errors happen again. Too simple the technical people.
I tried to train the model with coco dataset, but when it train rcnn part, it say out of memory.
Would you give me some hints to solve this problem?
My gpu is GTX1080, 8G
There are some information during training
{'ANCHOR_RATIOS': [0.5, 1, 2],
'ANCHOR_SCALES': [8],
'CLASS_ID': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80]),
'FIXED_PARAMS': ['conv0', 'stage1', 'gamma', 'beta'],
'FIXED_PARAMS_SHARED': ['conv0',
'stage1',
'stage2',
'stage3',
'stage4',
'P5',
'P4',
'P3',
'P2',
'gamma',
'beta'],
'NUM_ANCHORS': 3,
'NUM_CLASSES': 81,
'PIXEL_MEANS': array([0, 0, 0]),
'RCNN_FEAT_STRIDE': [32, 16, 8, 4],
'ROIALIGN': True,
'RPN_FEAT_STRIDE': [64, 32, 16, 8, 4],
'SCALES': [(800, 1000)],
'TEST': {'BATCH_IMAGES': 1,
'HAS_RPN': True,
'NMS': 0.3,
'PROPOSAL_MIN_SIZE': [64, 32, 16, 8, 4],
'PROPOSAL_NMS_THRESH': 0.7,
'PROPOSAL_POST_NMS_TOP_N': 2000,
'PROPOSAL_PRE_NMS_TOP_N': 20000,
'RPN_MIN_SIZE': [64, 32, 16, 8, 4],
'RPN_NMS_THRESH': 0.7,
'RPN_POST_NMS_TOP_N': 1000,
'RPN_PRE_NMS_TOP_N': 6000},
'TRAIN': {'ASPECT_GROUPING': True,
'BATCH_IMAGES': 1,
'BATCH_ROIS': 256,
'BBOX_MEANS': [0.0, 0.0, 0.0, 0.0],
'BBOX_NORMALIZATION_PRECOMPUTED': False,
'BBOX_REGRESSION_THRESH': 0.5,
'BBOX_STDS': [0.1, 0.1, 0.2, 0.2],
'BBOX_WEIGHTS': array([ 1., 1., 1., 1.]),
'BG_THRESH_HI': 0.5,
'BG_THRESH_LO': 0.0,
'FG_FRACTION': 0.25,
'FG_THRESH': 0.5,
'RPN_BATCH_SIZE': 32,
'RPN_BBOX_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'RPN_CLOBBER_POSITIVES': False,
'RPN_FG_FRACTION': 0.5,
'RPN_MIN_SIZE': [64, 32, 16, 8, 4],
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESH': 0.7,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POSITIVE_WEIGHT': -1.0,
'RPN_POST_NMS_TOP_N': 2000,
'RPN_PRE_NMS_TOP_N': 12000,
'SCALE': True,
'SCALE_RANGE': [0.8, 1]}}
There is multi-scale training in your implementation.
config.TRAIN.SCALE_RANGE = (0.8, 1)
But the author resizes the shorter edge to 800 pixels on coco and 2048×1024 pixels on Cityscapes.
The FPN backend should have a good scale invariant ability(I'm not sure this description is proper or not).
Just looking forward single scale training result.
when training the dataset in stage TRAIN RCNN WITH RPN INIT AND DETECTION (after COMBINE RPN2 WITH RCNN1), the RCNNL1Loss is always nan, i try several times, and i set the base_lr really low, it is always nan,
environment is p40/8 gpu
num_images 1962
cityscape_small_train gt roidb loaded from model/res50-fpn/cityscape/alternate/cache/cityscape_small_train_gt_roidb.pkl
append flipped images to roidb
filtered 3924 roidb entries: 3924 -> 0
[]
0
Traceback (most recent call last):
File "train_alternate_mask_fpn.py", line 115, in
main()
File "train_alternate_mask_fpn.py", line 112, in main
args.rcnn_epoch, args.rcnn_lr, args.rcnn_lr_step)
File "train_alternate_mask_fpn.py", line 31, in alternate_train
train_shared=False, lr=rpn_lr, lr_step=rpn_lr_step)
File "/home/zhkj/mx-maskrcnn/rcnn/tools/train_rpn.py", line 54, in train_rpn
allowed_border=9999)
File "/home/zhkj/mx-maskrcnn/rcnn/core/loader.py", line 407, in init
self.get_batch()
File "/home/zhkj/mx-maskrcnn/rcnn/core/loader.py", line 507, in get_batch
iroidb = [roidb[i] for i in range(islice.start, islice.stop)]
IndexError: list index out of range
how to resolve it?thanks for your answer!!!
Hi, thank u very much for sharring. I run your demo and successfully segment 500 pictures--the result is really wonderfully!
However, how can I get the accuracy (as u mention in your table, u get a average accuracy of 26.2 on test set)?
When I try to resume my net, this error happened.
File "train_alternate_mask_fpn.py", line 121, in
main()
File "train_alternate_mask_fpn.py", line 118, in main
args.rcnn_epoch, args.rcnn_lr, args.rcnn_lr_step)
File "train_alternate_mask_fpn.py", line 35, in alternate_train
train_shared=False, lr=rpn_lr, lr_step=rpn_lr_step)
File "/home/liyuw/Geonet-mx-maskrcnn/rcnn/tools/train_rpn.py", line 73, in train_rpn
arg_params, aux_params = load_param(prefix, begin_epoch, convert=True)
File "/home/liyuw/Geonet-mx-maskrcnn/rcnn/utils/load_model.py", line 49, in load_param
arg_params, aux_params = load_checkpoint(prefix, epoch)
File "/home/liyuw/Geonet-mx-maskrcnn/rcnn/utils/load_model.py", line 15, in load_checkpoint
save_dict = mx.nd.load('%s-%04d.params' % (prefix, epoch))
File "./incubator-mxnet/python/mxnet/ndarray/utils.py", line 174, in load
ctypes.byref(names)))
File "./incubator-mxnet/python/mxnet/base.py", line 146, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [19:27:16] src/io/local_filesys.cc:166: Check failed: allow_null LocalFileSystem: fail to open "model/res50-fpn/cityscape/alternate/rpn1-0000.params"
I do not have the sudo permission.
Thank you for your answer
hi, I put our data as cityscape format. And train it successfully and get the final.params model.
average : 0.000 0.000
And got no mask on demo scripts.
would you please help to suggest the problem happened in train process or some other reanson.
Thank you.
Is "COCO coming soon, stsy tuned" about days, weeks, months?
Thanks a lot
Tets
When I run train_alternate_mask_fpn.py
, I get this error: ImportError:No module named bbox
. I check there is the script in /mx-maskrcnn/rnn/cython/bbox.pyx
, but I don't know why it can't import this script? It is the reason of .pyx
?
Hi,
I could see the drop box link about 2h ago on my mobile, but when I tried again now to download it, the site can not be accessed anymore, whether from PC or mobile. Is the link de-activated or cancelled? Can you repost the param file somewhere else?
Thanks
Tets
Hi guys,
I am encountering the following issue while training on a single TITAN X GPU:
INFO:root:Epoch[0] Batch [820] Speed: 0.53 samples/sec Train-RCNNAcc=0.753992, RCNNLogLoss=0.927809, RCNNL1Loss=4.257946, MaskACC=0.902436, MaskLogLoss=0.166465,
/home/twang/work/fine_segmentation/baselines/mx-maskrcnn/rcnn/core/metric.py:134: RuntimeWarning: invalid value encountered in greater
idx = np.where(np.logical_and(mask_prob > 0.5, mask_weight == 1))
INFO:root:Epoch[0] Batch [840] Speed: 0.56 samples/sec Train-RCNNAcc=0.753502, RCNNLogLoss=0.987536, RCNNL1Loss=24666331660818701805251597434880.000000, MaskACC=0.901641, MaskLogLoss=nan,
The RCNNL1Loss basically explodes after this RuntimeWarning.
The MaskLogLoss was okay before this RuntimeWarning, but becomes NaN afterwards.
The error occurred during the 3rd training stage: TRAIN RCNN WITH IMAGENET INIT AND RPN DETECTION
Any advice will be much appreciated. Thanks!
thanks
mxnet.base.MXNetError: [22:47:15] src/io/local_filesys.cc:154: Check failed: allow_null LocalFileSystem: fail to open "model/resnet-50-0000.params"
Hi I encountered the following problem while training:
Calling initializer with init(str, NDArray) has been deprecated.please use init(mx.init.InitDesc(...), NDArray) instead.
init_internal(k, arg_params[k])
init P5_lateral_bias
init rpn_conv_weight
init rpn_conv_bias
init rpn_conv_cls_weight
init rpn_conv_cls_bias
init P4_lateral_weight
init P4_lateral_bias
init P4_aggregate_weight
init P4_aggregate_bias
init P3_lateral_weight
init P3_lateral_bias
init P3_aggregate_weight
init P3_aggregate_bias
init P2_lateral_weight
init P2_lateral_bias
init P2_aggregate_weight
init P2_aggregate_bias
init rpn_conv_bbox_weight
init rpn_conv_bbox_bias
lr 0.004 lr_epoch_diff [6] lr_iters [8895]
[14:30:08] src/operator/././cudnn_algoreg-inl.h:112: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[14:30:12] /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/dmlc-core/include/dmlc/./logging.h:308: [14:30:12] src/storage/./pooled_storage_manager.h:102: cudaMalloc failed: out of memory
Stack trace returned 10 entries:
[bt] (0) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f71355ef3cc]
[bt] (1) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7storage23GPUPooledStorageManager5AllocEm+0x15e) [0x7f71365e80de]
[bt] (2) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEmNS_7ContextE+0x69) [0x7f71365eb5e9]
[bt] (3) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x172088f) [0x7f713665b88f]
[bt] (4) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor19InitDataEntryMemoryEPSt6vectorINS_7NDArrayESaIS3_EE+0x2a54) [0x7f71366643b4]
[bt] (5) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor15FinishInitGraphEN4nnvm6SymbolENS2_5GraphEPNS_8ExecutorERKSt13unordered_mapINS2_9NodeEntryENS_7NDArrayENS2_13NodeEntryHashENS2_14NodeEntryEqualESaISt4pairIKS8_S9_EEE+0xa11) [0x7f713666ac81]
[bt] (6) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor4InitEN4nnvm6SymbolERKNS_7ContextERKSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES4_St4lessISD_ESaISt4pairIKSD_S4_EEERKSt6vectorIS4_SaIS4_EESR_SR_RKSt13unordered_mapISD_NS2_6TShapeESt4hashISD_ESt8equal_toISD_ESaISG_ISH_ST_EEERKSS_ISD_iSV_SX_SaISG_ISH_iEEERKSN_INS_9OpReqTypeESaIS18_EERKSt13unordered_setISD_SV_SX_SaISD_EEPSN_INS_7NDArrayESaIS1I_EES1L_S1L_PSS_ISD_S1I_SV_SX_SaISG_ISH_S1I_EEEPNS_8ExecutorERKSS_INS2_9NodeEntryES1I_NS2_13NodeEntryHashENS2_14NodeEntryEqualESaISG_IKS1S_S1I_EEE+0x909) [0x7f713666d1e9]
[bt] (7) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(ZN5mxnet8Executor10SimpleBindEN4nnvm6SymbolERKNS_7ContextERKSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES3_St4lessISC_ESaISt4pairIKSC_S3_EEERKSt6vectorIS3_SaIS3_EESQ_SQ_RKSt13unordered_mapISC_NS1_6TShapeESt4hashISC_ESt8equal_toISC_ESaISF_ISG_SS_EEERKSR_ISC_iSU_SW_SaISF_ISG_iEEERKSM_INS_9OpReqTypeESaIS17_EERKSt13unordered_setISC_SU_SW_SaISC_EEPSM_INS_7NDArrayESaIS1H_EES1K_S1K_PSR_ISC_S1H_SU_SW_SaISF_ISG_S1H_EEEPS0+0x233) [0x7f713666d813]
[bt] (8) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(MXExecutorSimpleBind+0x2d4a) [0x7f713662c40a]
[bt] (9) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f715bc80e40]
Traceback (most recent call last):
File "train_alternate_mask_fpn.py", line 114, in
main()
File "train_alternate_mask_fpn.py", line 111, in main
args.rcnn_epoch, args.rcnn_lr, args.rcnn_lr_step)
File "train_alternate_mask_fpn.py", line 31, in alternate_train
train_shared=False, lr=rpn_lr, lr_step=rpn_lr_step)
File "/home/cougarnet.uh.edu/csmailis/mx-maskrcnn/rcnn/tools/train_rpn.py", line 149, in train_rpn
arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
File "/home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/module/base_module.py", line 460, in fit
for_training=True, force_rebind=force_rebind)
File "/home/cougarnet.uh.edu/csmailis/mx-maskrcnn/rcnn/core/module.py", line 141, in bind
force_rebind=False, shared_module=None)
File "/home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/module/module.py", line 417, in bind
state_names=self._state_names)
File "/home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/module/executor_group.py", line 231, in init
self.bind_exec(data_shapes, label_shapes, shared_group)
File "/home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/module/executor_group.py", line 327, in bind_exec
shared_group))
File "/home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/module/executor_group.py", line 603, in _bind_ith_exec
shared_buffer=shared_data_arrays, **input_shapes)
File "/home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/symbol.py", line 1479, in simple_bind
raise RuntimeError(error_msg)
RuntimeError: simple_bind error. Arguments:
data: (1, 3, 1024, 2048)
bbox_weight: (1, 12, 174592)
bbox_target: (1, 12, 174592)
label: (1, 523776)
[14:30:12] src/storage/./pooled_storage_manager.h:102: cudaMalloc failed: out of memory
Stack trace returned 10 entries:
[bt] (0) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f71355ef3cc]
[bt] (1) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7storage23GPUPooledStorageManager5AllocEm+0x15e) [0x7f71365e80de]
[bt] (2) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEmNS_7ContextE+0x69) [0x7f71365eb5e9]
[bt] (3) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x172088f) [0x7f713665b88f]
[bt] (4) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor19InitDataEntryMemoryEPSt6vectorINS_7NDArrayESaIS3_EE+0x2a54) [0x7f71366643b4]
[bt] (5) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor15FinishInitGraphEN4nnvm6SymbolENS2_5GraphEPNS_8ExecutorERKSt13unordered_mapINS2_9NodeEntryENS_7NDArrayENS2_13NodeEntryHashENS2_14NodeEntryEqualESaISt4pairIKS8_S9_EEE+0xa11) [0x7f713666ac81]
[bt] (6) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor4InitEN4nnvm6SymbolERKNS_7ContextERKSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES4_St4lessISD_ESaISt4pairIKSD_S4_EEERKSt6vectorIS4_SaIS4_EESR_SR_RKSt13unordered_mapISD_NS2_6TShapeESt4hashISD_ESt8equal_toISD_ESaISG_ISH_ST_EEERKSS_ISD_iSV_SX_SaISG_ISH_iEEERKSN_INS_9OpReqTypeESaIS18_EERKSt13unordered_setISD_SV_SX_SaISD_EEPSN_INS_7NDArrayESaIS1I_EES1L_S1L_PSS_ISD_S1I_SV_SX_SaISG_ISH_S1I_EEEPNS_8ExecutorERKSS_INS2_9NodeEntryES1I_NS2_13NodeEntryHashENS2_14NodeEntryEqualESaISG_IKS1S_S1I_EEE+0x909) [0x7f713666d1e9]
[bt] (7) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(ZN5mxnet8Executor10SimpleBindEN4nnvm6SymbolERKNS_7ContextERKSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES3_St4lessISC_ESaISt4pairIKSC_S3_EEERKSt6vectorIS3_SaIS3_EESQ_SQ_RKSt13unordered_mapISC_NS1_6TShapeESt4hashISC_ESt8equal_toISC_ESaISF_ISG_SS_EEERKSR_ISC_iSU_SW_SaISF_ISG_iEEERKSM_INS_9OpReqTypeESaIS17_EERKSt13unordered_setISC_SU_SW_SaISC_EEPSM_INS_7NDArrayESaIS1H_EES1K_S1K_PSR_ISC_S1H_SU_SW_SaISF_ISG_S1H_EEEPS0+0x233) [0x7f713666d813]
[bt] (8) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(MXExecutorSimpleBind+0x2d4a) [0x7f713662c40a]
[bt] (9) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f715bc80e40]
Any ideas on how to resolve this ?
I encountered the above error while training. How could I solve it? THX!
Hello! Thanks for your amazing work !
I want to prepare my dataset like Cityscapes, however I am confusing about the annotation in cityscapes dataset. In load_from_seg(rcnn/dataset/cityscape) function, what's the meaning of ins_id ?
I tried to parse gtFine png file and found that each class is 1000*class_id and plus instance_id ? Is that correct ? ex: Label 23 have 8 instance, therefore the value in segmentation is [23000, 23001 ...23007], however, in aachen_000000_000019_gtFine_instanceIds.png the ids of label 26 is
[26003, 26004, 26005, 26006, 26007, 26008, 26009, 26010]
To sum up , I will appreciate if you can explain what is ins_id for and perhaps the annotation format of cityscapes (Cannot find detailed doc in official website :( ). Thank you very much.
/home/cgangee/code/mxnet/mshadow/mshadow/./base.h:371:43: error: ‘CUDA_R_32I’ was not declared in this scope
static const cudaDataType_t kCudaFlag = CUDA_R_32I;
I am runing train_alternate.sh, really slow when training mask-rcnn(step 2): to epoch[0] Batch [2740], it takes more than 4hours. (Thinking of 20 epoches in total!!!)
I am using 2 titanx with 12G mem each. So effective minibatch size is 4 for me. How to change it?
I find that your code may not correct when using multi-scale dataset.
After I fix the scale error about un-used Config.scale. The network always get nan loss, when training the multi-scale, especially when changing scale size from [1024, 2048] to [640, 1024].
BTW, I do change the value of config.TRAIN.SCALE to True. And the core/tester.py line160 and line226 will get error on evaluation. If you do not set "scale" into function "im_detect_mask"!
demo_mask.py needs to read imdb
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.