The mx-maskrcnn from tusimple

Why dividing 3.0 for feature stride in RoIAlign?

In the following codes:

Lines 72 to 73 in e8a05da

    
           Dtype h_stride = (hend - hstart)/3.0; 
        
           Dtype w_stride = (wend - wstart)/3.0;

, you use 3.0 computing strides. Why use 3.0? Shouldn't it be pooled_height or pooled_width?

@huangzehao @Zehaos

How can I download the cityscape database?

How can I download the cityscape database? THX

Need help!!!How to resume the training?

My training was ended with rcnn1-0005.params for someone else killed my program. Then how to resume training at this checkpoint?

check failed: e == cuDNN: CUDNN_STATUS_BSUCCESS(3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM

now, I meet the new problem as about. I guesss that cudnn's version is different.
in order to avoid this error, could I not to use cudnn when i compile mxnet?

How to test my own picture

I try to test my own image,but errors found.can you tell me the detail to test, thank you for very much

fps and pose estimation

When do inference, can it achieve 5+fps?
The current version supports pose estimation?

Advice: train“-”alternate.sh

A small advice about the small error between document train“_”alternate.sh and file name train“-”alternate.sh

error "has no attribute 'cv' cv2.cvtColor(im, cv2.cv.CV_RGB2BGR)

I get the error "Has no attribute 'cv' " in the line " im = cv2.cvtColor(im, cv2.cv.CV_RGB2BGR)", when I run demo.sh . Can anyone solve it ? thanks！

when i run bash scripts/train_alternate.sh show error below:

Traceback (most recent call last):
File "train_alternate_mask_fpn.py", line 114, in
main()
File "train_alternate_mask_fpn.py", line 111, in main
args.rcnn_epoch, args.rcnn_lr, args.rcnn_lr_step)
File "train_alternate_mask_fpn.py", line 31, in alternate_train
train_shared=False, lr=rpn_lr, lr_step=rpn_lr_step)
File "/home/luo/mx-maskrcnn-master/rcnn/tools/train_rpn.py", line 149, in train_rpn
arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
File "/home/luo/mx-maskrcnn-master/incubator-mxnet/python/mxnet/module/base_module.py", line 460, in fit
for_training=True, force_rebind=force_rebind)
File "/home/luo/mx-maskrcnn-master/rcnn/core/module.py", line 141, in bind
force_rebind=False, shared_module=None)
File "/home/luo/mx-maskrcnn-master/incubator-mxnet/python/mxnet/module/module.py", line 417, in bind
state_names=self._state_names)
File "/home/luo/mx-maskrcnn-master/incubator-mxnet/python/mxnet/module/executor_group.py", line 231, in init
self.bind_exec(data_shapes, label_shapes, shared_group)
File "/home/luo/mx-maskrcnn-master/incubator-mxnet/python/mxnet/module/executor_group.py", line 327, in bind_exec
shared_group))
File "/home/luo/mx-maskrcnn-master/incubator-mxnet/python/mxnet/module/executor_group.py", line 603, in _bind_ith_exec
shared_buffer=shared_data_arrays, **input_shapes)
File "/home/luo/mx-maskrcnn-master/incubator-mxnet/python/mxnet/symbol/symbol.py", line 1491, in simple_bind
raise RuntimeError(error_msg)
RuntimeError: simple_bind error. Arguments:
data: (1, 3, 1024, 2048)
bbox_weight: (1, 12, 174592)
bbox_target: (1, 12, 174592)
label: (1, 523776)
[15:49:36] src/storage/storage.cc:59: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: invalid device ordinal

how i should do? thanks

The number of sampled regular locations

The number of sampled regular locations in your implementation seems 3.

Dtype h_stride = (hend - hstart)/3.0;
Dtype w_stride = (wend - wstart)/3.0;

But the author samples 4 regular locations.

Is 3 better than 4?
Do you observe diminishing as the number of regular locations increase?

Train on own dataset

Any instructions on how to train it on our own dataset?

AttributeError: 'module' object has no attribute 'ROIAlign' When I run demo.sh

I got some warning when I ran make:

x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I/usr/local/lib/python2.7/dist-packages/numpy/core/include -I/usr/local/cuda/include -I/usr/include/python2.7 -c gpu_nms.cpp -o build/temp.linux-x86_64-2.7/gpu_nms.o -Wno-unused-function
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarraytypes.h:1788:0,
from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayobject.h:4,
from gpu_nms.cpp:499:
/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
c++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wl,-Bsymbolic-functions -Wl,-z,relro -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security build/temp.linux-x86_64-2.7/nms_kernel.o build/temp.linux-x86_64-2.7/gpu_nms.o -L/usr/local/cuda/lib64 -Wl,-R/usr/local/cuda/lib64 -lcudart -o /home/lzq12138/lkd/maskrcnn/mx-maskrcnn/rcnn/cython/gpu_nms.so
cd rcnn/pycocotools; python2 setup.py build_ext --inplace; rm -rf build; cd ../../
Warning: Extension name '_mask' does not match fully qualified name 'rcnn.pycocotools._mask' of '_mask.pyx'
Compiling _mask.pyx because it depends on /usr/local/lib/python2.7/dist-packages/Cython/Includes/numpy/init.pxd.

I can run bash scripts/train_alternate.sh successfully ,then I canceled the train with ctrl+c and trun to bash scripts/demo.sh

Error in CustomOp.forward: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/mxnet/operator.py", line 782, in forward_entry
aux=tensors[4])
File "/home/lzq12138/lkd/maskrcnn/mx-maskrcnn/rcnn/PY_OP/fpn_roi_pooling.py", line 76, in forward
roi_pool = mx.nd.ROIAlign(feat_dict['stride%s' % s], _rois, (self._pool_h, self._pool_w), 1.0 / float(s))
AttributeError: 'module' object has no attribute 'ROIAlign'

[00:34:38] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [00:34:38] src/operator/custom/custom.cc:293: Check failed: reinterpret_cast(params.info->callbacks[kCustomOpForward])( ptrs.size(), ptrs.data(), tags.data(), reinterpret_cast<const int*>(req.data()), static_cast(ctx.is_train), params.info->contexts[kCustomOpForward])

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x272c4c) [0x7f1961857c4c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x33ffaf) [0x7f1961924faf]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x20ef112) [0x7f19636d4112]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(MXExecutorForward+0x15) [0x7f1963665675]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f19d312ce40]
[bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f19d312c8ab]
[bt] (6) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f19d333c3df]
[bt] (7) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) [0x7f19d3340d82]
[bt] (8) python2(PyObject_Call+0x43) [0x4b0cb3]
[bt] (9) python2(PyEval_EvalFrameEx+0x5faf) [0x4c9faf]

Demo Test got all " NAN scores"

Emm.... I propose issue again...
My dataset is 5 classes. There are several terms of my config.

config.NUM_CLASSES = 5
config.SCALES = [(1024, 2048)]
config.CLASS_ID = [0, 1, 2, 3, 4]
config.TRAIN.SCALE = True (!!! This term you never use in your project!!!)
default.rpn_epoch = 4
default.rcnn_epoch = 12
dataset.ObjectSnap.NUM_CLASSES = 5
dataset.ObjectSnap.CLASS_ID = [0, 1, 2, 3, 4]
dataset.ObjectSnap.SCALES = [(1024, 2048)]

I build a small dataset with 41 images for verification. The net is trained well. Parts of my Log:

Epoch[2] Batch [20] Speed: 10.90 samples/sec Train-RPNAcc=0.903181, RPNLogLoss=0.238827, RPNL1Loss=0.933737,
Epoch[2] Batch [40] Speed: 12.69 samples/sec Train-RPNAcc=0.919255, RPNLogLoss=0.200416, RPNL1Loss=0.927370,
Epoch[2] Train-RPNAcc=0.919255
Epoch[2] Train-RPNLogLoss=0.200416
Epoch[2] Train-RPNL1Loss=0.927370
Epoch[2] Time cost=7.387

Epoch[0] Batch [20] Speed: 1.18 samples/sec Train-RCNNAcc=0.936756, RCNNLogLoss=0.921319, RCNNL1Loss=1.693366, MaskACC=0.981722, MaskLogLoss=0.033281,
Epoch[0] Batch [40] Speed: 1.28 samples/sec Train-RCNNAcc=0.946408, RCNNLogLoss=0.617575, RCNNL1Loss=1.681111, MaskACC=0.983959, MaskLogLoss=0.030308,
Epoch[0] Train-RCNNAcc=0.946408
Epoch[0] Train-RCNNLogLoss=0.617575
Epoch[0] Train-RCNNL1Loss=1.681111
Epoch[0] Train-MaskACC=0.983959
Epoch[0] Train-MaskLogLoss=0.030308
Epoch[0] Time cost=67.991

Finally, I get my final-0000.params model
However! When I use several images in my train lists, I got all NAN scores for both my demo test and evaluation test. Jesus! How could that happen??

PS: One more doubt point, the log shows I get a very high Acc (like 0.8~0.9) even at the very beginning of my training!

Hope get your reply soon~ Thanks~

MXNetError: [11:03:30] /mx-maskrcnn/incubator-mxnet/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (7 vs. 0) Name: MapPlanKernel ErrStr:too many resources requested for launch

Seems out of resources ,but I have 8G GPU memory and it not take all, why happens this?

MXNetError: [11:03:30] /media/jintian/Netac/CodeSpace/ng/auto_car/mx-maskrcnn/incubator-mxnet/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (7 vs. 0) Name: MapPlanKernel ErrStr:too many resources requested for launch

Ask for: The commits which affect the performance of net.

I feel both sad and happy to that you find the filp-mask mistake. I do think it affects the effectiveness of my trained net(cost more than 6 days). So I wonder: Is there some other commits which will affect the training of net?

Thanks! I still admire your works!

Out of array boundary?

mx-maskrcnn/rcnn/CXX_OP/roi_align.cu

Lines 77 to 78 in 36910cf

    
           int hlow = min(max(static_cast<int>(floor(h)), 0), height-1); 
        
           int hhigh = hlow + 1;

the max possible value of hlow seems to be height-1, and the max value of hhigh is height. So is there some risk that the most bottom or right value out of feature array boundary?

Also, in these two lines,

mx-maskrcnn/rcnn/CXX_OP/roi_align.cu

Lines 86 to 87 in 36910cf

    
           Dtype alpha = (hlow == hhigh) ? static_cast<Dtype>(0.5) : (h - hlow) / (hhigh - hlow); 
        
           Dtype beta = (wleft == wright) ? static_cast<Dtype>(0.5) : (w - wleft) / (wright - wleft);

hlow is never equal to hhigh and denominator seems always be 1.

src/operator/./crop-inl.h:82: Check failed: req[crop_enum::kOut] == kWriteTo (0 vs. 1)

Ubuntu 14.04.2
GPU:TITAN X;
Driver Version: 375.39
CUDA8.0;
cudnn 5.1.5
python2.7;
numpy1.8.2
mxnet 0.12.0

How to solve this problem? Thanks!

libPlayCtrl.so undefined symbol: AR_SetParam

when i compile mxnet, i got this error: OSError: libPlayCtrl.so: undefined symbol: AR_SetParam, could you help me solve this problem.I usually use caffe framework, this is my first time to use mxnet. Thank you.

How can I test my own picture and how to train my own datasets

demo problem

mkdir: cannot create directory 'data/cityscape/results/': File exists
Namespace(dataset='Cityscape', dataset_path='data/cityscape', epoch=0, gpu=0, has_rpn=True, image_set='val', network='resnet_fpn', prefix='model/final', proposal='rpn', result_path='data/cityscape/results/', root_path='data', shuffle=False, thresh=0.001, vis=False)
{'ANCHOR_RATIOS': [0.5, 1, 2],
'ANCHOR_SCALES': [8],
'CLASS_ID': [0, 24, 25, 26, 27, 28, 31, 32, 33],
'FIXED_PARAMS': ['conv0', 'stage1', 'gamma', 'beta'],
'FIXED_PARAMS_SHARED': ['conv0',
'stage1',
'stage2',
'stage3',
'stage4',
'P5',
'P4',
'P3',
'P2',
'gamma',
'beta'],
'NUM_ANCHORS': 3,
'NUM_CLASSES': 9,
'PIXEL_MEANS': array([0, 0, 0]),
'RCNN_FEAT_STRIDE': [32, 16, 8, 4],
'ROIALIGN': True,
'RPN_FEAT_STRIDE': [64, 32, 16, 8, 4],
'SCALES': [(1024, 2048)],
'TEST': {'BATCH_IMAGES': 1,
'HAS_RPN': True,
'NMS': 0.3,
'PROPOSAL_MIN_SIZE': [64, 32, 16, 8, 4],
'PROPOSAL_NMS_THRESH': 0.7,
'PROPOSAL_POST_NMS_TOP_N': 2000,
'PROPOSAL_PRE_NMS_TOP_N': 20000,
'RPN_MIN_SIZE': [64, 32, 16, 8, 4],
'RPN_NMS_THRESH': 0.7,
'RPN_POST_NMS_TOP_N': 1000,
'RPN_PRE_NMS_TOP_N': 6000},
'TRAIN': {'ASPECT_GROUPING': True,
'BATCH_IMAGES': 1,
'BATCH_ROIS': 256,
'BBOX_MEANS': [0.0, 0.0, 0.0, 0.0],
'BBOX_NORMALIZATION_PRECOMPUTED': False,
'BBOX_REGRESSION_THRESH': 0.5,
'BBOX_STDS': [0.1, 0.1, 0.2, 0.2],
'BBOX_WEIGHTS': array([ 1., 1., 1., 1.]),
'BG_THRESH_HI': 0.5,
'BG_THRESH_LO': 0.0,
'FG_FRACTION': 0.25,
'FG_THRESH': 0.5,
'RPN_BATCH_SIZE': 256,
'RPN_BBOX_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'RPN_CLOBBER_POSITIVES': False,
'RPN_FG_FRACTION': 0.5,
'RPN_MIN_SIZE': [64, 32, 16, 8, 4],
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESH': 0.7,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POSITIVE_WEIGHT': -1.0,
'RPN_POST_NMS_TOP_N': 2000,
'RPN_PRE_NMS_TOP_N': 12000,
'SCALE': True,
'SCALE_RANGE': [0.8, 1]}}
num_images 500
cityscape_val gt roidb loaded from data/cache/cityscape_val_gt_roidb.pkl
scripts/demo.sh: line 24: 31359 Segmentation fault (core dumped) python demo_mask.py --network resnet_fpn --dataset ${DATASET} --image_set ${TEST_SET} --prefix ${PREFIX} --result_path ${RESULT_PATH} --has_rpn --epoch 0 --gpu 0
This is my running log, I have finished the mxnet install, but when I run the demo.sh, I met this problem, anyone helps?

make cython failed

When i made the cython, it abort error message below, I run this in virtual machine, I know it's a problem about the gpu or cpu, i just don't know how to fix it, Someone can help me?

cd rcnn/cython/; python setup.py build_ext --inplace; rm -rf build; cd ../../
Traceback (most recent call last):
File "setup.py", line 58, in
CUDA = locate_cuda()
File "setup.py", line 46, in locate_cuda
raise EnvironmentError('The nvcc binary could not be '
EnvironmentError: The nvcc binary could not be located in your $PATH. Either add it to your path, or set $CUDAHOME
cd rcnn/pycocotools; python setup.py build_ext --inplace; rm -rf build; cd ../../
Warning: Extension name '_mask' does not match fully qualified name 'rcnn.pycocotools._mask' of '_mask.pyx'
running build_ext

Error in "_make_data_and_labels" -- Use my dataset to train net

I convert my dataset to the type of cityscapes to use this net. Of course, I modify several palaces for file index of dataset and class_id.
However, the training gets crashed after achieving the first training part of RPN.
The crash happend in
data_on_imgs['img_%s' % im_i]['bbox_targets_on_levels']['stride%s' % s] = np.concatenate([_bbox_targets, bbox_targets_pad])

data_on_imgs['img_%s' % im_i]['bbox_weights_on_levels']['stride%s' % s] = np.concatenate([_bbox_weights, bbox_weights_pad])

File "mx-maskrcnn/rcnn/core/loader.py", line 278, in _make_data_and_labels
ValueError: all the input array dimensions except for the concatenation axis must match exactly

PS: I check the pairs of corresponding img and label, they all get the same size in pairs.

Really Thank you guys for these a lot of help!

fail to "make"

When I build related cython code in step 4, I encountered the following error:

How could I solve it?
Thanks!

Check failed: err == cudaSuccess (7 vs. 0) Name: col2im_gpu_kernel ErrStr:too many resources requested for launch

the terminal suggest "you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb)." What should i do?

This application failed to start because it could not find or load the Qt platform plugin "xcb" in "". Available platform plugins are: minimal, offscreen, xcb.

Hi I encountered the following problem while runing scripts/demo_single_image.sh:

~/mx-maskrcnn$ bash scripts/demo_single_image.sh
Namespace(dataset='Cityscape', epoch=0, gpu=0, image_name='figures/test.jpg', network='resnet_fpn', prefix='model/final', thresh=0.3, vis=True)
This application failed to start because it could not find or load the Qt platform plugin "xcb"
in "".

Available platform plugins are: minimal, offscreen, xcb.

Reinstalling the application may fix this problem.
scripts/demo_single_image.sh: 行 18: 18915 已放弃 (核心已转储) python2 -m rcnn.tools.demo_single_image --network resnet_fpn --dataset ${DATASET} --prefix ${PREFIX} --epoch 0 --gpu 0 --image_name figures/test.jpg --thresh 0.3 --vis true

Thank you for your answer!!

training hardware and time info?

How long did the full training take for the provided training script? Also, what hardware did you use?

Full of errors.

Thanks for this "Wonderfull" opensource, but I have to say the codes are really shit. I opened a issue reported a bug and return a commit point, after I tried it, same errors happen again. Too simple the technical people.

src/storage/./pooled_storage_manager.h:102: cudaMalloc failed: out of memory

I tried to train the model with coco dataset, but when it train rcnn part, it say out of memory.
Would you give me some hints to solve this problem?
My gpu is GTX1080, 8G
There are some information during training
{'ANCHOR_RATIOS': [0.5, 1, 2],
'ANCHOR_SCALES': [8],
'CLASS_ID': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80]),
'FIXED_PARAMS': ['conv0', 'stage1', 'gamma', 'beta'],
'FIXED_PARAMS_SHARED': ['conv0',
'stage1',
'stage2',
'stage3',
'stage4',
'P5',
'P4',
'P3',
'P2',
'gamma',
'beta'],
'NUM_ANCHORS': 3,
'NUM_CLASSES': 81,
'PIXEL_MEANS': array([0, 0, 0]),
'RCNN_FEAT_STRIDE': [32, 16, 8, 4],
'ROIALIGN': True,
'RPN_FEAT_STRIDE': [64, 32, 16, 8, 4],
'SCALES': [(800, 1000)],
'TEST': {'BATCH_IMAGES': 1,
'HAS_RPN': True,
'NMS': 0.3,
'PROPOSAL_MIN_SIZE': [64, 32, 16, 8, 4],
'PROPOSAL_NMS_THRESH': 0.7,
'PROPOSAL_POST_NMS_TOP_N': 2000,
'PROPOSAL_PRE_NMS_TOP_N': 20000,
'RPN_MIN_SIZE': [64, 32, 16, 8, 4],
'RPN_NMS_THRESH': 0.7,
'RPN_POST_NMS_TOP_N': 1000,
'RPN_PRE_NMS_TOP_N': 6000},
'TRAIN': {'ASPECT_GROUPING': True,
'BATCH_IMAGES': 1,
'BATCH_ROIS': 256,
'BBOX_MEANS': [0.0, 0.0, 0.0, 0.0],
'BBOX_NORMALIZATION_PRECOMPUTED': False,
'BBOX_REGRESSION_THRESH': 0.5,
'BBOX_STDS': [0.1, 0.1, 0.2, 0.2],
'BBOX_WEIGHTS': array([ 1., 1., 1., 1.]),
'BG_THRESH_HI': 0.5,
'BG_THRESH_LO': 0.0,
'FG_FRACTION': 0.25,
'FG_THRESH': 0.5,
'RPN_BATCH_SIZE': 32,
'RPN_BBOX_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'RPN_CLOBBER_POSITIVES': False,
'RPN_FG_FRACTION': 0.5,
'RPN_MIN_SIZE': [64, 32, 16, 8, 4],
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESH': 0.7,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POSITIVE_WEIGHT': -1.0,
'RPN_POST_NMS_TOP_N': 2000,
'RPN_PRE_NMS_TOP_N': 12000,
'SCALE': True,
'SCALE_RANGE': [0.8, 1]}}

About multi-scale training

There is multi-scale training in your implementation.

config.TRAIN.SCALE_RANGE = (0.8, 1)

But the author resizes the shorter edge to 800 pixels on coco and 2048×1024 pixels on Cityscapes.

The FPN backend should have a good scale invariant ability(I'm not sure this description is proper or not).

Just looking forward single scale training result.

train the cityscape dataset with RCNNL1Loss=nan?

when training the dataset in stage TRAIN RCNN WITH RPN INIT AND DETECTION (after COMBINE RPN2 WITH RCNN1), the RCNNL1Loss is always nan, i try several times, and i set the base_lr really low, it is always nan,
environment is p40/8 gpu

when training on my own pictures，show error below

num_images 1962
cityscape_small_train gt roidb loaded from model/res50-fpn/cityscape/alternate/cache/cityscape_small_train_gt_roidb.pkl
append flipped images to roidb
filtered 3924 roidb entries: 3924 -> 0
[]
0
Traceback (most recent call last):
File "train_alternate_mask_fpn.py", line 115, in
main()
File "train_alternate_mask_fpn.py", line 112, in main
args.rcnn_epoch, args.rcnn_lr, args.rcnn_lr_step)
File "train_alternate_mask_fpn.py", line 31, in alternate_train
train_shared=False, lr=rpn_lr, lr_step=rpn_lr_step)
File "/home/zhkj/mx-maskrcnn/rcnn/tools/train_rpn.py", line 54, in train_rpn
allowed_border=9999)
File "/home/zhkj/mx-maskrcnn/rcnn/core/loader.py", line 407, in init
self.get_batch()
File "/home/zhkj/mx-maskrcnn/rcnn/core/loader.py", line 507, in get_batch
iroidb = [roidb[i] for i in range(islice.start, islice.stop)]
IndexError: list index out of range

how to resolve it？thanks for your answer!!!

How to evaluate the result?

Hi, thank u very much for sharring. I run your demo and successfully segment 500 pictures--the result is really wonderfully!

However, how can I get the accuracy (as u mention in your table, u get a average accuracy of 26.2 on test set)?

Error when resume to train net

When I try to resume my net, this error happened.

File "train_alternate_mask_fpn.py", line 121, in
main()
File "train_alternate_mask_fpn.py", line 118, in main
args.rcnn_epoch, args.rcnn_lr, args.rcnn_lr_step)
File "train_alternate_mask_fpn.py", line 35, in alternate_train
train_shared=False, lr=rpn_lr, lr_step=rpn_lr_step)
File "/home/liyuw/Geonet-mx-maskrcnn/rcnn/tools/train_rpn.py", line 73, in train_rpn
arg_params, aux_params = load_param(prefix, begin_epoch, convert=True)
File "/home/liyuw/Geonet-mx-maskrcnn/rcnn/utils/load_model.py", line 49, in load_param
arg_params, aux_params = load_checkpoint(prefix, epoch)
File "/home/liyuw/Geonet-mx-maskrcnn/rcnn/utils/load_model.py", line 15, in load_checkpoint
save_dict = mx.nd.load('%s-%04d.params' % (prefix, epoch))
File "./incubator-mxnet/python/mxnet/ndarray/utils.py", line 174, in load
ctypes.byref(names)))
File "./incubator-mxnet/python/mxnet/base.py", line 146, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))

mxnet.base.MXNetError: [19:27:16] src/io/local_filesys.cc:166: Check failed: allow_null LocalFileSystem: fail to open "model/res50-fpn/cityscape/alternate/rpn1-0000.params"
I do not have the sudo permission.

How do I make data sets like cityscapes using own images?

Thank you for your answer

I got no mask on demo scripts

hi, I put our data as cityscape format. And train it successfully and get the final.params model.

But when I evaluate it, got the following result:
##################################################
what : AP AP_50%
##################################################
person : 0.000 0.000
rider : nan nan
car : nan nan
truck : nan nan
bus : nan nan
train : nan nan
motorcycle : nan nan
bicycle : nan nan

average : 0.000 0.000

And got no mask on demo scripts.
would you please help to suggest the problem happened in train process or some other reanson.

Thank you.

COCO, how much coming soon?

Is "COCO coming soon, stsy tuned" about days, weeks, months?
Thanks a lot

Tets

ImportError：No module named bbox

When I run train_alternate_mask_fpn.py, I get this error: ImportError：No module named bbox. I check there is the script in /mx-maskrcnn/rnn/cython/bbox.pyx, but I don't know why it can't import this script? It is the reason of .pyx?

Can not access model on dropbox anymore

Hi,
I could see the drop box link about 2h ago on my mobile, but when I tried again now to download it, the site can not be accessed anymore, whether from PC or mobile. Is the link de-activated or cancelled? Can you repost the param file somewhere else?
Thanks
Tets

can you share the Cityscapes datasets？

Abnormal RCNNL1Loss and MaskLogLoss after RuntimeWarning

Hi guys,

I am encountering the following issue while training on a single TITAN X GPU:

INFO:root:Epoch[0] Batch [820] Speed: 0.53 samples/sec Train-RCNNAcc=0.753992, RCNNLogLoss=0.927809, RCNNL1Loss=4.257946, MaskACC=0.902436, MaskLogLoss=0.166465,
/home/twang/work/fine_segmentation/baselines/mx-maskrcnn/rcnn/core/metric.py:134: RuntimeWarning: invalid value encountered in greater
idx = np.where(np.logical_and(mask_prob > 0.5, mask_weight == 1))
INFO:root:Epoch[0] Batch [840] Speed: 0.56 samples/sec Train-RCNNAcc=0.753502, RCNNLogLoss=0.987536, RCNNL1Loss=24666331660818701805251597434880.000000, MaskACC=0.901641, MaskLogLoss=nan,

The RCNNL1Loss basically explodes after this RuntimeWarning.
The MaskLogLoss was okay before this RuntimeWarning, but becomes NaN afterwards.

The error occurred during the 3rd training stage: TRAIN RCNN WITH IMAGENET INIT AND RPN DETECTION

Any advice will be much appreciated. Thanks!

AttribuuteError: 'module' object has no attribute 'ROIAlign'

thanks

multi gpu batch norm is synchronized or not

Check failed: allow_null LocalFileSystem: fail to open "model/resnet-50-0000.params"

mxnet.base.MXNetError: [22:47:15] src/io/local_filesys.cc:154: Check failed: allow_null LocalFileSystem: fail to open "model/resnet-50-0000.params"

Problem while training

Hi I encountered the following problem while training:

Calling initializer with init(str, NDArray) has been deprecated.please use init(mx.init.InitDesc(...), NDArray) instead.
init_internal(k, arg_params[k])
init P5_lateral_bias
init rpn_conv_weight
init rpn_conv_bias
init rpn_conv_cls_weight
init rpn_conv_cls_bias
init P4_lateral_weight
init P4_lateral_bias
init P4_aggregate_weight
init P4_aggregate_bias
init P3_lateral_weight
init P3_lateral_bias
init P3_aggregate_weight
init P3_aggregate_bias
init P2_lateral_weight
init P2_lateral_bias
init P2_aggregate_weight
init P2_aggregate_bias
init rpn_conv_bbox_weight
init rpn_conv_bbox_bias
lr 0.004 lr_epoch_diff [6] lr_iters [8895]
[14:30:08] src/operator/././cudnn_algoreg-inl.h:112: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[14:30:12] /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/dmlc-core/include/dmlc/./logging.h:308: [14:30:12] src/storage/./pooled_storage_manager.h:102: cudaMalloc failed: out of memory

Stack trace returned 10 entries:
[bt] (0) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f71355ef3cc]
[bt] (1) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7storage23GPUPooledStorageManager5AllocEm+0x15e) [0x7f71365e80de]
[bt] (2) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEmNS_7ContextE+0x69) [0x7f71365eb5e9]
[bt] (3) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x172088f) [0x7f713665b88f]
[bt] (4) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor19InitDataEntryMemoryEPSt6vectorINS_7NDArrayESaIS3_EE+0x2a54) [0x7f71366643b4]
[bt] (5) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor15FinishInitGraphEN4nnvm6SymbolENS2_5GraphEPNS_8ExecutorERKSt13unordered_mapINS2_9NodeEntryENS_7NDArrayENS2_13NodeEntryHashENS2_14NodeEntryEqualESaISt4pairIKS8_S9_EEE+0xa11) [0x7f713666ac81]
[bt] (6) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor4InitEN4nnvm6SymbolERKNS_7ContextERKSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES4_St4lessISD_ESaISt4pairIKSD_S4_EEERKSt6vectorIS4_SaIS4_EESR_SR_RKSt13unordered_mapISD_NS2_6TShapeESt4hashISD_ESt8equal_toISD_ESaISG_ISH_ST_EEERKSS_ISD_iSV_SX_SaISG_ISH_iEEERKSN_INS_9OpReqTypeESaIS18_EERKSt13unordered_setISD_SV_SX_SaISD_EEPSN_INS_7NDArrayESaIS1I_EES1L_S1L_PSS_ISD_S1I_SV_SX_SaISG_ISH_S1I_EEEPNS_8ExecutorERKSS_INS2_9NodeEntryES1I_NS2_13NodeEntryHashENS2_14NodeEntryEqualESaISG_IKS1S_S1I_EEE+0x909) [0x7f713666d1e9]
[bt] (7) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(ZN5mxnet8Executor10SimpleBindEN4nnvm6SymbolERKNS_7ContextERKSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES3_St4lessISC_ESaISt4pairIKSC_S3_EEERKSt6vectorIS3_SaIS3_EESQ_SQ_RKSt13unordered_mapISC_NS1_6TShapeESt4hashISC_ESt8equal_toISC_ESaISF_ISG_SS_EEERKSR_ISC_iSU_SW_SaISF_ISG_iEEERKSM_INS_9OpReqTypeESaIS17_EERKSt13unordered_setISC_SU_SW_SaISC_EEPSM_INS_7NDArrayESaIS1H_EES1K_S1K_PSR_ISC_S1H_SU_SW_SaISF_ISG_S1H_EEEPS0+0x233) [0x7f713666d813]
[bt] (8) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(MXExecutorSimpleBind+0x2d4a) [0x7f713662c40a]
[bt] (9) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f715bc80e40]

Traceback (most recent call last):
File "train_alternate_mask_fpn.py", line 114, in
main()
File "train_alternate_mask_fpn.py", line 111, in main
args.rcnn_epoch, args.rcnn_lr, args.rcnn_lr_step)
File "train_alternate_mask_fpn.py", line 31, in alternate_train
train_shared=False, lr=rpn_lr, lr_step=rpn_lr_step)
File "/home/cougarnet.uh.edu/csmailis/mx-maskrcnn/rcnn/tools/train_rpn.py", line 149, in train_rpn
arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
File "/home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/module/base_module.py", line 460, in fit
for_training=True, force_rebind=force_rebind)
File "/home/cougarnet.uh.edu/csmailis/mx-maskrcnn/rcnn/core/module.py", line 141, in bind
force_rebind=False, shared_module=None)
File "/home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/module/module.py", line 417, in bind
state_names=self._state_names)
File "/home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/module/executor_group.py", line 231, in init
self.bind_exec(data_shapes, label_shapes, shared_group)
File "/home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/module/executor_group.py", line 327, in bind_exec
shared_group))
File "/home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/module/executor_group.py", line 603, in _bind_ith_exec
shared_buffer=shared_data_arrays, **input_shapes)
File "/home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/symbol.py", line 1479, in simple_bind
raise RuntimeError(error_msg)
RuntimeError: simple_bind error. Arguments:
data: (1, 3, 1024, 2048)
bbox_weight: (1, 12, 174592)
bbox_target: (1, 12, 174592)
label: (1, 523776)
[14:30:12] src/storage/./pooled_storage_manager.h:102: cudaMalloc failed: out of memory

Stack trace returned 10 entries:
[bt] (0) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f71355ef3cc]
[bt] (1) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7storage23GPUPooledStorageManager5AllocEm+0x15e) [0x7f71365e80de]
[bt] (2) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEmNS_7ContextE+0x69) [0x7f71365eb5e9]
[bt] (3) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x172088f) [0x7f713665b88f]
[bt] (4) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor19InitDataEntryMemoryEPSt6vectorINS_7NDArrayESaIS3_EE+0x2a54) [0x7f71366643b4]
[bt] (5) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor15FinishInitGraphEN4nnvm6SymbolENS2_5GraphEPNS_8ExecutorERKSt13unordered_mapINS2_9NodeEntryENS_7NDArrayENS2_13NodeEntryHashENS2_14NodeEntryEqualESaISt4pairIKS8_S9_EEE+0xa11) [0x7f713666ac81]
[bt] (6) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor4InitEN4nnvm6SymbolERKNS_7ContextERKSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES4_St4lessISD_ESaISt4pairIKSD_S4_EEERKSt6vectorIS4_SaIS4_EESR_SR_RKSt13unordered_mapISD_NS2_6TShapeESt4hashISD_ESt8equal_toISD_ESaISG_ISH_ST_EEERKSS_ISD_iSV_SX_SaISG_ISH_iEEERKSN_INS_9OpReqTypeESaIS18_EERKSt13unordered_setISD_SV_SX_SaISD_EEPSN_INS_7NDArrayESaIS1I_EES1L_S1L_PSS_ISD_S1I_SV_SX_SaISG_ISH_S1I_EEEPNS_8ExecutorERKSS_INS2_9NodeEntryES1I_NS2_13NodeEntryHashENS2_14NodeEntryEqualESaISG_IKS1S_S1I_EEE+0x909) [0x7f713666d1e9]
[bt] (7) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(ZN5mxnet8Executor10SimpleBindEN4nnvm6SymbolERKNS_7ContextERKSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES3_St4lessISC_ESaISt4pairIKSC_S3_EEERKSt6vectorIS3_SaIS3_EESQ_SQ_RKSt13unordered_mapISC_NS1_6TShapeESt4hashISC_ESt8equal_toISC_ESaISF_ISG_SS_EEERKSR_ISC_iSU_SW_SaISF_ISG_iEEERKSM_INS_9OpReqTypeESaIS17_EERKSt13unordered_setISC_SU_SW_SaISC_EEPSM_INS_7NDArrayESaIS1H_EES1K_S1K_PSR_ISC_S1H_SU_SW_SaISF_ISG_S1H_EEEPS0+0x233) [0x7f713666d813]
[bt] (8) /home/cougarnet.uh.edu/csmailis/mx-maskrcnn/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(MXExecutorSimpleBind+0x2d4a) [0x7f713662c40a]
[bt] (9) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f715bc80e40]

Any ideas on how to resolve this ?

src/storage/storage.cc:59: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: invalid device ordinal

I encountered the above error while training. How could I solve it? THX!

Some problem about preparing dataset

Hello! Thanks for your amazing work !

I want to prepare my dataset like Cityscapes, however I am confusing about the annotation in cityscapes dataset. In load_from_seg(rcnn/dataset/cityscape) function, what's the meaning of ins_id ?

I tried to parse gtFine png file and found that each class is 1000*class_id and plus instance_id ? Is that correct ? ex: Label 23 have 8 instance, therefore the value in segmentation is [23000, 23001 ...23007], however, in aachen_000000_000019_gtFine_instanceIds.png the ids of label 26 is
[26003, 26004, 26005, 26006, 26007, 26008, 26009, 26010]

To sum up , I will appreciate if you can explain what is ins_id for and perhaps the annotation format of cityscapes (Cannot find detailed doc in official website :( ). Thank you very much.

compile mxnet error with cuda 8.0

/home/cgangee/code/mxnet/mshadow/mshadow/./base.h:371:43: error: ‘CUDA_R_32I’ was not declared in this scope
static const cudaDataType_t kCudaFlag = CUDA_R_32I;

Training is too slow. How to be more effective, adjust mini-batch size?

I am runing train_alternate.sh, really slow when training mask-rcnn(step 2): to epoch[0] Batch [2740], it takes more than 4hours. (Thinking of 20 epoches in total!!!)

I am using 2 titanx with 12G mem each. So effective minibatch size is 4 for me. How to change it?

Scale Problem

I find that your code may not correct when using multi-scale dataset.
After I fix the scale error about un-used Config.scale. The network always get nan loss, when training the multi-scale, especially when changing scale size from [1024, 2048] to [640, 1024].

BTW, I do change the value of config.TRAIN.SCALE to True. And the core/tester.py line160 and line226 will get error on evaluation. If you do not set "scale" into function "im_detect_mask"!

why not add a demo.py just read an image and plot ouput result after prediction?

demo_mask.py needs to read imdb

	Dtype h_stride = (hend - hstart)/3.0;
	Dtype w_stride = (wend - wstart)/3.0;

	int hlow = min(max(static_cast<int>(floor(h)), 0), height-1);
	int hhigh = hlow + 1;

	Dtype alpha = (hlow == hhigh) ? static_cast<Dtype>(0.5) : (h - hlow) / (hhigh - hlow);
	Dtype beta = (wleft == wright) ? static_cast<Dtype>(0.5) : (w - wleft) / (wright - wleft);

tusimple / mx-maskrcnn Goto Github PK

mx-maskrcnn's People

Contributors

Stargazers

Watchers

Forkers

mx-maskrcnn's Issues

I get the error "Has no attribute 'cv' " in the line " im = cv2.cvtColor(im, cv2.cv.CV_RGB2BGR)", when I run demo.sh . Can anyone solve it ? thanks！

Recommend Projects

Recommend Topics

Recommend Org