zhengyang-wang / deeplab-v2--resnet-101--tensorflow Goto Github PK

An (re-)implementation of DeepLab v2 (ResNet-101) in TensorFlow for semantic image segmentation on the PASCAL VOC 2012 dataset.

License: GNU General Public License v3.0

Python 100.00%

deep-learning tensorflow semantic-segmentation deeplab-resnet deeplabv2 pascal-voc

deeplab-v2--resnet-101--tensorflow's Introduction

Deeplab v2 ResNet for Semantic Image Segmentation

This is an (re-)implementation of DeepLab v2 (ResNet-101) in TensorFlow for semantic image segmentation on the PASCAL VOC 2012 dataset. We refer to DrSleep's implementation (Many thanks!). We do not use tf-to-caffe packages like kaffe so you only need TensorFlow 1.3.0+ to run this code.

The deeplab pre-trained ResNet-101 ckpt files (pre-trained on MSCOCO) are provided by DrSleep -- here. Thanks again!

Created by Zhengyang Wang and Shuiwang Ji at Texas A&M University.

Update

05/08/2018:

Our work based on this implementation has led to a paper accepted for long presentation in KDD2018. You may find the code of the work in this branch.

If using this code , please cite our paper.

@inproceedings{wang2018smoothed,
  title={Smoothed Dilated Convolutions for Improved Dense Prediction},
  author={Wang, Zhengyang and Ji, Shuiwang},
  booktitle={Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
  pages={2486--2495},
  year={2018},
  organization={ACM}
}

02/02/2018:

A clarification:

As reported, ResNet pre-trained models (NOT deeplab) from Tensorflow were trained using the channel order RGB instead BGR (https://github.com/tensorflow/models/blob/master/research/slim/preprocessing/vgg_preprocessing.py).

Thus, the most correct way to apply them is to use the same order RGB. The original code is for pre-trained models from Caffe and uses BGR. To correct this, when you use res101 and res50, you need to delete line 116 and line 117 in utils/image_reader.py to remove the RGB to BGR step when reading images. Then, modify line 77 in utils/label_utils.py to remove the BGR to RGB step in the inverse process for image visualization. At last, you need to change the IMAGE_MEAN by swapping the first and the third values in line 26 and line 26 for non_msc and msc training, respectively.

However, this change actually does not affect the performance a lot, proved by discussion in issue 30. In this task, the size of training patches is different from that in ImageNet. And the set of images is different. The IMAGE_MEAN is never accurate. I guess that simply using IMAGE_MEAN=[127.5, 127.5, 127.5] will work as well.

12/13/2017:

Now the test code will output the mIoU as well as the IoU for each class.

12/12/2017:

Add 'predict' function, you can use '--option=predict' to save your outputs now (both the true prediction where each pixel is between 0 and 20 and the visual one where each class has its own color).
Add multi-scale training, testing and predicting. Check main_msc.py and model_msc.py and use them just as main.py and model.py.
Add plot_training_curve.py to use the log.txt to make plots of training curve.
Now this is a 'full' (re-)implementation of DeepLab v2 (ResNet-101) in TensorFlow. Thank you for the support. You are welcome to report your settings and results as well as any bug!

11/09/2017:

The new version enables using original ImageNet pre-trained ResNet models (without pre-training on MSCOCO). You may change arguments ('encoder_name' and 'pretrain_file') in main.py to use corresponding pre-trained models. The original pre-trained ResNet-101 ckpt files are provided by tensorflow officially -- res101 and res50.
To help those who want to use this model on the CityScapes dataset, I shared the corresponding txt files and the python file which generates them. Note that you need to use tools here to generate labels with trainID first. Hope it would be helpful. Do not forget to change IMG_MEAN in model.py and other settings in main.py.
'is_training' argument is removed and 'self._batch_norm' changes. Basically, for a small batch size, it is better to keep the statistics of the BN layers (running means and variances) frozen, and to not update the values provided by the pre-trained model by setting 'is_training=False'. Note that is_training=False still updates BN parameters gamma (scale) and beta (offset) if they are presented in var_list of the optimiser definition. Set 'trainable=False' in BN fuctions to remove them from trainable_variables.
Add 'phase' argument in network.py for future development. 'phase=True' means training. It is mainly for controlling batch normalization (if any) in the non-pre-trained part.

Example: If you have a batch normalization layer in the decoder, you should use 

outputs = self._batch_norm(inputs, name='g_bn1', is_training=self.phase, activation_fn=tf.nn.relu, trainable=True)

Some changes to make the code more readable and easy to modify for future research.
I plan to add 'predict' function to enable saving predicted results for offline evaluation, post-processing, etc.

System requirement

Programming language

Python 3.5

Python Packages

tensorflow-gpu 1.3.0

Configure the network

All network hyperparameters are configured in main.py.

Training

num_steps: how many iterations to train

save_interval: how many steps to save the model

random_seed: random seed for tensorflow

weight_decay: l2 regularization parameter

learning_rate: initial learning rate

power: parameter for poly learning rate

momentum: momentum

encoder_name: name of pre-trained model, res101, res50 or deeplab

pretrain_file: the initial pre-trained model file for transfer learning

data_list: training data list file

grad_update_every (msc only): accumulate the gradients for how many steps before updating weights. Note that in the msc case, this is actually the true training batch size.

Testing/Validation

valid_step: checkpoint number for testing/validation

valid_num_steps: = number of testing/validation samples

valid_data_list: testing/validation data list file

Prediction

out_dir: directory for saving prediction outputs

test_step: checkpoint number for prediction

test_num_steps: = number of prediction samples

test_data_list: prediction data list filename

visual: whether to save visualizable prediction outputs

Data

data_dir: data directory

batch_size: training batch size

input height: height of input image

input width: width of input image

num_classes: number of classes

ignore_label: label pixel value that should be ignored

random_scale: whether to perform random scaling data-augmentation

random_mirror: whether to perform random left-right flipping data-augmentation

Log

modeldir: where to store saved models

logfile: where to store training log

logdir: where to store log for tensorboard

Training and Testing

Start training

After configuring the network, we can start to train. Run

python main.py

The training of Deeplab v2 ResNet will start.

Training process visualization

We employ tensorboard for visualization.

tensorboard --logdir=log --port=6006

You may visualize the graph of the model and (training images + groud truth labels + predicted labels).

To visualize the training loss curve, write your own script to make use of the training log.

Testing and prediction

Select a checkpoint to test/validate your model in terms of pixel accuracy and mean IoU.

Fill the valid_step in main.py with the checkpoint you want to test. Change valid_num_steps and valid_data_list accordingly. Run

python main.py --option=test

The final output includes pixel accuracy and mean IoU.

Run

python main.py --option=predict

The outputs will be saved in the 'output' folder.

deeplab-v2--resnet-101--tensorflow's People

Contributors

Stargazers

Watchers

Forkers

dyz-zju jankim imperialcollegelondon pengfeiran licaizi summertune fenglovebella john1231983 zumbalamambo igi123 leiup yuchen1984 prerakmody lonestar686 venkatesh-sakthivel giladsharir divelab boren87 abhishekvahadane tbetterlife yaowang-bjtu keepersecond zzzz94 fanqc nmchgx ycychunyan mmtaksuu samxiaosheng convmech peijinwang csmanu bwuzhang acewjh segregation keshawanshi dongcute asker0lee johndpope zmbhou xbutterflyx gepu0221 levelsethu stargazeryuan wangq95 pacifichongyang gaoqi1993 progressforever codewithsalim beshining ryany1994 anisayari caltech-z vicqwz tianlanshidai oddbook zhangyonle lior1990 jingsongchen sscorpio93 neveroldmilk ledakk silvaco gulizhuguli leeseyun flyfeatherok apulis quebradawill wambugunm dliang110 foroliviawong zhangtongzan gszswork

deeplab-v2--resnet-101--tensorflow's Issues

After training,I did the main.py 's prediction.Why the pictures is black, only have black background!

Results are similar to the original deeplab v2 paper?

Hello, your implementation has achive the results compared to the original paper?
I want to known,thanks!

After predicting,the outputs images only have the backgorund !! What should I do , thanks a lot.

how to fine-tune from the original pretrained model?

I try to use original ImageNet pre-trained ResNet models to finetune, but I get the problem that "NotFoundError (see above for traceback): Key resnet_v1_50/block1/unit_1/bottleneck_v1/conv2/BatchNorm/beta not found in checkpoint".
Could anybody tell me how to solve this problem?

Save prediction probabilities.

Hi,
this code is great and so easy to implement, thanks!
I would like to have the probability for each class saved (with visual probability map, for each image, I would get 20 probability images (one for each class)). But just with getting a probability matrix would also be nice.
How do I have to modify the code to get this?

Thanks in advance!

Apply the code on my own data set

What code do I need to change if I want to use your code for segmentation of my gray image? Change the channel, change the size and look forward to your reply

MultiGPU

How to run your code on multi-GPU? Thank you very much.

How to train your project in multiple GPU?

Hello, I have 4 GPU 1080. Could you tell me how can I train your project in my server because I will use bigger batch size? I have tried uncommend the line to allow growth gpu in your main.py but i think we need to modify the gradient update (average them). I am using tf 1.2.1. Thanks

Compute per class iou code

The accuracy on the cityscape

Could you tell me the accuracy on the cityscape via using your code. I want to check it is my fault or just the performance of the deeplabv2. I got a very low Mean IoU on cityscape. Thank you very much. @zhengyang-wang

about dataset

你好，为什么我下载的数据集只有2000多张分割的图片，代码里的train.txt有10000多张，是我下载的数据集不对吗

Could you provide the demo/inference code?

This is not a bug. Just request a new feature. I hope it can be used not only me but also another people

Given an image and trained model, we will segment the image and save the prediction result to file. You can use it in your brach. It may help someone to refer it. This is the code

"""Run DeepLab-ResNet on a given image.

This script computes a segmentation mask for a given image.
"""

from __future__ import print_function

import argparse
import os
from PIL import Image
from network import *
from utils import decode_labels


IMG_MEAN = np.array((103.939, 116.779, 123.68), dtype=np.float32)

NUM_CLASSES = 19
SAVE_DIR = './output/'


def get_arguments():
    """Parse all the arguments provided from the CLI.

    Returns:
      A list of parsed arguments.
    """
    parser = argparse.ArgumentParser(description="DeepLabLFOV Network Inference.")
    parser.add_argument("--img_path", type=str,default='./input/frankfurt_000000_000294_leftImg8bit.png',
                        help="Path to the RGB image file.")
    parser.add_argument("--model_weights", type=str, default='./model_cityscape/model.ckpt-10000',
                        help="Path to the file with model weights.")
    parser.add_argument("--num-classes", type=int, default=NUM_CLASSES,
                        help="Number of classes to predict (including background).")
    parser.add_argument("--save-dir", type=str, default=SAVE_DIR,
                        help="Where to save predicted mask.")
    return parser.parse_args()


def load(saver, sess, ckpt_path):
    '''Load trained weights.

    Args:
      saver: TensorFlow saver object.
      sess: TensorFlow session.
      ckpt_path: path to checkpoint file with parameters.
    '''
    saver.restore(sess, ckpt_path)
    print("Restored model parameters from {}".format(ckpt_path))


def main():
    """Create the model and start the evaluation process."""
    args = get_arguments()

    # Prepare image.
    img = tf.image.decode_jpeg(tf.read_file(args.img_path), channels=3)
    # Convert RGB to BGR.
    img_r, img_g, img_b = tf.split(axis=2, num_or_size_splits=3, value=img)
    img = tf.cast(tf.concat(axis=2, values=[img_b, img_g, img_r]), dtype=tf.float32)
    # Extract mean.
    img -= IMG_MEAN

    # Create network. Deeplab_v2(self.image_batch, self.conf.num_classes, False)
    net = Deeplab_v2(tf.expand_dims(img, dim=0), args.num_classes, False)

    # Which variables to load.
    restore_var = tf.global_variables()

    # Predictions.
    raw_output = net.outputs
    raw_output_up = tf.image.resize_bilinear(raw_output, tf.shape(img)[0:2, ])
    raw_output_up = tf.argmax(raw_output_up, dimension=3)
    pred = tf.expand_dims(raw_output_up, dim=3)

    # Set up TF session and initialize variables.
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    sess = tf.Session(config=config)
    init = tf.global_variables_initializer()

    sess.run(init)

    # Load weights.
    loader = tf.train.Saver(var_list=restore_var)
    load(loader, sess, args.model_weights)

    # Perform inference.
    preds = sess.run(pred)

    msk = decode_labels(preds, num_classes=args.num_classes)
    im = Image.fromarray(msk[0])
    if not os.path.exists(args.save_dir):
        os.makedirs(args.save_dir)
    im.save(args.save_dir + 'mask.png')

    print('The output file has been saved to {}'.format(args.save_dir + 'mask.png'))


if __name__ == '__main__':
    os.environ['CUDA_VISIBLE_DEVICES'] = '0'
    main()

Note that, you must change the label_colours code

label_colours = [(128, 64, 128), (244, 35, 231), (69, 69, 69)
                # 0 = road, 1 = sidewalk, 2 = building
                ,(102, 102, 156), (190, 153, 153), (153, 153, 153)
                # 3 = wall, 4 = fence, 5 = pole
                ,(250, 170, 29), (219, 219, 0), (106, 142, 35)
                # 6 = traffic light, 7 = traffic sign, 8 = vegetation
                ,(152, 250, 152), (69, 129, 180), (219, 19, 60)
                # 9 = terrain, 10 = sky, 11 = person
                ,(255, 0, 0), (0, 0, 142), (0, 0, 69)
                # 12 = rider, 13 = car, 14 = truck
                ,(0, 60, 100), (0, 79, 100), (0, 0, 230)
                # 15 = bus, 16 = train, 17 = motocycle
                ,(119, 10, 32), (1, 1, 1)]
                # 18 = bicycle, 19 = void label

some class Iou is nan

I downloaded the aug dataset and converted the mat format label to a png format .Then I trained with your train.txt file. I adjust the picture input_height and input_width to 270*270 because of GPU(1080), after 20000 training loss is 1.18. I test the model with Test.txt but found Iou is very low and some class value is nan. And every predict result only have two classes (tow colors) and some is wrong result.How can I solve it ? Thank you very much.
.

why the shape of 'self.label_batch' is (1, 281, 500, 1) ,but not (1, 321, 321, 1), thank you again

compute_IoU_per_class

when i use compute_IoU_per_class function, there is such problem:

print('class %d: %.3f'%(i,IoU))
TypeError: a float is required
how can i fix it?
thanks a lot

Image order RGB or BRG?

Hello, I am using the official resnet-101 pre-trained model (as your link). It is trained from ImageNet, with image order is RGB and IMAGE_MEAN is

_R_MEAN = 123.68 / 255
_G_MEAN = 116.78 / 255
_B_MEAN = 103.94 / 255

The official resnet-101 pre-processing L223 is

channels = tf.split(axis=2, num_or_size_splits=num_channels, value=image)
  for i in range(num_channels):
    channels[i] -= means[i]
  return tf.concat(axis=2, values=channels)

While your code is converting RGB to BRG and used another IMAGE_MEAN. I think we should you same pre-processing as pre-trained model did such as RGB order and imagenet image mean. Am I right?

Results for VOC2012 are not correct

I Use the original images from VOC2012 for training.
training set: 1464
test set : 1449
At the same time , I use the Res101 for pre-train model which is from slim's checkpoint.
The loss for training is about 1.8
when I test the model, the IOU is very low , only 0.1 and some classes are nan,
What's wrong with my configuration ? I have modified the environment with README

Cityscape training parameters

Thanks for sharing a nice work. I have achieved the Pascal VOC as you did. Right now, I would like to evaluate the performance in cityscape. I have created the training, validation and testing file as your code. I just have a TitanX pascal 12GB. Could you share your parameter setting to perform training in cityscape dataset? Currently, this is my setting but it is not work in testing phase. I am using the
input_height as 512, input_width, as 1024? In additions, do you use both fine and coarse data for training?

IMG_MEAN = np.array((103.939, 116.779, 123.68), dtype=np.float32)

In the main.py

       # training
	flags.DEFINE_integer('num_steps', 20000, 'maximum number of iterations')
	flags.DEFINE_integer('save_interval', 1000, 'number of iterations for saving and visualization')
	flags.DEFINE_integer('random_seed', 1234, 'random seed')
	flags.DEFINE_float('weight_decay', 0.0005, 'weight decay rate')
	flags.DEFINE_float('learning_rate', 2.5e-4, 'learning rate')
	flags.DEFINE_float('power', 0.9, 'hyperparameter for poly learning rate')
	flags.DEFINE_float('momentum', 0.9, 'momentum')
	flags.DEFINE_string('encoder_name', 'deeplab', 'name of pre-trained model, res101, res50 or deeplab')
	flags.DEFINE_string('pretrain_file', './reference model/deeplab_resnet_init.ckpt', 'pre-trained model filename corresponding to encoder_name')
	flags.DEFINE_string('data_list', './dataset_cityscapes/train_fine.txt', 'training data list filename')

	# testing / validation
	flags.DEFINE_integer('valid_step', 2000, 'checkpoint number for testing/validation')
	flags.DEFINE_integer('valid_num_steps', 1449, '= number of testing/validation samples')
	flags.DEFINE_string('valid_data_list', './dataset_cityscapes/val_fine.txt', 'testing/validation data list filename')

	# data
	flags.DEFINE_string('data_dir', './cityscapes/leftImg8bit_trainvaltest', 'data directory')
	flags.DEFINE_integer('batch_size', 2, 'training batch size')
	flags.DEFINE_integer('input_height', 512, 'input image height')
	flags.DEFINE_integer('input_width', 1024, 'input image width')
	flags.DEFINE_integer('num_classes', 19, 'number of classes')
	flags.DEFINE_integer('ignore_label', 255, 'label pixel value that should be ignored')
	flags.DEFINE_boolean('random_scale', True, 'whether to perform random scaling data-augmentation')
	flags.DEFINE_boolean('random_mirror', True, 'whether to perform random left-right flipping data-augmentation')	
	```

what should be changed if there is only two classes

Hi, I am new to the ML/CV, now I am doing image segmentation for skin lesion. I only need to separate lesion area from background (two classes)

When I configure the main.py and run train, the loss remains in 1.231 after a few steps. Then I run test, the pixel accuracy keeps 1.00 and mean IoU keeps 0.5. Did you encounter the same problem?

I found in Dr.Sleep's note that when load checkpoint for different class number (not 21), the --not-restore-last should be passed. Have you also implement this?

question about CRFs

hi Zhengyang，Thanks for your code! here is one question, did you use CRFs in your code?

Optimizer choice: Adam VS SGD

Dear Doctor Wang,

I noticed that DrSleep uses Adam for training, while yours and the original paper employed standard SGD. I am curious that do you have any experience on the performance difference of these two approach on this problem?

Bests,

Xiong

mIoU is 0.108 after training

so , I just get mIoU 0.108, it's really bad, how can I improve the performance? thank you

Problem with pre-trained model

Hey, I want to use resnet v2 to train deeplab v2. Now , I have doubt about the pre-trained model.
From your code , you mentioned deeplab_resnet.ckpt and resnet101.ckpt, Can I just use the resnet101.ckpt as the initialize model to train deeplab v2 . If it works, then I can use the resnet v2 to train deeplab_v2

How to print the mean IoU during training

Hello, I would like to add one thing in your training code. During training, I want to print mean IoU (besides step and loss). I add in your model.py in line 262

 #mIoU
pred_logits = tf.reshape(self.pred, [-1, ])
gt = tf.reshape(self.label_batch, [-1, ])
 # Ignoring all labels greater than or equal to n_classes.
temp = tf.less_equal(gt, self.conf.num_classes - 1)
weights = tf.cast(temp, tf.int32)
# fix for tf 1.3.0
gt = tf.where(temp, gt, tf.cast(temp, tf.uint8))        
 self.mIoU, self.mIou_update_op = tf.contrib.metrics.streaming_mean_iou(pred_logits, gt, num_classes=self.conf.num_classes, weights=weights)

Line 39

self.sess.run(tf.local_variables_initializer())

Line 53

                loss_value, images, labels, preds, summary, _,_, = self.sess.run(
                    [self.reduced_loss,
                    self.image_batch,
                    self.label_batch,
                    self.pred,
                    self.total_summary,
                    self.train_op,
                    self.mIou_update_op],
                    feed_dict=feed_dict)
m_IoU = self.mIoU.eval(session=self.sess)

And line 68

print('step {:d} \t loss = {:.3f}, ({:.3f} sec/step), Mean IoU: {:.3f}'.format(step, loss_value, duration, m_IoU))

But the result of mIoU did not change among steps. Do you know what is reason and how could I fix it? Thanks

step 0 	 loss = 4.039, (3.927 sec/step), Mean IoU: 0.004
step 1 	 loss = 3.142, (0.893 sec/step), Mean IoU: 0.004
step 2 	 loss = 2.124, (0.660 sec/step), Mean IoU: 0.004
step 3 	 loss = 1.549, (0.655 sec/step), Mean IoU: 0.004
step 4 	 loss = 2.705, (0.677 sec/step), Mean IoU: 0.004
step 5 	 loss = 1.956, (0.700 sec/step), Mean IoU: 0.004
...
step 502 	 loss = 0.769, (0.852 sec/step), Mean IoU: 0.004
step 503 	 loss = 0.525, (0.831 sec/step), Mean IoU: 0.004
step 504 	 loss = 1.202, (0.812 sec/step), Mean IoU: 0.004

Is training of batchnorm is always "False"

Thank you for your great implementation. Can you please explain why "is_training" in all batch normalization layers are always "False"? This results in no updating of batch normalization parameters.

test mode cannot print block shape

This is the print info after running '--option=test'.
Did anyone encounter the same problem?

-----------build encoder: deeplab pre-trained-----------
after start block: (1, ?, ?, 64)
after block1: (1, ?, ?, 256)
after block2: (1, ?, ?, 512)
after block3: (1, ?, ?, 1024)
after block4: (1, ?, ?, 2048)
-----------build decoder-----------
after aspp block: (1, ?, ?, 2)

Getting error 'FIFOQueue '_1_create_inputs/batch/fifo_queue' is closed and has insufficient elements (requested 10, current size 0)'

When I run main.py I get:

OutOfRangeError (see above for traceback): FIFOQueue '_1_create_inputs/batch/fifo_queue' is closed and has insufficient elements (requested 10, current size 0) [[Node: create_inputs/batch = QueueDequeueManyV2[component_types=[DT_FLOAT, DT_UINT8], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](create_inputs/batch/fifo_queue, create_inputs/batch/n)]]

I've downloaded VOC and have the path in main.py set to:
flags.DEFINE_string('data_dir', './VOCdevkit/VOC2012', 'data directory')

I've also set the following in main.py:
flags.DEFINE_string('encoder_name', 'res101', 'name of pre-trained model, res101, res50 or deeplab') flags.DEFINE_string('pretrain_file', './resnet_v1_101.ckpt', 'pre-trained model filename corresponding to encoder_name')

Apologies if I'm doing something obvious wrong.

there is no directory called SegmentationClassAug,how to solve it

-----------build encoder: deeplab pre-trained-----------
after start block: (10, 81, 81, 64)
after block1: (10, 81, 81, 256)
after block2: (10, 41, 41, 512)
after block3: (10, 41, 41, 1024)
after block4: (10, 41, 41, 2048)
-----------build decoder-----------
after aspp block: (10, 41, 41, 21)
INFO:tensorflow:Restoring parameters from ../reference model/deeplab_resnet.ckpt
Restored model parameters from ../reference model/deeplab_resnet.ckpt
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.NotFoundError'>, /home/hp/VOCdevkit/VOC2012/SegmentationClassAug/2008_008300.png; No such file or directory
[[Node: create_inputs/ReadFile_1 = ReadFile_device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Redundant condition in model.py

I have check your code and your note

'is_training' argument is removed and 'self._batch_norm' changes. Basically, for a small batch size, it is better to keep the statistics of the BN layers (running means and variances) frozen, and to not update the values provided by the pre-trained model by setting 'is_training=False'. Note that is_training=False still updates BN parameters gamma (scale) and beta (offset) if they are presented in var_list of the optimiser definition. Set 'trainable=False' in BN fuctions to remove them from trainable_variables

It means that when we frozen BN, we will keep its parameters fixed (including mean, variance, beta, gamma). in your bn layer in network.py, you have set trainable=False, hence gamma and beta will not appear in list. So, I think you have a redundant condition in the lines of model.py

decoder_w_trainable = [v for v in decoder_trainable if 'weights' in v.name or 'gamma' in v.name] # lr * 10.0
decoder_b_trainable = [v for v in decoder_trainable if 'biases' in v.name or 'beta' in v.name] # lr * 20.0

Am I right? Or for a good condition, it must be 'gamma' not in v name, instead of 'gamma' in v.name

In predict mode, the output images and file names are not consistent

I shifted to prediction mode for visualization, then I found that the images and file names are not consistent.

Tensor name "bn4b17_branch2c/moving_mean" not found in checkpoint files

I used the pretrained Resnet-101 provided from the tensorflow as you told in 11/09/2017 update
(http://download.tensorflow.org/models/resnet_v1_101_2016_08_28.tar.gz), but an NotFoundError is raised when load the model, as showed below:

NotFoundError (see above for traceback): Tensor name "bn4b17_branch2c/moving_mean" not found in checkpoint files G:/DeepLab/reference_model/tensorflow_official/resnet_v1_101.ckpt
[[Node: save_1/RestoreV2_202 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_1/Const_0_0, save_1/RestoreV2_202/tensor_names, save_1/RestoreV2_202/shape_and_slices)]]
[[Node: save_1/RestoreV2_242/_183 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1226_save_1/RestoreV2_242", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

I checked the downloaded "resnet_v1_101.ckpt" files, and found the tensor name in the ckpt is like

tensor_name: resnet_v1_101/block3/unit_5/bottleneck_v1/conv2/BatchNorm/moving_variance
tensor_name: resnet_v1_101/block3/unit_14/bottleneck_v1/conv3/BatchNorm/gamma
tensor_name: resnet_v1_101/block1/unit_3/bottleneck_v1/conv1/weights
tensor_name: resnet_v1_101/block1/unit_3/bottleneck_v1/conv3/weights
tensor_name: resnet_v1_101/block3/unit_16/bottleneck_v1/conv1/BatchNorm/gamma
tensor_name: resnet_v1_101/block3/unit_7/bottleneck_v1/conv3/BatchNorm/gamma
tensor_name: resnet_v1_101/block3/unit_18/bottleneck_v1/conv2/BatchNorm/gamma
tensor_name: resnet_v1_101/block3/unit_7/bottleneck_v1/conv1/BatchNorm/beta
tensor_name: resnet_v1_101/block3/unit_18/bottleneck_v1/conv2/weights

which is exactly different from what is needed.
Is there any process needed to be done before restoring the mode? or have I used a wrong model? Help

Bug for create folder

This is a bug when someone run prediction in the first time. Because the source code does not have some folder such as ./output/prediction, /output/visual_prediction, model. So it is better to provide a short script to generate these folders checkif it is not exist.

In addition, could you tell me the performance of mIOU did you achieve with multiple scale code in the PASCAL and CITYSCAPE? Thanks for your good job.

[use may own pretrain model]

If I want to use may own pretrain resnet50 model, the tensor name in my ckpt file is:
tensor_name group2/block3/conv3/bn/mean/EMA

tensor_name group3/block1/conv2/bn/variance/EMA

tensor_name group3/block0/conv1/bn/beta

tensor_name group3/block2/conv3/bn/beta/Momentum

tensor_name group3/block2/conv1/bn/gamma

tensor_name group2/block4/conv3/W/Momentum

tensor_name group0/block1/conv3/bn/gamma/Momentum

tensor_name group1/block0/conv3/W
........

and in your ckpt file, the tensor name is:
tensor_name resnet_v1_50/block3/unit_2/bottleneck_v1/conv1/BatchNorm/moving_mean

tensor_name resnet_v1_50/block4/unit_1/bottleneck_v1/conv3/BatchNorm/beta

tensor_name resnet_v1_50/block3/unit_2/bottleneck_v1/conv3/BatchNorm/gamma
tensor_name resnet_v1_50/block2/unit_1/bottleneck_v1/conv3/weights

tensor_name resnet_v1_50/block3/unit_1/bottleneck_v1/conv3/BatchNorm/moving_variance
........
I have changed the name in network.py:

	with tf.variable_scope(scope_name) as scope:
		outputs = self._start_block('conv0')
		print("after start block:", outputs.shape)
		with tf.variable_scope('group0') as scope:
			outputs = self._bottleneck_resblock(outputs, 256, 'block0',	identity_connection=False)
			outputs = self._bottleneck_resblock(outputs, 256, 'block1')
			outputs = self._bottleneck_resblock(outputs, 256, 'block2')
			print("after group0 :", outputs.shape)

..........
def _bottleneck_resblock(self, x, num_o, name, half_size=False, identity_connection=True):
first_s = 2 if half_size else 1
assert num_o % 4 == 0, 'Bottleneck number of output ERROR!'
# branch1
if not identity_connection:
o_b1 = self._conv2d(x, 1, num_o, first_s, name='%s/shortcut' % name)
o_b1 = self._batch_norm(o_b1, name='%s/shortcut' % name, is_training=False, activation_fn=None)
else:
o_b1 = x
# branch2
o_b2a = self._conv2d(x, 1, num_o / 4, first_s, name='%s/conv1' % name)
o_b2a = self._batch_norm(o_b2a, name='%s/conv1' % name, is_training=False, activation_fn=tf.nn.relu)

	o_b2b = self._conv2d(o_b2a, 3, num_o / 4, 1, name='%s/conv2' % name)
	o_b2b = self._batch_norm(o_b2b, name='%s/conv2' % name, is_training=False, activation_fn=tf.nn.relu)

	o_b2c = self._conv2d(o_b2b, 1, num_o, 1, name='%s/conv3' % name)
	o_b2c = self._batch_norm(o_b2c, name='%s/conv3' % name, is_training=False, activation_fn=None)
	# add
	outputs = self._add([o_b1,o_b2c], name='%s/add' % name)
	# relu
	outputs = self._relu(outputs, name='%s/relu' % name)
	return outputs

........

what others I should do to run this code with my own pretrain ckpt file?

What's the result of mIOU in val/test with your code?

When I used this code, I reduced 'batch_size' and 'image_size' because my GPU memory is not enough,
batch_size = 7, image_size = 257; unfortunately, I only got 73.2 with mIOU;
additionally, I add a new model(model_msc): multi scale input for train and test, mIOU = 74.2;
they are all far from 76.35(paper's result).
Could you tell me what the reason might be?
And, how many mIOU got by your experiment?
thank you very much!

reference model

你好，

　　我下了您的工程在自己的机器ｒｕｎ　失败，提示找不到匹配文件

2018-03-31 12:26:55.147942: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ../reference model/deeplab_resnet_init.ckpt

请问这个问题如何解决，谢谢！

pretrain model download

I don't find the download link of 'deeplab_resnet.ckpt', Can you provide it?

Thanks

Another dataset

Hello,
I want to ask, if there is a possibility to use your deep lab implementation on my own dataset for semantic segmentation task.
I have a dataset with semantic segmentation labels of 12 classes, is it possible to use your model ?

Loss goes to nan when uses res101

I am training your code with res101 in cityscape dataset (for deeplab pre-trained it worked well). I set up the main.py as

flags.DEFINE_float('momentum', 0.9, 'momentum')
	flags.DEFINE_string('encoder_name', 'res101', 'name of pre-trained model, res101, res50 or deeplab')
	flags.DEFINE_string('pretrain_file', './reference model/resnet_v1_101.ckpt', 'pre-trained model filename corresponding to encoder_name')
	flags.DEFINE_string('data_list', './dataset_cityscapes/train_fine.txt', 'training data list filename')
...
	flags.DEFINE_integer('input_height', 713, 'input image height')
	flags.DEFINE_integer('input_width', 713, 'input image width')
	flags.DEFINE_integer('num_classes', 19, 'number of classes')

After some iterations, the loss goes to nan. I am using python3 and tensorflow 1.3. Did you meet same problem as me? How could I fix it? Thanks

This is loss log

step 0 	 loss = 6.337, (12.625 sec/step)
step 1 	 loss = 6.133, (1.820 sec/step)
step 2 	 loss = 2.675, (1.625 sec/step)
step 3 	 loss = 6.042, (1.630 sec/step)
step 4 	 loss = 15.278, (1.545 sec/step)
step 5 	 loss = 12.153, (1.532 sec/step)
step 6 	 loss = 52.724, (1.100 sec/step)
step 7 	 loss = 940.443, (1.041 sec/step)
step 8 	 loss = 5393914151199927058379743980628738048.000, (1.052 sec/step)
step 9 	 loss = nan, (1.093 sec/step)
step 10 	 loss = nan, (1.445 sec/step)
step 11 	 loss = nan, (1.025 sec/step)

[use may own pretrain model]

In my ckpt file the tensor name is :
group3/block1/conv2/bn/mean
group3/block1/conv2/bn/variance
......
after my change, when I run the code,there will be error like:
Not found: key group3/block1/conv2/bn/moving_mean not found in checkpoint
Not found: key group3/block1/conv2/bn/moving_variance not found in checkpoint
......
where I need to change to make the code read the mean and variance instead of moving_mean and moving_variance?

fc7 and fc8 layers?

Though there are fc7 and fc8 layers On Fig. 7 in the deeplab v2 paper, I couldn't find those in your implementation.
Is it my failure to understand the code or it just has not been implemented?

OutOfRangeError Occurred

Hi, there. I came across an out-of-range error after executing the instruction: python main.py

Here's some of my configuration:

OS: Windows 10
python: 3.5 ( not Anaconda )
Tensorflow: 1.3 + CPU

Would you please help me with this? Thanks!

===================================================================

2017-12-28 09:50:03.044001: W C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-12-28 09:50:03.044142: W C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
-----------build encoder: deeplab pre-trained-----------
after start block: (10, 81, 81, 64)
after block1: (10, 81, 81, 256)
after block2: (10, 41, 41, 512)
after block3: (10, 41, 41, 1024)
after block4: (10, 41, 41, 2048)
-----------build decoder-----------
after aspp block: (10, 41, 41, 21)
Restored model parameters from ../reference model/deeplab_resnet_init.ckpt
2017-12-28 09:50:44.847421: W C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\35\tensorflow\core\framework\op_kernel.cc:1192] Not found: NewRandomAccessFile failed to Create/Open: E:\Data\VOC_data\VOC\VOCdevkit\VOC2012/SegmentationClassAug/2008_003519.png : 系统找不到指定的路径。

Traceback (most recent call last):
File "C:\Program Files\Python 3.5\lib\site-packages\tensorflow\python\client\session.py", line 1327, in _do_call
return fn(*args)
File "C:\Program Files\Python 3.5\lib\site-packages\tensorflow\python\client\session.py", line 1306, in _run_fn
status, run_metadata)
File "C:\Program Files\Python 3.5\lib\contextlib.py", line 66, in exit
next(self.gen)
File "C:\Program Files\Python 3.5\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.OutOfRangeError: FIFOQueue '_1_create_inputs/batch/fifo_queue' is closed and has insufficient elements (requested 10, current size 0)
[[Node: create_inputs/batch = QueueDequeueManyV2[component_types=[DT_FLOAT, DT_UINT8], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](create_inputs/batch/fifo_queue, create_inputs/batch/n)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main.py", line 82, in
tf.app.run()
File "C:\Program Files\Python 3.5\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "main.py", line 76, in main
getattr(model, args.option)()
File "C:\Users\Shane\Desktop\Deeplab-v2--ResNet-101--Tensorflow\model.py", line 60, in train
feed_dict=feed_dict)
File "C:\Program Files\Python 3.5\lib\site-packages\tensorflow\python\client\session.py", line 895, in run
run_metadata_ptr)
File "C:\Program Files\Python 3.5\lib\site-packages\tensorflow\python\client\session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Program Files\Python 3.5\lib\site-packages\tensorflow\python\client\session.py", line 1321, in _do_run
options, run_metadata)
File "C:\Program Files\Python 3.5\lib\site-packages\tensorflow\python\client\session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: FIFOQueue '_1_create_inputs/batch/fifo_queue' is closed and has insufficient elements (requested 10, current size 0)
[[Node: create_inputs/batch = QueueDequeueManyV2[component_types=[DT_FLOAT, DT_UINT8], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](create_inputs/batch/fifo_queue, create_inputs/batch/n)]]

Caused by op 'create_inputs/batch', defined at:
File "main.py", line 82, in
tf.app.run()
File "C:\Program Files\Python 3.5\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "main.py", line 76, in main
getattr(model, args.option)()
File "C:\Users\Shane\Desktop\Deeplab-v2--ResNet-101--Tensorflow\model.py", line 36, in train
self.train_setup()
File "C:\Users\Shane\Desktop\Deeplab-v2--ResNet-101--Tensorflow\model.py", line 169, in train_setup
self.image_batch, self.label_batch = reader.dequeue(self.conf.batch_size)
File "C:\Users\Shane\Desktop\Deeplab-v2--ResNet-101--Tensorflow\utils\image_reader.py", line 179, in dequeue
num_elements)
File "C:\Program Files\Python 3.5\lib\site-packages\tensorflow\python\training\input.py", line 922, in batch
name=name)
File "C:\Program Files\Python 3.5\lib\site-packages\tensorflow\python\training\input.py", line 716, in _batch
dequeued = queue.dequeue_many(batch_size, name=name)
File "C:\Program Files\Python 3.5\lib\site-packages\tensorflow\python\ops\data_flow_ops.py", line 457, in dequeue_many
self._queue_ref, n=n, component_types=self._dtypes, name=name)
File "C:\Program Files\Python 3.5\lib\site-packages\tensorflow\python\ops\gen_data_flow_ops.py", line 1342, in _queue_dequeue_many_v2
timeout_ms=timeout_ms, name=name)
File "C:\Program Files\Python 3.5\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "C:\Program Files\Python 3.5\lib\site-packages\tensorflow\python\framework\ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "C:\Program Files\Python 3.5\lib\site-packages\tensorflow\python\framework\ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

OutOfRangeError (see above for traceback): FIFOQueue '_1_create_inputs/batch/fifo_queue' is closed and has insufficient elements (requested 10, current size 0)
[[Node: create_inputs/batch = QueueDequeueManyV2[component_types=[DT_FLOAT, DT_UINT8], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](create_inputs/batch/fifo_queue, create_inputs/batch/n)]]

TypeErro

Hello, what is the format of your data? Can you upload the data set you use. I always get the error "TypeError: Value passed to parameter 'x' has DataType uint8 not in list of allowed values: float16, float32, float64, int32, int64, complex64, complex128"

tf.flags

Hi zhengyang, I want to know why you set flags.FLAGS.dict['__parsed'] = False in configure( )? What does it mean?
Thank you!

NewRandomAccessFile failed to Create/Open: E:\Datase\VOC2012 : 拒绝访问,

Hi zhengyang, I run your code in windows, and I set
flags.DEFINE_string('data_dir','E:\Dataset\VOC2012','data directory').
Then, I get the NewRandomAccessFile failed to Create/Open: E:\Datase\VOC2012 : 拒绝访问,
and OutOfRangeError: FIFOQueue '_1_create_inputs/batch/fifo_queue' is closed and has insufficient elements (requested 10, current size 0) ,which same as @cclough.
It is very confuse me, can you can tell me where I was wrong?
Thank you!!!!

How could the network train batch norm?

This is not an issue. I just want to extend the training process by train with batch norm. As you mentioned, the training BN may not good when the batch size is small. However, I am running on a powerful computer so I think it can train with the batch size of 16. After completed, these BN will be frozen, and then the network trains with small batch size and learning rate.

As you mentioned in the README

Example: If you have a batch normalization layer in the decoder, you should use
outputs = self._batch_norm(inputs, name='g_bn1', is_training=self.phase, activation_fn=tf.nn.relu, trainable=True)

To train with BN, I will set is_training flag to True and trainable=True, in the network.py. Is that all? Do I need change something in the model.py in the lines

 restore_var = [v for v in tf.global_variables() if 'fc' not in v.name]
# Trainable Variables
all_trainable = tf.trainable_variables()
# Fine-tune part
encoder_trainable = [v for v in all_trainable if 'fc' not in v.name] # lr * 1.0
# Decoder part
decoder_trainable = [v for v in all_trainable if 'fc' in v.name]
		....
decoder_w_trainable = [v for v in decoder_trainable if 'weights' in v.name or 'gamma' in v.name] # lr * 10 

decoder_b_trainable = [v for v in decoder_trainable if 'biases' in v.name or 'beta' in v.name] # lr * 20.0

This is my completed code for train BN in the decoder

o=self._conv2d_bn(x, 1, 256, 1, name='fc1', biased=True)
o_bn = self._batch_norm(o, name='fc1_bn', is_training=self.phase, trainable=self.phase,activation_fn=tf.nn.relu)

Thanks so much

使用自己的数据集时，mIoU的值很低

您好，非常感谢您的代码。当我使用PASCAL VOC数据集时，mIoU和最终的预测像素值表现的比较好;但是当我用自己的数据集时，loss虽然下降了，但是mIoU的值非常低，如下图

我自己的数据集比较小，训练集和验证集都只有100张左右;而且图片分辨率是640X480;这些图片基本上一致，如下图

希望大神帮忙看看，十分感谢！

The loss is still about 2 when it's about 4000 iterations.

Hello, thank you for your kind to public your implementation, it's a great work and very helpful for me. When I train the model on the cityscapes dataset, the loss almost can't decrease after 100 iterations and it's about 2. The mIoU can increase slowly(about 80 iterations per 0.001 increasing). I want to know if it's normal? If it's abnormal, do you have advice to me about what things might cause this?
Thanks in advance.^_^

Training Cityscapes - Changes in label_utils.py

@zhengyang-wang I am trying to build my model using Cityscapes dataset. Because of the dataset's structure :

https://github.com/mcordts/cityscapesScripts/blob/4031404ae3dffe901f45a91f7bb0253389284d6f/cityscapesscripts/helpers/labels.py

some labels should be ignored in evaluation. How is this possible using your code? I can see that you ignore just 255 class. Is it possible to do this without changing the whole data-label preprocessing?

ResourceExhaustedError: OOM when allocating tensor with shape[5760,4,4,2048]

I successfully deployed the project on the system with Windows 10 + Tensorflow 1.3.0 (CPU only) . However, when I deployed it on the system with Ubuntu + tensorflow 1.4 (GeForce GTX 1080 Ti), I ran into following problem.

sheldon@amax:~/Projects/Deeplab-v2--ResNet-101--Tensorflow$ python3 main.py
2017-12-28 11:41:44.924418: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-12-28 11:41:45.623542: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:8a:00.0
totalMemory: 10.91GiB freeMemory: 10.75GiB
2017-12-28 11:41:45.623612: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:8a:00.0, compute capability: 6.1)
-----------build encoder: deeplab pre-trained-----------
after start block: (10, 81, 81, 64)
after block1: (10, 81, 81, 256)
after block2: (10, 41, 41, 512)
after block3: (10, 41, 41, 1024)
after block4: (10, 41, 41, 2048)
-----------build decoder-----------
after aspp block: (10, 41, 41, 21)
Restored model parameters from /data2/deeplab_resnet_init.ckpt
2017-12-28 11:42:09.929249: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 720.00MiB.  Current allocation summary follows.
2017-12-28 11:42:09.929406: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (256):   Total Chunks: 278, Chunks in use: 216. 69.5KiB allocated for chunks. 54.0KiB in use in bin. 15.6KiB client-requested in use in bin.
2017-12-28 11:42:09.929433: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (512):   Total Chunks: 65, Chunks in use: 64. 32.5KiB allocated for chunks. 32.0KiB in use in bin. 32.0KiB client-requested in use in bin.
2017-12-28 11:42:09.929452: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (1024):  Total Chunks: 401, Chunks in use: 401. 401.2KiB allocated for chunks. 401.2KiB in use in bin. 401.0KiB client-requested in use in bin.
2017-12-28 11:42:09.929471: I tensorflow/core/common_runtime/bfc_allocator.cc:627] Bin (2048):  Total Chunks: 88, Chunks in use: 88. 176.0KiB allocated for chunks. 176.0KiB in use in bin. 176.0KiB client-requested in use in bin.

> (There were way too much similar outputs, so I just left out most of the lines here)

2017-12-28 11:42:09.950169: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 424673280 totalling 405.00MiB
2017-12-28 11:42:09.950182: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 663552000 totalling 632.81MiB
2017-12-28 11:42:09.950194: I tensorflow/core/common_runtime/bfc_allocator.cc:683] Sum Total of in-use chunks: 9.55GiB
2017-12-28 11:42:09.950211: I tensorflow/core/common_runtime/bfc_allocator.cc:685] Stats:
Limit:                 10968825856
InUse:                 10253331200
MaxInUse:              10265177344
NumAllocs:                    4856
MaxAllocSize:            802160640

2017-12-28 11:42:09.950341: W tensorflow/core/common_runtime/bfc_allocator.cc:277] **********************************************************************************************______
2017-12-28 11:42:09.950375: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[5760,4,4,2048]
2017-12-28 11:42:09.973697: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.00GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-12-28 11:42:09.973763: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.00MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-12-28 11:42:09.973796: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 928.77MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-12-28 11:42:09.999611: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.58GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-12-28 11:42:09.999662: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.00MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-12-28 11:42:09.999689: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 415.06MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-12-28 11:42:10.017460: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.21GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-12-28 11:42:10.017501: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.67GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
Traceback (most recent call last):
  File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
    status, run_metadata)
  File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[5760,4,4,2048]
         [[Node: fc1_voc12_c3/convolution/SpaceToBatchND = SpaceToBatchND[T=DT_FLOAT, Tblock_shape=DT_INT32, Tpaddings=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](res5c_relu, fc1_voc12_c3/convolution/SpaceToBatchND/block_shape, fc1_voc12_c3/convolution/SpaceToBatchND/paddings)]]
         [[Node: add/_1131 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_5776_add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 82, in <module>
    tf.app.run()
  File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "main.py", line 76, in main
    getattr(model, args.option)()
  File "/home/sheldon/Projects/Deeplab-v2--ResNet-101--Tensorflow/model.py", line 60, in train
    feed_dict=feed_dict)
  File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[5760,4,4,2048]
         [[Node: fc1_voc12_c3/convolution/SpaceToBatchND = SpaceToBatchND[T=DT_FLOAT, Tblock_shape=DT_INT32, Tpaddings=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](res5c_relu, fc1_voc12_c3/convolution/SpaceToBatchND/block_shape, fc1_voc12_c3/convolution/SpaceToBatchND/paddings)]]
         [[Node: add/_1131 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_5776_add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'fc1_voc12_c3/convolution/SpaceToBatchND', defined at:
  File "main.py", line 82, in <module>
    tf.app.run()
  File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "main.py", line 76, in main
    getattr(model, args.option)()
  File "/home/sheldon/Projects/Deeplab-v2--ResNet-101--Tensorflow/model.py", line 36, in train
    self.train_setup()
  File "/home/sheldon/Projects/Deeplab-v2--ResNet-101--Tensorflow/model.py", line 177, in train_setup
    net = Deeplab_v2(self.image_batch, self.conf.num_classes, True)
  File "/home/sheldon/Projects/Deeplab-v2--ResNet-101--Tensorflow/network.py", line 34, in __init__
    self.build_network()
  File "/home/sheldon/Projects/Deeplab-v2--ResNet-101--Tensorflow/network.py", line 38, in build_network
    self.outputs = self.build_decoder(self.encoding)
  File "/home/sheldon/Projects/Deeplab-v2--ResNet-101--Tensorflow/network.py", line 64, in build_decoder
    outputs = self._ASPP(encoding, self.num_classes, [6, 12, 18, 24])
  File "/home/sheldon/Projects/Deeplab-v2--ResNet-101--Tensorflow/network.py", line 125, in _ASPP
    o.append(self._dilated_conv2d(x, 3, num_o, d, name='fc1_voc12_c%d' % i, biased=True))
  File "/home/sheldon/Projects/Deeplab-v2--ResNet-101--Tensorflow/network.py", line 150, in _dilated_conv2d
    o = tf.nn.atrous_conv2d(x, w, dilation_factor, padding='SAME')
  File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/ops/nn_ops.py", line 1137, in atrous_conv2d
    name=name)
  File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/ops/nn_ops.py", line 751, in convolution
    return op(input, filter)
  File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/ops/nn_ops.py", line 835, in __call__
    return self.conv_op(inp, filter)
  File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/ops/nn_ops.py", line 499, in __call__
    return self.call(inp, filter)
  File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/ops/nn_ops.py", line 490, in _with_space_to_batch_call
    paddings=paddings)
  File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 4922, in space_to_batch_nd
    paddings=paddings, name=name)
  File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/home/sheldon/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[5760,4,4,2048]
         [[Node: fc1_voc12_c3/convolution/SpaceToBatchND = SpaceToBatchND[T=DT_FLOAT, Tblock_shape=DT_INT32, Tpaddings=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](res5c_relu, fc1_voc12_c3/convolution/SpaceToBatchND/block_shape, fc1_voc12_c3/convolution/SpaceToBatchND/paddings)]]
         [[Node: add/_1131 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_5776_add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

It seems that the system ran out of resources. What shall I do to fix the problem?

Summary of all trained models?

Could you summarize a little on the trained models with different configurations and different databases? For example, for pascal voc 2012 validation set, pre-trained models (resnet50/resnet101/deeplab), training with/without msc, evaluate with/without msc.

(by the way, the difference between two pre-trained models, resnet101 and deeplab, is just the resnet101 is pre-trained on ImageNet, and deeplab is pre-trained on ImageNet and COCO?)