ika-rwth-aachen / cam2bev Goto Github PK

TensorFlow Implementation for Computing a Semantically Segmented Bird's Eye View (BEV) Image Given the Images of Multiple Vehicle-Mounted Cameras.

License: MIT License

Shell 0.79% Python 99.21%

autonomous-vehicles birds-eye-view computer-vision deep-learning ipm machine-learning segmentation sim2real simulation

cam2bev's People

Contributors

Stargazers

Watchers

Forkers

autonomous-logistics kaishijeng jediofgever luuthienxuan xrosliang yuhonghong95721 manila95 jdc08161063 pan17wj shangweichao walims aviralksingh anotherkey sase-ai bobdeng1974 daoudamjad mattromlewski mehardeep chuong geo-fortune hoanganhpham1006 medical-projects gojila1029 pradeephyd adas-eye mssandroid drzhoukarl abanglzu lonfr collector-m liupandede cuulee lijiunderstand muzafferkabakci zahid8 sgk-000 lkathke dlperf zhang405744522 bccwai skylook atlasgooo zhiqiao761 liuwq0809 spinachr javenzhu zhangsongdmk hust-wayne huangwgang wyjforwjy zxt881108 halocsidez rohanrudra55 dstarer yumianhuli2 limjiajing udonl gregorian82 nuaasxr rocinant laozheng1 yn828 chanifrusydi nakajimakou1 hzm8341 rick-sunrise v-danh hailuo0112 yysu-888 kocerb leedonus andrewnash paradigmengineering wohaiyo tuskaw anonymous-n wenuyang dctyxx jie311 jahaniam leonoyz raffael1998 surajbhardwaj kundjanasith vishvajitjambuti xfz12138 yucnet gaehwanjo datomi79 mllearnerakash andyxiao2022 minho8849 canferman kejingjing88212 miladhtb daydreamer2023 hardcorehenr akshaynarla lyrics-wangkl vijaydl

cam2bev's Issues

How can I simulate the data myself ?

Thank you very much for your excellent work. I would like to ask how I can download VTD ?

From BEV to front view

Hi,

Thank you again for the great work!
I am wondering if this model can also work well from BEV to front view translation?

Frame rate

Hi, may I know what is frame rate the images in the data set are taken at?

how i can get the camera_configs with my own data

Thanks for your great work. I'd like to konw how i can get the camera_configs with my own data. My own data is collected by carla, Can you give me some suggestions. Thanks a lot

Thank you for sharing such a well-documented code repository for your work! Could you also share the co-ordinate system (left-handed, right-handed, or something else) that thepreprocessing/ipm/ipm.pycode is following. I see a comment about switching axes in the setR() function so just wanted to make sure.

Thanks!

Frame Rate

What is the speed of this implementation? We are mostly interested in finding a way to stitch camera images and had thoughts about transferring to birds eye view for depth information. Our purpose is to implement this is on our autonmous race car, and we will be running at high speeds so traditional homography seems to have too many inaccuracies.

Your methods pose interest to us. However, given that this is on a racecar, speed is of high importance. We are currently running our cameras anywhere from 25 to 40 Hz. In the paper, you mention 2Hz. Is this a consequence from the speed of the network?

In short, what's the latency or maximum speed of the network?

Pretrained weights

Hello,
Thanks for sharing your work, it looks awesome. Would you be pleased to provide pretrained weights of your models? It would be much easier to compare custom results with your work if there is no need to reproduce training, that may takes a lot of time unfortunately.

Best regards.

field of view single-input model

First of all thank you so much for this great work.

In the paper I read that the field of view of the ground truth image for the full 360° is approximately 70m x 44m. Do you have an approximation of the field of view of the single-input model too?
Based on the ground truth BEV image it might also let us do an approximation of which field of view it might have on a real world application?

Thanks a lot in advance.

the R matrix calculate order may have some problem?

halo, in ipm.py when set the R matrix for Camera, the apply order is first pitch, then apply yaw:

this may cause apply yaw is not right because the Y axis is changed after we apply pitch?
(although its ok when pitch is 0.when i changed the pitch t0 no-zero, it generate wrong answer)
i think we should first apply yaw, and then apply pitch.

Training failed

Hello! When I was training your data set, I failed every time to the last step, I changed the batch-size to 2, also failed to the last step, what is your training time? Can you help me out here. Thank you!

Label Images vs. Input Images

What is the difference between label and input images in the response I got from a prior issue?
If I'm using my own dataset with only front camera images as input, is the following organization correct?

front folder: segmented images
bev: ground truth BEV images on non-segmented data
bev+occlusion: occlusion.py run on bev folder imgs
homographies: ipm.py run on front folder imgs

to confirm, the segmented images do not go in the preprocessing scripts?

Problem is that your input image has a fourth alpha channel, s.t. the resized image has shape (256, 512, 4). This causes the crash during one-hot-encoding.

I will push a fix tomorrow, s.t. an image will always be loaded as RGB instead of RGBA, even if present. In the meantime, you can fix it yourself by replacing utils.py#L77 with

img = tf.image.decode_png(img, channels=3)

Some more notes on your files:

The standard implementation expects semantically segmented input and output images, which are then one-hot-en/decoded as part of the pipeline. Your images are a blend of the real-world-image and the semantic segmentation. One-hot-en/decoding will not work properly this way.
Your input image color-codes vehicles in a purple-ish way, but the standard 0,0,142 (RGB) blue is listed in the convert_10.xml. You need to check the colors you specify there.
Your label image has shape (640, 480), while your input image has shape (480, 640). Keep in mind that both will be center-cropped/resized to (256, 512).
It's important that you provide a good estimate of the homography matrix. Just saying as I couldn't have a look at your homography file.

Originally posted by @lreiher in #3 (comment)

Model for testing

Hello, may I ask if there is a trained model that can directly perform inference

Performance when evaluating custom data

Firstly i would like to thank you very much for this repository and the ideas introduce in the paper. I have reproduce the results using 1_FRLR deeplab_mobilenet and the provided dataset and was looking to test on some images that we have collected on my side.
What i found is that using the trained model, i wasn't able to interpret the results using my own images (We mount the fisheye camera on a truck and run segmentation and our own ipm to generated the homography image as input)

Below are the input and result from our custom image
Input:

Result:

Result using 1_FRLR validation data provided
Input:

Result:

I am not sure why there is such a huge difference in the results. Although I was expecting a poorer result because of changes in the data domain, i wasn't expecting instances like cars, which was represented in our own homography, to disappear upon inference.

I was wondering if there is anything that i did wrongly during the predict step and i look forward to anyone who could shed a light on whether its a prediction step issue or its a model generalization issue

Thank you

some issue about "homography_converter"

Hello, thank you for this repository and the ideas. I'm trying to reproduce this work recently.
I find the performace on my dataset of using SpatialTransformer to preprocess the input images is not better than splice the IPM images together, so i am trying to find where i did wrong about the SpatialTransformer unit.

In order to use the uNetXST model, I need to pre calculate uNetXST_ homographies for 1_ FRLR.py as theta_ init, the init parameter input into the SpatialTransformer, but I found that the uNetXST_ homographies i calculated by provided script related to the front camera cannot match the data provided in preprocessing/holography provided in the code_ converter/uNetXST_ homographies/1_ FRLR.py, but the other three cameras I calculated are all matched.

# uNetXST_ homographies about front
# provided in 1_FRLR.py
  np.array([[4.651574574230558e-14, 10.192351107009959, -5.36318723862984e-07],
            [-5.588661045867985e-07, 0.0, 2.3708767903941617],
            [35.30731833118676, 0.0, -1.7000018578614013]]),     # front

# what i calculated:
# use the ipm_homographs provide by preprocessing/homography_converter/README.md
[[6.5627512483814406e-15, 10.192351107009959, -5.363187232807152e-07], 
[-5.588661045867985e-07, 0.0, 2.3708767903941617], 
[35.30731833118676, 0.0, -1.7000018578614013]]

For this reason, I recalculated the homographs of IPM, and found that the parameters of cameras are also not the same as what provided in preprocessing/homography_converter/README.md.

# what provided in preprocessing/homography_converter/README.md
# OpenCV homography for front:
# [[0.0, 0.8841865353311344, -253.37277367000263], [0.049056392233805146, 0.5285437237795494, -183.265385638118], [-0.0, 0.001750144780726984, -0.5285437237795492]]
# OpenCV homography for rear:
# [[6.288911300436434e-18, 0.8292344604207404, -264.08036704706365], [-0.04905639223380515, 0.5285437237795513, -135.9750235247304], [-0.0, 0.0017501447807269904, -0.5285437237795512]] 
# OpenCV homography for left:
# [[0.04905639223380514, 0.7984814950483465, -264.7865925612947], [3.0038376863423275e-18, 0.4821577791689496, -159.26320930902278], [-0.0, 0.0016334684620118568, -0.49330747552758086]] 
# OpenCV homography for right:
# [[-0.04905639223380516, 0.7984814950483448, -217.49623044790604], [3.0038376863423283e-18, 0.5044571718862112, -138.69450590963578], [-0.0, 0.0016334684620118542, -0.49330747552758]] 

# what i calculated:
OpenCV homography for front:
[[6.288911300436432e-18, 0.8841865353311336, -253.37277367000237], [0.049056392233805146, 0.5285437237795487, -183.26538563811778], [1.1577177764518928e-20, 0.0017501447807269821, -0.5285437237795486]]
OpenCV homography for rear:
[[6.288911300436434e-18, 0.8292344604207396, -264.08036704706336], [-0.049056392233805146, 0.5285437237795507, -135.97502352473026], [-4.194994772127064e-21, 0.0017501447807269886, -0.5285437237795506]]
OpenCV homography for left:
[[0.04905639223380514, 0.7984814950483464, -264.7865925612947], [3.0038376863423302e-18, 0.48215777916894953, -159.26320930902278], [8.805125002324379e-36, 0.0016334684620118568, -0.49330747552758086]]
OpenCV homography for right:
[[-0.04905639223380516, 0.7984814950483449, -217.49623044790607], [3.0038376863423263e-18, 0.5044571718862112, -138.69450590963578], [-5.442256126106322e-36, 0.0016334684620118542, -0.49330747552758]]

which leads to more differences in the result of uNetXST_ homographies, but surprisingly the result about right camera is same as the file provided:

H = [
    np.array([[4.651574574230558e-14, 10.192351107009959, -5.36318723862984e-07],
              [-5.588661045867985e-07, 0.0, 2.3708767903941617],
              [35.30731833118676, 0.0, -1.7000018578614013]]),  # front
    # what i calculated
    # [[-5.336674296656391e-14, 10.192351107009959, -5.363187163709389e-07],
    # [-5.588660999399972e-07, -1.3484445368213003e-16, 2.3708767903941643],
    # [35.30731833118661, -1.5835212325065431e-16, -1.700001857861401]]

    np.array([[-5.336674306912119e-14, -10.192351107009957, 5.363187220578325e-07],
              [5.588660952931949e-07, 3.582264351370481e-23, 2.370876772982613],
              [-35.30731833118661, -2.263156574813233e-15, -0.5999981421386035]]),  # rear
    # what i calculated
    # [[2.6539246969884692e-14, -10.192351107009959, 5.363187207328902e-07],
    # [5.58866099939995e-07, -4.8860892910808006e-17, 2.3708767729826152],
    # [-35.30731833118661, -2.9784330148858767e-15, -0.5999981421386087]]

    np.array([[20.38470221401992, 7.562206982469407e-14, -0.28867638384075833],
              [-3.422067857504854e-23, 2.794330463189411e-07, 2.540225111648729],
              [2.1619497190382224e-15, -17.65365916559334, -0.4999990710692976]]),  # left
    # what i calculated
    # [[20.38470221401992, 3.566907532807007e-14, -0.28867638384075667],
    # [-3.422067879481381e-23, 2.794330463189408e-07, 2.5402251116487293],
    # [2.1619497190382196e-15, -17.65365916559332, -0.4999990710692995]]

    np.array([[-20.38470221401991, -4.849709834037436e-15, 0.2886763838407495],
              [-3.4220679184765114e-23, -2.794330512976549e-07, 2.5402251116487626],
              [2.161949719038217e-15, 17.653659165593304, -0.5000009289306967]])  # right
    # what i calculated
    # [[-20.38470221401991, -4.849709834037436e-15, 0.28867638384074945],
    # [-3.4220679184765114e-23, -2.794330512976549e-07, 2.5402251116487626],
    # [2.161949719038217e-15, 17.653659165593304, -0.5000009289306967]]
]

The difference seems small due to the order of magnitude, but I'm not sure whether it will affect the performance of the uNetXST model, because the performance of the uNetXST model on my dataset is not as good as the result obtained by directly using the IPM image as input.

Finally, Is my calculation correct? if I want to use my own dataset, how can I get the correct uNetXST_Homographs value?

Performance issue in the definition of one_hot_encode_image_op, model/utils.py(P1)

Hello, I found a performance issue in the definition of one_hot_encode_image_op, model/utils.py, tf.zeros will be called repeatedly during program execution, resulting in reduced efficiency. I think it should be created before the loop.

Looking forward to your reply. Btw, I am very glad to create a PR to fix it if you are too busy.

drone camera config file

Hello!I would like to ask whether drone config parameter is specified the exact value or we can choose an suitable one? if specified the exact value,how it was obtained

Training on original input

Hello!

I want to train the model (uNetXST) on the original images as input. I was wondering what change needs to be done in that case?

I assume I will have to change train.py so that no one-hot-encoding is done on the input? In that case, what will the value of n_classes_input be? Just the channels of the image (3)?

What should the regularization coefficient be set to?

By the way, there is another question. When training the Spatial Transformer Network, how much should the regularization coefficient be set to solve the problem of loss being nan?

0 What changes to make to resume training from where it was left off?

Hi, I cannot do the whole training in one go, thus I have to do multiple training sessions. I couldn't figure a way to resume training from where it was left off during the last training session, like loading the last trained weights and then doing the training.
Thank you for your time. looking forward to your reply.

one-hot-palette-label in 2_F versus 1_FRLR

Thanks so much for sharing this method and code in such a well-documented fashion. I just had a question of clarification regarding the use of different one-hot-palette labels in the mutli-view versus single-view networks.

I started off training the 1_FRLR method, and I observe that the one-hot-palette-input, convert_10, and the one-hot-palette-label, convert_9+occl, seem very similar - the main difference is that the "sky" RGB values are changed to "occluded" instead, which makes sense because the inputs will not see an occluded class, and the BEV will not see the sky.

However, now looking at the 2_F config file, I see that while convert_10 is still used for the one-hot-palette-input, the one-hot-palette-label is now using convert_3+occl. My understanding is that now the network input views are seeing classes that the ground truth input will never have - for example, terrain that was seen as "9" in the input will be understood as "3" in the ground truth BEV. So my questions are:

What was the rationale of reducing the label classes? Is it because less camera views mean 4 times less data, and having too many classes doesn't allow adequate training on all of them?
If my understanding of the input to label process is correct, does the network eventually just learn to convert those terrain areas, oroginally understood as "9", to "3" to match the ground truth?

Non-360 View

Our camera suite lacks view of a ~30 degree region in the rear of the vehicle. Classical methods of image stitching would break down without overlapping regions between cameras for image stitching. Since this method uses a learning network, I could see it being robust to this.

So to clarify the question, how would you expect this method to respond to a non-360 view. Would the region just show up as the added "occluded" class since it is not within the view of any of the cameras?

Performance issues in /model/train.py (by P3)

Hello! I've found a performance issue in /model/train.py: batch() should be called before map(), which could make your program more efficient. Here is the tensorflow document to support it.

Detailed description is listed below:

dataTrain.batch(conf.batch_size, drop_remainder=True)(here) should be called before dataTrain.map(parse_sample, num_parallel_calls=tf.data.experimental.AUTOTUNE)(here).
dataValid.batch(1)(here) should be called before dataValid.map(parse_sample, num_parallel_calls=tf.data.experimental.AUTOTUNE)(here).

Besides, you need to check the function called in map()(e.g., parse_sample called in dataValid.map(parse_sample, num_parallel_calls=tf.data.experimental.AUTOTUNE)) whether to be affected or not to make the changed code work properly. For example, if parse_sample needs data with shape (x, y, z) as its input before fix, it would require data with shape (batch_size, x, y, z).

Looking forward to your reply. Btw, I am very glad to create a PR to fix it if you are too busy.

Pretrained Model

Dear Sir

Thanks so much for your amazing work. May you share the pre-trained model on both 360 view and frontal view?

Yours
Hazem

A few questions about your work

Thank you for your excellent work! I have some questions to ask.

I would like to know how the synthetic data is obtained? I noticed that the synthetic data is about cityscapes, I would like to know how the GT from the BEV perspective is synthesized, I would like to use this generation method for the synthesis of indoor scene data, can you elaborate?
Regarding section3-C of the paper: single-Input Model, is the input of this part the projection of the inference results obtained by the trained segmentation algorithm through the IPM? So the function of the network here is to correct the error caused by IPM? So the pipeline for the application is:
Input image -segmentation result -ipm - single-input model - final result, is my understanding correct?
In section 3-D: Muti-Input Model, do you want to integrate IPM into the network and perform end-to-end segmentation from the perspective of BEV? But if there is no segmentation of GT from the perspective of BEV? How should this be done? For example, I only have segmentation annotations in perspective view, does that mean I can't use this method at all? So how do I apply him to real scenarios?
According to the above questions, can I understand that the key to the algorithm lies in the generation of segmentation annotations from the BEV perspective. Back to the first question, that is to say, the most important part is data synthesis? Is my understanding correct? I'm curious to know if I only have a self-labeled indoor scene dataset - 2d, how can I extend your method to BEV perspective?
Looking forward to your reply! ! !

Real-world application

Hi,

I am planning to use your software for a real application. I want to use the single input model with the uNetXST configuration and a single front facing camera. Is there any way to train the model with your dataset or do I need to create my own dataset with my camera parameters? Also, my camera operates in HD resolution (1280x720), does the training data also need to have the same resolution?

Many thanks in advance!

How to test on real-world images

Hello,

Could you please explain how one can use real-world images after training the model ? I have tested the model successfully on the validation dataset from VTD, but for real-world data, I believe that I have to semantically segment it with the color palette that was used for training. Is there an existing model that you recommend for segmentation ? (i.e. which model do you use to label the left-most real-world input pictures of Fig. 6 from the paper ?)

Thank you

Originally posted by @mikqs in #16 (comment)

BEV image

Thanks for your great job. I am reading the source code. In the ipm.py , why the t should multiply the -R?
def setT(self, XCam, YCam, ZCam):
X = np.array([XCam, YCam, ZCam])
self.t = -self.R.dot(X)

A Recommendation on Modifying the Resolution of Example Images in Your Dataset

First and foremost, thank you for providing such a brilliant work!
Now here comes my issue. I firstly download the synthetic dataset you provide. Then when I try to use the ipm.py in your pre-processing directory, I use the front/rear/left/right images in the data/1_FRLR/examples, and the camera calibration files in preprocessing/camera_configs/1_FRLR/, as the argument. This results in a totally blank image output.
After a debug through your code, I realize that it is caused by the non-matched resolution of images in example directory. Your configuration files tell that your cameras have a resolution of 964x604, while the example images have a resolution of 320x200.
Now I understand that this directory is used to store the images used to be shown in your README and I should select a set of images in train or val directory. However, for anyone coming up with this repo for the first time, such arrangement is very very very confusing. Cause at that moment we can hardly understand that it is a repo in your GitLab.
Therefore I highly recommend that you adjust such arrangement, maybe give images in examples directory proper resolution, or put another example images in your preprocessing directory.
It is not a critical issue. Your work is still generally wonderful!

Training of deeplab-mobilenet and deeplab-xception failed

I tried reproducing the results mentioned in the paper. To train deeplab-xception and deeplab-mobilenet, I simply changed "./train.py -c config.1_FRLR.unetxst.yml" to "./train.py -c config.1_FRLR.deeplab-xception.yml" and "./train.py -c config.1_FRLR.deeplab-mobilenet.yml". The training successfully starts but no improvement in performance occurs. I did not change any of the configurations and the training stops at epoch 21 for both models. Thanks for your time and I look forward to your guidance.

OOM Issue on UNetXST -- Occupies 55GB of physical RAM

Hello @lreiher, thank you for the fabulous work!

We're trying to reproduce your results on the UNetXST model with your dataset and your configuration (config.1_FRLR.unetxst.yml ). I can confirm a previous issue that the RAM usage continuously and rapidly grows as soon as the training starts. I am talking about the main RAM, not the GPU RAM. By the time process is killed due to OOM, it occupies ~55GB of physical memory. We're trying to train on two RTX 3090s with 24GB RAM each.

Can you think of a part of your code where stuff accumulates in memory without being garbage collected? It's very strange that you haven't encountered this. As a note, we were able to train MobileNetV2 with your 1.FRLR configuration so this issue is endemic to UNetXST.

Training on Google Colab

Hi,

I'm trying to train your UNetXST model for the front view only on a Google Colab notebook, but it takes forever for each epoch. (2hrs +)
The notebook is running Python 3.10.11 with CUDA 11.8 and Tensorflow 2.12.0 preinstalled. In requirements.txt you suggest to train with Tensorflow<2.5.0, but this seems to affect only the Deeplab models.

Another thing that I have noticed is the TF-TRT Warning: Could not find TensorRT... do I need to install TensorRT for training?

Thank you in advance!

Below is the output of the terminal while training:

2023-05-22 12:35:07.066135: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-05-22 12:35:07.119840: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-22 12:35:08.005860: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Found 32190 training samples
Found 3172 validation samples
2023-05-22 12:36:46.183583: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2023-05-22 12:36:46.183643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38284 MB memory: -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:00:04.0, compute capability: 8.0
Built data pipeline for training
Built data pipeline for validation
Compiled model uNetXST.py
Starting training...
2023-05-22 12:36:59.890772: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype string and shape [32190]
[[{{node Placeholder/_1}}]]
2023-05-22 12:36:59.891129: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype string and shape [32190]
[[{{node Placeholder/_1}}]]
Epoch 1/100
2023-05-22 12:37:19.169642: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] layout failed: INVALID_ARGUMENT: Size of values 0 does not match size of permutation 4 @ fanin shape inmodel_5/dropout/dropout/SelectV2-2-TransposeNHWCToNCHW-LayoutOptimizer
2023-05-22 12:37:25.133130: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:424] Loaded cuDNN version 8700
2023-05-22 12:37:31.275985: I tensorflow/compiler/xla/service/service.cc:169] XLA service 0x7f0603f4c350 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-05-22 12:37:31.276032: I tensorflow/compiler/xla/service/service.cc:177] StreamExecutor device (0): NVIDIA A100-SXM4-40GB, Compute Capability 8.0
2023-05-22 12:37:31.420157: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable.
2023-05-22 12:37:31.919526: I ./tensorflow/compiler/jit/device_compiler.h:180] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
54/6438 [..............................] - ETA: 2:13:38 - loss: 1.3788 - categorical_accuracy: 0.5008 - mean_io_u_with_one_hot_labels: 0.2887

how to generate this simulated data

hello, I was inspired after reading the paper.
I would like to know how to generate this simulated data. I want to simulate the garage scene to train the model

visualization

How to visualization like your github website

let me know the step by step plz

Fixing 'Incompatible Shapes' error

I am trying to train this model on my own data. I have been able to get it to work before with my own data, but I wanted to have my images semantically segmented beforehand, so I used a different model to do so, and I think in doing so I must have changed my environment enough to start getting this error because I highly doubt it's an issue with the new images I'm using. They are the same size as the previous. I have run the requirements.txt file and still am getting this issue. I have posted the error below. Any help on what the problem might be and how to fix it would be greatly appreciated. Thank you in advance!

Starting training...
Train for 200 steps, validate for 1169 steps
Epoch 1/100
2020-12-06 20:35:50.965473: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Invalid argument: Incompatible shapes: [3] vs. [256,512,4]
[[{{node Equal_29}}]]
[[IteratorGetNext]]
2020-12-06 20:35:50.987892: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Invalid argument: Incompatible shapes: [3] vs. [256,512,4]
[[{{node Equal_29}}]]
[[IteratorGetNext]]
[[metrics/mean_io_u_with_one_hot_labels/StatefulPartitionedCall/confusion_matrix/assert_non_negative/assert_less_equal/Assert/AssertGuard/else/_6/Assert/data_1/_20]]
1/200 [..............................] - ETA: 35:21WARNING:tensorflow:Can save best model only with val_mean_io_u_with_one_hot_labels available, skipping.
WARNING:tensorflow:Early stopping conditioned on metric val_mean_io_u_with_one_hot_labels which is not available. Available metrics are:
Traceback (most recent call last):
File "./train.py", line 185, in
callbacks=callbacks)
File "/home/techlab_grizzly/Desktop/Cam2BEV/env37/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
use_multiprocessing=use_multiprocessing)
File "/home/techlab_grizzly/Desktop/Cam2BEV/env37/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
total_epochs=epochs)
File "/home/techlab_grizzly/Desktop/Cam2BEV/env37/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
batch_outs = execution_function(iterator)
File "/home/techlab_grizzly/Desktop/Cam2BEV/env37/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
distributed_function(input_fn))
File "/home/techlab_grizzly/Desktop/Cam2BEV/env37/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in call
result = self._call(*args, **kwds)
File "/home/techlab_grizzly/Desktop/Cam2BEV/env37/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 632, in _call
return self._stateless_fn(*args, **kwds)
File "/home/techlab_grizzly/Desktop/Cam2BEV/env37/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2363, in call
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/techlab_grizzly/Desktop/Cam2BEV/env37/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1611, in _filtered_call
self.captured_inputs)
File "/home/techlab_grizzly/Desktop/Cam2BEV/env37/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/techlab_grizzly/Desktop/Cam2BEV/env37/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
ctx=ctx)
File "/home/techlab_grizzly/Desktop/Cam2BEV/env37/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [3] vs. [256,512,4]
[[{{node Equal_29}}]]
[[IteratorGetNext]] [Op:__inference_distributed_function_17137]
Function call stack:
distributed_function

OOMKilled when training the code

Hi @lreiher, thanks for your great work.
I suffer from some training problem, when I run the code (with the released dataset),
at the end of the first epoch (6638/6639) (may be in the val stage), the GPU memory shows not enough and the training break. I run the code on the Titan X GPU.
I copy and show some error content:
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
2020-12-24 16:12:51.644620: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 8388608 totalling 8.00MiB
2020-12-24 16:12:51.644637: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 8415488 totalling 8.03MiB
2020-12-24 16:12:51.644654: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 3 Chunks of size 10485760 totalling 30.00MiB
2020-12-24 16:12:51.644672: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 15204352 totalling 14.50MiB
2020-12-24 16:12:51.644689: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 28 Chunks of size 26214400 totalling 700.00MiB
2020-12-24 16:12:51.644707: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 29360128 totalling 28.00MiB
2020-12-24 16:12:51.644725: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 32505856 totalling 31.00MiB
2020-12-24 16:12:51.644743: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 64.00GiB
2020-12-24 16:12:51.644760: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 68719476736 memory_limit_: 68719476736 available bytes: 0 curr_region_allocation_bytes_: 68719476736
2020-12-24 16:12:51.644783: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
Limit: 68719476736
InUse: 68714584576
MaxInUse: 68719357696
NumAllocs: 115809
MaxAllocSize: 33554432

2020-12-24 16:12:51.645379: W tensorflow/core/common_runtime/bfc_allocator.cc:424] ****************************************************************************************************

original images as input

Hello, it's very strange that you use image mask as the model's input. Have you ever tried input original images?