skhadem / 3d-boundingbox Goto Github PK
View Code? Open in Web Editor NEWPyTorch implementation for 3D Bounding Box Estimation Using Deep Learning and Geometry
License: MIT License
PyTorch implementation for 3D Bounding Box Estimation Using Deep Learning and Geometry
License: MIT License
I want to know if we have a ground truth 2d bounding box annotation, Can the performance of this model be improved to what level ,can someone give me a idea
Hello!
I have a question about the derivation:
In the given material (http://ywpkwon.github.io/pdf/bbox3d-study.pdf), what does K mean?
I think it is intrinsic matrix, but why its shape is 3 * 4?
Thank you very much!
I have my own datasets whose format are like pascal voc with labeled images only, i do not have the calibration file, how can I train on my own datasts?
Is there another link we can use to access the pretrained weights? Thanks!
When I run this code:
my_vgg = vgg.vgg19_bn(pretrained=True)
model = Model(features=my_vgg.features, bins=2).cuda()
I get notice below:
model = Model(features=my_vgg.features, bins=2).cuda()
TypeError: 'module' object is not callable
i want to know the KITTI method that how did they generated calib file because when i generate it and it is totally different in syntax as compared to to that and no fruitful results. I made it by opencv chess method. kindly tell me how i can get my calib file in syntax like of the KITTIs. Thanks
Does anybody know if this method could be used in vehicle 3D detection in residential scenes?
In that case, The images are from surveillance cameras. And those cameras are set at about 3-4 meters high from the ground.
Thus, the Roll angle of cameras are definitely not 0.
If I directly use the the pretrained model and original camera calibration files, the predictions are reaaaaally bad in my scenes.
So how could I change the camera calibration files to make it adaptable for image datasets by surveillance cameras?
Or it's not capable to do that?
Hi ... I am currently working on my data generated using Unity. I am not able to get proper 3D bounding box. My 2D bounding box output is proper but 3D box is going for a toss. I am working on data's generated using Unity 3D. My Image size is 1024 x 1024. What are the changes i need to do in order to map the 3D box on my data.
Hello,
Thanks for this great repo! Could you please add a licence file, to clarify how it can be used?
Thanks
Once a 3D Boundingbox is plotted, it is on the original image (names 'img') in your code. However, next cropped image would be cropped from the previous 'img' with last one or several 3D boundingbox. (i.e. Cropped images with partial 3D boundingbox plots are fed into model for training or testing. So I change tht code by plot_img = np.copy(truth_img) and then plot 3D boundingbox on plot_img rather than 'img'
Hi,
I am attempting to use this method to train on my own dataset which I have generated in Unity using the Unity Perception Package, therefore this requires quite a few modifications of the Dataset class. Unity will generate the ground truth and provide me with the following:
X,Y,Z position of the 3D bounding box center wrt. the camera
Object dimensions
Object rotation wrt. global coordinate frame
2D bounding box coordinates within the image
Camera intrinsic matrix
In the corresponding paper, the three angles of interest are Theta Ray, Theta L, and Theta. I believe understand what these are and the correspondance between them:
Theta ray is the ray angle of the object center (calculated as the angle between the camera principal point and 3D bounding box center).
Theta L is the local orientation i.e. orientation of object wrt. to the camera.
Theta is the global orientation of the object.
Theta = Theta Ray + Theta L
However, looking in the Dataset class, there are references to three different angles: Alpha, Ry and theta_ray. As far as I understand it, Alpha is equivalent to Theta L (as this it what you are regressing), Ry is equivalent to Theta (global orientation), and theta_ray is self-explanitory.
As far as I am aware, theta_ray is calculated using the position of the 2D bounding box within the image, and the model is predicting Alpha, and using the correspondance between these we can find the global orientation of the object.
I would just like to confirm that all this is correct, as I have been having a hard time understanding this.
Your feedback is greatly appreciated :)
hello, does this work OK now?
Hi there,
I would like to tansfer the pytorch model to caffemodel myself.
The cfg file, which saves the neural structure is necessary to be built.
However the model we get after training is devided into 3 parts, demension, oriented, and confidence.
so I am a bit confused how to write the cfg file on my own.
Will be so appreciate if you can provide a solution or suggestion for me.
Best regards.
I was unable to understand the formation of constraints in the calculation of the translation vector in Math.py file. Could you please hint me a bit towards it?
The predictions are way out from the gound truth when the object is truncated i.e. only part of the object is within the image boundary. Estimating position uses the four bounding box corners but when there is truncation the bounding box does not cover all of the object, only the part that is within the image.
Is there a way to overcome this problem? Or is this method simply just not suitable for cases where truncation occurs?
Hi
I have a question. In the paper 3D Bounding Box Estimation Using Deep Learning and Geometry, there is a assumption that 3D bounding box fits tightly into 2D detection window requires that each side of the 2D bounding box to be touched by the projection of at least one of the 3D box corners. I have tested your codes, it seems that you have not considered that. Could you please have a discussion?
Hello @skhadem, thank you so much for this implementation. I would like to ask how to convert the result back into KITTI format? I have a plan to reproduce the paper result. Could you give me a hint which values should be put for KIITI format?
Also I have problem to understand the label. as we can see from the development kit as follows.
#Values Name Description
----------------------------------------------------------------------------
1 type Describes the type of object: 'Car', 'Van', 'Truck',
'Pedestrian', 'Person_sitting', 'Cyclist', 'Tram',
'Misc' or 'DontCare'
1 truncated Float from 0 (non-truncated) to 1 (truncated), where
truncated refers to the object leaving image boundaries
1 occluded Integer (0,1,2,3) indicating occlusion state:
0 = fully visible, 1 = partly occluded
2 = largely occluded, 3 = unknown
1 alpha Observation angle of object, ranging [-pi..pi]
4 bbox 2D bounding box of object in the image (0-based index):
contains left, top, right, bottom pixel coordinates
3 dimensions 3D object dimensions: height, width, length (in meters)
3 location 3D object location x,y,z in camera coordinates (in meters)
1 rotation_y Rotation ry around Y-axis in camera coordinates [-pi..pi]
1 score Only for results: Float, indicating confidence in
detection, needed for p/r curves, higher is better.
However when I see the sample of label, let say 000000.txt
, the content is follows.
Pedestrian 0.00 0 -0.20 712.40 143.00 810.73 307.92 1.89 0.48 1.20 1.84 1.47 8.41 0.01
As we can see in here we only have 15 values instead of 16 values as in description. For evaluation is that necessary to provide the score in the last?
Thank you so much
$ python Run.py
output:
Traceback (most recent call last):
File "Run.py", line 201, in <module>
main()
File "Run.py", line 137, in main
detections = yolo.detect(yolo_img)
File "code/3D-BoundingBox-master/yolo/yolo.py", line 34, in detect
ln = [ln[i[0] - 1] for i in self.net.getUnconnectedOutLayers()]
File "code/3D-BoundingBox-master/yolo/yolo.py", line 34, in <listcomp>
ln = [ln[i[0] - 1] for i in self.net.getUnconnectedOutLayers()]
IndexError: invalid index to scalar variable.
Hi
Thanks for your great work, I am now trying to replace the backbone of second stage using Resnet since Resnet usually has a better performance when doing a same work. However, after I replace the vgg by resnet, the result is terrible. I wonder if you have do the same thing and cloud you please tell me whether this idea is useful. Thank you!
Best regards
I think the Yolo implementation here is CPU-based while PyTorch is cuda-based. How can I use Cuda even for Yolo detections?
If I want to do a 3d inspection on an image (without a label file) what should I do?
Hi
Thanks for your great work, It helped me a lot, however, there are some codes that I don't understand, for example: dim += averages.get_item(label['Class']) , cloud you please explain it ? Thank you very much!
Hi folks,
I have observed this part of the source code:
"""
det.type in self.classes and det.score > self.score_thres):
intrinsics = ros_intrinsics(self.camera_info.P)
input_tensor,theta_ray = preprocessing(image,det,intrinsics)
[orient, conf, dim] = self.model(input_tensor) #Apply the model to get the estimation
orient = orient.cpu().data.numpy()[0, :, :]
conf = conf.cpu().data.numpy()[0, :]
dim = dim.cpu().data.numpy()[0, :]
# print("Conf:{}".format(conf))
dim += self.averages.get_item(det.type)
argmax = np.argmax(conf)
orient = orient[argmax, :]
cos = orient[0]
sin = orient[1]
alpha = np.arctan2(sin, cos)
alpha += self.angle_bins[argmax]
alpha -= np.pi
"""
But that conf is a tuple of two numbers, which is used to determine the best orientation, like this:
"""
Conf:[ 6.3896847 -6.5501723]
Conf:[ 6.496025 -6.7066655]
Conf:[ 5.410366 -5.5474744]
Conf:[ 7.092432 -7.3124714]
Conf:[ 9.061753 -9.251386]
Conf:[ 7.587371 -7.831802]
Conf:[ 2.149212 -2.1235662]
Conf:[-0.84504336 0.89392436]
Conf:[ 4.436549 -4.5268965]
Conf:[ 1.2938225 -1.4327605]
"""
How can I get the score of the final 3D Bounding Box? (0 to 1 value, like in every 2D or 3D object detector)
Thanks in advance.
Feeding in the same image multiple times produces different orientation estimations each time. Does anyone know why this would be the case?
Hi,
I'm wondering how do you construct your dataset from Kitti, especially the "Location" keyword in the label. I don't quite understand how the following lines of code works from torch_lib/Dataset.py
Location = [line[11], line[12], line[13]] # x, y, z
Location[1] -= Dimension[0] / 2 # bring the KITTI center up to the middle of the object
Why does the y component of "Location" corresponds to the x component of "Dimension"?
This also appears in library/Math.py in the calc_location() function:
# using a different coord system
dx = dimension[2] / 2
dy = dimension[0] / 2
dz = dimension[1] / 2
Why you are switching the coordinate system and can you please tell me how do you parse the raw data regarding location information from Kitti dataset?
p.s. I noticed this because I try to read ground truth label directly from the generated dataset and plot it using your plot_3d_box(img, cam_to_img, orient, dimensions, location) function. However, there exists serious offset in location, especially in y coordinates. Can you please tell me how to read ground truth location information from the generated dataset and plot it correctly?
Thank you so much~
yolo
Using previous model epoch_90.pkl
/home/zjut/anaconda3/envs/pytorch_GPU/lib/python3.7/site-packages/torchvision/models/_utils.py:209: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, "
/home/zjut/anaconda3/envs/pytorch_GPU/lib/python3.7/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None
for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing weights=VGG19_BN_Weights.IMAGENET1K_V1
. You can also use weights=VGG19_BN_Weights.DEFAULT
to get the most up-to-date weights.
warnings.warn(msg)
Traceback (most recent call last):
File "Run.py", line 203, in
main()
File "Run.py", line 137, in main
detections = yolo.detect(yolo_img)
File "/home/zjut/code/3D-BoundingBox/yolo/yolo.py", line 31, in detect
(H,W) = image.shape[:2]
ValueError: not enough values to unpack (expected 2, got 0)
Hi
I just wondering if the training process running on normal images such as we take from regular camera instead of velo or stereo camera?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.