Comments (10)
This error seems to happen because the validation projection indices are wrong. I changed the implementation of these indices recently. Did you try with the last version of the code?
On my computer, I started from scratch with the current implementation and did not have this error.
If you want to retry, you should delete the input_0.XXX
folder, which contains precomputed inputs. In that way, the code will have to compute them again. Hopefully that should correct your error.
Best,
Hugues
from kpconv.
Hi, @HuguesTHOMAS
Sorry for my neglecting your updated code. I will try it again following your advice.Thank you so much again.
Best,
Hlxwk
from kpconv.
Hi @HuguesTHOMAS, I met with similar problem in the middle of training (at around iteration 20000) with the latest code. I didn’t modify the default code, would you mind help us figure out how the solve the problem? Great thanks~
Best,
Ken
from kpconv.
Hi @HuguesTHOMAS,
I also have the same problem when running validation with S3DIS dataset. Did pull your changes today. Delete the precomputed files and run it again. The problem still persists. Thanks again.
My setup:
- Python 3.6.8
- Tensorflow 1.13.1
- CUDA 10.0 (cudnn 7.4.1.)
P.S. (off topic) I saw that you have reported a bug with 1.13.0 and CUDA 10 regarding matrix multiplications. Fortunately, I did not face it in on my setup with an above-mentioned version of SW:
from kpconv.
@nejcd,
The bug I reported is really strange, the same version of the code give different results depending on the GPU drivers installed and the version of CUDA used. I would advise not to use CUDA 10.0 to avoid any problem. How do you know you did not face the matrix multiplication bug? I ask because this bug does not occur on simple matric multiplication opperations. It only appears inside the network computationnal flow, as if CUDA did not handle a big graph of tf operations.
@kentangSJTU,
could you be more specific? Is it the same error message? Did the loss became NaN? What TF and CUDA version have you installed?
from kpconv.
@nejcd,
The bug I reported is really strange, the same version of the code give different results depending on the GPU drivers installed and the version of CUDA used. I would advise not to use CUDA 10.0 to avoid any problem. How do you know you did not face the matrix multiplication bug? I ask because this bug does not occur on simple matric multiplication opperations. It only appears inside the network computationnal flow, as if CUDA did not handle a big graph of tf operations.@kentangSJTU,
could you be more specific? Is it the same error message? Did the loss became NaN? What TF and CUDA version have you installed?
Hi, @HuguesTHOMAS, sorry for not making the question clear.
The error message happens in train-time validation (to be more specific, just after the smaller evaluation in epoch 49 ends), and is divided into three parts. The first part is:
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
[[{{node IteratorGetNext}} = IteratorGetNext[output_shapes=[[?,3], [?,3], [?,3], [?,3], [?,3], ..., [?], [?,3], [?,3,3], [?], [?]], output_types=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_INT32, DT_FLOAT, DT_FLOAT, DT_INT32, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](IteratorV2)]]
[[{{node optimizer/gradients/KernelPointNetwork/layer_2/resnetb_0/conv2/concat_1_grad/GatherV2_1/_571}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_3325_...GatherV2_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
In the main thread, the error happens in layer_0; but in my case, the error happens in layer_2, so it seems to be different.
The second part occurs during handling of the above exception, the error message is very similar to what is reported above. And the last part is the same as what is reported in the main thread, "IndexError: arrays used as indices must be of integer (or boolean) type", with exactly the same line number (i.e. Line 806 in trainer.py).
The loss didn't become NaN, and I use TF 1.12.0 + CUDA 9.0, as is suggested by the official document. Thanks a lot for your reply~
I have an update:
By temporarily disabling the code around Line 806 of trainer.py, I am able to train the model normally. But in testing, the same error happens again, when the script is calculating Reprojection Vote #15. Thus, I believe this error is not related to training, but testing indeed.
Best,
Ken
from kpconv.
@kentangSJTU,
Thank you for the details. It seems that their is a problem with the reprojection indices, which is not surprising, as I changed this part of the code very recently. As this happens in the middle of the training, this could be caused by a particular input, for example, with empty reprojection indices or something similar which is not handled well.
I am going to run the code myself and see if I find what causes the error.
Best,
Hugues
from kpconv.
The bug has been fixed.
The validation and test should work now on all datasets, but you will have to delete the input_0.XXX
folders so that the old reprojection indices can be replaced.
from kpconv.
The bug has been fixed.
The validation and test should work now on all datasets, but you will have to delete the
input_0.XXX
folders so that the old reprojection indices can be replaced.
Thanks a lot for your reply~
from kpconv.
@nejcd,The bug I reported is really strange, the same version of the code give different results depending on the GPU drivers installed and the version of CUDA used. I would advise not to use CUDA 10.0 to avoid any problem. How do you know you did not face the matrix multiplication bug? I ask because this bug does not occur on simple matric multiplication operations. It only appears inside the network computationnal flow, as if CUDA did not handle a big graph of tf operations.
@HuguesTHOMAS , as I have understood that, NaNs are starting to appear during training and it is not possible to train with affected version? All training runs I have run converged nicely therefor I assumed that version and hw(GPU GTX 1080ti) I have is not affected. If I am missing something please let me know, or how should I test it.
from kpconv.
Related Issues (20)
- about test results
- about making my own datasets
- ModelNet40 implementation - (\x00\x00\x00\x00\x00)
- about segmentation fault (core demped) HOT 2
- about the training time
- Designing a cylindrical neighborhood. HOT 2
- Do you have the pretrained model for the Semantic3D dataset? HOT 2
- Thesis not available anymore
- GAN for point cloud
- More features in KPCONV HOT 1
- IndexError:too many indices for array
- Why are there no oa and macc in the test results for S3DIS
- Not compiling
- Why S3DIS is only tested on validation set? HOT 1
- visualize in SemanticKITTI dataset HOT 1
- Problem with Compile the C++ extension modules HOT 2
- Transfer Learning
- Why switching y-z in ShapeNetPart dataset preprocessing
- How to train only with XYZ.
- Does this model support inference on CPU? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kpconv.