Coder Social home page Coder Social logo

Comments (17)

dallison avatar dallison commented on June 4, 2024 1

Alright, found the issue. The getpid() call is always returning the number 12. This is used to generate the name of the shared memory files, so all processes will use the same files for their shared memory. I'm changing it to use mkstemp instead. This is very weird and I've no idea why getpid would return 12. It feels like this is an errno value or something but getpid doesn't return errno values.

This only happens when you run it with 'bazel test'. If you run the client_test program directly, the getpid is correct. I suspect a bug in bazel test but I've no idea how it would make getpid return 12.

I'll upload a fix.

from subspace.

dallison avatar dallison commented on June 4, 2024

Thanks Mikael, I'll try to reproduce it tonight after work. It feels like some sort of locking issue but you have given me enough information to repro it. I assume these are unreliable messages. Can you tell me the number of slots in the channel?

from subspace.

mikael-s-persson avatar mikael-s-persson commented on June 4, 2024

Yeah, the parameters are /*slot_size=*/1024, /*num_slots=*/32,
It should be reproducible. I couldn't run your client_test with bazel, I could only compile and then run it manually, maybe something to do with the forking for the server (I used the Server class doing "Run()" in a background thread instead). The non-determinism of it makes me think it has to be some race condition, and the fact that the double publish is consistent from start to finish makes me think it's somewhere in the creation.

from subspace.

dallison avatar dallison commented on June 4, 2024

There is a --start_server=false option in the client test and you can run the server manually (changing the socket name to match the server's socket).

Could you check the number of publishers in your test? There should be one. It's definitely weird behavior and I can't think what would cause it at the moment. I'll see if I can repro it over lunch.

from subspace.

mikael-s-persson avatar mikael-s-persson commented on June 4, 2024

Some more evidence (of what, I don't know).

I tried to use the notify fd in the server instead of a loop on trying to connect a client to it. That does not work. I think the server is lying about being ready when it notifies the fd.

I tried switching to "local=false" for the server. With that, I get some errors (~15 / 1000 runs) for failing to create the broadcasting (maybe some port collisions, etc.). But interestingly, I don't see any errors with the double publishing... so maybe it slows things down, making the race condition become rarer, or it's something about local=true that is buggy.

I tried to get the ChannelDirectory to see what the server knows about the channels. Generally, that fails on the protobuf encoding, but the few times that channel's data was not corrupted, it said that my "star wars" channel had zero slot count and zero slot size, as if no publisher was ever created on it.

I turned on the debug on the clients, and got a few more interesting print outs:

[ RUN      ] TestSubspaceMsgMap.Basic
Publishing new message with of length 37 with payload 000
����ڂٞ

A New Hopeffffff�?
publish old slot: 1: 0
publish new slot: 3: 0
Published message with ordinal 2 size 37
Publishing new message with of length 50 with payload 000
��������
The Empire Strikes Back000000�?
publish old slot: 3: 0
publish new slot: 5: 0
Published message with ordinal 4 size 50
Publishing new message with of length 49 with payload 000
��̗ճ��
The Return of the Jedi�������?
publish old slot: 5: 0
publish new slot: 7: 0
Published message with ordinal 6 size 49
Publishing new message with of length 45 with payload 000
�������

The Phantom Menace000000�?
publish old slot: 7: 0
publish new slot: 9: 0
Published message with ordinal 8 size 45
Publishing new message with of length 47 with payload 000
������ۖ
Attack of the Clones333333�?
publish old slot: 9: 0
publish new slot: 11: 0
Published message with ordinal 10 size 47
Publishing new message with of length 46 with payload 000
���͏��
Revenge of the Sith000000�?
publish old slot: 11: 0
publish new slot: 13: 0
Published message with ordinal 12 size 46
Publishing new message with of length 44 with payload 000
������ʐ
The Force Awakens�������?
publish old slot: 13: 0
publish new slot: 16: 0
Published message with ordinal 15 size 44
Publishing new message with of length 40 with payload 000
����Ǒ��
The Last Jed�������?
publish old slot: 16: 0
publish new slot: 18: 0
Published message with ordinal 17 size 40
Publishing new message with of length 48 with payload 000
���ޚ���
The Rise of Skywalker	333333�?
publish old slot: 18: 0
publish new slot: 19: 0
Published message with ordinal 18 size 48
Source was triggered!
read new_slot: 0: 1
Staging new message with ordinal 1 and payload 000
����ڂٞ

A New Hopeffffff�?
Emitting message 1
Reading next message!
read old slot: 0: 1
read new_slot: 1: 2
Staging new message with ordinal 2 and payload 000
����ڂٞ

A New Hopeffffff�?
Emitting message 2
Reading next message!
read old slot: 1: 2
read new_slot: 2: 3
Staging new message with ordinal 3 and payload 000
��������
The Empire Strikes Back000000�?
Emitting message 3
Reading next message!
read old slot: 2: 3
read new_slot: 3: 4
Staging new message with ordinal 4 and payload 000
��������
The Empire Strikes Back000000�?
Emitting message 4
Reading next message!
read old slot: 3: 4
read new_slot: 4: 5
Staging new message with ordinal 5 and payload 000
��̗ճ��
The Return of the Jedi�������?
Emitting message 5
Reading next message!
read old slot: 4: 5
read new_slot: 5: 6
Staging new message with ordinal 6 and payload 000
��̗ճ��
The Return of the Jedi�������?
Emitting message 6
Reading next message!
read old slot: 5: 6
read new_slot: 6: 7
Staging new message with ordinal 7 and payload 000
�������

The Phantom Menace000000�?
Emitting message 7
Reading next message!
read old slot: 6: 7
read new_slot: 7: 8
Staging new message with ordinal 8 and payload 000
�������

The Phantom Menace000000�?
Emitting message 8
Reading next message!
read old slot: 7: 8
read new_slot: 8: 9
Staging new message with ordinal 9 and payload 000
������ۖ
Attack of the Clones333333�?
Emitting message 9
Reading next message!
read old slot: 8: 9
read new_slot: 9: 10
Staging new message with ordinal 10 and payload 000
������ۖ
Attack of the Clones333333�?
Emitting message 10
Reading next message!
read old slot: 9: 10
read new_slot: 10: 11
Staging new message with ordinal 11 and payload 000
���͏��
Revenge of the Sith000000�?
Emitting message 11
Reading next message!
read old slot: 10: 11
read new_slot: 11: 12
Staging new message with ordinal 12 and payload 000
���͏��
Revenge of the Sith000000�?
Emitting message 12
Reading next message!
read old slot: 11: 12
read new_slot: 12: 13
Staging new message with ordinal 13 and payload 000
������ʐ
The Force Awakens�������?
Emitting message 13
Reading next message!
read old slot: 12: 13
read new_slot: 14: 14
Staging new message with ordinal 14 and payload 000
����Ǒ��
The Last Jed�������?
Emitting message 14
Reading next message!
read old slot: 14: 14
read new_slot: 13: 15
Staging new message with ordinal 15 and payload 000
������ʐ
The Force Awakens�������?
Emitting message 15
Reading next message!
read old slot: 13: 15
read new_slot: 15: 16
Staging new message with ordinal 16 and payload 000
���ޚ���
The Rise of Skywalker	333333�?
Emitting message 16
Reading next message!
read old slot: 15: 16
read new_slot: 16: 17
Staging new message with ordinal 17 and payload 000
����Ǒ��
The Last Jed�������?
Emitting message 17
Reading next message!
read old slot: 16: 17
read new_slot: 18: 18
Staging new message with ordinal 18 and payload 000
���ޚ���
The Rise of Skywalker	333333�?
Emitting message 18
Reading next message!
read old slot: 18: 18
msg_map/subspace_source_test.cc:117: Failure
Value of: reviews_received
Expected: has 9 elements where
element #0 is equal to (8-byte object <00-40 37-F7 52-D4 00-00>, "A New Hope", 4, 0.95),
element #1 is equal to (8-byte object <00-C0 EC-41 22-2A 01-00>, "The Empire Strikes Back", 5, 1),
element #2 is equal to (8-byte object <00-E0 85-5D 7E-80 01-00>, "The Return of the Jedi", 6, 0.9),
element #3 is equal to (8-byte object <00-20 15-E3 3E-4B 03-00>, "The Phantom Menace", 1, 0.5),
element #4 is equal to (8-byte object <00-00 A2-4B 22-A1 03-00>, "Attack of the Clones", 2, 0.6),
element #5 is equal to (8-byte object <00-A0 DD-EF 2D-F7 03-00>, "Revenge of the Sith", 3, 0.75),
element #6 is equal to (8-byte object <00-24 77-C6 35-27 05-00>, "The Force Awakens", 7, 0.8),
element #7 is equal to (8-byte object <00-24 F0-A2 6A-60 05-00>, "The Last Jedi", 8, 0.1),
element #8 is equal to (8-byte object <00-C4 4C-50 2C-9A 05-00>, "The Rise of Skywalker", 9, 0.3)
  Actual: { (8-byte object <00-40 37-F7 52-D4 00-00>, "A New Hope", 4, 0.95), (8-byte object <00-40 37-F7 52-D4 00-00>, "A New Hope", 4, 0.95), (8-byte object <00-C0 EC-41 22-2A 01-00>, "The Empire Strikes Back", 5, 1), (8-byte object <00-C0 EC-41 22-2A 01-00>, "The Empire Strikes Back", 5, 1), (8-byte object <00-E0 85-5D 7E-80 01-00>, "The Return of the Jedi", 6, 0.9), (8-byte object <00-E0 85-5D 7E-80 01-00>, "The Return of the Jedi", 6, 0.9), (8-byte object <00-20 15-E3 3E-4B 03-00>, "The Phantom Menace", 1, 0.5), (8-byte object <00-20 15-E3 3E-4B 03-00>, "The Phantom Menace", 1, 0.5), (8-byte object <00-00 A2-4B 22-A1 03-00>, "Attack of the Clones", 2, 0.6), (8-byte object <00-00 A2-4B 22-A1 03-00>, "Attack of the Clones", 2, 0.6), (8-byte object <00-A0 DD-EF 2D-F7 03-00>, "Revenge of the Sith", 3, 0.75), (8-byte object <00-A0 DD-EF 2D-F7 03-00>, "Revenge of the Sith", 3, 0.75), (8-byte object <00-24 77-C6 35-27 05-00>, "The Force Awakens", 7, 0.8), (8-byte object <00-24 F0-A2 6A-60 05-00>, "The Last Jedi", 8, 0.1), (8-byte object <00-24 77-C6 35-27 05-00>, "The Force Awakens", 7, 0.8), (8-byte object <00-C4 4C-50 2C-9A 05-00>, "The Rise of Skywalker", 9, 0.3), (8-byte object <00-24 F0-A2 6A-60 05-00>, "The Last Jedi", 8, 0.1), (8-byte object <00-C4 4C-50 2C-9A 05-00>, "The Rise of Skywalker", 9, 0.3) }, which has 18 elements
[  FAILED  ] TestSubspaceMsgMap.Basic (2 ms)

from subspace.

dallison avatar dallison commented on June 4, 2024

I've added a couple of things to help debug this:

  1. Run a thread in client_test instead of a process
  2. A regression test called Mikael in client_test that should mimic your test.
  3. Wrote 8 bytes to the notify_fd to allow an eventfd to be used on Linux.

Due to the server running as a thread, I also added a mutex in the debug memory maps and dealt with C++ undefined initialization order.

I've run my tests several thousand times and never seen a failure. Could you check out my test to see what is different with yours?

from subspace.

dallison avatar dallison commented on June 4, 2024

I'm wondering if this is due to the lack of locking on the std::map used to track mmap/munmap calls in channel.cc. I've switched that off with a macro.

from subspace.

mikael-s-persson avatar mikael-s-persson commented on June 4, 2024

Ummm it appears that it might only be observable under Clang. I've been using that on my end. I added a config for that, see #4.

When I run this command, I get a similar rate of failures, with similar duplicated publishing:

$ bazel test --runs_per_test=1000 -c opt --config=clang //client:client_test --test_arg=--gtest_filter=ClientTest.Mikael

from subspace.

mikael-s-persson avatar mikael-s-persson commented on June 4, 2024

In fact, I get a bunch of different failures on all the tests. Here is a cat of all the failures from one run with the command I just posted, but using "-c dbg" instead of opt:

exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //client:client_test
-----------------------------------------------------------------------------
Note: Google Test filter = ClientTest.Mikael
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ClientTest
2023-06-28 01:38:31.176713556: F: Failed to create statistics channel: INTERNAL: No slot available for publisher
exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //client:client_test
-----------------------------------------------------------------------------
Note: Google Test filter = ClientTest.Mikael
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ClientTest
2023-06-28 01:38:31.176725343: F: Failed to create statistics channel: INTERNAL: No slot available for publisher
exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //client:client_test
-----------------------------------------------------------------------------
Note: Google Test filter = ClientTest.Mikael
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ClientTest
Starting Subspace server
[ RUN      ] ClientTest.Mikael
client/client_test.cc:1115: Failure
Value of: pub.ok()
  Actual: false
Expected: true
[  FAILED  ] ClientTest.Mikael (2 ms)
Stopping Subspace server
[----------] 1 test from ClientTest (2 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (11 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] ClientTest.Mikael

 1 FAILED TEST
exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //client:client_test
-----------------------------------------------------------------------------
Note: Google Test filter = ClientTest.Mikael
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ClientTest
Starting Subspace server
[ RUN      ] ClientTest.Mikael
client/client_test.cc:1115: Failure
Value of: pub.ok()
  Actual: false
Expected: true
[  FAILED  ] ClientTest.Mikael (20 ms)
Stopping Subspace server
[----------] 1 test from ClientTest (20 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (41 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] ClientTest.Mikael

 1 FAILED TEST
exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //client:client_test
-----------------------------------------------------------------------------
Note: Google Test filter = ClientTest.Mikael
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ClientTest
Starting Subspace server
[ RUN      ] ClientTest.Mikael
client/client_test.cc:1146: Failure
Expected equality of these values:
  sent_msgs[i]
    Which is: "foobar 0\0"
  received_msgs[i]
    Which is: "\0\0\0\0\0\0\0\0\0"
i = 0
client/client_test.cc:1146: Failure
Expected equality of these values:
  sent_msgs[i]
    Which is: "foobar 1\0"
  received_msgs[i]
    Which is: "\0\0\0\0\0\0\0\0\0"
i = 1
client/client_test.cc:1146: Failure
Expected equality of these values:
  sent_msgs[i]
    Which is: "foobar 2\0"
  received_msgs[i]
    Which is: "\0\0\0\0\0\0\0\0\0"
i = 2
client/client_test.cc:1146: Failure
Expected equality of these values:
  sent_msgs[i]
    Which is: "foobar 3\0"
  received_msgs[i]
    Which is: "\0\0\0\0\0\0\0\0\0"
i = 3
client/client_test.cc:1146: Failure
Expected equality of these values:
  sent_msgs[i]
    Which is: "foobar 4\0"
  received_msgs[i]
    Which is: "\0\0\0\0\0\0\0\0\0"
i = 4
client/client_test.cc:1146: Failure
Expected equality of these values:
  sent_msgs[i]
    Which is: "foobar 5\0"
  received_msgs[i]
    Which is: "\0\0\0\0\0\0\0\0\0"
i = 5
[  FAILED  ] ClientTest.Mikael (7 ms)
Stopping Subspace server
[----------] 1 test from ClientTest (7 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (9 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] ClientTest.Mikael

 1 FAILED TEST
exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //client:client_test
-----------------------------------------------------------------------------
Note: Google Test filter = ClientTest.Mikael
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ClientTest
Starting Subspace server
[ RUN      ] ClientTest.Mikael
client/client_test.cc:1144: Failure
Expected equality of these values:
  sent_msgs.size()
    Which is: 6
  received_msgs.size()
    Which is: 0
-- Test timed out at 2023-06-28 01:39:41 UTC --
exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //client:client_test
-----------------------------------------------------------------------------
Note: Google Test filter = ClientTest.Mikael
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ClientTest
Starting Subspace server
[ RUN      ] ClientTest.Mikael
client/client_test.cc:1144: Failure
Expected equality of these values:
  sent_msgs.size()
    Which is: 6
  received_msgs.size()
    Which is: 0
-- Test timed out at 2023-06-28 01:39:41 UTC --

from subspace.

dallison avatar dallison commented on June 4, 2024

I've been using a Mac with clang. I've incorporated your changes and will try it on linux.

from subspace.

dallison avatar dallison commented on June 4, 2024

OK, I'm able to reproduce on Linux. Debugging. Hopefully will have a fix tomorrow. Thanks for the help.

from subspace.

mikael-s-persson avatar mikael-s-persson commented on June 4, 2024

OMG! That's good to know. I guess bazel test must be doing some funky forking without forking. I had a suspicion that shared-mem files were being, well, shared more than they were supposed to be.

from subspace.

dallison avatar dallison commented on June 4, 2024

Pretty dangerous if you ask me. BTW, I found this by putting calls to PrintLists() into the common/channel.cc functions and noticed that the active list already had something in it before the first call to publish. That's impossible, I said to myself, then realized it must be picking up something that already existed. I added 'if (pid == 12) abort()' in the CreateSharedMemory function (after a printf showed the shm_name had the number 12 in it) and sure enough, everything failed.

Please test this out and let me know if you see any other issues. Thanks for your patience.

from subspace.

mikael-s-persson avatar mikael-s-persson commented on June 4, 2024

Awesome! Thanks Dave!

Works on a 10,000 runs:

INFO: Build completed successfully, 10001 total actions
//msg_map:subspace_source_test                                           PASSED in 0.3s
  Stats over 10000 runs: max = 0.3s, min = 0.1s, avg = 0.1s, dev = 0.0s

Executed 1 out of 1 test: 1 test passes.

from subspace.

mikael-s-persson avatar mikael-s-persson commented on June 4, 2024

I guess it would be a good time for a new release. ;)

from subspace.

dallison avatar dallison commented on June 4, 2024

Thanks Mikael, I'll release a new version now.

from subspace.

dallison avatar dallison commented on June 4, 2024

I figured out how Bazel runs tests. It uses a program it builds, called linux-sandbox that is basically a changeroot with a new PID namespace. That means that all process ids in the tests start at 1 and increment from there. However, it appears that the /dev/shm directory isn't per namespace so tests can clash. That's seems like a bug to me.

from subspace.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.