platformlab / homamodule Goto Github PK

View Code? Open in Web Editor NEW

172.0 12.0 41.0 3.08 MB

A Linux kernel module that implements the Homa transport protocol.

Makefile 0.31% C 48.98% Shell 0.63% C++ 14.56% Perl 0.13% Python 35.24% CMake 0.16%

homamodule's Introduction

This repo contains an implementation of the Homa transport protocol as a Linux kernel module.

For more information on Homa in general, see the Homa Wiki.
More information about this implementation and its performance are available in the paper A Linux Kernel Implementation of the Homa Transport Protocol, which appeared in the USENIX Annual Technical Conference in July, 2021.
A synopsis of the protocol implemented by this module is available in protocol.md.
As of August 2020, Homa has complete functionality for running real applications, and its tail latency is more than 10x better than TCP for all workloads I have measured (Homa's 99-th percentile latency is usually better than TCP's mean latency). Here is a list of the most significant functionality that is still missing:
- The incast optimization from Section 3.6 of the SIGCOMM paper has not been implemented yet. If you would like to test Homa under large incasts, let me know and I will implement this feature.
- Socket buffer memory management needs more work. Large numbers of large messages (hundreds of MB?) may cause buffer exhaustion and deadlock.
Please contact me if you have any problems using this repo; I'm happy to provide advice and support.
The head is known to work under Linux 6.1.38. In the past, Homa has run under several earlier versions of Linux, including 5.17.7, 5.4.80, and 4.15.18. There is a separate branch for each of these older versions, with a names such as linux_4.15.18. Older branches are out of date feature-wise: recent commits have not been back-ported to them. Other versions of Linux have not been tested and may require code changes (these upgrades rarely take more than a couple of hours). If you get Homa working on some other version, please submit a pull request with the required code changes.
Related work that you may find useful:
- Preliminary support for using Homa with gRPC
- A Go client that works with this module
To build the module, type make all; then type sudo insmod homa.ko to install it, and sudo rmmod homa to remove an installed module. In practice, though, you'll probably want to do several other things as part of installing Homa. I have created a Python script that I use for installing Homa on clusters managed by the CloudLab project; it's in cloudlab/bin/config. I normally invoke it with no parameters to install and configure Homa on the current machine.
The script cloudlab/bin/install will copy relevant Homa files across a cluster of machines and configure Homa on each node. It assumes that nodes have names nodeN where N is a small integer, and it also assumes that you have already run make both in the top-level directory and in util.
For best Homa performance, you should also make the following configuration changes:
- Enable priority queues in your switches, selected by the 3 high-order bits of the DSCP field in IPv4 packet headers or the 4 high-order bits of the Traffic Class field in IPv6 headers. You can use sysctl to configure Homa's use of priorities (e.g., if you want it to use fewer than 8 levels). See the man page homa.7 for more info.
- Enable jumbo frames on your switches and on the Linux nodes.
NIC support for TSO: Homa can use TCP Segmentation Offload (TSO) in order to send large messages more efficiently. To do this, it uses a header format that matches TCP's headers closely enough to take advantage of TSO support in NICs. It is not clear that this approach will work with all NICs, but the following NICs are known to work:
- Mellanox ConnectX-4, ConnectX-5, and ConnectX-6
There have been reports of problems with the following NICs (these have not yet been explored thoroughly enough to know whether the problems are insurmountable):
- Intel E810 (ice), XXV710 (i40e), XL710
Please let me know if you find other NICs that work (or NICs that don't work). If the NIC doesn't support TSO for Homa, then you can request that Homa perform segmentation in software by setting the gso_force_software parameter to a nonzero value using sysctl. Unfortunately, software segmentation is inefficient because it has to copy the packet data. Alternatively, you can ensure that the max_gso_size parameter is the same as the maximum packet size, which eliminates GSO in any form. This is also inefficient because it requires more packets to traverse the Linux networking stack.
A collection of man pages is available in the "man" subdirectory. The API for Homa is different from TCP sockets.
The subdirectory "test" contains unit tests, which you can run by typing "make" in that subdirectory.
The subdirectory "util" contains an assortment of utility programs that you may find useful in exercising and benchmarking Homa. Compile them by typing make in that subdirectory. Here are some examples of benchmarks you might find useful:
- The cp_node program can be run stand-alone on clients and servers to run simple benchmarks. For a simple latency test, run cp_node server on node1 of the cluster, then run cp_node client on node 0. The client will send continuous back-to-back short requests to the server and output timing information. Or, run cp_node client --workload 500000 on the client: this will send continuous 500 KB messages for a simple througput test. Type cp_node --help to learn about other ways you can use this program.
- The cp_vs_tcp script uses cp_node to run cluster-wide tests comparing Homa with TCP (and/or DCTCP); it was used to generate the data for Figures 3 and 4 in the Homa ATC paper. Here is an example command:
```
cp_vs_tcp -n 10 -w w4 -b 20
```
  When invoked on node0, this will run a benchmark using the W4 workload from the ATC paper, running on 10 nodes and generating 20 Gbps of offered load (80% network load on a 25 Gbps network). Type cp_vs_tcp --help for information on all available options.
- Other cp_ scripts can be used for different benchmarks. See util/README.md for more information.
Some additional tools you might find useful:
- Homa collects various metrics about its behavior, such as the size distribution of incoming messages. You can access these through the file /proc/net/homa_metrics. The script util/metrics.py will collect metrics and print out all the numbers that have changed since its last run.
- Homa exports a collection of configuration parameters through the sysctl mechanism. For details, see the man page homa.7.

Significant recent improvements

June 2024: refactored sk_buff management to use frags; improves efficiency significantly.
April 2024: replaced master branch with main
December 2022: Version 2.0. This includes a new mechanism for managing buffer space for incoming messages, which improves throughput by 50-100% in many situations. In addition, Homa now uses the sendmsg and recvmsg system calls, rather than ioctls, for sending and receiving messages. The API for receiving messages is incompatible with 1.01.
November 2022: implemented software GSO for Homa.
September 2022: added support for IPv6, as well as completion cookies. This required small but incompatible changes to the API. Many thanks to Dan Manjarres for contributing these improvements.
September 2022: Homa now works on Linux 5.18 as well as 5.17.7
June 2022: upgraded to Linux 5.17.7.
November 2021: changed semantics to at-most-once (servers can no longer see multiple instances of the same RPC).
August 2021: added new versions of the Homa system calls that support iovecs; in addition, incoming messages can be read incrementally across several homa_recv calls.
November 2020: upgraded to Linux 5.4.3.
June 2020: implemented busy-waiting during homa_recv: shaves 2 microseconds off latency.
June 2020: several fixes to prevent RPCs from getting "stuck", where they never make progress.
May 2020: got priorities working correctly using the DSCP field of IP headers.
December 2019: first versions of cperf ("cluster performance") benchmark.
December 2019 - June 2020: many improvements to the GRO mechanism, including better hashing and batching across RPCs; improves both throughput and latency.
Fall 2019: many improvements to pacer, spread over a couple of months.
November 6, 2019: reworked locking to use RPC-level locks instead of socket locks for most things (significantly reduces socket lock. contention). Many more refinements to this in subsequent commits.
September 25, 2019: reworked timeout mechanism to eliminate over-hasty timeouts. Also, limit the rate at which RESENDs will be sent to an overloaded machine.
August 1, 2019: GSO and GRO are now working.
March 13, 2019: added support for shutdown kernel call, plus poll, select, and epoll. Homa now connects will all of the essential Linux plumbing.
March 11, 2019: extended homa_recv API with new arguments: flags, id.
February 16, 2019: added manual entries in the subdirectory "man".
February 14, 2019: output queue throttling now seems to work (i.e., senders implement SRPT properly).
November 6, 2019: timers and packet retransmission now work.

homamodule's People

Contributors

Stargazers

Watchers

Forkers

sarsanaee moneytech qizhe alimunir yongming tidesq moxoo chlin501 kalimuthu-velappan mengqingkai viclinx wiflin zxzx9898 micchie chude-telco jiangyan80122 oserres danmanj ajunlonglive shanemhansen yuuzi41 monkey-sheng heaier nickcao nougnanke spiderdetective harshkapadia2 falcondb pedrogzz18 alexisbarraza3293 subgorgin missinglinkelectronics cthye amjal adityavidyadharan zhongjiechen97 chuyu-jpg sulaimansuhas wu-cl jackmygreat breakertt

homamodule's Issues

How does Homa perform against TCP_NODELAY sockets?

Hi @johnousterhout I just read your ATC 21 Homa paper and found that this is really an interesting project! However I have some questions, the paper says that Homa can outperform TCP by a huge number of times for small payloads. Does the TCP compared there just plain TCP connections? Or the TCP connections configured with TCP_NODELAY option which can reduce the latency a lot? If not the NODELAY sockets, how does Home perform compared against TCP_NODELAY sockets? Thanks!

Build error: Implicit declaration of function `kthread_complete_and_exit`

I am trying to build the Homa module on CloudLab, but was not able to proceed beyond the error Implicit declaration of function 'kthread_complete_and_exit' as shown in the picture below. Does anyone know how I can resolve this issue? Thank you!

I tried mocking the declaration in homa_impl.h below where it is defined with the line extern void kthread_complete_and_exit(struct completion *comp, long code);, but I get the same error as shown above.

I then tried to add the mocking line to homa_outgoing.c, just below where homa_impl.h is included and to homa_plumbing.c (because the kthread_complete_and_exit function is used in these two files), but I get the error show below.

How do I resolve the original error and/or the error after mocking?

Machine specs:

OS: Ubuntu 22.04.2 LTS (jammy)
Linux kernel: Linux 5.15.0-70-generic x86_64
Testbed: CloudLab's xl170 machines

Thank you!

Allow sending zero length message

zero length message might sound like an unplausible idea, but as homa is built around the notion of request/response (RPC) style communication, a zero length message serves perfectly as an ACK.

cp_node should not pre-compute a pseudo-random pattern for the client requests

We shouldn't be pre-computing the information about the client requests (servers, lengths of the requests and intervals of time between requests) in the client constructor inside the cp_node.cc file.

Instead, we could compute these details on-the-fly inside the sender methods for both homa_client and tcp_client using the random generators, as this will not represent a significant impact on the performance of the tests and it will make Homa's benchmark more realistic.

Napi and SoftIRQ CPU Metrics

Hello, I am currently preparing for my master thesis and thought I would like to analyze the pacer in more detail. My idea was to use a smart NIC to offload the functionality to the hardware as well for tx and for rx to reduce CPU utilization and hopefully reduce even more the tail latency. In the course of this, I took a closer look at the CPU metrics and saw that the code states that the usage for NAPI and Softirq can only be seen with a modified kernel. Could you tell me what needs to be modified or provide me with a patch, as these measurements would be interesting for me for the overall CPU utilization?

Wireshark dissector

Hi, we have implemented a first version of a Wireshark dissector as a plugin which attaches to the IP protocol on the same protocol type as in the kernel module. The dissector then displays in the GUI the recognized header fields and the rest as payload. We would like to provide the first version of the dissector that can be used with the homa kernel module. The question would be, should it come via a pull request into the kernel module or as a separate repository next to the existing repos?

How to solve:"unimplemented unhash invoked on Homa socket"

Homa's Maximum Message Length Modification

The maximum length of a message in Homa is set by the HOMA_MAX__MESSAGE_LENGTH macro defined in homa.h. In previous builds of the module, particularly for kernel version 5.17.7, I could just change the macro value, and it worked with larger messages.

With the newest build, however, there is an assertion error I get from this line:

#if !defined(__cplusplus)
_Static_assert(sizeof(struct homa_recvmsg_args) >= 96,
        "homa_recvmsg_args shrunk");
_Static_assert(sizeof(struct homa_recvmsg_args) <= 96,
        "homa_recvmsg_args grew");
#endif

I appreciate it if you explain to me why this limit is there in the first place. Is this an implementation limitation? Or is just placed for performance optimization? What is a safe way of modifying it? If these are answered somewhere in the documentation, please just direct me towards them. Thanks.

The message length problem in socket

We build and run the HomaModule successfully. However, there is a sending error when we enlarge the message length (exceed MTU) by modifying the MSGLEN. The return of homa_send() is -1 and errno is 14. Is there any wrong with the way we set the message length?

L

Build error: `cpu_khz` undeclared

I am trying to build the Homa module on an Ubuntu machine on CloudLab, but am facing an error ('cpu_khz undeclared') as shown in the image below. Does anyone know how I can resolve this? Thank you!

Machine specs:

OS: Ubuntu 22.04.1 LTS
- Codename: jammy
Kernel: Linux 5.15.0-69-generic aarch64
- This is the output of the command uname -srm

Cc: @KartikSoneji

compilation failed with Fedora38

It works on Fedora37 with kernel 6.4.15-100, while it failed on Fedora38 with kernel 6.5.5-200

[mikehuang@fedora HomaModule]$ make
make -C /lib/modules/6.5.5-200.fc38.x86_64/build M=/home/mikehuang/Downloads/HomaModule modules
make[1]: Entering directory '/usr/src/kernels/6.5.5-200.fc38.x86_64'
  CC [M]  /home/mikehuang/Downloads/HomaModule/homa_incoming.o
  CC [M]  /home/mikehuang/Downloads/HomaModule/homa_offload.o
  CC [M]  /home/mikehuang/Downloads/HomaModule/homa_outgoing.o
  CC [M]  /home/mikehuang/Downloads/HomaModule/homa_peertab.o
  CC [M]  /home/mikehuang/Downloads/HomaModule/homa_pool.o
  CC [M]  /home/mikehuang/Downloads/HomaModule/homa_plumbing.o
/home/mikehuang/Downloads/HomaModule/homa_plumbing.c:79:10: error: ‘const struct proto_ops’ has no member named ‘sendpage’
   79 |         .sendpage          = sock_no_sendpage,
      |          ^~~~~~~~
/home/mikehuang/Downloads/HomaModule/homa_plumbing.c:79:30: error: ‘sock_no_sendpage’ undeclared here (not in a function); did you mean ‘sock_no_sendmsg’?
   79 |         .sendpage          = sock_no_sendpage,
      |                              ^~~~~~~~~~~~~~~~
      |                              sock_no_sendmsg
/home/mikehuang/Downloads/HomaModule/homa_plumbing.c:101:10: error: ‘const struct proto_ops’ has no member named ‘sendpage’
  101 |         .sendpage          = sock_no_sendpage,
      |          ^~~~~~~~
/home/mikehuang/Downloads/HomaModule/homa_plumbing.c:117:30: error: initialization of ‘int (*)(struct sock *, int,  int *)’ from incompatible pointer type ‘int (*)(struct sock *, int,  long unsigned int)’ [-Werror=incompatible-pointer-types]
  117 |         .ioctl             = homa_ioctl,
      |                              ^~~~~~~~~~
/home/mikehuang/Downloads/HomaModule/homa_plumbing.c:117:30: note: (near initialization for ‘homa_prot.ioctl’)
/home/mikehuang/Downloads/HomaModule/homa_plumbing.c:124:10: error: ‘struct proto’ has no member named ‘sendpage’
  124 |         .sendpage          = homa_sendpage,
      |          ^~~~~~~~
/home/mikehuang/Downloads/HomaModule/homa_plumbing.c:124:30: error: positional initialization of field in ‘struct’ declared with ‘designated_init’ attribute [-Werror=designated-init]
  124 |         .sendpage          = homa_sendpage,
      |                              ^~~~~~~~~~~~~
/home/mikehuang/Downloads/HomaModule/homa_plumbing.c:124:30: note: (near initialization for ‘homa_prot’)
/home/mikehuang/Downloads/HomaModule/homa_plumbing.c:124:30: error: initialization of ‘void (*)(struct socket *)’ from incompatible pointer type ‘int (*)(struct sock *, struct page *, int,  size_t,  int)’ {aka ‘int (*)(struct sock *, struct page *, int,  long unsigned int,  int)’} [-Werror=incompatible-pointer-types]
/home/mikehuang/Downloads/HomaModule/homa_plumbing.c:124:30: note: (near initialization for ‘homa_prot.splice_eof’)
/home/mikehuang/Downloads/HomaModule/homa_plumbing.c:145:30: error: initialization of ‘int (*)(struct sock *, int,  int *)’ from incompatible pointer type ‘int (*)(struct sock *, int,  long unsigned int)’ [-Werror=incompatible-pointer-types]
  145 |         .ioctl             = homa_ioctl,
      |                              ^~~~~~~~~~
/home/mikehuang/Downloads/HomaModule/homa_plumbing.c:145:30: note: (near initialization for ‘homav6_prot.ioctl’)
/home/mikehuang/Downloads/HomaModule/homa_plumbing.c:152:10: error: ‘struct proto’ has no member named ‘sendpage’
  152 |         .sendpage          = homa_sendpage,
      |          ^~~~~~~~
/home/mikehuang/Downloads/HomaModule/homa_plumbing.c:152:30: error: positional initialization of field in ‘struct’ declared with ‘designated_init’ attribute [-Werror=designated-init]
  152 |         .sendpage          = homa_sendpage,
      |                              ^~~~~~~~~~~~~
/home/mikehuang/Downloads/HomaModule/homa_plumbing.c:152:30: note: (near initialization for ‘homav6_prot’)
/home/mikehuang/Downloads/HomaModule/homa_plumbing.c:152:30: error: initialization of ‘void (*)(struct socket *)’ from incompatible pointer type ‘int (*)(struct sock *, struct page *, int,  size_t,  int)’ {aka ‘int (*)(struct sock *, struct page *, int,  long unsigned int,  int)’} [-Werror=incompatible-pointer-types]
/home/mikehuang/Downloads/HomaModule/homa_plumbing.c:152:30: note: (near initialization for ‘homav6_prot.splice_eof’)
cc1: some warnings being treated as errors
make[3]: *** [scripts/Makefile.build:243: /home/mikehuang/Downloads/HomaModule/homa_plumbing.o] Error 1
make[2]: *** [/usr/src/kernels/6.5.5-200.fc38.x86_64/Makefile:2046: /home/mikehuang/Downloads/HomaModule] Error 2
make[1]: *** [Makefile:246: __sub-make] Error 2
make[1]: Leaving directory '/usr/src/kernels/6.5.5-200.fc38.x86_64'
make: *** [Makefile:22: all] Error 2

attribute((unused)) for used variables

argc and argv are used in the test_harness_run function body. The unused attributes here seem to be leftover from previous builds? It's a little confusing to me.

Homa throughput and RTT is not as good as TCP?

Hi,

I have two debian servers running v5.10.70. With trivial changes I built HOMA (linux_5.4.80) successfully.
Running a few simple tests and I'm a bit surprised that HOMA performance is not as good as TCP, neither throughput nor RTT:

Server side command: ./cp_node server --protocol homa/tcp
Client side command:
./cp_node client --ports 1 --protocol homa --workload 1000
1659663158.268025572 Clients: 8.48 Kops/sec, 0.07 Gbps, RTT (us) P50 121.11 P99 123.87 P99.9 183.79, avg. length 1000.0 bytes
./cp_node client --ports 1 --protocol tcp --workload 1000
1659663180.842814970 Clients: 9.99 Kops/sec, 0.08 Gbps, RTT (us) P50 104.64 P99 107.49 P99.9 184.44, avg. length 1000.0 bytes

./cp_node client --ports 1 --protocol homa --workload 1000 --client-max 100
1659663463.048425659 Clients: 70.92 Kops/sec, 0.57 Gbps, RTT (us) P50 861.38 P99 12590.59 P99.9 12851.42, avg. length 1000.0 bytes
./cp_node client --ports 1 --protocol tcp --workload 1000 --client-max 100
1659663434.857581058 Clients: 115.93 Kops/sec, 0.93 Gbps, RTT (us) P50 838.62 P99 1004.78 P99.9 1036.63, avg. length 1000.0 bytes

I'm using 1Gbps ethernet port and there is a switch between the 2 servers. QoS is not configured at switch side.

Also I tried the "lo" interface:
./cp_node client --ports 1 --protocol homa --workload 100
1659660550.276439665 Clients: 9.40 Kops/sec, 0.01 Gbps, RTT (us) P50 22.92 P99 28.06 P99.9 10352.63, avg. length 100.0 bytes
./cp_node client --ports 1 --protocol homa --workload 100
1659660616.124633889 Clients: 51.79 Kops/sec, 0.04 Gbps, RTT (us) P50 16.88 P99 18.23 P99.9 24.29, avg. length 100.0 bytes

Just wondering, did I do the test wrongly? Or is HOMA designed to be better in specific scenarios in data centers and here I'm not expose any of those scenarios? I was expecting a better performance from HOMA.

Appreciate if you could share your thoughts on this.
Thanks.

Poor Homa performance on 40Gb testbed

Hi,

I have been able to reproduce cp_load benchmark results on Cloudlab's XL170 cluster with two servers. Even without jumbo frames, Homa performs well and everything seems to be working.

However, deploying Homa module on my own testbed, I'm having trouble getting good performance out of it.
My testbed differs from Cloudlab setup a bit though.
NICs are intel Xl710 40Gbps. And I'm using a Tofino switch with default traffic management configurations. My kernel version was 5.4.0 on both Cloudlab tests and on my testbed.

Here's a sample output I'm getting from cp_node benchmark on my testbed:

% client --ports 3 --port-receivers 3 --server-ports 3 --workload w1  --server-nodes 1 --first-server 1 --gbps 10 --client-max 1 --protocol homa
1651509672.377323601 Average message length 0.2 KB (expected 0.2KB), rate 2171.30 K/sec, expected BW 3.4 Gbps
1651509672.384952516 Average message length 0.2 KB (expected 0.2KB), rate 2131.21 K/sec, expected BW 3.3 Gbps
1651509672.393831923 Average message length 0.2 KB (expected 0.2KB), rate 2167.65 K/sec, expected BW 3.5 Gbps
% 1651509672.783422250 Outstanding client RPCs: 3
1651509673.783635619 Clients: 1.47 Kops/sec, 0.00 Gbps, RTT (us) P50 22.91 P99 92447.70 P99.9 93063.08, avg. length 202.9 bytes
1651509673.783650504 Lag due to overload: 97.4%
1651509673.783653186 Outstanding client RPCs: 3
1651509674.783969683 Clients: 1.26 Kops/sec, 0.00 Gbps, RTT (us) P50 22.96 P99 92522.27 P99.9 93227.96, avg. length 200.8 bytes
1651509674.783982950 Lag due to overload: 99.6%
1651509674.783985663 Outstanding client RPCs: 3
1651509675.784360574 Clients: 1.49 Kops/sec, 0.00 Gbps, RTT (us) P50 22.89 P99 92589.05 P99.9 93228.28, avg. length 207.6 bytes
1651509675.784374451 Lag due to overload: 102.6%
1651509675.784377258 Outstanding client RPCs: 3

Here's the metrics dump:

rdtsc_cycles                 537299507584440 (256142.6 s) RDTSC cycle counter when metrics were gathered
clock_rate                              2.10              CPU clock rate (GHz)
msg_bytes_64                         1325650 (  5.2  /s)  Bytes in incoming messages containing 0-64 bytes
msg_bytes_128                        1290967 (  5.0  /s)  Bytes in incoming messages containing 65-128 bytes
msg_bytes_192                        1570408 (  6.1  /s)  Bytes in incoming messages containing 129-192 bytes
msg_bytes_256                        1468035 (  5.7  /s)  Bytes in incoming messages containing 193-256 bytes
msg_bytes_320                        1600099 (  6.2  /s)  Bytes in incoming messages containing 257-320 bytes
msg_bytes_384                        1397238 (  5.5  /s)  Bytes in incoming messages containing 321-384 bytes
msg_bytes_448                        1496449 (  5.8  /s)  Bytes in incoming messages containing 385-448 bytes
msg_bytes_512                        1029311 (  4.0  /s)  Bytes in incoming messages containing 449-512 bytes
msg_bytes_576                        1052136 (  4.1  /s)  Bytes in incoming messages containing 513-576 bytes
msg_bytes_640                        1003884 (  3.9  /s)  Bytes in incoming messages containing 577-640 bytes
msg_bytes_704                         837983 (  3.3  /s)  Bytes in incoming messages containing 641-704 bytes
msg_bytes_768                         837325 (  3.3  /s)  Bytes in incoming messages containing 705-768 bytes
msg_bytes_832                         819130 (  3.2  /s)  Bytes in incoming messages containing 769-832 bytes
msg_bytes_896                         504275 (  2.0  /s)  Bytes in incoming messages containing 833-896 bytes
msg_bytes_960                         545910 (  2.1  /s)  Bytes in incoming messages containing 897-960 bytes
msg_bytes_1024                        551174 (  2.2  /s)  Bytes in incoming messages containing 961-1024 bytes
msg_bytes_1088                        495804 (  1.9  /s)  Bytes in incoming messages containing 1025-1088 bytes
msg_bytes_1152                        478656 (  1.9  /s)  Bytes in incoming messages containing 1089-1152 bytes
msg_bytes_1216                        455900 (  1.8  /s)  Bytes in incoming messages containing 1153-1216 bytes
msg_bytes_1280                        483448 (  1.9  /s)  Bytes in incoming messages containing 1217-1280 bytes
msg_bytes_1408                        708560 (  2.8  /s)  Bytes in incoming messages containing 1345-1408 bytes
msg_bytes_1472                         12870 (  0.1  /s)  Bytes in incoming messages containing 1409-1472 bytes
msg_bytes_1536                        598455 (  2.3  /s)  Bytes in incoming messages containing 1473-1536 bytes
msg_bytes_1664                        549519 (  2.1  /s)  Bytes in incoming messages containing 1601-1664 bytes
msg_bytes_1856                        796214 (  3.1  /s)  Bytes in incoming messages containing 1793-1856 bytes
msg_bytes_2112                        951159 (  3.7  /s)  Bytes in incoming messages containing 2049-2112 bytes
msg_bytes_2624                        931680 (  3.6  /s)  Bytes in incoming messages containing 2561-2624 bytes
msg_bytes_2880                         48620 (  0.2  /s)  Bytes in incoming messages containing 2817-2880 bytes
msg_bytes_3200                        911512 (  3.6  /s)  Bytes in incoming messages containing 3137-3200 bytes
msg_bytes_3904                        615726 (  2.4  /s)  Bytes in incoming messages containing 3841-3904 bytes
msg_bytes_5120                        432702 (  1.7  /s)  Bytes in incoming messages containing 4097-5120 bytes
msg_bytes_6144                        286496 (  1.1  /s)  Bytes in incoming messages containing 5121-6144 bytes
msg_bytes_7168                         57200 (  0.2  /s)  Bytes in incoming messages containing 6145-7168 bytes
msg_bytes_8192                        309643 (  1.2  /s)  Bytes in incoming messages containing 7169-8192 bytes
msg_bytes_9216                        299400 (  1.2  /s)  Bytes in incoming messages containing 8193-9216 bytes
msg_bytes_10240                        20020 (  0.1  /s)  Bytes in incoming messages containing 9217-10240 bytes
msg_bytes_11264                       151802 (  0.6  /s)  Bytes in incoming messages containing 10241-11264 bytes
msg_bytes_12288                        11440 (  0.0  /s)  Bytes in incoming messages containing 11265-12288 bytes
msg_bytes_13312                        92985 (  0.4  /s)  Bytes in incoming messages containing 12289-13312 bytes
msg_bytes_15360                        44100 (  0.2  /s)  Bytes in incoming messages containing 14337-15360 bytes
msg_bytes_16384                       163104 (  0.6  /s)  Bytes in incoming messages containing 15361-16384 bytes
msg_bytes_19456                        57330 (  0.2  /s)  Bytes in incoming messages containing 18433-19456 bytes
msg_bytes_21504                       205800 (  0.8  /s)  Bytes in incoming messages containing 20481-21504 bytes
msg_bytes_22528                       132300 (  0.5  /s)  Bytes in incoming messages containing 21505-22528 bytes
msg_bytes_23552                       164640 (  0.6  /s)  Bytes in incoming messages containing 22529-23552 bytes
msg_bytes_25600                       124542 (  0.5  /s)  Bytes in incoming messages containing 24577-25600 bytes
msg_bytes_26624                       238140 (  0.9  /s)  Bytes in incoming messages containing 25601-26624 bytes
msg_bytes_28672                        27930 (  0.1  /s)  Bytes in incoming messages containing 27649-28672 bytes
msg_bytes_29696                       176400 (  0.7  /s)  Bytes in incoming messages containing 28673-29696 bytes
msg_bytes_30720                        60328 (  0.2  /s)  Bytes in incoming messages containing 29697-30720 bytes
msg_bytes_31744                        61740 (  0.2  /s)  Bytes in incoming messages containing 30721-31744 bytes
msg_bytes_32768                        32340 (  0.1  /s)  Bytes in incoming messages containing 31745-32768 bytes
msg_bytes_34816                       135240 (  0.5  /s)  Bytes in incoming messages containing 33793-34816 bytes
msg_bytes_35840                        35280 (  0.1  /s)  Bytes in incoming messages containing 34817-35840 bytes
msg_bytes_36864                       147000 (  0.6  /s)  Bytes in incoming messages containing 35841-36864 bytes
msg_bytes_38912                       114660 (  0.4  /s)  Bytes in incoming messages containing 37889-38912 bytes
msg_bytes_39936                       158760 (  0.6  /s)  Bytes in incoming messages containing 38913-39936 bytes
msg_bytes_41984                       123480 (  0.5  /s)  Bytes in incoming messages containing 40961-41984 bytes
msg_bytes_43008                        42630 (  0.2  /s)  Bytes in incoming messages containing 41985-43008 bytes
msg_bytes_45056                        44100 (  0.2  /s)  Bytes in incoming messages containing 44033-45056 bytes
msg_bytes_46080                        91140 (  0.4  /s)  Bytes in incoming messages containing 45057-46080 bytes
msg_bytes_47104                       188160 (  0.7  /s)  Bytes in incoming messages containing 46081-47104 bytes
msg_bytes_49152                       145530 (  0.6  /s)  Bytes in incoming messages containing 48129-49152 bytes
msg_bytes_50176                        99960 (  0.4  /s)  Bytes in incoming messages containing 49153-50176 bytes
msg_bytes_53248                        52920 (  0.2  /s)  Bytes in incoming messages containing 52225-53248 bytes
msg_bytes_56320                       279300 (  1.1  /s)  Bytes in incoming messages containing 55297-56320 bytes
msg_bytes_57344                       114660 (  0.4  /s)  Bytes in incoming messages containing 56321-57344 bytes
msg_bytes_59392                       176400 (  0.7  /s)  Bytes in incoming messages containing 58369-59392 bytes
msg_bytes_60416                        60270 (  0.2  /s)  Bytes in incoming messages containing 59393-60416 bytes
msg_bytes_62464                       123480 (  0.5  /s)  Bytes in incoming messages containing 61441-62464 bytes
msg_bytes_63488                       189630 (  0.7  /s)  Bytes in incoming messages containing 62465-63488 bytes
msg_bytes_65536                       129360 (  0.5  /s)  Bytes in incoming messages containing 64513-65536 bytes
msg_bytes_66560                        66150 (  0.3  /s)  Bytes in incoming messages containing 65537-66560 bytes
msg_bytes_68608                       270480 (  1.1  /s)  Bytes in incoming messages containing 67585-68608 bytes
msg_bytes_69632                       276360 (  1.1  /s)  Bytes in incoming messages containing 68609-69632 bytes
msg_bytes_70656                       211680 (  0.8  /s)  Bytes in incoming messages containing 69633-70656 bytes
msg_bytes_72704                       144060 (  0.6  /s)  Bytes in incoming messages containing 71681-72704 bytes
msg_bytes_73728                       147000 (  0.6  /s)  Bytes in incoming messages containing 72705-73728 bytes
msg_bytes_75776                       149940 (  0.6  /s)  Bytes in incoming messages containing 74753-75776 bytes
msg_bytes_76800                       305760 (  1.2  /s)  Bytes in incoming messages containing 75777-76800 bytes
msg_bytes_78848                        77910 (  0.3  /s)  Bytes in incoming messages containing 77825-78848 bytes
msg_bytes_82944                        82320 (  0.3  /s)  Bytes in incoming messages containing 81921-82944 bytes
msg_bytes_92160                        91140 (  0.4  /s)  Bytes in incoming messages containing 91137-92160 bytes
msg_bytes_104448                      104370 (  0.4  /s)  Bytes in incoming messages containing 103425-104448 bytes
msg_bytes_109568                      217560 (  0.8  /s)  Bytes in incoming messages containing 108545-109568 bytes
msg_bytes_113664                      113190 (  0.4  /s)  Bytes in incoming messages containing 112641-113664 bytes
msg_bytes_117760                      117600 (  0.5  /s)  Bytes in incoming messages containing 116737-117760 bytes
msg_bytes_126976                      126420 (  0.5  /s)  Bytes in incoming messages containing 125953-126976 bytes
large_msg_count                          150 (  0.0  /s)  # of incoming messages >= 131073 bytes
large_msg_bytes                    119719250 (467.4  /s)  Bytes in incoming messages >= 131073 bytes
received_msg_bytes                 153191629 (598.1  /s)  Total bytes in all incoming messages
sent_msg_bytes                     153191961 (598.1  /s)  Total bytes in all outgoing messages
packets_sent_DATA                     248097 (  1.0  /s)  DATA packets sent
packets_sent_GRANT                      5245 (  0.0  /s)  GRANT packets sent
packets_sent_RESEND                    12438 (  0.0  /s)  RESEND packets sent
packets_sent_UNKNOWN                      27 (  0.0  /s)  UNKNOWN packets sent
packets_sent_BUSY                       8781 (  0.0  /s)  BUSY packets sent
packets_sent_CUTOFFS                       1 (  0.0  /s)  CUTOFFS packets sent
packets_sent_ACK                        3118 (  0.0  /s)  ACK packets sent
packets_rcvd_DATA                     230399 (  0.9  /s)  DATA packets received
packets_rcvd_GRANT                      3323 (  0.0  /s)  GRANT packets received
packets_rcvd_RESEND                    10837 (  0.0  /s)  RESEND packets received
packets_rcvd_UNKNOWN                    2961 (  0.0  /s)  UNKNOWN packets received
packets_rcvd_BUSY                       1469 (  0.0  /s)  BUSY packets received
packets_rcvd_CUTOFFS                       1 (  0.0  /s)  CUTOFFS packets received
packets_rcvd_NEED_ACK                 167753 (  0.7  /s)  NEED_ACK packets received
priority0_bytes                    112400186 (438.8  /s)  Bytes transmitted at priority 0
priority4_bytes                     33368336 (130.3  /s)  Bytes transmitted at priority 4
priority5_bytes                      6609782 ( 25.8  /s)  Bytes transmitted at priority 5
priority6_bytes                     27081009 (105.7  /s)  Bytes transmitted at priority 6
priority7_bytes                    134029560 (523.3  /s)  Bytes transmitted at priority 7
priority0_packets                      12999 (  0.1  /s)  Packets transmitted at priority 0
priority4_packets                      13112 (  0.1  /s)  Packets transmitted at priority 4
priority5_packets                       3315 (  0.0  /s)  Packets transmitted at priority 5
priority6_packets                      38222 (  0.1  /s)  Packets transmitted at priority 6
priority7_packets                     210059 (  0.8  /s)  Packets transmitted at priority 7
responses_received                    137653 (  0.5  /s)  Incoming response messages
fast_wakeups                          129873 (  0.5  /s)  Messages received while polling
slow_wakeups                            7843 (  0.0  /s)  Messages received after thread went to sleep
poll_cycles                       6572851972 (0.0%)       Time spent polling for incoming messages
softirq_calls                         316552 (  1.2  /s)  Calls to homa_softirq (i.e. # GRO pkts received)
softirq_cycles                     914663980 (0.0%)       Time spent in homa_softirq
send_cycles                        841951466 (0.0%)       Time spent in homa_ioc_send kernel call
send_calls                            137656 (  0.5  /s)  Total invocations of send kernel call
recv_cycles                       7110916433 (0.0%)       Time spent in homa_ioc_recv kernel call
recv_calls                            137725 (  0.5  /s)  Total invocations of recv kernel call
blocked_cycles                 4717570905561 (0.9%)       Time spent blocked in homa_ioc_recv
grant_cycles                        54717084 (0.0%)       Time spent sending grants
user_cycles                     801185038993 (0.1%)       App. time outside Homa kernel call handler
timer_cycles                   2325643069515 (0.4%)       Time spent in homa_timer
pacer_cycles                        56038632 (0.0%)       Time spent in homa_pacer_main
homa_cycles                    2334566640026 (0.4%)       Total time in all Homa-related functions
pacer_lost_cycles                  126995770 (0.0%)       Lost transmission time because pacer was slow
pacer_bytes                         81712130 (319.0  /s)  Bytes transmitted when the pacer was active
pacer_skipped_rpcs                        53 (  0.0  /s)  Pacer aborts because of locked RPCs
pacer_needed_help                        612 (  0.0  /s)  homa_pacer_xmit invocations from homa_check_pacer
throttled_cycles                   187408446 (0.0%)       Time when the throttled queue was nonempty
resent_packets                         95720 (  0.4  /s)  DATA packets sent in response to RESENDs
peer_new_entries                           1 (  0.0  /s)  New entries created in peer table
resent_packets_used                    95693 (  0.4  /s)  Retransmitted packets that were actually needed
peer_timeouts                              1 (  0.0  /s)  Peers found to be nonresponsive
client_lock_misses                     44178 (  0.2  /s)  Bucket lock misses for client RPCs
client_lock_miss_cycles             35311635 (0.0%)       Time lost waiting for client bucket locks
client_lock_miss_delay                 381.0              Avg. wait time per client_lock miss (ns)
socket_lock_misses                     27431 (  0.1  /s)  Socket lock misses
socket_lock_miss_cycles             11711361 (0.0%)       Time lost waiting for socket locks
socket_lock_miss_delay                 203.5              Avg. wait time per socket_lock miss (ns)
throttle_lock_misses                       2 (  0.0  /s)  Throttle lock misses
throttle_lock_miss_cycles                657 (0.0%)       Time lost waiting for throttle locks
throttle_lock_miss_delay               156.6              Avg. wait time per throttle_lock miss (ns)
grantable_lock_misses                      6 (  0.0  /s)  Grantable lock misses
grantable_lock_miss_cycles              5067 (0.0%)       Time lost waiting for grantable lock
grantable_lock_miss_delay              402.6              Avg. wait time per grantable_lock miss (ns)
reaper_calls                          147653 (  0.6  /s)  Reaper invocations that were not disabled
reaper_dead_skbs                     3952029 ( 15.4  /s)  Sum of hsk->dead_skbs across all reaper calls
avg_dead_skbs                           26.8              Avg. hsk->dead_skbs in reaper
throttle_list_adds                      2497 (  0.0  /s)  Calls to homa_add_to_throttled
throttle_list_checks                      19 (  0.0  /s)  List elements checked in homa_add_to_throttled
gro_benefit                             1.32              Homa packets per homa_softirq call

Per-Core CPU Usage:
-------------------
               Core0   Core1   Core2   Core3   Core4   Core5   Core6   Core7
napi            0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
softirq         0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
send            0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
recv            0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
reply           0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
timer           0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
pacer           0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
Total           0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

               Core8   Core9   Core10  Core11  Core12  Core13  Core14  Core15
napi            0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
softirq         0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
send            0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
recv            0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
reply           0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
timer           0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
pacer           0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
Total           0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

Overall Core Utilization:
-------------------------
send_syscall             0.00      2.92 us/syscall
recv_syscall (-poll)     0.00      1.86 us/syscall
reply_syscall            0.00      0.00 us/syscall
NAPI                     0.00      0.00 us/packet
Linux SoftIRQ            0.00      0.00 us/packet
  Homa SoftIRQ           0.00      1.05 us/packet
  Sending grants         0.00      0.06 us/packet
Pacer                    0.00
Timer handler            0.00
------------------------------
Total Core Utilization   0.00

Polling in recv          0.00     22.75 us/syscall
App/syscall              0.00   1386.96 us/syscall

Lock Misses:
------------
            Misses/sec.  ns/Miss   %CPU
client          0.2       381.0     0.0
socket          0.1       203.5     0.0
grantable       0.0       402.6     0.0
throttle        0.0       156.6     0.0
peer            0.0         0.0     0.0

Receiving Messages:
-------------------
Available immediately:   -0.0%
Arrival while polling:   94.3%
Arrival while sleeping:   5.7%

Miscellaneous:
--------------
Bytes/packet:         368
Packets received:   0.000 M/sec
Packets sent:       0.000 M/sec
Core efficiency:    0.001 M packets/sec/core (sent & received combined)
                    0.00  Gbps/core (goodput)
Pacer throughput:   7.32  Gbps

Canaries (possible problem indicators):
---------------------------------------
resent_packets                         95720 (  0.4  /s)  DATA packets sent in response to RESENDs
resent_packets_used                    95693 (  0.4  /s)  Retransmitted packets that were actually needed
peer_timeouts                              1 (  0.0  /s)  Peers found to be nonresponsive
pacer_lost_cycles                  126995770 (0.0%)       Lost transmission time because pacer was slow
checks_per_throttle_insert               0.0              List traversals per throttle list insert
acks_per_rpc                            22.7              ACK packets sent per 1000 client RPCs

I'd appreciate any tips on parameter settings. I imagine some misconfiguration is causing this behavior.
Note1: The TCP benchmark on cp_node is working okay on my testbed.
Note 2: the "unloaded" benchmark on cp_load script times out on my testbed.

Problem with Multiple Links

I have been trying to incorporate Homa into Mininet hosts. That is a cheaper and easier way for me to do certain tests that do not require >10Gbps bandwidth. Plus, this way I can use different software switches like bmv2.

I set up a small topology consisting of 12 hosts and 100Mbps links. Adding entries to /etc/hosts allowed me to run the cp_node utility program on my Mininet network. In my benchmark, I ran these commands each on a different host:

server1 > cp_node server --protocol homa --first-port 4000 > logs/homa_vs_tcp/homa-server-1.txt &
client1 > cp_node client --protocol homa --first-port 4000 --workload 10000 --servers 1 > logs/homa_vs_tcp/homa-client-1.txt &
server2 > cp_node server --protocol homa --first-port 4001 > logs/homa_vs_tcp/homa-server-2.txt &
client2 > cp_node client --protocol homa --first-port 4001 --workload 10000 --servers 2 > logs/homa_vs_tcp/homa-client-2.txt &
...

I also ran these commands to test TCP:

server1 > cp_node server --protocol tcp --first-port 4000 > logs/homa_vs_tcp/tcp-server-1.txt &
client1 > cp_node client --protocol tcp --first-port 4000 --workload 10000 --servers 1 > logs/homa_vs_tcp/tcp-client-1.txt &
server2 > cp_node server --protocol tcp --first-port 4001 > logs/homa_vs_tcp/tcp-server-2.txt &
client2 > cp_node client --protocol tcp --first-port 4001 --workload 10000 --servers 2 > logs/homa_vs_tcp/tcp-client-2.txt &
...

Here is the module configuration:

sysctl .net.homa.link_mbps=100
sysctl .net.homa.timeout_resends=50
sysctl .net.homa.resend_ticks=50
sysctl .net.homa.resend_interval=50

Here are the results I got for Homa on the clients (these are for one client but the results are the same for all):

1708487176.717311252 Clients: 0.09 Kops/sec, 0.01 Gbps out, 0.01 Gbps in, RTT (us) P50 10667.64 P99 10676.73 P99.9 10680.13, avg. req. length 10000.0 bytes
1708487180.718111920 Clients: 0.09 Kops/sec, 0.01 Gbps out, 0.01 Gbps in, RTT (us) P50 10667.84 P99 10700.83 P99.9 10714.93, avg. req. length 10000.0 bytes

Here are the results I got for TCP on the clients (these are for one client but the results are the same for all):

1708487191.763639441 Clients: 0.56 Kops/sec, 0.04 Gbps out, 0.04 Gbps in, RTT (us) P50 1772.89 P99 1808.76 P99.9 1976.60, avg. req. length 10000.0 bytes
1708487195.765673529 Clients: 0.56 Kops/sec, 0.05 Gbps out, 0.05 Gbps in, RTT (us) P50 1774.09 P99 1831.05 P99.9 1905.58, avg. req. length 10000.0 bytes

Looking at the time traces, there is evidence as to why Homa has lower throughput and higher latency compared to TCP:

 1007.364 us (+   0.683 us) [C24] Finished queueing packet: rpc id 1694709, offset 8520, len 1480, granted 10000
 1149.781 us (+   1.410 us) [C24] Finished queueing packet: rpc id 1694712, offset 0, len 8520, granted 10000
 1894.174 us (+   0.404 us) [C24] Finished queueing packet: rpc id 1694712, offset 8520, len 1480, granted 10000
 2036.510 us (+   1.410 us) [C24] Finished queueing packet: rpc id 1694714, offset 0, len 8520, granted 10000
 2781.758 us (+   0.791 us) [C24] Finished queueing packet: rpc id 1694714, offset 8520, len 1480, granted 10000
 2923.029 us (+   1.413 us) [C24] Finished queueing packet: rpc id 1694716, offset 0, len 8520, granted 10000

The send calls are at least 120 - 150 us apart. What I expect is when c1 -> s1, c2 -> s2, ... are happening at the same time, we should see packet transmissions that are only a couple of us apart. But, seems like the pacer_thread is pacing all the packets globally. That is because of this line of the homa_pacer_xmit function where it busy-waits for the NIC queue wait time to drop below the max_nic_queue_ns config parameter.

Just increasing the max_nic_queueu_ns config parameter is not a solution, since it interferes with the SRPT on the client side. The problem is the Module keeps and updates only one link_idle_time value, whereas, we might have multiple links on the host each of which will require their own link_idle_time value (Mininet creates virtual links for the hosts). To verify this is the source of the issue, I disabled the throttle queue by setting the HOMA_FLAG_DONT_THROTTLE bit. Here are the results when we bypass the throttle queue and thereby, the busy wait on link_idle_time:

1708567417.962724782 Clients: 0.49 Kops/sec, 0.04 Gbps out, 0.04 Gbps in, RTT (us) P50 2047.35 P99 2091.64 P99.9 3550.46, avg. req. length 10000.0 bytes
1708567421.964527966 Clients: 0.49 Kops/sec, 0.04 Gbps out, 0.04 Gbps in, RTT (us) P50 2045.86 P99 2090.98 P99.9 2311.49, avg. req. length 10000.0 bytes

Which are a lot better. So how about keeping one link_idle_time value for each link instead of keeping one for the host? I don't think implementing it is too much headache since the device names for each RPC are already available through homa_get_dst(rpc->peer, rpc->hsk)->dev so we can have access to the active Homa links using a linked list or sth.

cp_* test code questions and logs/ availability?

Hi Team HOMA, hi @johnousterhout ,

Preamble: This is not any issue with the HomaModule. Instead this entry is about usage/knowledge questions regarding test code and results. So if I should move the following to a different communication channel, I'm happy to do so. Please advice.

The questions are:

Why do the associated HOMA and TCP experiments, the 1:N nodes experiments in cp_basic, mostly use different numbers of max. outstanding requests? homa_client_rpc_tput uses the default (from cperf.py) of 200. However, tcp_client_rpc_tput uses 100 by direct definition in cp_basic. The situation is similar for *_client_tput (50 vs 20) and *_server_rpc_tput (10 vs 50). Just the experiemts *_server_tput use the same number (5) for all protocols. What is the reason for this asymmetry? cp_vs_tcp (in contrast to cp_basic) uses 200 outstanding requests (default - or what is specified with --client-max) in any case.
The same question as above arises for the number of client/server ports, for example: In cp_basic TCP servers run with 16 ports, where as Homa servers run just with 6. This is in contrast to the number of client ports, which 9 in both cases. Is my assumption correct, that this number of ports, and also the number of, for example, port_threads is a result of elaborations what the "best" scenario is for a certain cluster (the CloudLab cluster, in this case)? And this is not only true for cp_basic, but also for any cp_* experiments using defaults from cperf.py, which seem to also show "asymmetry" between protocols (e.g. Homa server ports 3 and TCP server ports 8)?
I know, the following may be a "answer: it depends" type of question - on potentially multiple levels: I wonder, if you happen to have results of the experiments in cp_basic (and cp_vs_tcp) for a 10GbE network? Would it be possible to get the logs/ output(s)? Or do you have an educated "guess" what to expect - out of 10GbE network? Say, we have 4 nodes and especially do one of these throughput tests with "large messages" (either 1 client to multiple servers (*_client_tput) or 1 server to multiple clients (*_server_tput)). What Gbit/s value(s) should one expect?
Would it be possible to get a "typical" logs/ output for cp_basic and cp_vs_tcp from runs on the "cloudlab" 25GbE cluster setup?

No new file homa_skb.c

I see in the newest commits, there seems to be some intention to introduce a homa_skb.c file. I cannot seem to find this anywhere and this is leading to a compilation error. Perhaps the file is still untracked in the local repo?

Error in making HOMA

Hi, here is a problem I have while making HOMA!

alireza@Alireza:~/Documents/HomaModule$ make
make -C /lib/modules/4.15.0-29-generic/build M=/home/alireza/Documents/HomaModule modules
make[1]: Entering directory '/usr/src/linux-headers-4.15.0-29-generic'
  CC [M]  /home/alireza/Documents/HomaModule/homa_input.o
In file included from /home/alireza/Documents/HomaModule/homa_input.c:3:0:
/home/alireza/Documents/HomaModule/homa_impl.h:535:8: error: unknown type name ‘__poll_t’
 extern __poll_t
        ^
scripts/Makefile.build:332: recipe for target '/home/alireza/Documents/HomaModule/homa_input.o' failed
make[2]: *** [/home/alireza/Documents/HomaModule/homa_input.o] Error 1
Makefile:1552: recipe for target '_module_/home/alireza/Documents/HomaModule' failed
make[1]: *** [_module_/home/alireza/Documents/HomaModule] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-4.15.0-29-generic'
Makefile:7: recipe for target 'all' failed
make: *** [all] Error 2

Is this a problem related to the packages in my system or is a logical error? It cannot find __poll_t ! I guess something should be included in homa_impl.h.

Thanks

cp_basic with "large messages" show very odd behavior

Hi Team HOMA, hi John,

Preamble: This is most likely, I think, not an issue with the HomaModule, but more like a user/knowledge issue. So if I should move the following to a different communication channel, I'm happy to do so. Please advice.

We are currently getting started with the HOMA protocol and the HomaModule in particular. As a first major step we would like to get the HomaModule and the associated cp_* test scripts running on our very small in-house 10GBit/s "cluster", nothing remotely fancy. For starters we took cp_basic, just 2 nodes and are looking for a successful run-through. This is where we run into the issue that the sub-tests/experiments with "large messages" (500kB, e.g. "homa_1msg_tput") are more or less obviously not producing any siginficant load on the network interfaces.

We commented out other experiments in cp_basic and just left e.g. "homa_1msg_tput" uncommented to look at just that. Then the stdout/stderr of cp_basic is:

$ ./HomaModule-a/util/cp_basic -v -n 2 -s 10
cperf starting at 2023-09-08 12:50:51.943014
Options: --alt_slowdown: False, --client_max: 200, --client_ports: 9, --clients: [0, 1], --cperf_log: cperf.log, --dctcp: False, --debug: False, --delete_rtts: False, --gbps: 0.0, --ipv6: , --log_dir: logs/20230908125051, --mtu: 0, --no_homa_prio: False, --no_rtt_files: True, --nodes: [0, 1], --num_nodes: 2, --plot_only: False, --port_receivers: 1, --port_threads: 3, --protocol: homa, --seconds: 10, --server_ports: 6, --servers: [0, 1], --skip: None, --tcp_client_ports: 9, --tcp_port_receivers: 1, --tcp_port_threads: 1, --tcp_server_ports: 16, --unsched: 0, --unsched_boost: 0, --verbose: True, --workload:
Homa configuration:
  dead_buffs_limit     5000
  dynamic_windows      0
  grant_fifo_fraction  50
  gro_policy           82
  link_mbps            10000
  max_dead_buffs       3245
  max_grantable_rpcs   0
  max_gro_skbs         20
  max_gso_size         10000
  max_nic_queue_ns     2000
  max_incoming         0
  max_overcommit       8
  max_rpcs_per_peer    1
  num_priorities       8
  pacer_fifo_fraction  50
  poll_usecs           50
  reap_limit           10
  resend_interval      10
  resend_ticks         15
  throttle_min_bytes   1000
  timeout_resends      5
  unsched_bytes        10000
Starting homa servers on nodes range(1, 2)
Starting cp_node on node1
Command for node1: server --ports 6 --port-threads 3 --protocol homa
Starting cp_node on node0
Starting homa_1msg_tput experiment with clients range(0, 1)
Command for node0: client --ports 1 --port-receivers 0 --server-ports 1 --workload 500000 --servers 0,1 --gbps 0.000 --client-max 1 --protocol homa --id 0
Recording initial metrics
Command for node1: log Starting homa_1msg_tput experiment
Command for node0: log Starting homa_1msg_tput experiment
Command for node1: log Ending homa_1msg_tput experiment
Command for node0: log Ending homa_1msg_tput experiment
Retrieving data for homa_1msg_tput experiment
Recording final metrics from nodes [0, 1]
Command for node0: stop senders
Command for node0: stop clients
Traceback (most recent call last):
  File "$PATHTO/HomaModule-a/util/cp_basic", line 166, in <module>
    set_sysctl_parameter("net.ipv4.tcp_congestion_control", congestion,
                                                            ^^^^^^^^^^
NameError: name 'congestion' is not defined

Of course the evaluation cp_basic is doing at the end is failing, since we commented any experiment other than "homa_1msg_tput" (especially the TCP tests).

We then took the cp_node server and cp_node client command lines shown by cp_basic for the experiment "homa_1msg_tput" and executed them on our node0 and node1 respectively (while homa_prio is still running from the cp_basic invocation). The results/stdout are:

node0:

$ pgrep -a homa_prio
1897 bin/homa_prio --interval 500 --unsched 0 --unsched-boost 0
$ cp_node client --ports 1 --port-receivers 0 --server-ports 1 --workload 500000 --servers 0,1 --gbps 0.000 --client-max 1 --protocol homa --id 0
1694170446.906141958 Average message length 500.0 KB, rate 0.00 K/sec, expected BW 0.0 Gbps
1694170447.906504259 Outstanding client RPCs: 1
1694170448.906655629 Clients: 0.00 Kops/sec, 0.00 Gbps out, 0.00 Gbps in, RTT (us) P50 0.00 P99 0.00 P99.9 0.00, avg. req. length -nan bytes
1694170448.906677672 Outstanding client RPCs: 1
1694170449.906809664 Clients: 0.00 Kops/sec, 0.00 Gbps out, 0.00 Gbps in, RTT (us) P50 0.00 P99 0.00 P99.9 0.00, avg. req. length -nan bytes
1694170449.906828062 Outstanding client RPCs: 1
1694170450.906963385 Clients: 0.00 Kops/sec, 0.00 Gbps out, 0.00 Gbps in, RTT (us) P50 0.00 P99 0.00 P99.9 0.00, avg. req. length -nan bytes
1694170450.906981678 Outstanding client RPCs: 1
1694170451.907138367 Clients: 0.00 Kops/sec, 0.00 Gbps out, 0.00 Gbps in, RTT (us) P50 0.00 P99 0.00 P99.9 0.00, avg. req. length -nan bytes
1694170451.907156330 Outstanding client RPCs: 1
^C

node1:

$ pgrep -a homa_prio
1792 bin/homa_prio --interval 500 --unsched 0 --unsched-boost 0
$ cp_node server --ports 6 --port-threads 3 --protocol homa
1694170435.679299266 Successfully bound to Homa port 4000
1694170435.679520677 Successfully bound to Homa port 4001
1694170435.679893405 Successfully bound to Homa port 4002
1694170435.680120748 Successfully bound to Homa port 4003
1694170435.680386196 Successfully bound to Homa port 4004
1694170435.680606426 Successfully bound to Homa port 4005

So, on node1, the server is not reporting anything to stdout/stderr. And on node0 the client is reporting suspicious numbers of "0.00 Gbps".

Looking at the traffic between node0 and node1 via "port mirroring" in our switch, we see that the client is sending out a bunch of DATA packets, which seem to look OK. It stops when the number of exactly 17040 bytes is reached. Apparently the client is using 17040 bytes for the number of "unscheduled packets", right? Then, the client/node0 is sending out a RESEND packet shortly after the last DATA packet, which makes sense, since node0 is expecting a response. Then the server/node1 is replying with a BUSY packet - apparently rejecting the RESEND request. And that is what is then happening over and over again, in a rather slow ping-pong back and forth.

However, the server is somehow reacting to the client's DATA packets by sending one CUTOFFS packet. Repeating the test, the CUTOFFS packet sometimes appears right after the client sent its last DATA packet (reaching 17040 bytes of payload), and sometimes the CUTOFFS packet can be seen in middle of the client's DATA packet burst. The Wireshark screenshot below shows the former case.

node0: 192.168.250.1
node1: 192.168.250.2

In further tests, it turned out that using any --workload larger than 17040, even 17040, seems to produces that same behavior.

We are working with Ubuntu 18.04 w/ custom Linux 6.1.38 (GCC 7.5.0) and Debian 12 w/ shipped Linux 6.1.38 (GCC 12.2.0). Both show the same result. We also experimented with switching on/off TSO, GSO and GRO - in various combinations, but did not detect any difference. The 2 nodes in question do have old Intel X520-2 10G NIC each - nothing fancy. The nodes are interconnected using an Arista 10G switch. And we used the current main branch (74f29a5).

What is especially puzzling to me is, that the client seems to use 17040 bytes as the "unscheduled bytes", where at the same time the sysctl setting is left at default (10000 bytes).

I think, we are missing some crucial point?

Poor behavior of RSS

Hi John! I found that Homa's use of RSS does not seem to be good? I'm not sure if it's because of my configuration. For example, I use 3 machines as clients and 1 machine as server (xl170 in Cloudlab).
Client command:

sudo ./cp_node client --protocol homa --ports 9 --port-receivers 1 --server-ports 6 --client-max 10 --workload 100 --first-server 0

Server command:

sudo ./cp_node server --protocol homa --ports 6 --port-threads 3

However, through command:

sudo ethtool -S ens1f1np1 | grep -E 'rx[0-9]+_packets'

I observed that only 3 queues are receiving packets. For comparison, I tried running TCP with the same parameters and I found that almost all queues had received packets.

I would be extremely grateful if you could provide some insight!

Dist.h exposes too many internal details and lacks unit tests.

The contents of dist.h/dist.cc should be encapsulated into a class and conformed to standard C++ randomization and organize the interface to hide implementation details from the user. The files also need a unit test to ensure all the functions operate as intended.

Does not build on kernel 6.0.0-rc1 (or 5.19 I think)

This commit changes the recvmsg API to remove the nonblocking parameter:

https://lore.kernel.org/all/[email protected]/

The following patch resolves the build issue:

diff --git a/homa_impl.h b/homa_impl.h
index a5b0582..4a33938 100644
--- a/homa_impl.h
+++ b/homa_impl.h
@@ -2767,7 +2767,7 @@ extern void     homa_prios_changed(struct homa *homa);
 extern int      homa_proc_read_metrics(char *buffer, char **start, off_t offset,
                     int count, int *eof, void *data);
 extern int      homa_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
-                    int noblock, int flags, int *addr_len);
+                    int flags, int *addr_len);
 extern int      homa_register_interests(struct homa_interest *interest,
                     struct homa_sock *hsk, int flags, __u64 id,
                    const sockaddr_in_union *client_addr);
diff --git a/homa_plumbing.c b/homa_plumbing.c
index ae5efbd..551ca00 100644
--- a/homa_plumbing.c
+++ b/homa_plumbing.c
@@ -1148,7 +1148,7 @@ int homa_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) {
  * Return:       0 on success, otherwise a negative errno.
  */
 int homa_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
-                int nonblocking, int flags, int *addr_len) {
+                int flags, int *addr_len) {
        /* Homa doesn't support the usual read-write kernel calls; must
         * invoke operations through ioctls in order to manipulate RPC ids.
         */

Potential MTU related miscalculation

When testing homa with nccl-tests, I noticed that small testcases (whose messages falls under MTU) pass, while large testcases fail. Looking through the logs it seem that the packets never make it to the other end. tcpdump confirms that it's homa sending oversized packages with DF bit set.

13:56:10.888085 IP (tos 0xa0, ttl 64, id 0, offset 0, flags [DF], proto unknown (253), length 4237)
    172.25.230.89 > 172.25.230.90:  exptest-253 4217

(Background: I'm on ubuntu 22.04, with kernel 6.1.0-1007-oem)

Manuals and code inconsistent about throttle_min_bytes?

I was looking at the throttle queue implementation and reading up about it, when I noticed that the Homa manual mentions this about the throttle_min_bytes configuration parameter:

An integer value specifying the smallest packet size subject to
output queue throttling. Packets smaller than this will be immediately added to the NIC
queue without considering the queue length...

But how it is used in the code here seems to be the remaining bytes left of the message determine if the message is added to the throttled queue as opposed to the packet sizes. So the packets could be uniform in size during message transmission and they might as well be all smaller than the throttle_min_bytes value, but if the remaining bytes of a message are greater than the throttle_min_bytes parameter value, the message will still be added to the throttle queue. I hope someone can correct me if I am misunderstanding.

How to run send_raw/receive_raw without sudo permissions?

Hi I'm exploring the Homa kernel module. I've successfully compiled it, and successfully loaded it into kernel. But when trying the util/receive_raw and util/send_raw cases it requires root privilege to open socket. How can I remove the need of running them with sudo? Thanks!

I'm running x86_64 Ubuntu 22.04.3 LTS, w/ 6.2.0-37-generic stock kernel.

Error compiling

I'm trying to compile using AlmaLinux 8.7, but I get this error . Please help.

Homa on newer kernel

Is there any plans to port on newer kernel?

Tried to build on 5.17 without success. First issue solved by adding header:

+++ b/homa_offload.c
@@ -18,6 +18,7 @@
  */
 
 #include "homa_impl.h"
+#include <net/gro.h>

but did not found quick solution for following errors:

  CC [M]  /tmp/HomaModule/homa_peertab.o
/tmp/HomaModule/homa_peertab.c: In function ‘homa_peer_find’:
/tmp/HomaModule/homa_peertab.c:147:46: error: passing argument 2 of ‘security_sk_classify_flow’ from incompatible pointer type [-Werror=incompatible-pointer-types]
  147 |         security_sk_classify_flow(&inet->sk, &peer->flow);
      |                                              ^~~~~~~~~~~
      |                                              |
      |                                              struct flowi *
In file included from ./include/net/scm.h:8,
                 from ./include/linux/netlink.h:9,
                 from ./include/uapi/linux/neighbour.h:6,
                 from ./include/linux/netdevice.h:45,
                 from ./include/linux/if_vlan.h:10,
                 from /tmp/HomaModule/homa_impl.h:27,
                 from /tmp/HomaModule/homa_peertab.c:20:
./include/linux/security.h:1401:70: note: expected ‘struct flowi_common *’ but argument is of type ‘struct flowi *’
 1401 | void security_sk_classify_flow(struct sock *sk, struct flowi_common *flic);
      |                                                 ~~~~~~~~~~~~~~~~~~~~~^~~~
/tmp/HomaModule/homa_peertab.c: In function ‘homa_dst_refresh’:
/tmp/HomaModule/homa_peertab.c:197:39: error: passing argument 2 of ‘security_sk_classify_flow’ from incompatible pointer type [-Werror=incompatible-pointer-types]
  197 |         security_sk_classify_flow(sk, &peer->flow);
      |                                       ^~~~~~~~~~~
      |                                       |
      |                                       struct flowi *
In file included from ./include/net/scm.h:8,
                 from ./include/linux/netlink.h:9,
                 from ./include/uapi/linux/neighbour.h:6,
                 from ./include/linux/netdevice.h:45,
                 from ./include/linux/if_vlan.h:10,
                 from /tmp/HomaModule/homa_impl.h:27,
                 from /tmp/HomaModule/homa_peertab.c:20:
./include/linux/security.h:1401:70: note: expected ‘struct flowi_common *’ but argument is of type ‘struct flowi *’
 1401 | void security_sk_classify_flow(struct sock *sk, struct flowi_common *flic);
      |                                                 ~~~~~~~~~~~~~~~~~~~~~^~~~
cc1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:288: /tmp/HomaModule/homa_peertab.o] Error 1
make[1]: *** [Makefile:1837: /tmp/HomaModule] Error 2