Other data formats may be more concise than iovec for referencing multiple buffers. F

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Expand iovec support to include other data formats about libfabric HOT 11 CLOSED

shefty commented on July 17, 2024

Expand iovec support to include other data formats

from libfabric.

Comments (11)

sayantansur commented on July 17, 2024 1

I am on the fence on this issue. Definitely some experiments on a platform that support any other data format natively will be beneficial, as suggested by @goodell. Pending that, I propose that we table the extensions.

from libfabric.

shefty commented on July 17, 2024

From an email from Jeff Hammond:

o AR: Jeff Hammond to provide a strawman of the datatypes to be supported

I propose that the following MPI datatypes have explicit support in
the API. As Chapter 4 of MPI-3 is extremely detailed in its
definition of these operations, I see no reason to summarize here. I
only list the motivation and some caveats.

The over-arching caveat is that I do not think we need to support
recursive type creating the way MPI does. The derived types below
only need to be composed of built-in types.

MPI_TYPE_INDEXED

General purpose IOVEC supports allows the user to act on arbitrary
noncontiguous data using a single API call.

MPI_TYPE_VECTOR (shmem__i{put,get} are special cases of this
where blocklength=1)

Vector types are an obvious optimization because communicating such
data regions otherwise requires O(n) operations or a single IOVEC
operation with O(n) metadata, where n is the number of elements in the
vector. Explicit expression of vector types also enables the use of
scatter-gather engines in the CPU and/or NIC.

MPI_TYPE_CREATE_INDEXED_BLOCK

This is an optimization to (1) that reduces the vector of block sizes
to a scalar.

This is not necessarily a complete list of type support but my three
highest priorities as a user and/or implementor of MPI-3, GA/ARMCI and
OpenSHMEM.

from libfabric.

shefty commented on July 17, 2024

Removed this as a dependency on an alpha release of software. This could be handled through extended APIs or structures, which may be a better option than attempting to fit this into an existing call.

from libfabric.

shefty commented on July 17, 2024

Proposal is to define new 'strided' iovectors. E.g.

struct fi_strided_iov {
struct iovec iov; /* starting address and length of data segment to read/write /
offset_t stride; / number of bytes from start of segment x to start of segment x + 1 /
size_t count; / number of segments */
};

We can work on the name. There wouuld be similar definitions for other iov (e.g. rma, atomic) structures. The use of a strided iov would be indicated by the op_flag FI_STRIDED_IOV.

No changes to the API would be needed to support this.

from libfabric.

shefty commented on July 17, 2024

Should stride be negative? See if result ends up with simple arithmetic

from libfabric.

shefty commented on July 17, 2024

Can count be zero?

from libfabric.

shefty commented on July 17, 2024

The use case for this feature is still unclear.

from libfabric.

goodell commented on July 17, 2024

I think it's clear that supporting strided IOVs could end up being a better match for MPI implementations, but it seems like a pretty weak win to me unless there actually is hardware/firmware out there that can more efficiently use such a structure. Otherwise we've probably just created more net work as the MPI implementation adds additional checks to see if the MPI datatype conforms to whatever the libfabric restrictions are, then passes it to libfabric, who then expands the strided IOV under the covers just like the MPI implementation (unconditionally) would have.

from libfabric.

charlesarcher commented on July 17, 2024

@goodell
MPI implementations that are expanding iovecs are most likely suboptimal. The MPI datatype state machine implementations that I know about (IBM dgsp style, and the MPICH dataloops) do not require expansion to iovecs, and are generally compact representations of the datatypes supported by MPI. The iovec expansion is a temporary state that is noncompact, and cache unfriendly, especially for small chunks(where the iovec size is on the order of the message size). Even for a pure software implementation, the datatype state machine should be processed iteratively, without expansion into temporary iovecs.

MPICH can do this today via the segment maniplulate routines, although function call overhead to process each chunk could still dominate on fast hardware. This is an implementation issue that MPICH 3.3 is planning to address with DAME, a jit compiled, optimized datatype infrastructure (http://dl.acm.org/citation.cfm?id=2802659).

I'm not saying we have to add this feature to OFI, especially for hardware that can process iovecs, I'm pointing out that overheads could limit message rates in a software driven implementation. If fi_write/read do not introduce too much overhead (maybe via FI_DIRECT), then the MPI state machines can drive the software implementations without using iovecs.

from libfabric.

goodell commented on July 17, 2024

The MPI datatype state machine implementations that I know about (IBM dgsp style, and the MPICH dataloops) do not require expansion to iovecs, and are generally compact representations of the datatypes supported by MPI. The iovec expansion is a temporary state that is noncompact, and cache unfriendly, especially for small chunks(where the iovec size is on the order of the message size). Even for a pure software implementation, the datatype state machine should be processed iteratively, without expansion into temporary iovecs.

No disagreement here about how MPI datatype processing is usually implemented, nor am I trying to imply that expansion to full iovec is a common or efficient approach for dealing with strided data transfers. I've fixed at least a few MPICH dataloop bugs in the past and am familiar with how they operate.

However, in the absence of a lower level messaging API (i.e., libfabric here) that natively supports strided IOVs, the MPI implementation basically has three options as I see it for effecting a strided data transfer:

Pack the data to temporary buffers and use contiguous transfers.
Translate the MPI level data representation to IOVs, then use that IOV to invoke the lower level transfer.
Transfer each contiguous segment separately (as in your fi_write suggestion).

Obviously (1) and (2) can be chunked and possibly pipelined instead of a "transfer it all in one shot" approach.

IMO, the questions that we need to answer to justify adding strided IOVs to libfabric are: (A) is there a greater overall efficiency actually enabled by adding this support, e.g., through better a match to hardware capabilities, or (B) is there something currently impossible that is would be made possible?

For (A), I don't know the answer to this. Current generation Cisco hardware would not benefit. I'm pretty sure we'll able to implement it on future generation Cisco hardware but it's not clear to me yet how much of a performance win it would actually be compared with some packing-based strategies -- experiments would be needed. I don't understand the other NICs out there well enough to know the answer, though I suspect current generation IB wouldn't support this either. If the answer to this question is "no", then we've just pushed a portion of one of the core software tasks of MPI onto libfabric, where it will be handled in software as well. That doesn't seem like much of a win to me.

For (B) the main thing I can think of is when using tagged messaging at the libfabric level, this allows you to send a strided, tagged message that goes through the normal matching logic without having to fall back on some other rendezvous mechanism. Whether that actually results in greater efficiency still depends on the answer to (A), but maybe it could lead to a cleaner design in some cases? You still won't be able to completely eliminate that rendezvous case for more exotic MPI datatypes, so I'm not sure you actually get a reduction in your MPI implementation's complexity.

If fi_write/read do not introduce too much overhead (maybe via FI_DIRECT), then the MPI state machines can drive the software implementations without using iovecs.

So... sounds like a strided-IOV representation might not be needed anyway from your perspective.

from libfabric.

shefty commented on July 17, 2024

No strong desire has come up in 6 years to address this. Closing.

from libfabric.

Expand iovec support to include other data formats about libfabric HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent