Coder Social home page Coder Social logo

Comments (11)

sayantansur avatar sayantansur commented on July 17, 2024 1

I am on the fence on this issue. Definitely some experiments on a platform that support any other data format natively will be beneficial, as suggested by @goodell. Pending that, I propose that we table the extensions.

from libfabric.

shefty avatar shefty commented on July 17, 2024

From an email from Jeff Hammond:

o AR: Jeff Hammond to provide a strawman of the datatypes to be supported

I propose that the following MPI datatypes have explicit support in
the API. As Chapter 4 of MPI-3 is extremely detailed in its
definition of these operations, I see no reason to summarize here. I
only list the motivation and some caveats.

The over-arching caveat is that I do not think we need to support
recursive type creating the way MPI does. The derived types below
only need to be composed of built-in types.

  1. MPI_TYPE_INDEXED

General purpose IOVEC supports allows the user to act on arbitrary
noncontiguous data using a single API call.

  1. MPI_TYPE_VECTOR (shmem__i{put,get} are special cases of this
    where blocklength=1)

Vector types are an obvious optimization because communicating such
data regions otherwise requires O(n) operations or a single IOVEC
operation with O(n) metadata, where n is the number of elements in the
vector. Explicit expression of vector types also enables the use of
scatter-gather engines in the CPU and/or NIC.

  1. MPI_TYPE_CREATE_INDEXED_BLOCK

This is an optimization to (1) that reduces the vector of block sizes
to a scalar.

This is not necessarily a complete list of type support but my three
highest priorities as a user and/or implementor of MPI-3, GA/ARMCI and
OpenSHMEM.

from libfabric.

shefty avatar shefty commented on July 17, 2024

Removed this as a dependency on an alpha release of software. This could be handled through extended APIs or structures, which may be a better option than attempting to fit this into an existing call.

from libfabric.

shefty avatar shefty commented on July 17, 2024

Proposal is to define new 'strided' iovectors. E.g.

struct fi_strided_iov {
struct iovec iov; /* starting address and length of data segment to read/write /
offset_t stride; /
number of bytes from start of segment x to start of segment x + 1 /
size_t count; /
number of segments */
};

We can work on the name. There wouuld be similar definitions for other iov (e.g. rma, atomic) structures. The use of a strided iov would be indicated by the op_flag FI_STRIDED_IOV.

No changes to the API would be needed to support this.

from libfabric.

shefty avatar shefty commented on July 17, 2024

Should stride be negative? See if result ends up with simple arithmetic

from libfabric.

shefty avatar shefty commented on July 17, 2024

Can count be zero?

from libfabric.

shefty avatar shefty commented on July 17, 2024

The use case for this feature is still unclear.

from libfabric.

goodell avatar goodell commented on July 17, 2024

I think it's clear that supporting strided IOVs could end up being a better match for MPI implementations, but it seems like a pretty weak win to me unless there actually is hardware/firmware out there that can more efficiently use such a structure. Otherwise we've probably just created more net work as the MPI implementation adds additional checks to see if the MPI datatype conforms to whatever the libfabric restrictions are, then passes it to libfabric, who then expands the strided IOV under the covers just like the MPI implementation (unconditionally) would have.

from libfabric.

charlesarcher avatar charlesarcher commented on July 17, 2024

@goodell
MPI implementations that are expanding iovecs are most likely suboptimal. The MPI datatype state machine implementations that I know about (IBM dgsp style, and the MPICH dataloops) do not require expansion to iovecs, and are generally compact representations of the datatypes supported by MPI. The iovec expansion is a temporary state that is noncompact, and cache unfriendly, especially for small chunks(where the iovec size is on the order of the message size). Even for a pure software implementation, the datatype state machine should be processed iteratively, without expansion into temporary iovecs.

MPICH can do this today via the segment maniplulate routines, although function call overhead to process each chunk could still dominate on fast hardware. This is an implementation issue that MPICH 3.3 is planning to address with DAME, a jit compiled, optimized datatype infrastructure (http://dl.acm.org/citation.cfm?id=2802659).

I'm not saying we have to add this feature to OFI, especially for hardware that can process iovecs, I'm pointing out that overheads could limit message rates in a software driven implementation. If fi_write/read do not introduce too much overhead (maybe via FI_DIRECT), then the MPI state machines can drive the software implementations without using iovecs.

from libfabric.

goodell avatar goodell commented on July 17, 2024

The MPI datatype state machine implementations that I know about (IBM dgsp style, and the MPICH dataloops) do not require expansion to iovecs, and are generally compact representations of the datatypes supported by MPI. The iovec expansion is a temporary state that is noncompact, and cache unfriendly, especially for small chunks(where the iovec size is on the order of the message size). Even for a pure software implementation, the datatype state machine should be processed iteratively, without expansion into temporary iovecs.

No disagreement here about how MPI datatype processing is usually implemented, nor am I trying to imply that expansion to full iovec is a common or efficient approach for dealing with strided data transfers. I've fixed at least a few MPICH dataloop bugs in the past and am familiar with how they operate.

However, in the absence of a lower level messaging API (i.e., libfabric here) that natively supports strided IOVs, the MPI implementation basically has three options as I see it for effecting a strided data transfer:

  1. Pack the data to temporary buffers and use contiguous transfers.
  2. Translate the MPI level data representation to IOVs, then use that IOV to invoke the lower level transfer.
  3. Transfer each contiguous segment separately (as in your fi_write suggestion).

Obviously (1) and (2) can be chunked and possibly pipelined instead of a "transfer it all in one shot" approach.

IMO, the questions that we need to answer to justify adding strided IOVs to libfabric are: (A) is there a greater overall efficiency actually enabled by adding this support, e.g., through better a match to hardware capabilities, or (B) is there something currently impossible that is would be made possible?

For (A), I don't know the answer to this. Current generation Cisco hardware would not benefit. I'm pretty sure we'll able to implement it on future generation Cisco hardware but it's not clear to me yet how much of a performance win it would actually be compared with some packing-based strategies -- experiments would be needed. I don't understand the other NICs out there well enough to know the answer, though I suspect current generation IB wouldn't support this either. If the answer to this question is "no", then we've just pushed a portion of one of the core software tasks of MPI onto libfabric, where it will be handled in software as well. That doesn't seem like much of a win to me.

For (B) the main thing I can think of is when using tagged messaging at the libfabric level, this allows you to send a strided, tagged message that goes through the normal matching logic without having to fall back on some other rendezvous mechanism. Whether that actually results in greater efficiency still depends on the answer to (A), but maybe it could lead to a cleaner design in some cases? You still won't be able to completely eliminate that rendezvous case for more exotic MPI datatypes, so I'm not sure you actually get a reduction in your MPI implementation's complexity.

If fi_write/read do not introduce too much overhead (maybe via FI_DIRECT), then the MPI state machines can drive the software implementations without using iovecs.

So... sounds like a strided-IOV representation might not be needed anyway from your perspective.

from libfabric.

shefty avatar shefty commented on July 17, 2024

No strong desire has come up in 6 years to address this. Closing.

from libfabric.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.