Comments (11)
I am on the fence on this issue. Definitely some experiments on a platform that support any other data format natively will be beneficial, as suggested by @goodell. Pending that, I propose that we table the extensions.
from libfabric.
From an email from Jeff Hammond:
o AR: Jeff Hammond to provide a strawman of the datatypes to be supported
I propose that the following MPI datatypes have explicit support in
the API. As Chapter 4 of MPI-3 is extremely detailed in its
definition of these operations, I see no reason to summarize here. I
only list the motivation and some caveats.
The over-arching caveat is that I do not think we need to support
recursive type creating the way MPI does. The derived types below
only need to be composed of built-in types.
- MPI_TYPE_INDEXED
General purpose IOVEC supports allows the user to act on arbitrary
noncontiguous data using a single API call.
- MPI_TYPE_VECTOR (shmem__i{put,get} are special cases of this
where blocklength=1)
Vector types are an obvious optimization because communicating such
data regions otherwise requires O(n) operations or a single IOVEC
operation with O(n) metadata, where n is the number of elements in the
vector. Explicit expression of vector types also enables the use of
scatter-gather engines in the CPU and/or NIC.
- MPI_TYPE_CREATE_INDEXED_BLOCK
This is an optimization to (1) that reduces the vector of block sizes
to a scalar.
This is not necessarily a complete list of type support but my three
highest priorities as a user and/or implementor of MPI-3, GA/ARMCI and
OpenSHMEM.
from libfabric.
Removed this as a dependency on an alpha release of software. This could be handled through extended APIs or structures, which may be a better option than attempting to fit this into an existing call.
from libfabric.
Proposal is to define new 'strided' iovectors. E.g.
struct fi_strided_iov {
struct iovec iov; /* starting address and length of data segment to read/write /
offset_t stride; / number of bytes from start of segment x to start of segment x + 1 /
size_t count; / number of segments */
};
We can work on the name. There wouuld be similar definitions for other iov (e.g. rma, atomic) structures. The use of a strided iov would be indicated by the op_flag FI_STRIDED_IOV.
No changes to the API would be needed to support this.
from libfabric.
Should stride be negative? See if result ends up with simple arithmetic
from libfabric.
Can count be zero?
from libfabric.
The use case for this feature is still unclear.
from libfabric.
I think it's clear that supporting strided IOVs could end up being a better match for MPI implementations, but it seems like a pretty weak win to me unless there actually is hardware/firmware out there that can more efficiently use such a structure. Otherwise we've probably just created more net work as the MPI implementation adds additional checks to see if the MPI datatype conforms to whatever the libfabric restrictions are, then passes it to libfabric, who then expands the strided IOV under the covers just like the MPI implementation (unconditionally) would have.
from libfabric.
@goodell
MPI implementations that are expanding iovecs are most likely suboptimal. The MPI datatype state machine implementations that I know about (IBM dgsp style, and the MPICH dataloops) do not require expansion to iovecs, and are generally compact representations of the datatypes supported by MPI. The iovec expansion is a temporary state that is noncompact, and cache unfriendly, especially for small chunks(where the iovec size is on the order of the message size). Even for a pure software implementation, the datatype state machine should be processed iteratively, without expansion into temporary iovecs.
MPICH can do this today via the segment maniplulate routines, although function call overhead to process each chunk could still dominate on fast hardware. This is an implementation issue that MPICH 3.3 is planning to address with DAME, a jit compiled, optimized datatype infrastructure (http://dl.acm.org/citation.cfm?id=2802659).
I'm not saying we have to add this feature to OFI, especially for hardware that can process iovecs, I'm pointing out that overheads could limit message rates in a software driven implementation. If fi_write/read do not introduce too much overhead (maybe via FI_DIRECT), then the MPI state machines can drive the software implementations without using iovecs.
from libfabric.
The MPI datatype state machine implementations that I know about (IBM dgsp style, and the MPICH dataloops) do not require expansion to iovecs, and are generally compact representations of the datatypes supported by MPI. The iovec expansion is a temporary state that is noncompact, and cache unfriendly, especially for small chunks(where the iovec size is on the order of the message size). Even for a pure software implementation, the datatype state machine should be processed iteratively, without expansion into temporary iovecs.
No disagreement here about how MPI datatype processing is usually implemented, nor am I trying to imply that expansion to full iovec is a common or efficient approach for dealing with strided data transfers. I've fixed at least a few MPICH dataloop bugs in the past and am familiar with how they operate.
However, in the absence of a lower level messaging API (i.e., libfabric here) that natively supports strided IOVs, the MPI implementation basically has three options as I see it for effecting a strided data transfer:
- Pack the data to temporary buffers and use contiguous transfers.
- Translate the MPI level data representation to IOVs, then use that IOV to invoke the lower level transfer.
- Transfer each contiguous segment separately (as in your
fi_write
suggestion).
Obviously (1) and (2) can be chunked and possibly pipelined instead of a "transfer it all in one shot" approach.
IMO, the questions that we need to answer to justify adding strided IOVs to libfabric are: (A) is there a greater overall efficiency actually enabled by adding this support, e.g., through better a match to hardware capabilities, or (B) is there something currently impossible that is would be made possible?
For (A), I don't know the answer to this. Current generation Cisco hardware would not benefit. I'm pretty sure we'll able to implement it on future generation Cisco hardware but it's not clear to me yet how much of a performance win it would actually be compared with some packing-based strategies -- experiments would be needed. I don't understand the other NICs out there well enough to know the answer, though I suspect current generation IB wouldn't support this either. If the answer to this question is "no", then we've just pushed a portion of one of the core software tasks of MPI onto libfabric, where it will be handled in software as well. That doesn't seem like much of a win to me.
For (B) the main thing I can think of is when using tagged messaging at the libfabric level, this allows you to send a strided, tagged message that goes through the normal matching logic without having to fall back on some other rendezvous mechanism. Whether that actually results in greater efficiency still depends on the answer to (A), but maybe it could lead to a cleaner design in some cases? You still won't be able to completely eliminate that rendezvous case for more exotic MPI datatypes, so I'm not sure you actually get a reduction in your MPI implementation's complexity.
If fi_write/read do not introduce too much overhead (maybe via FI_DIRECT), then the MPI state machines can drive the software implementations without using iovecs.
So... sounds like a strided-IOV representation might not be needed anyway from your perspective.
from libfabric.
No strong desire has come up in 6 years to address this. Closing.
from libfabric.
Related Issues (20)
- Update README with missing providers
- `fi_errno` codes depend on implementation-defined macros HOT 3
- BUG, MAINT: segfaults through libfabric->ucx HOT 33
- Fix memory leaks detected by ASAN in Libfabric core code HOT 1
- libfabric + intel MPI over fi_mlx with multiple IB cards on 4OAM PVC HOT 1
- prov/shm: HMEM async copy path does double copy HOT 1
- BUG: unsafe CXI <-> gdrcopy cleanup interactions HOT 1
- hmem: compilation fails with incompatible pointer types on macOS with gcc 14 HOT 1
- prov/verbs: manual progress, call fi_cq_read with zero count does not drive progress HOT 11
- prov/cxi: OFI poll failed during MPI calls on LUMI/Adastra HOT 3
- Release tarballs are missing Windows files HOT 3
- rdm_tagged_bw is broken with OOB sync HOT 3
- prov/psm3: "munmap_chunk(): invalid pointer" on cleanup of fi_rdm_tagged_peek with OOB HOT 1
- prov/ucx: fi_rdm_tagged_peek cleanup race condition HOT 1
- Is fi_cntr_read expected to progress the EP ? HOT 1
- prov/ucx: fi_rdm_tagged_bw fi_av_insert error HOT 1
- EAGAIN endless loop HOT 1
- Support IPC for allocations created by cudaMallocAsync
- Release v1.22.0
- Release 1.21.1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from libfabric.