Coder Social home page Coder Social logo

Comments (9)

dledford avatar dledford commented on August 16, 2024

@shefty For this, are you referring to adding fields to struct fi_info (which is used when creating an endpoint and therefore could be preset to requested values versus do a modify sequence after the endpoint is created)? And if so, are you wanting to get as details as IB gets here with options for send queue size, receive queue size, and send and recv queue maximum SG entries, and possibly a request for maximum inline data too? Or do you think that's getting to fabric specific and defeating the purpose of abstracting the fabric out?

from libfabric.

shefty avatar shefty commented on August 16, 2024

I was thinking of adding the fields to fi_ep_attr, but I don't know what fields to add, if any. I was thinking along the lines of send/recv queue size and SGL sizes. But if we expand the endpoint to include the concept of sessions for multi-threaded purposes, then there may be multiple sizes, corresponding to different HW work queues. So a single send queue size value may not work. Personally, I like the idea of trying to keep things abstract and using return codes to keep the user from overrunning any lower level queues. I'm not sure that works for all apps though. And I'm not sure what to do with SGL limits. SGL limits seem easier to expose through fi_ep_attr.

from libfabric.

dledford avatar dledford commented on August 16, 2024

On 09/16/2014 01:58 PM, Sean Hefty wrote:

I was thinking of adding the fields to fi_ep_attr, but I don't know what
fields to add, if any. I was thinking along the lines of send/recv queue
size and SGL sizes.

OK.

But if we expand the endpoint to include the concept
of sessions

Definition please. By sessions do you mean multiple connections between
two hosts not following the same path (like on over ib0 and one over
ib1, or say one over eth1 and one over ib0 where eth1 is RoCE enabled
and ib1 is InfiniBand)?

for multi-threaded purposes, then there may be multiple
sizes, corresponding to different HW work queues.

This will probably go beyond the scope of libfabrics. Or at least I
would think beyond the scope of the bottom layer of libfabrics. We've
talked multiple times about the difference between a libfabrics that
MPIs or other apps that want really low level, "get out of my way" type
access to the underlying fabric want, and then there are apps that want
"abstract away all that fabric stuff and give me a simple, but
performant, interface". The overhead associated with sessions seems
like it would pre-emptively force support for sessions up to that higher
layer abstraction. As such, I'm not sure you want to build that into
the lower layer data structures versus handling it entirely at a higher
layer.

At a minimum though, I can see that if you are going to support the
notion of sessions, then not only would the queue size and other
parameters need to be in fi_ep_attr, but I think you would need to move
the src_addr and dst_addr from fi_info to fi_ep_attr as well since the
addresses of each session would likely be unique.

So a single send queue
size value may not work. Personally, I like the idea of trying to keep
things abstract and using return codes to keep the user from overrunning
any lower level queues.

I had thought about that. But that is decided performance unfriendly in
the IB case. And it would prevent the app from implementing any sort of
credit mechanism themselves. But, for some providers, there is no
concept of a queue depth (sockets provider immediately comes to mind).
So I was thinking to add it, but define it in the API such that a user
can specify a requested queue depth in the fi_ep_attr struct, and
depending on the provider the endpoint is created on, the following
matrix of values will be placed in the fi_ep_attr struct on return:

User fills Provider has notion Provider is queue
in queue size of queue size deficient

Yes return min(max queue return -1
depth, requested
queue depth)
No return default queue return -1
depth

I'm not sure that works for all apps though. And
I'm not sure what to do with SGL limits. SGL limits seem easier to
expose through fi_ep_attr.

I would agree with that. I think there are a number of things that are
in fi_info that can go to fi_ep_attr if you are going to support
sessions, and a number of things that can come back if you aren't.
However, since there's no harm in them being in fi_ep_attr, we can put
stuff there and plan for the possible future that way.

from libfabric.

shefty avatar shefty commented on August 16, 2024

For 'sessions', what I mean are multiple HW command queues mapped to the application. The command queues have the same transport and network level address. If the queues can receive data, they may have a different session level address, which ideally would be exposed to the app as an index. A very simple use case would be an app using different sessions to communicate with different sets of remote processes. (I haven't thought through this concept, so my ideas are just up in the air at the moment.)

I agree that we will need to expose a size for application credit schemes. Maybe the answer is in the definition. (Note that I'm lousy coming up with names.)

min_outstanding_send - The minimum number of data transfers that a provider will queue to an endpoint.

This still allows for returning EBUSY. A provider may be able to queue more requests.

I also want to consider software providers that enhance the capabilities of a HW provider. E.g. there could be a provider that supports transfers larger than 4 GB, by breaking up a large request into multiple smaller requests. I don't think this causes any issues to a reported queue size, but I haven't thought through it.

Btw, it's kind of arbitrary which fields go into fi_info versus fi_ep_attr. I wanted to keep all mandatory fields in fi_info, and only require those apps that want to deal at the lower level fill out fi_ep_attr.

from libfabric.

dledford avatar dledford commented on August 16, 2024

On 09/16/2014 02:45 PM, Sean Hefty wrote:

For 'sessions', what I mean are multiple HW command queues mapped to the
application. The command queues have the same transport and network
level address. If the queues can receive data, they may have a different
session level address, which ideally would be exposed to the app as an
index. A very simple use case would be an app using different sessions
to communicate with different sets of remote processes. (I haven't
thought through this concept, so my ideas are just up in the air at the
moment.)

I think I get what you mean (but I doubt that sets of different
processes is reasonable, you will likely need a whole new EP for each
different process you talk to due to the requirement of having to
listen/connect to different ports/services). However, an example that
does make sense to me, and something I've been looking at doing as an
optimization to conserve memory use in IB communications, is the idea of
having multiple queue pairs between two apps where the queue pairs
utilized different maximum message sizes and queue depths in order to
allow you to send lots of small messages without wasting huge amounts of
space. Such as a queue pair with a max message size of 256 bytes,
another at 1k, another at 4k, another at 16k, and one at 64k, with each
queue pair having progressively fewer entries as the size got larger.
For apps that send lots of small messages with some medium and large
size messages mixed in, this would make a lot of sense (ordering issues
not being considered here, the app would either need to take care of
that or there would need to be a layered ordering provider on top of
this scheme).

I agree that we will need to expose a size for application credit
schemes. Maybe the answer is in the definition. (Note that I'm lousy
coming up with names.)

min_outstanding_send - The minimum number of data transfers that a
provider will queue to an endpoint.

Except that most credit schemes are based on the opposite of this: a
maximum that the app knows and can plan for minus the currently
in-flight number.

This still allows for returning EBUSY.

When it comes to applications that want to manage their credits, if we
ever return EBUSY, we've failed.

A provider may be able to queue
more requests.

I think it's fair to say that, if an app wants to manage its own
in-flight counts and credits, that the maximum queue depth plus sends
sent minus completions received should allow them to do so
deterministically, And that only applications that don't bother to
track queue state should ever hit EBUSY, but for them it should exist
and the tracking of queue depths versus sent versus completed should be
an optional optimization left up to the application. Fair enough?

I also want to consider software providers that enhance the capabilities
of a HW provider. E.g. there could be a provider that supports transfers
larger than 4 GB, by breaking up a large request into multiple smaller
requests. I don't think this causes any issues to a reported queue size,
but I haven't thought through it.

It shouldn't, but it would mean that the software provider will have to
provide a minimal queue of their own to compensate for split packets.
But that's OK, they have to split and recombine packets, a small queue
is nothing major to add to that.

Btw, it's kind of arbitrary which fields go into fi_info versus
fi_ep_attr. I wanted to keep all mandatory fields in fi_info, and only
require those apps that want to deal at the lower level fill out fi_ep_attr.

OK, I can understand that. I'll make a note to that effect in the
header file ;-)

from libfabric.

shefty avatar shefty commented on August 16, 2024

I think I get what you mean (but I doubt that sets of different
processes is reasonable, you will likely need a whole new EP for each
different process you talk to due to the requirement of having to
listen/connect to different ports/services).

Ah - I was thinking more of unconnected endpoints. HPC apps in general want reliable unconnected endpoints. There are at least a couple of vendors that support this (including Intel). The Mellanox XRC and dynamic connection features are steps in this direction.

does make sense to me, and something I've been looking at doing as an
optimization to conserve memory use in IB communications, is the idea of
having multiple queue pairs between two apps where the queue pairs
utilized different maximum message sizes and queue depths in order to

The libfabric feature to do this is the FI_MULTI_RECV flag, which is support for 'slab based' memory buffering. I.e. the user posts a single large buffer, and multiple receives simply fill in the buffer. This would be more for future HW or non-offload HW. IB could simulate this by using RDMA writes with immediate in place of sending messages.

I agree that we will need to expose a size for application credit
schemes. Maybe the answer is in the definition. (Note that I'm lousy
coming up with names.)

min_outstanding_send - The minimum number of data transfers that a
provider will queue to an endpoint.

Except that most credit schemes are based on the opposite of this: a
maximum that the app knows and can plan for minus the currently
in-flight number.

The app can set its starting max_credits to the min_oustanding. I used min instead of max, since the app may be able to post more. E.g. for iWarp to support RDMA write with immediate, it would consume 2 queue entries (RDMA write + send message). So it would set the min_outstanding = 1/2(queue size). If the app posts nothing but writes with immediate, it will block at min_outstanding. But if it only does sends, it can queue twice that amount.

I think this meets the intent that you want. The only issue is really the name.

from libfabric.

shefty avatar shefty commented on August 16, 2024

A general proposal to expose this is described here:

http://lists.openfabrics.org/pipermail/ofiwg/2014-September/000354.html

I will post a patch for this idea for further discussion.

from libfabric.

shefty avatar shefty commented on August 16, 2024

A patch has been developed, but has not been committed.

from libfabric.

shefty avatar shefty commented on August 16, 2024

An initial patch for this was committed 5cb07ab. Discussions are continuing on the mail list to enhance this, but closing this issue, since the queue sizes are now available.

from libfabric.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.