utexas-bwi / concert_scheduling Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 4.0 1.11 MB

Scheduler support packages for the Robotics in Concert project

Home Page: http://wiki.ros.org/concert_scheduling

Python 100.00%

concert_scheduling's People

Contributors

Stargazers

Watchers

Forkers

stonier robotics-in-concert jihoonl

concert_scheduling's Issues

scheduler_node: validate requests before putting them in the ready queue

The current logic only rejects an invalid request when it reaches the head of the queue.

Some problems can only be found at allocation time, but others could be rejected earlier.

concert_scheduler_requests: set default topic name from ROS parameter

In the Scheduler and Requester constructors: when topic is not specified, search for a topic_name ROS parameter, using rocon_scheduler if none found.

resource_pool: actually resolve wildcard requests

The current stub only copies the exact resource names requested.

More work remains for full resource matching in the pool.

See: _allocate_permutation

Invalid request string constant

In https://github.com/utexas-bwi/rocon_experimental/blob/master/concert_simple_scheduler/src/concert_simple_scheduler/scheduler_node.py#L197

if hasattr(Request, "INVALID"):  # new reason code defined?      
    element.request.cancel(Request.INVALID)

Planning or have already implemented this?

priority_queue: add() with different priority is broken

Need a test case that demonstrates this bug.

We can live with it for a while, because the scheduler is not currently changing priorities.

Requester workflow problem

Seems to be a problem with a particular flow of events in the scheduler:

I have an example in which the following happens.

A requester immediately sends off a requests to the scheduler.
The SchedulerRequests callback _allocate_resources catches the request
Eventually this callback tries to construct a _RequesterStatus object here:

self.requesters[rqr_id] = _RequesterStatus(self, msg)

mid-construction, it tries to handle the initial message and calls concert_simple_scheduler's callback.
This queues the request, adds the resources and calls dispatch()
In dispatch, it calls self.notify_requesters()
notify_requesters in turn calls rocon_scheduler_requests.Scheduler.notify() which has the following very important code:

self.requesters[requester_id].send_feedback()

which doesn't exist yet because we actually haven't got through the construction of _RequesterStatus yet nor added it to the dictionary.

Not sure where you'd like to tackle a fix for this. If you need instructions for a reproducible example, let me know.

Not processing newly allocatable resources

If a new resource is discovered on concert_client_changes, it doesn't proceed to allocate that resource and finally grant a request that is waiting.

I'm digging around for more information now and will update back here.

priority_queue: add peek() method

The peek() method would return the head of the queue without removing it.

scheduler_node : left vs lost resources

This is a carry over from utexas-bwi/rocon_scheduler_requests#10. I'll try to crystallise what was there and then raise the issue.

Summary

We have resources within a request going missing
The scheduler should not act on this, just pass notification back to the requester
The requester has the freedom to choose its own response
- cancel and resubmit, or delay, or other...

Current Status

Resources are currently tracked with a status flag with AVAILABLE, ALLOCATED and MISSING values. If a resource disappears from the conductor's /concert/conductor/concert_clients publication, then it is flagged as MISSING and published in turn on /concert/scheduler/resource_pool.

Issues

There are actually two kinds of missing - left and lost

The first is when the concert client has left the concert. If this happens and it does come back, it will come back under a different rocon uri (probably with a postfixed counter to the name, e.g. kobuki2 instead of kobuki). The second when it just loses its network connection to the concert and will come back at some point.

The conductors' concert_clients topic provides the necessary information for both cases.

When a client has left, it just disappears from the topic's list of clients and this is the case that is handled in the resource pool update.

When a client is lost, it will (have a bit of work to do this week on this) get shown in the client_status variable which indicates its current connectivity, e.g.

clients: 
  - 
    name: kobuki
    gateway_name: kobukib093a49d98e747cd9ef2f1cb337408f3
    platform_info: 
      uri: rocon:/pc/kobuki/hydro/precise
      version: acdc
      icon: 
        resource_name: ''
        format: png
    client_status: connected
    app_status: running

Questions

Should left clients stay in the resource pool? This is important for tracking an individually left resource in a request containing multiple resources.
Should have status flags that reflect both these conditions in https://github.com/robotics-in-concert/rocon_msgas/blob/hydro-devel/scheduler_msgs/msg/CurrentStatus.msg#L25?

documentation uploaded incorrect

The documentation link you have in the readme is probably going to the correct url:

http://farnsworth.csres.utexas.edu/docs/concert_simple_scheduler/html/

but looks as though you've accidentally uploaded the rocon_request_scheduler docs there.

scheduler_node: use new ROCON URI matching logic

The current implementation still uses the Python re package to match requests to resources.

Queues and scheduler rset not in sync

The scheduler will grant a request, but in the request feedback back to the requester, the pre-grant status of the request is received (i.e. the grant update did not take effect).

I added some logging to get a handle on what is going on - the most interesting point of which is in the dispatch function. Passing it the rset variable via the callback method, i.e.

    def callback(self, rset):
        rospy.logdebug('scheduler callback:')
        for rq in rset.values():
            rospy.logwarn("DJS : request address in callback [%s]" % hex(id(rq)))
            rospy.logdebug('  ' + str(rq))
            if rq.msg.status == Request.NEW:
                self.queue(rq, rset.requester_id)
            elif rq.msg.status == Request.CANCELING:
                self.free(rq, rset.requester_id)
        self.dispatch(rset)                 # try to allocate ready requests

    def dispatch(self, rset=None):
        while len(self.ready_queue) > 0:
            # Try to allocate top element in the ready queue.
            elem = self.ready_queue.pop()
            rospy.logwarn("DJS: elem -> address [%s]" % hex(id(elem.request)))
            rospy.logwarn("DJS: elem -> id [%s]" % unique_id.toHexString(elem.request.msg.id))
            rospy.logwarn("DJS: elem -> status [%s]" % elem.request.msg.status)
            resources = []
            try:
                resources = self.pool.allocate(elem.request)
            except InvalidRequestError as ex:
                self.reject_request(elem, ex)
                continue                # skip to next queue element

            if not resources:           # top request cannot be satisfied?
                # Return it to head of queue.
                self.ready_queue.add(elem)
                break                   # stop looking

            try:
                elem.request.grant(resources)
                rospy.logwarn("DJS: dispatch -> address [%s]" % hex(id(elem.request)))
                rospy.logwarn("DJS: dispatch -> msg address [%s]" % hex(id(elem.request.msg)))
                rospy.logwarn("DJS: dispatch -> id [%s]" % unique_id.toHexString(elem.request.msg.id))
                rospy.logwarn("DJS: dispatch -> status [%s]" % elem.request.msg.status)
                if rset is not None:
                    for rq in rset.values():
                        rospy.logwarn("DJS: rq -> address [%s]" % hex(id(rq)))
                        rospy.logwarn("DJS: rq -> msg address [%s]" % hex(id(rq.msg)))
                        rospy.logwarn("DJS: rq -> id [%s]" % unique_id.toHexString(rq.msg.id))
                        rospy.logwarn("DJS: rq -> status [%s]" % rq.msg.status)
                rospy.loginfo(
                    'Request granted: ' + str(elem.request.uuid))
            except TransitionError:     # request no longer active?
                # Return allocated resources to the pool.
                self.pool.release_resources(resources)
            self.notification_set.add(elem.requester_id)

I get this output:

[WARN] [WallTime: 1396153452.730991] DJS : request address in callback [0x1f18490]
[WARN] [WallTime: 1396153452.731454] DJS: transitions [w] -> address [0x1f18490]
[WARN] [WallTime: 1396153452.731776] DJS: transitions [w] -> id [10d4e332-4e90-40f4-8f79-57c8bc098182]
[WARN] [WallTime: 1396153452.732043] DJS: transitions [w] -> grant [2]
[WARN] [WallTime: 1396153452.732398] DJS: transitions [w] -> resources [[rapp: rocon_apps/teleop
id: 
  uuid: [54, 104, 244, 114, 150, 196, 79, 31, 169, 128, 168, 115, 67, 248, 50, 146]
uri: rocon:/pc/guimul/hydro/precise
remappings: []]]
[INFO] [WallTime: 1396153452.733212] Request queued: 10d4e332-4e90-40f4-8f79-57c8bc098182
[WARN] [WallTime: 1396153452.733482] DJS: elem -> address [0x1f185d0]
[WARN] [WallTime: 1396153452.733764] DJS: elem -> id [10d4e332-4e90-40f4-8f79-57c8bc098182]
[WARN] [WallTime: 1396153452.734061] DJS: elem -> status [2]
[WARN] [WallTime: 1396153452.734742] DJS: transitions [g] -> address [0x1f185d0]
[WARN] [WallTime: 1396153452.735097] DJS: transitions [g] -> id [10d4e332-4e90-40f4-8f79-57c8bc098182]
[WARN] [WallTime: 1396153452.735369] DJS: transitions [g] -> grant [3]
[WARN] [WallTime: 1396153452.735695] DJS: transitions [g] -> resources [[rapp: rocon_apps/teleop
id: 
  uuid: [54, 104, 244, 114, 150, 196, 79, 31, 169, 128, 168, 115, 67, 248, 50, 146]
uri: rocon:/pc/guimul/hydro/precise
remappings: []]]
[WARN] [WallTime: 1396153452.736015] DJS: dispatch -> address [0x1f185d0]
[WARN] [WallTime: 1396153452.736313] DJS: dispatch -> msg address [0x1eedb40]
[WARN] [WallTime: 1396153452.736619] DJS: dispatch -> id [10d4e332-4e90-40f4-8f79-57c8bc098182]
[WARN] [WallTime: 1396153452.736906] DJS: dispatch -> status [3]
[WARN] [WallTime: 1396153452.737174] DJS: rq -> address [0x1f18490]
[WARN] [WallTime: 1396153452.737463] DJS: rq -> msg address [0x1eed830]
[WARN] [WallTime: 1396153452.737759] DJS: rq -> id [10d4e332-4e90-40f4-8f79-57c8bc098182]
[WARN] [WallTime: 1396153452.738022] DJS: rq -> status [2]
[INFO] [WallTime: 1396153452.738291] Request granted: 10d4e332-4e90-40f4-8f79-57c8bc098182
[WARN] [WallTime: 1396153452.738597] DJS: sending -> address [0x1f18490]
[WARN] [WallTime: 1396153452.738933] DJS: sending -> id 10d4e332-4e90-40f4-8f79-57c8bc098182
[WARN] [WallTime: 1396153452.739228] DJS: sending -> status 2
[WARN] [WallTime: 1396153452.739579] DJS: sending -> resources [rapp: rocon_apps/teleop
id: 
  uuid: [54, 104, 244, 114, 150, 196, 79, 31, 169, 128, 168, 115, 67, 248, 50, 146]
uri: rocon:/pc/guimul/hydro/precise
remappings: []]
[WARN] [WallTime: 1396153452.740037] DJS: resource pool changed
[WARN] [WallTime: 1396153452.740379] DJS: publishing known resources
[WARN] [WallTime: 1396153452.740843] DJS: requester feedback - request id 10d4e332-4e90-40f4-8f79-57c8bc098182
[WARN] [WallTime: 1396153452.741350] DJS: requester feedback - request status 2
[WARN] [WallTime: 1396153452.741736] DJS: requester feedback - request resources [rapp: rocon_apps/teleop

As you can see, the rset requets and the popped ready_queue request are not the same thing. So any grant() operation on the popped request is irrelevant. I suspect that they are supposed to be one and the same object?

concert_simple_scheduler: support different policies via derived classes

It should be possible to provide different scheduling policies with minimal code by using derived classes, but for that to work cleanly it needs to be kept in mind as a goal.

priority_queue: current implementation seems to work LIFO, not FIFO

Maybe this was caused by commenting out the deep copy in 36f16bc.

deb release for indigo

@jack-oquin Any chance you could do a deb release for this? I could also do this for you along with our other rocon releases too if you like (lets you just worry about the sources).

concert_scheduler_requests crash

I'm afraid the error message is a bit too cryptic for me to explain. And I can't reproduce this error with certainty.

[ERROR] [WallTime: 1398033132.501574] bad callback: <bound method CompatibilityTreeScheduler._ros_subscriber_concert_client_changes of <concert_schedulers.compatibility_tree_scheduler.scheduler.CompatibilityTreeScheduler object at 0x17fd848>>
Traceback (most recent call last):
  File "/opt/ros/hydro/lib/python2.7/dist-packages/rospy/topics.py", line 682, in _invoke_callback
    cb(msg)
  File "/home/piyushk/rocon_catkin_ws/src/rocon_concert/concert_schedulers/src/concert_schedulers/compatibility_tree_scheduler/scheduler.py", line 118, in _ros_subscriber_concert_client_changes
    self._update()
  File "/home/piyushk/rocon_catkin_ws/src/rocon_concert/concert_schedulers/src/concert_schedulers/compatibility_tree_scheduler/scheduler.py", line 250, in _update
    reply.grant(resources)
  File "/home/piyushk/rocon_catkin_ws/src/concert_scheduling/concert_scheduler_requests/src/concert_scheduler_requests/transitions.py", line 306, in grant
    self._transition(EVENT_GRANT, reason=Request.NONE)
  File "/home/piyushk/rocon_catkin_ws/src/concert_scheduling/concert_scheduler_requests/src/concert_scheduler_requests/transitions.py", line 224, in _transition
    + ' in state ' + str(self.msg.status))
TransitionError: invalid event grant in state 5

Start app on allocated resources

Do you plan to start apps upon granting resources?

We talked about this fleetingly earlier. Probably no. of schedulers < no. of requesters so it would be good to centralise the start app code in the few schedulers we have.

I can hack on this and send you a PR if you like.

scheduler_node: add preemption logic

The current implementation never preempts resources it previously allocated, even when a higher-priority request is waiting on it.

tests: simulate ROCON conductor for testing

This is needed to provide the scheduler with resources to allocate.

The mock conductor script must simulate resources starting and stopping, both randomly and on a known timeline.

concert_scheduler_requests: support requests using logical connectives

Migrated from utexas-bwi/rocon_scheduler_requests#8 from @stonier:

Current specificaiton specifies a group of resources required for the request:

Resource[] resources

which is simple, and works for alot of situation. Jack mentioned something along the line of making requests of requests. i.e. using logical connectives to form more particularly gnarly requests. e.g.

((two foo or one bar) and 3 baz)

in which case, a currently specified list of resources such as five different robots would look like:

(a and b and c and d and e)

This is a nice idea and alot more powerful than just a list of resources. Does it really have many practical use cases though?

Would be nice to explore if we identify a need for adding this complexity in a future iteration so posting this as a kind of TODO marker.

concert_scheduler_requests: optionally pass serialization lock to constructors

There are legitimate reasons to allocate the big scheduler lock at a higher level, where other threads may be involved. The big requester lock should work similarly.

scheduler_node: handle reservation requests

concert_scheduler_requests: add request preemption timeout

Moved from utexas-bwi/rocon_scheduler_requests#28:

The current scheduler module allows a requester to hang on to a preempted resource indefinitely. That may be useful in some cases, but relies on all requesters to be responsive and well-behaved. While the scheduler protocol is intentionally co-operative, providing a timeout does seem desirable for cases where the requester is not working correctly.

We want a mechanism for the scheduler to allow a reasonable period of time for clean-up without tying up resources forever. Since a fixed time is unlikely to work for all situations, the requester and scheduler should negotiate how long to wait using the hold_time and availability fields. The scheduler's preempt() call would set availability to some reasonable time limit, and the requester could update hold_time if it has different requirements.

See utexas-bwi/rocon_scheduler_requests#10 for a longer discussion of some aspects involving lost resources.

scheduler_node : resource status via the requester feedback

Some background information can also be found in #23.

A requester currently has to track resource status (i.e. MISSING or not) in the resource_pool topic. While this is eminently doable, it would be convenient to have this information show up in the requester feedback function which is where a requester (I think) typically does alot of it's decision making.

The variable that gets passed back in the requester feedback function is the RequestSet which has resource information embedded in scheduler_msgs.Resouorce which would feel like a more natural place to get resource status feedback.

Note: this doesn't have to be done since we have a way of getting it already. What are your thoughts on it though Jack?

scheduler_node: publish schedulable resources on a latched topic

cross-reference: utexas-bwi/rocon_scheduler_requests#9

The likely solution involves publishing a new message whenever the resource_pool changes.

I am considering /schedulable_resources or /schedulable_rocon_resources for the default topic name. The first one seems better.

There is a use case for including the request priority of ALLOCATED resources. See this earlier discussion.

scheduler_node: detect ready requests that cannot be satisfied, move them to blocked_queue

Otherwise, a high-priority request for resources that are not available will block everything that follows.