Coder Social home page Coder Social logo

Comments (5)

werkt avatar werkt commented on September 25, 2024 1

There are no timeouts for this per se - these essentially represent operations in flight that are waiting for appropriate workers to pull them off the mentioned queue. I'll turn this around and ask: what sort of behavior would you expect to happen when there are effectively no workers available to pull from a queue right now? There are implications here as well that indicate that we need to act with some tolerance to transiency - should this be size or age based? have a ttl on exiting from a queue? Will a ttl (essentially introducing a maximum latency for a queue) be appropriate time-wise, since it will slow down every misconfigured build's actions by this amount each time? We could also use active configuration awareness from workers, on servers, to decide whether there is any worker that is currently available that could pull from this queue, assuming it matches its requirements.

All of this requires additional monitoring and consensus. For what is essentially misconfiguration, is this required? Or could an external system monitor fill the role of detecting this state and alerting an administrator?

from bazel-buildfarm.

coder1363691 avatar coder1363691 commented on September 25, 2024

@werkt Thank you for the discussion. Indeed, situations like this misconfiguration should not occur or occur very rarely.However, once it happens, it results in the strange situation which requiring manual intervention from the backplane to resolve the issue. Ideally, I hope to set a timeout for the operations in the OperationQueue, such as OperationQueueTimeout. When scanning the queue, if we find that a task has been stalled for too long, we can terminate that operation and notify the client that the task has timed out. this differs from the actual execution timeout configuration, known as defaultActionTimeout. OperationQueueTimeout represents the maximum timeout for tasks that cannot be consumed, while defaultActionTimeout represents tasks that can be consumed but may time out during execution. However, this is just a possible suggestion. If necessary, I may try to fix this issue locally first. If there are further ideas in the future, we can continue the discussion.

from bazel-buildfarm.

coder1363691 avatar coder1363691 commented on September 25, 2024

@werkt hello, werkt, I made some modifications to the previous discussion, when an Operation stays in the OperationQueue too long, I will attempt to clean it up. It seems to be working fine from my test:

           protected void visit(QueueEntry queueEntry, String queueEntryJson) {
             onOperationName.accept(queueEntry.getExecuteEntry().getOperationName());
+            // check task timeout in queue
+            long queueAt = queueEntry.getExecuteEntry().getQueuedTimestamp().getSeconds();
+            long now = System.currentTimeMillis() / 1000;
+            long durationTime = now - queueAt;
+            if (durationTime > configs.getBackplane().getMaxQueueTimeout()) {
+              Status status = Status.newBuilder()
+                .setCode(Code.CANCELLED.getNumber()).setMessage("Operation Queued Timeout").build();
+              ExecuteEntry executeEntry = queueEntry.getExecuteEntry();
+              ExecuteOperationMetadata metadata =
+              ExecuteOperationMetadata.newBuilder()
+                  .setActionDigest(executeEntry.getActionDigest())
+                  .setStdoutStreamName(executeEntry.getStdoutStreamName())
+                  .setStderrStreamName(executeEntry.getStderrStreamName())
+                  .setStage(ExecutionStage.Value.COMPLETED)
+                  .build();
+              Operation queueTimeoutOperation = Operation.newBuilder()
+                  .setName(executeEntry.getOperationName())
+                  .setDone(true)
+                  .setMetadata(Any.pack(metadata))
+                  .setResponse(Any.pack(ExecuteResponse.newBuilder().setStatus(status).build()))
+                  .build();
+              // publish operation status
+              try {
+                putOperation(queueTimeoutOperation, ExecutionStage.Value.COMPLETED);
+              } catch (IOException e) {
+                log.log(Level.SEVERE, format("Error put expired %s", executeEntry.getOperationName()), e);
+              }
+              // remove operation from queue
+              if (!state.operationQueue.removeFromQueue(jedis, queueEntryJson)) {
+                log.log(Level.WARNING, format("removeFromQueue %s failed",executeEntry.getOperationName()));
+              } else {
+                queueTimeoutCounter.inc();
+                log.log(Level.WARNING, format("Operation queued expired,%s", executeEntry.getOperationName()));
+              }
+            }
           }

In addition, I find an interesting issue while using bazel-remote (https://github.com/buchgr/bazel-remote/) as a CAS (Content Addressable Storage) service. When there is a previously uploaded (hash, blob) in the CAS, here https://github.com/bazelbuild/bazel-buildfarm/blob/8d6e93fe0798978bff997c78458ac00fc35d0eeb/src/main/java/build/buildfarm/common/grpc/StubWriteOutputStream.java#L258C8-L258C8 will throws an exception without properly handling the writeObserver. This leads to a goroutine leak on the bazel-remote side. I made some modifications to the logic locally, and it seems to be working as expected now. Do you have any ideas about this?

   public void write(byte[] b, int off, int len) throws IOException {
     if (isComplete()) {
+      synchronized (this) {
+        if (writeObserver != null) {
+          writeObserver.onCompleted();
+          writeObserver = null;
+        }
+      }
       throw new WriteCompleteException();
     }

from bazel-buildfarm.

werkt avatar werkt commented on September 25, 2024

That change to the visitor for queue lifetime looks pretty good - can comment further if you put up a PR for it.

Wasn't aware of that leak, and I've encountered some recent edge cases around StubWriteOutputStream in general that make me concerned for its resource implications overall - I don't think that covers quite enough of the cases where we should complete the write observer though - I would rather that we do this as a response to the onNext reception immediately, so that we're properly triggering off of events (which aren't dependent upon the client actually calling write, or any other method).

from bazel-buildfarm.

coder1363691 avatar coder1363691 commented on September 25, 2024

@werkt OK, i will try a PR later
BTW, i saw you commit here, dd5c87b
and i think always handle writeObserve.onCompleted() in close() is a better idea, i test it and it works fine, and i will patch my local as what this commit did

from bazel-buildfarm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.