Comments (5)
There are no timeouts for this per se - these essentially represent operations in flight that are waiting for appropriate workers to pull them off the mentioned queue. I'll turn this around and ask: what sort of behavior would you expect to happen when there are effectively no workers available to pull from a queue right now? There are implications here as well that indicate that we need to act with some tolerance to transiency - should this be size or age based? have a ttl on exiting from a queue? Will a ttl (essentially introducing a maximum latency for a queue) be appropriate time-wise, since it will slow down every misconfigured build's actions by this amount each time? We could also use active configuration awareness from workers, on servers, to decide whether there is any worker that is currently available that could pull from this queue, assuming it matches its requirements.
All of this requires additional monitoring and consensus. For what is essentially misconfiguration, is this required? Or could an external system monitor fill the role of detecting this state and alerting an administrator?
from bazel-buildfarm.
@werkt Thank you for the discussion. Indeed, situations like this misconfiguration should not occur or occur very rarely.However, once it happens, it results in the strange situation which requiring manual intervention from the backplane to resolve the issue. Ideally, I hope to set a timeout for the operations in the OperationQueue, such as OperationQueueTimeout. When scanning the queue, if we find that a task has been stalled for too long, we can terminate that operation and notify the client that the task has timed out. this differs from the actual execution timeout configuration, known as defaultActionTimeout. OperationQueueTimeout represents the maximum timeout for tasks that cannot be consumed, while defaultActionTimeout represents tasks that can be consumed but may time out during execution. However, this is just a possible suggestion. If necessary, I may try to fix this issue locally first. If there are further ideas in the future, we can continue the discussion.
from bazel-buildfarm.
@werkt hello, werkt, I made some modifications to the previous discussion, when an Operation stays in the OperationQueue too long, I will attempt to clean it up. It seems to be working fine from my test:
protected void visit(QueueEntry queueEntry, String queueEntryJson) {
onOperationName.accept(queueEntry.getExecuteEntry().getOperationName());
+ // check task timeout in queue
+ long queueAt = queueEntry.getExecuteEntry().getQueuedTimestamp().getSeconds();
+ long now = System.currentTimeMillis() / 1000;
+ long durationTime = now - queueAt;
+ if (durationTime > configs.getBackplane().getMaxQueueTimeout()) {
+ Status status = Status.newBuilder()
+ .setCode(Code.CANCELLED.getNumber()).setMessage("Operation Queued Timeout").build();
+ ExecuteEntry executeEntry = queueEntry.getExecuteEntry();
+ ExecuteOperationMetadata metadata =
+ ExecuteOperationMetadata.newBuilder()
+ .setActionDigest(executeEntry.getActionDigest())
+ .setStdoutStreamName(executeEntry.getStdoutStreamName())
+ .setStderrStreamName(executeEntry.getStderrStreamName())
+ .setStage(ExecutionStage.Value.COMPLETED)
+ .build();
+ Operation queueTimeoutOperation = Operation.newBuilder()
+ .setName(executeEntry.getOperationName())
+ .setDone(true)
+ .setMetadata(Any.pack(metadata))
+ .setResponse(Any.pack(ExecuteResponse.newBuilder().setStatus(status).build()))
+ .build();
+ // publish operation status
+ try {
+ putOperation(queueTimeoutOperation, ExecutionStage.Value.COMPLETED);
+ } catch (IOException e) {
+ log.log(Level.SEVERE, format("Error put expired %s", executeEntry.getOperationName()), e);
+ }
+ // remove operation from queue
+ if (!state.operationQueue.removeFromQueue(jedis, queueEntryJson)) {
+ log.log(Level.WARNING, format("removeFromQueue %s failed",executeEntry.getOperationName()));
+ } else {
+ queueTimeoutCounter.inc();
+ log.log(Level.WARNING, format("Operation queued expired,%s", executeEntry.getOperationName()));
+ }
+ }
}
In addition, I find an interesting issue while using bazel-remote (https://github.com/buchgr/bazel-remote/) as a CAS (Content Addressable Storage) service. When there is a previously uploaded (hash, blob) in the CAS, here https://github.com/bazelbuild/bazel-buildfarm/blob/8d6e93fe0798978bff997c78458ac00fc35d0eeb/src/main/java/build/buildfarm/common/grpc/StubWriteOutputStream.java#L258C8-L258C8 will throws an exception without properly handling the writeObserver. This leads to a goroutine leak on the bazel-remote side. I made some modifications to the logic locally, and it seems to be working as expected now. Do you have any ideas about this?
public void write(byte[] b, int off, int len) throws IOException {
if (isComplete()) {
+ synchronized (this) {
+ if (writeObserver != null) {
+ writeObserver.onCompleted();
+ writeObserver = null;
+ }
+ }
throw new WriteCompleteException();
}
from bazel-buildfarm.
That change to the visitor for queue lifetime looks pretty good - can comment further if you put up a PR for it.
Wasn't aware of that leak, and I've encountered some recent edge cases around StubWriteOutputStream in general that make me concerned for its resource implications overall - I don't think that covers quite enough of the cases where we should complete the write observer though - I would rather that we do this as a response to the onNext reception immediately, so that we're properly triggering off of events (which aren't dependent upon the client actually calling write, or any other method).
from bazel-buildfarm.
@werkt OK, i will try a PR later
BTW, i saw you commit here, dd5c87b
and i think always handle writeObserve.onCompleted() in close() is a better idea, i test it and it works fine, and i will patch my local as what this commit did
from bazel-buildfarm.
Related Issues (20)
- Fetch service does not properly handle missing content-length HOT 1
- Support Multiple Hashing Function
- Remote execution service executing processes locally HOT 4
- ERROR: Failed to query remote execution capabilities: UNAVAILABLE: io exception HOT 4
- Feasibility Analysis of Using Buildfarm for Large-Scale Development HOT 2
- Verbose Logging for Servers and Workers through Helm Chart
- Check logs during remote execution HOT 4
- Incorrect container port for Shard-worker in Helm template
- hardlinks in CAS leads to task failure in some cases HOT 2
- Fetch asset support for credential use
- [Bzlmod] No repository visible as '@maven' from main repository HOT 1
- Querying remote cache failed due to Missing Digest HOT 5
- Helm chart won't deploy workers because {ready,live}ness probes are using the wrong port
- When is the release? HOT 1
- Helm chart deployment storage issues
- FindMissingBlobs histogram has too few buckets
- CGroups v2 HOT 1
- RedisShardSubscription::stop is unsafe
- 无任务时redis CPU占用过高 HOT 2
- Transfer bazel-buildfarm to bazel-contrib HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bazel-buildfarm.