Comments (5)
Here is the updated config for another_memory_instance:
# the hash function for this worker, required
# to match out of band between the client and server
# since resource names must be determined on the client
# for a valid upload
hash_function: SHA256
# the endpoint used to execute operations
operation_queue: {
target: "portland.home:8980"
# the instance domain that this worker will execute work in
# all requests will be tagged with this instance name
instance_name: "another_memory_instance"
}
# the endpoint used for cas interactions
content_addressable_storage: {
target: "portland.home:8980"
# the instance domain that this worker will make resource requests in
# all requests will be tagged with this instance name
instance_name: "another_memory_instance"
}
# the endpoint used for action cache interactions
action_cache: {
target: "portland.home:8980"
# the instance domain that this worker will make resource requests in
# all requests will be tagged with this instance name
instance_name: "another_memory_instance"
}
# all content for the operations will be stored under this path
root: "/tmp/worker"
# the local cache location relative to the 'root', or absolute
cas_cache_directory: "cache"
# total size in bytes of inline content for action results
# output files, stdout, and stderr content, in that order
# will be inlined if their cumulative size does not exceed this limit.
inline_content_limit: 1048567 # 1024 * 1024
# whether the stdout of running processes should be streamed
stream_stdout: true
# whether to insert stdout into the CAS, can be:
# ALWAYS_INSERT: stdout is always inserted into the CAS
# INSERT_ABOVE_LIMIT: stdout is inserted into the CAS when it exceeds the inline limit above
stdout_cas_policy: ALWAYS_INSERT
# whether the stderr of running processes should be streamed
stream_stderr: true
# whether to insert stderr into the CAS, can be:
# ALWAYS_INSERT: stderr is always inserted into the CAS
# INSERT_ABOVE_LIMIT: stderr is inserted into the CAS when it exceeds the inline limit above
stderr_cas_policy: ALWAYS_INSERT
# whether to insert output files into the CAS, can be:
# ALWAYS_INSERT: output files are always inserted into the CAS
# INSERT_ABOVE_LIMIT: output files are inserted into the CAS when it exceeds the inline limit above
file_cas_policy: ALWAYS_INSERT
# the worker will take it upon itself to requeue (exceptionally)
# failed operations via the OperationQueue#put method with queued
# status.
requeue_on_failure: true
# ContentAddressableStorage#getTree per-page directory count
# value of '0' means let the server decide
tree_page_size: 0
# the period between poll operations at any stage
operation_poll_period: {
seconds: 1
nanos: 0
}
# key/value set of definining capabilities of this worker
# all execute requests must match perfectly with workers which
# provide capabilities
# so an action with a required platform: { arch: "x86_64" } must
# match with a worker with at least { arch: "x86_64" } here
platform: {
# commented out here for illustrative purposes, a default empty
# 'platform' is a sufficient starting point without specifying
# any platform requirements on the actions' side
###
# property: {
# name: "key_name"
# value: "value_string"
# }
}
# limit for contents of files retained
# from CAS in the cache
cas_cache_max_size_bytes: 2147483648 # 2 * 1024 * 1024 * 1024
# the number of concurrently available slots in the execute phase
execute_stage_width: 10
# an imposed action-key-invariant timeout used in the unspecified timeout case
default_action_timeout: {
seconds: 600
nanos: 0
}
# a limit on the action timeout specified in the action, above which
# the operation will report a failed result immediately
maximum_action_timeout: {
seconds: 3600
nanos: 0
}
This runs on a separate box, pointing to portland.home, where the server and default_memory_instance run
from bazel-buildfarm.
from bazel help --long build
:
--remote_instance_name (a string; default: "")
Value to pass as instance_name in the remote execution API.
Try swapping between the two names of your instances when you build.
from bazel-buildfarm.
Both workers spin up without error, but I from what I can tell, only the worker designated as default_instance_name receives traffic. I can redirect builds to any of the workers by changing the default instance. But I can't seem to scale the build by running both workers in parallel.
This, however, makes me believe that you are trying to increase your total pool size by using multiple workers. If you want both of your workers to be able to execute work queued by your single-instance configured client, you need to have both of the workers configured to use the same instance.
from bazel-buildfarm.
Ah I see. Thanks!
If I specify default_memory_instance
in all worker.configs, all worker instances seem to participate in the build.
I am running each worker instance on separate machines now with execute_stage_width: 10
.
Two of the machines have 8 cores but the third box only has 4.
Cpu utilization seems to be fairly good on all three machines. I also confirmed that the build keeps going if I remove workers.
I run the actual bazel build with --jobs=30
.
Are there any recommendations for how to optimally configure params like --jobs
, execute_stage_width
and number of workers?
Also, is there any way to get output in the worker console during execution to prove that the worker is actually taking part in the build.
Right now the only indications I have is that CPU utilization is way up + I see that any of the workers will keep the build going even after removing other builders.
from bazel-buildfarm.
There is no recommendation for how to configure --jobs
to match with worker capacities in terms of total saturation. The possible behaviors with a single bazel client become unruly when large values of -j are considered (concurrent competing downloads for cache, non-execute activity for actions, digest calculations, local fallback oversubscription, etc), so I would recommend that, just like with a local -j value, you increase it until there is no further benefit from doing so (and if you manage to break buildfarm in the process, file a bug).
From my perspective, buildfarm and remote execution is a distributed service with a loosely defined SLA, where many distinct clients (users doing builds) are able to capitalize on shared execution resources and cache, and while improvement of the scaling of the client is important, it is also substantially tangential.
Similarly, such enumeration of 'what is executing/being fetched remote compared to local' is a burden of the client to communicate to the users.
That said, I'm pursuing a couple of efforts in bazel at least to accomplish better melding with the distributed environments along this vein. More to come.
from bazel-buildfarm.
Related Issues (20)
- Support more ways to specify a Redis password HOT 1
- Error Target @io_bazel_rules_docker//platforms:image_transition was referenced as a platform, but does not provide PlatformInfo HOT 11
- Buildfarm worker not starting when following quickstart guide HOT 1
- Prometheus Server issue with worker startup HOT 1
- @rules_oss_audit causes `build //...` to fail
- Impact of execution wrappers on remote build performance? HOT 2
- Is communication amongst schedulers and workers secured? HOT 2
- What can cause differences in speed of performing certain actions? HOT 5
- Migrate to Bzlmod for managing external dependencies
- How to get all logging? HOT 4
- How to get action results? HOT 2
- when blob file miss,rbe client is stuck HOT 6
- gRPC reflection fails with java.lang.NoSuchMethodError HOT 1
- Latest worker image fails on startup due to classfile version mismatch HOT 3
- expire Operation in backplane HOT 5
- [Scheduler] Exception notifying context listener HOT 1
- Are workers in RemoteCasWriter fixed whenever any new storage workers are added afterwards? HOT 2
- ci: windows tests fail very often HOT 2
- image bazelbuild/buildfarm-worker:v2.7.0 fails to start with "libfuse.so.2: cannot open shared object file: No such file or directory" HOT 8
- First GRPC type storage tries to create Fuse Exec FS
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bazel-buildfarm.