Coder Social home page Coder Social logo

Comments (5)

thelgevold avatar thelgevold commented on May 26, 2024

Here is the updated config for another_memory_instance:

# the hash function for this worker, required
# to match out of band between the client and server
# since resource names must be determined on the client
# for a valid upload
hash_function: SHA256

# the endpoint used to execute operations
operation_queue: {
  target: "portland.home:8980"

  # the instance domain that this worker will execute work in
  # all requests will be tagged with this instance name
  instance_name: "another_memory_instance"
}

# the endpoint used for cas interactions
content_addressable_storage: {
  target: "portland.home:8980"

  # the instance domain that this worker will make resource requests in
  # all requests will be tagged with this instance name
  instance_name: "another_memory_instance"
}

# the endpoint used for action cache interactions
action_cache: {
  target: "portland.home:8980"

  # the instance domain that this worker will make resource requests in
  # all requests will be tagged with this instance name
  instance_name: "another_memory_instance"
}

# all content for the operations will be stored under this path
root: "/tmp/worker"

# the local cache location relative to the 'root', or absolute
cas_cache_directory: "cache"

# total size in bytes of inline content for action results
# output files, stdout, and stderr content, in that order
# will be inlined if their cumulative size does not exceed this limit.
inline_content_limit: 1048567 # 1024 * 1024

# whether the stdout of running processes should be streamed
stream_stdout: true

# whether to insert stdout into the CAS, can be:
#   ALWAYS_INSERT: stdout is always inserted into the CAS
#   INSERT_ABOVE_LIMIT: stdout is inserted into the CAS when it exceeds the inline limit above
stdout_cas_policy: ALWAYS_INSERT

# whether the stderr of running processes should be streamed
stream_stderr: true

# whether to insert stderr into the CAS, can be:
#   ALWAYS_INSERT: stderr is always inserted into the CAS
#   INSERT_ABOVE_LIMIT: stderr is inserted into the CAS when it exceeds the inline limit above
stderr_cas_policy: ALWAYS_INSERT

# whether to insert output files into the CAS, can be:
#   ALWAYS_INSERT: output files are always inserted into the CAS
#   INSERT_ABOVE_LIMIT: output files are inserted into the CAS when it exceeds the inline limit above
file_cas_policy: ALWAYS_INSERT

# the worker will take it upon itself to requeue (exceptionally)
# failed operations via the OperationQueue#put method with queued
# status.
requeue_on_failure: true

# ContentAddressableStorage#getTree per-page directory count
# value of '0' means let the server decide
tree_page_size: 0

# the period between poll operations at any stage
operation_poll_period: {
  seconds: 1
  nanos: 0
}

# key/value set of definining capabilities of this worker
# all execute requests must match perfectly with workers which
# provide capabilities
# so an action with a required platform: { arch: "x86_64" } must
# match with a worker with at least { arch: "x86_64" } here
platform: {
  # commented out here for illustrative purposes, a default empty
  # 'platform' is a sufficient starting point without specifying
  # any platform requirements on the actions' side
  ###
  # property: {
  #   name: "key_name"
  #   value: "value_string"
  # }
}

# limit for contents of files retained
# from CAS in the cache
cas_cache_max_size_bytes: 2147483648 # 2 * 1024 * 1024 * 1024

# the number of concurrently available slots in the execute phase
execute_stage_width: 10

# an imposed action-key-invariant timeout used in the unspecified timeout case
default_action_timeout: {
  seconds: 600
  nanos: 0
}

# a limit on the action timeout specified in the action, above which
# the operation will report a failed result immediately
maximum_action_timeout: {
  seconds: 3600
  nanos: 0
}

This runs on a separate box, pointing to portland.home, where the server and default_memory_instance run

from bazel-buildfarm.

werkt avatar werkt commented on May 26, 2024

from bazel help --long build:

  --remote_instance_name (a string; default: "")
    Value to pass as instance_name in the remote execution API.

Try swapping between the two names of your instances when you build.

from bazel-buildfarm.

werkt avatar werkt commented on May 26, 2024

Both workers spin up without error, but I from what I can tell, only the worker designated as default_instance_name receives traffic. I can redirect builds to any of the workers by changing the default instance. But I can't seem to scale the build by running both workers in parallel.

This, however, makes me believe that you are trying to increase your total pool size by using multiple workers. If you want both of your workers to be able to execute work queued by your single-instance configured client, you need to have both of the workers configured to use the same instance.

from bazel-buildfarm.

thelgevold avatar thelgevold commented on May 26, 2024

Ah I see. Thanks!

If I specify default_memory_instance in all worker.configs, all worker instances seem to participate in the build.

I am running each worker instance on separate machines now with execute_stage_width: 10.
Two of the machines have 8 cores but the third box only has 4.

Cpu utilization seems to be fairly good on all three machines. I also confirmed that the build keeps going if I remove workers.

I run the actual bazel build with --jobs=30.

Are there any recommendations for how to optimally configure params like --jobs, execute_stage_width and number of workers?

Also, is there any way to get output in the worker console during execution to prove that the worker is actually taking part in the build.

Right now the only indications I have is that CPU utilization is way up + I see that any of the workers will keep the build going even after removing other builders.

from bazel-buildfarm.

werkt avatar werkt commented on May 26, 2024

There is no recommendation for how to configure --jobs to match with worker capacities in terms of total saturation. The possible behaviors with a single bazel client become unruly when large values of -j are considered (concurrent competing downloads for cache, non-execute activity for actions, digest calculations, local fallback oversubscription, etc), so I would recommend that, just like with a local -j value, you increase it until there is no further benefit from doing so (and if you manage to break buildfarm in the process, file a bug).

From my perspective, buildfarm and remote execution is a distributed service with a loosely defined SLA, where many distinct clients (users doing builds) are able to capitalize on shared execution resources and cache, and while improvement of the scaling of the client is important, it is also substantially tangential.

Similarly, such enumeration of 'what is executing/being fetched remote compared to local' is a burden of the client to communicate to the users.

That said, I'm pursuing a couple of efforts in bazel at least to accomplish better melding with the distributed environments along this vein. More to come.

from bazel-buildfarm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.