Coder Social home page Coder Social logo

parca-dev / parca-agent Goto Github PK

View Code? Open in Web Editor NEW
527.0 17.0 67.0 65.44 MB

eBPF based always-on profiler auto-discovering targets in Kubernetes and systemd, zero code changes or restarts needed!

Home Page: https://parca.dev/

License: Apache License 2.0

Go 86.35% Dockerfile 0.05% Makefile 0.86% Jsonnet 9.83% Shell 2.91%
ebpf profiling pprof performance kubernetes observability linux golang continuous-profiling libbpf systemd hacktoberfest bpf ebpf-programs c cpp go python ruby rust

parca-agent's Introduction

Apache 2 License Build Container parca-agent

Parca Agent

Parca Agent is an always-on sampling profiler that uses eBPF to capture raw profiling data with very low overhead. It observes user-space and kernel-space stacktraces 19 times per second and builds pprof formatted profiles from the extracted data. Read more details in the design documentation.

The collected data can be viewed locally via HTTP endpoints and then be configured to be sent to a Parca server to be queried and analyzed over time.

Requirements

  • Linux Kernel version 5.3+ with BTF

Quickstart

See the Kubernetes Getting Started.

Language Support

Parca Agent is continuously enhancing its support for multiple languages. Incomplete list of languages we currently support:

  • C, C++, Go (with extended support), Rust
  • .NET, Deno, Erlang, Java, Julia, Node.js, Wasmtime, PHP 8 and above
  • Ruby, Python

Please check our docs for further information.

Note

Further language support is coming in the upcoming weeks and months.

Supported Profiles

Types of profiles that are available:

  • On-CPU
  • Soon: Network usage, Allocations

Note

Please check our docs if your language is supported.

The following types of profiles require explicit instrumentation:

  • Runtime specific information such as Goroutines

Debugging

Logging

To debug potential errors, enable debug logging using --log-level=debug.

Configuration

Flags:

Usage: parca-agent

Flags:
  -h, --help                       Show context-sensitive help.
      --log-level="info"           Log level.
      --log-format="logfmt"        Configure if structured logging as JSON or as
                                   logfmt
      --http-address="127.0.0.1:7071"
                                   Address to bind HTTP server to.
      --version                    Show application version.
      --node="hostname"           The name of the node that the process is
                                   running on. If on Kubernetes, this must match
                                   the Kubernetes node name.
      --config-path=""             Path to config file.
      --memlock-rlimit=0           The value for the maximum number of bytes
                                   of memory that may be locked into RAM. It is
                                   used to ensure the agent can lock memory for
                                   eBPF maps. 0 means no limit.
      --mutex-profile-fraction=0
                                   Fraction of mutex profile samples to collect.
      --block-profile-rate=0       Sample rate for block profile.
      --profiling-duration=10s     The agent profiling duration to use. Leave
                                   this empty to use the defaults.
      --profiling-cpu-sampling-frequency=19
                                   The frequency at which profiling data is
                                   collected, e.g., 19 samples per second.
      --profiling-perf-event-buffer-poll-interval=250ms
                                   The interval at which the perf event buffer
                                   is polled for new events.
      --profiling-perf-event-buffer-processing-interval=100ms
                                   The interval at which the perf event buffer
                                   is processed.
      --profiling-perf-event-buffer-worker-count=4
                                   The number of workers that process the perf
                                   event buffer.
      --metadata-external-labels=KEY=VALUE;...
                                   Label(s) to attach to all profiles.
      --metadata-container-runtime-socket-path=STRING
                                   The filesystem path to the container runtimes
                                   socket. Leave this empty to use the defaults.
      --metadata-disable-caching
                                   Disable caching of metadata.
      --local-store-directory=STRING
                                   The local directory to store the profiling
                                   data.
      --remote-store-address=STRING
                                   gRPC address to send profiles and symbols to.
      --remote-store-bearer-token=STRING
                                   Bearer token to authenticate with store
                                   ($PARCA_BEARER_TOKEN).
      --remote-store-bearer-token-file=STRING
                                   File to read bearer token from to
                                   authenticate with store.
      --remote-store-insecure      Send gRPC requests via plaintext instead of
                                   TLS.
      --remote-store-insecure-skip-verify
                                   Skip TLS certificate verification.
      --remote-store-batch-write-interval=10s
                                   Interval between batch remote client writes.
                                   Leave this empty to use the default value of
                                   10s.
      --remote-store-rpc-logging-enable
                                   Enable gRPC logging.
      --remote-store-rpc-unary-timeout=5m
                                   Maximum timeout window for unary gRPC
                                   requests including retries.
      --debuginfo-directories=/usr/lib/debug,...
                                   Ordered list of local directories to search
                                   for debuginfo files.
      --debuginfo-temp-dir="/tmp"
                                   The local directory path to store the interim
                                   debuginfo files.
      --debuginfo-strip            Only upload information needed for
                                   symbolization. If false the exact binary the
                                   agent sees will be uploaded unmodified.
      --debuginfo-compress         Compress debuginfo files' DWARF sections
                                   before uploading.
      --debuginfo-upload-disable
                                   Disable debuginfo collection and upload.
      --debuginfo-upload-max-parallel=25
                                   The maximum number of debuginfo upload
                                   requests to make in parallel.
      --debuginfo-upload-timeout-duration=2m
                                   The timeout duration to cancel upload
                                   requests.
      --debuginfo-upload-cache-duration=5m
                                   The duration to cache debuginfo upload
                                   responses for.
      --debuginfo-disable-caching
                                   Disable caching of debuginfo.
      --symbolizer-jit-disable     Disable JIT symbolization.
      --otlp-address=STRING        The endpoint to send OTLP traces to.
      --otlp-exporter="grpc"       The OTLP exporter to use.
      --object-file-pool-eviction-policy="lru"
                                   The eviction policy to use for the object
                                   file pool.
      --object-file-pool-size=100
                                   The maximum number of object files to keep in
                                   the pool. This is used to avoid re-reading
                                   object files from disk. It keeps FDs open,
                                   so it should be kept in sync with ulimits.
                                   0 means no limit.
      --dwarf-unwinding-disable    Do not unwind using .eh_frame information.
      --dwarf-unwinding-mixed      Unwind using .eh_frame information and frame
                                   pointers.
      --python-unwinding-disable
                                   Disable Python unwinder.
      --ruby-unwinding-disable     Disable Ruby unwinder.
      --analytics-opt-out          Opt out of sending anonymous usage
                                   statistics.
      --telemetry-disable-panic-reporting

      --telemetry-stderr-buffer-size-kb=4096

      --bpf-verbose-logging        Enable verbose BPF logging.
      --bpf-events-buffer-size=8192
                                   Size in pages of the events buffer.
      --verbose-bpf-logging        [deprecated] Use --bpf-verbose-logging.
                                   Enable verbose BPF logging.

Metadata Labels

Parca Agent supports Prometheus relabeling. The following labels are always attached to profiles:

  • node: The name of the node that the process is running on as specified by the --node flag.
  • comm: The command name of the process being profiled.

And optionally you can attach additional labels using the --metadata-external-labels flag.

Using relabeling the following labels can be attached to profiles:

  • __meta_process_pid: The process ID of the process being profiled.
  • __meta_process_cmdline: The command line arguments of the process being profiled.
  • __meta_process_cgroup: The (main) cgroup of the process being profiled.
  • __meta_process_ppid: The parent process ID of the process being profiled.
  • __meta_process_executable_file_id: The file ID (a hash) of the executable of the process being profiled.
  • __meta_process_executable_name: The basename of the executable of the process being profiled.
  • __meta_process_executable_build_id: The build ID of the executable of the process being profiled.
  • __meta_process_executable_compiler: The compiler used to build the executable of the process being profiled.
  • __meta_process_executable_static: Whether the executable of the process being profiled is statically linked.
  • __meta_process_executable_stripped: Whether the executable of the process being profiled is stripped from debuginfo.
  • __meta_system_kernel_release: The kernel release of the system.
  • __meta_system_kernel_machine: The kernel machine of the system (typically the architecture).
  • __meta_agent_revision: The revision of the agent.
  • __meta_kubernetes_namespace: The namespace of the pod the process is running in.
  • __meta_kubernetes_pod_name: The name of the pod the process is running in.
  • __meta_kubernetes_pod_label_*: The value of the label * of the pod the process is running in.
  • __meta_kubernetes_pod_labelpresent_*: Whether the label * of the pod the process is running in is present.
  • __meta_kubernetes_pod_annotation_*: The value of the annotation * of the pod the process is running in.
  • __meta_kubernetes_pod_annotationpresent_*: Whether the annotation * of the pod the process is running in is present.
  • __meta_kubernetes_pod_ip: The IP of the pod the process is running in.
  • __meta_kubernetes_pod_container_name: The name of the container the process is running in.
  • __meta_kubernetes_pod_container_id: The ID of the container the process is running in.
  • __meta_kubernetes_pod_container_image: The image of the container the process is running in.
  • __meta_kubernetes_pod_container_init: Whether the container the process is running in is an init container.
  • __meta_kubernetes_pod_ready: Whether the pod the process is running in is ready.
  • __meta_kubernetes_pod_phase: The phase of the pod the process is running in.
  • __meta_kubernetes_node_name: The name of the node the process is running on.
  • __meta_kubernetes_pod_host_ip: The host IP of the pod the process is running in.
  • __meta_kubernetes_pod_uid: The UID of the pod the process is running in.
  • __meta_kubernetes_pod_controller_kind: The kind of the controller of the pod the process is running in.
  • __meta_kubernetes_pod_controller_name: The name of the controller of the pod the process is running in.
  • __meta_kubernetes_node_label_*: The value of the label * of the node the process is running on.
  • __meta_kubernetes_node_labelpresent_*: Whether the label * of the node the process is running on is present.
  • __meta_kubernetes_node_annotation_*: The value of the annotation * of the node the process is running on.
  • __meta_kubernetes_node_annotationpresent_*: Whether the annotation * of the node the process is running on is present.
  • __meta_docker_container_id: The ID of the container the process is running in.
  • __meta_docker_container_name: The name of the container the process is running in.
  • __meta_docker_build_kit_container_id: The ID of the container the process is running in.
  • __meta_containerd_container_id: The ID of the container the process is running in.
  • __meta_containerd_container_name: The name of the container the process is running in.
  • __meta_containerd_pod_name: The name of the pod the process is running in.
  • __meta_lxc_container_id: The ID of the container the process is running in.

Security

Parca Agent is required to be running as root user (or CAP_SYS_ADMIN). Various security precautions have been taken to protect users running Parca Agent. See details in Security Considerations.

To report a security vulnerability, see this guide.

Contributing

Check out our Contributing Guide to get started!

License

User-space code: Apache 2

Kernel-space code (eBPF profilers): GNU General Public License, version 2

Credits

Thanks to:

  • Kinvolk for creating Inspektor Gadget; some parts of this project were inspired by parts of it.

parca-agent's People

Contributors

andrewa-stripe avatar brancz avatar dependabot[bot] avatar derekparker avatar dreamerlzl avatar gnurizen avatar heylongdacoder avatar importhuman avatar javierhonduco avatar jnsgruk avatar kakkoyun avatar korniltsev avatar manojvivek avatar marselester avatar maxbrunet avatar metalmatze avatar mrueg avatar namanl2001 avatar paulfantom avatar pre-commit-ci-lite[bot] avatar pryz avatar renovate[bot] avatar slashpai avatar sylfrena avatar thorfour avatar umanwizard avatar v-thakkar avatar vadorovsky avatar zdyj3170101136 avatar zecke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parca-agent's Issues

One-off mode

It would be really nice if parca-agent supported a one-off mode, similar to perf to profile CLI commands. I imagine the experience to be very similar to that of perf, but integrated with Parca APIs. Eg:

$ parca-agent --one-off -- ./my-cli
View profiling data at https://demo.parca.dev/<query parameters>

Cache debuginfo.Exists responses

As far as I'm aware the requests send with debuginfo.Exists are to check if a binary symbol is already present on the Parca server side.
Given that the build IDs are pretty unique I propose to add caching on Parca agent's side to reduce the amount of Exists requests sent.

metadata/discovery: Support hybrid cgroup v1/v2

Currently parca-agent can support cgroup v1 and v2, however we don't have a great story around what happens in hybrid mode when both are mounted.

From our perspective we care mostly about the perf_events controller. In cgroup v2 this is always implicitly enabled, so we don't have to do anything special within the agent. With cgroup v1 we have to enable the controller manually in some instances and then add processes to it.

In hybrid mode we are going to want to stick with either v1 or v2 and not mix and match them. This may involve inspecting the cgroup fs and making a determination per systemd unit / pod / etc... what version is being used and stick with that going forward for that specific service.

Heap Profiling Support

  • Figuring out what is in the program heap at any given time

  • Locating memory leaks

  • Finding places that do a lot of allocation

External label support

Similar to Prometheus, it often happens that one wants to attach the same label(s) to all profiles of the nodes, such as region, env, or any other to distinguish and slices and dice profiles better.

When this is implemented a user should be able to specify multiple --external-label=key=value style flags, which then automatically get added to all targets.

Sign releases with sigstore

Since users need to run parca agent as root or with CAP_SYS_ADMIN, we want to do our utmost best to secure the supply chain of parca agent. An additional thing to the things that we are already doing today, would be to sign our artifacts. A popular and well maintained solution is https://www.sigstore.dev/

Support multi-arch container images

Currently, the Dockerfile can only build x86_64 images, due to the following lines:

parca-agent/Dockerfile

Lines 33 to 40 in 8338e77

COPY --from=build /lib/x86_64-linux-gnu/libpthread.so.0 /lib/x86_64-linux-gnu/libpthread.so.0
COPY --from=build /usr/lib/x86_64-linux-gnu/libelf-0.176.so /usr/lib/x86_64-linux-gnu/libelf-0.176.so
COPY --from=build /usr/lib/x86_64-linux-gnu/libdw.so.1 /usr/lib/x86_64-linux-gnu/libdw.so.1
RUN ln -s /usr/lib/x86_64-linux-gnu/libelf-0.176.so /usr/lib/x86_64-linux-gnu/libelf.so.1
COPY --from=build /lib/x86_64-linux-gnu/libz.so.1 /lib/x86_64-linux-gnu/libz.so.1
COPY --from=build /lib/x86_64-linux-gnu/libc.so.6 /lib/x86_64-linux-gnu/libc.so.6
COPY --from=build /usr/lib/x86_64-linux-gnu/libbfd-2.31.1-system.so /usr/lib/x86_64-linux-gnu/libbfd-2.31.1-system.so
COPY --from=build /lib/x86_64-linux-gnu/libdl.so.2 /lib/x86_64-linux-gnu/libdl.so.2

On the latest Mac this location is actually aarch64-linux-gnu for example. Changing it in the Dockerfile to the correct location makes docker build correctly produce an image.

OpenShift manifests

Things we need added to the Kubernetes setup.

In the Parca Agent DaemonSet/container:

          securityContext:
            capabilities:
              add:
                - SYS_ADMIN
            privileged: true
            runAsUser: 0

And a ClusterRole:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: parca-agent-scc
rules:
- apiGroups:
  - security.openshift.io
  resourceNames:
  - privileged
  resources:
  - securitycontextconstraints
  verbs:
  - use

And a RoleBinding:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: <name>
  namespace: <namespace>
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: parca-agent-scc
subjects:
- kind: ServiceAccount
  name: <name>
  namespace: <namespace>

Reduce amount of host mounts

While it's completely understandable why parca-agent requires certain mounts, it will be nice to reduce amount of host mounts in k8s.

Main point (as discussed in discord if i recall correctly) is

- mountPath: /host/root
    name: root
    readOnly: true

Is mounting whole host root really needed?

Support for namespace selector

It would be nice to add the possibility to the agent to only collect profiles from pods running in specific namespaces. Use case is to reduce parca resource usage by not profiling namespaces that we're not interested in.

It is somewhat possible to do this today if we add a new label to all pods of the namespace we're interested in and use --pod-label-selector, but sometimes we don't have the power/permissions to modify pod metadata.

Elixir/Erlang VM support

The Erlang VM has support for perf maps via the ERL_FLAGS="+S 1 +JPperf true" flags. However, even when setting those flags, profiling an Erlang process does not consistently work (occasionally individual addresses are symbolized).

Working theory: Erlang has a multi-process model, which could be a problem in this case if a process is short-lived (in the sense that a profiling loop of 10 seconds passed while the process was created and ended).

Ultimately even if the perf-map support works, it would be great for erlang users not to have to change anything about their deployment to reap the benefits, but it's a good intermediate step.

parca-debug-info doesn't work with Rust binaries

Steps to reproduce:

$ cargo new --bin parca-test
$ cd parca-test
$ cargo build --quiet
$ parca-debug-info extract --log-level="debug" ./target/debug/parca-test
level=debug ts=2022-01-04T15:50:08.435179792Z caller=debuginfo.go:278 msg="using eu-strip" file=./target/debug/parca-test
level=error ts=2022-01-04T15:50:08.441557328Z caller=debuginfo.go:259 msg="external binutils command call failed" output="eu-strip: Cannot remove allocated section '.debug_gdb_scripts'" file=./target/debug/parca-test
level=error ts=2022-01-04T15:50:08.441599333Z caller=debuginfo.go:130 msg="failed to extract debug information" buildid=f17a111406287bd2e6a658a8ac539e3b6bfa35ac file=./target/debug/parca-test err="failed to extract debug information from binary: exit status 1"
level=info ts=2022-01-04T15:50:08.441618562Z caller=main.go:170 msg=done!

Duplicate metrics collector registration attempted

panic: duplicate metrics collector registration attempted

goroutine 68 [running]:
github.com/prometheus/client_golang/prometheus.(*Registry).MustRegister(0xc000568f70, {0xc000568fb0, 0x1, 0x0})
    github.com/prometheus/[email protected]/prometheus/registry.go:403 +0x7f
github.com/prometheus/client_golang/prometheus/promauto.Factory.NewCounterVec({{0x1d9a6e8, 0xc00022d590}}, {{0x0, 0x0}, {0x0, 0x0}, {0x1b77405, 0x29}, {0x1b65bd8, 0x20}, ...}, ...)
    github.com/prometheus/[email protected]/prometheus/promauto/auto.go:276 +0x133
github.com/parca-dev/parca-agent/pkg/profiler.NewCgroupProfiler({0x1d78d20, 0xc00022d4f0}, {0x1d9a6e8, 0xc00022d590}, 0xc000307740, {0x1d7a1e0, 0xc00023bf40}, {0x1d94610, 0xc000569bf0}, 0xc0005a1230, ...)
    github.com/parca-dev/parca-agent/pkg/profiler/profiler.go:103 +0x465
github.com/parca-dev/parca-agent/pkg/target.(*ProfilerPool).Sync(0xc000566ea0, {0xc000648000, 0x9, 0x3})
    github.com/parca-dev/parca-agent/pkg/target/profiler_pool.go:132 +0x266
github.com/parca-dev/parca-agent/pkg/target.(*Manager).reconcileTargets(0xc000648e00, {0x1dadc70, 0xc000591000}, 0x0)
    github.com/parca-dev/parca-agent/pkg/target/manager.go:97 +0x1eb
github.com/parca-dev/parca-agent/pkg/target.(*Manager).Run(0xc000308dc0, {0x1dadc70, 0xc000591000}, 0xc0002d0660)
    github.com/parca-dev/parca-agent/pkg/target/manager.go:75 +0xc9
main.main.func8()
    github.com/parca-dev/parca-agent/cmd/parca-agent/main.go:370 +0x105
github.com/oklog/run.(*Group).Run.func1({0xc000591040, 0xc0001bca20})
    github.com/oklog/[email protected]/group.go:38 +0x2f
created by github.com/oklog/run.(*Group).Run
    github.com/oklog/[email protected]/group.go:37 +0x22f
[event: pod parca/parca-agent-4h6gm] Back-off restarting failed container

cc @Sylfrena @metalmatze

Allocation Profiling Support

Aside from CPU profiling, allocation profiling is very useful as well, especially because a lot of CPU is spent on poor allocation practices. Allocation profiling can also be useful for troubleshooting memory leaks.

This can be done using USDT (Userland Statically Defined Tracing) using uprobe. It's likely that this will require language/runtime specific implementations (see bcc's uobjnew).

Use batch operations

// TODO(brancz): Use libbpf batch functions.

libbpf supports the batch operations such as:

  • bpf_map_lookup_and_delete_batch
  • bpf_map_lookup_batch
  • bpf_map_delete_batch

These make sense to be used. Because we already load all the data from the BPF maps into the memory of our user-space program, so might as well do it efficiently. And not only does it allow us to batch, but it also allows us to delete items at the same time, which is exactly what we do anyways after each profiling iteration.

They first need to be implemented in libbpfgo.

Parca agent makes Oracle Kubernetes Engine worker nodes down

The subject may sound weird, but that's what I have experienced. Hope you guys shed some light. :)

I started testing Parca in OKE (Oracle Kubernetes Engine) and found installing Parca agent made the worker node unresponsive to SSH and must be hard-rebooted. At first, I suspected the issue is with OKE or its underlying cloud infrastructure. But over time, I was able to isolate that this issue only happens after Parca agent is installed. The reproduction was done across multiple clusters, node pools, nodes, etc.

But I still don't know what exactly caused this issue. I've scanned the system stats and logs but they looked clean. The system resources were fine. I could have missed something as I have no idea where to look at.

Parca agent showed this log from all nodes it was installed, but I am not sure if it is relevant (anyway, the log says "please report").

level=error ts=2021-12-31T05:29:33.514162476Z caller=debuginfo.go:259 namespace=kube-system pod=coredns-94d6cc8b6-br22d container=coredns component=debuginfoextractor msg="external binutils command call failed" output="objcopy: BFD (GNU Binutils for Debian) 2.31.1 internal error, aborting at ../../bfd/elf.c:7085 in rewrite_elf_program_headerobjcopy: Please report this bug." file=/proc/16763/root/coredns

Environment

  • Oracle Kubernetes Engine (Kubernetes 1.21)
  • OS: Oracle Linux Server 7.9 (a cousin to RHEL 7.9)

Steps to reproduce

  1. kubectl create namespace parca
  2. kubectl apply -f kubectl apply -f https://github.com/parca-dev/parca-agent/releases/download/v0.3.0/kubernetes-manifest.yaml
  3. Watch the node become "not ready" and unresponsive within minutes. Sometimes in a few mins, sometimes takes a long time (so the nodes in the cluster which Parca agent is installed as DaemonSet went unresponsive one by one).
  4. hard-reboot the node

Automate updating debian snapshots

E: Release file for http://snapshot.debian.org/archive/debian-security/20210621T000000Z/dists/buster/updates/InRelease is expired (invalid since 30d 18h 29min 0s). Updates for this repository will not be applied.
E: Release file for http://snapshot.debian.org/archive/debian/20210621T000000Z/dists/buster-updates/InRelease is expired (invalid since 30d 18h 32min 9s). Updates for this repository will not be applied.
E: Release file for http://snapshot.debian.org/archive/debian/20210621T000000Z/dists/buster-backports/InRelease is expired (invalid since 30d 18h 32min 10s). Updates for this repository will not be applied.
STEP 4: FROM docker.io/debian@sha256:c6e92d5b7730fdfc2753c4cce68c90d6c86a6a3391955549f9fe8ad6ce619ce0 AS all
error building at STEP "RUN apt-get update && apt-get install -y clang-11 make gcc coreutils elfutils binutils zlib1g-dev libelf-dev ca-certificates netbase &&         ln -s /usr/bin/clang-11 /usr/bin/clang &&         ln -s /usr/bin/llc-11 /usr/bin/llc": error while running runtime: exit status 100
Trying to pull docker.io/library/debian@sha256:c6e92d5b7730fdfc2753c4cce68c90d6c86a6a3391955549f9fe8ad6ce619ce0...
time="2021-07-28T14:41:21Z" level=error msg="exit status 100"
make: *** [Makefile:182: container] Error 100
Error: Process completed with exit code 2.

For now, we just disabled checks:
https://snapshot.debian.org/

To access snapshots of suites using Valid-Until that are older than a dozen days, it is necessary to ignore the Valid-Until header within Release files, in order to prevent apt from disregarding snapshot entries ("Release file expired"). Use aptitude -o Acquire::Check-Valid-Until=false update or apt-get -o Acquire::Check-Valid-Until=false update for this purpose.

Multiple systemd units with the same name

I have just started playing with parca and this is the first thing that came up. ;)

Use case - multiple containers with systemd inside:
Cgroup v2 structure:

/sys/fs/cgroup/machine.slice/machine-libpod_pod_<some_hash>.slice/libpod-<some_hash>.scope/container/system.slice
/sys/fs/cgroup/machine.slice/machine-libpod_pod_<other_hash>.slice/libpod-<other_hash>.scope/container/system.slice
/sys/fs/cgroup/machine.slice/machine-libpod_pod_<antother_hash>.slice/libpod-<another_hash>.scope/container/system.slice

Each of the containers runs systemd unit with the same name. Obviously, from the container perspective those are separate systemd instances, but from the host perspective when using single parca-agent there is a problem, I can't just use --systemd-units=my-app.service - parca-agent will try to find that service under /sys/fs/cgroup/system.slice (BTW default for now is /sys/fs/cgroup/systemd/system.slice which is wrong for cgroup v2. Shouldn't this be autodetected?)

Can this be supported, or should I run multiple agents for each container with --systemd-cgroup-path= pointing at each system.slice inside machine.slice?

Env: RHEL 8 and podman.

JVM support

This could be a built-in and pprof-integrated version of perf-map-agent and/or async-profiler.

Plan

  1. In Phase one, use async-profiler to profile JVM processes. #1115
  2. In Phase two, use the eBPF profiler to collect the profiles.

Agent fails when started without specifying `--profiling-duration`

ts=2021-10-18T15:42:26.248832216Z caller=main.go:83 msg=starting... node=k3s-parca-43a7dd92-node-pool-ad21 store=parca.parca.svc.cluster.local:7070
level=info ts=2021-10-18T15:42:26.273059724Z caller=podinformer.go:143 msg="starting pod controller"
panic: non-positive interval for NewTicker

goroutine 67 [running]:
time.NewTicker(0x0, 0x1d0def3)
	time/tick.go:24 +0x151
github.com/parca-dev/parca-agent/pkg/agent.(*CgroupProfiler).Run(0xc000506780, 0x1f6ba10, 0xc000632040, 0x0, 0x0)
	github.com/parca-dev/parca-agent/pkg/agent/profile.go:229 +0x9a7
github.com/parca-dev/parca-agent/pkg/agent.(*PodManager).Run.func1(0xc000506780, 0x1f6ba10, 0xc00007dcc0, 0x1f31140, 0xc0004312c0)
	github.com/parca-dev/parca-agent/pkg/agent/podmanager.go:141 +0x4c
created by github.com/parca-dev/parca-agent/pkg/agent.(*PodManager).Run
	github.com/parca-dev/parca-agent/pkg/agent/podmanager.go:140 +0x995

cc @slashpai

Error parsing erlang perf-map

Launched a simple rabbitmq setup, which is written in erlang. Added the environment variable ERL_FLAGS="+S 1 +JPperf true", which will make the erlang VM emit Linux Kernel JIT Perf Maps, which Parca Agent has support for. I verified that perf maps are indeed being emitted by the erlang VM, however saw this error in the logs when I turned on debug logs:

level=debug ts=2021-11-19T09:38:32.640589484Z caller=profile.go:382 namespace=test-rabbitmq pod=rabbitmq-0 container=rabbitmq msg="no perfmap" err="parsing start failed on [0x7f208ac3b000 88 $global::arith_compare_shared]: strconv.ParseUint: parsing \"0x7f208ac3b000\": invalid syntax"

Theory: It appears that the erlang VM emits the memory address slightly differently than other runtimes that we've tested with (eg. nodejs), it adds 0x prefix before the uint64 address.

Customizing labeling

We currently apply certain non-customizable labeling of data, but that may not be identical to how users label their other observability data. We should allow customizing target labels, in the Prometheus ecosystem, this is done via relabel configs, however, often relabeling is confusing especially to new-comers, so unless we conclude we need all of the functionality that relabeling provides, I'd be ok with investigating alternative paths as well.

Multiple pprof normalization errors

I am not sure if this will be of any use, but I got multiple errors from parca-agent:

$ k logs -f parca-agent-r7xsz | grep 'level=error'
level=error ts=2021-10-10T13:24:16.98371093Z caller=profile.go:458 namespace=flux-system pod=kustomize-controller-6f6647b88d-kqf6l container=manager msg="failed to send profile" err="rpc error: code = Internal desc = failed to normalize pprof: execute SQL statement: constraint failed: UNIQUE constraint failed: locations.mapping_id, locations.is_folded, locations.normalized_address, locations.lines (2067)"
level=error ts=2021-10-10T13:25:55.20761276Z caller=profile.go:458 namespace=monitoring pod=blackbox-exporter-8578ddc7c4-f2zs7 container=blackbox-exporter msg="failed to send profile" err="rpc error: code = Internal desc = failed to normalize pprof: interrupted (9)"
level=error ts=2021-10-10T13:27:25.241547919Z caller=profile.go:458 namespace=monitoring pod=prometheus-k8s-1 container=prometheus msg="failed to send profile" err="rpc error: code = Internal desc = failed to normalize pprof: interrupted (9)"
level=error ts=2021-10-10T13:29:15.256740408Z caller=profile.go:458 namespace=monitoring pod=prometheus-k8s-1 container=prometheus msg="failed to send profile" err="rpc error: code = Internal desc = failed to normalize pprof: execute SQL statement: interrupted (9)"
level=error ts=2021-10-10T13:29:45.429935595Z caller=profile.go:458 namespace=parca pod=parca-799874bf59-q5pjh container=parca msg="failed to send profile" err="rpc error: code = Internal desc = failed to normalize pprof: interrupted (9)"
level=error ts=2021-10-10T13:30:15.415817902Z caller=profile.go:458 namespace=multimedia pod=radarr-0 container=radarr msg="failed to send profile" err="rpc error: code = Internal desc = failed to normalize pprof: get lines by location ID: execute SQL query: interrupted (9)"
level=error ts=2021-10-10T13:30:15.415889377Z caller=profile.go:458 namespace=nextcloud pod=mysql-dd64d4b74-rz77s container=mysql msg="failed to send profile" err="rpc error: code = Internal desc = failed to normalize pprof: get lines by location ID: execute SQL query: interrupted (9)"
level=error ts=2021-10-10T13:30:55.233376238Z caller=profile.go:458 namespace=storage-system pod=restic-robot-7b47bd67fb-wlp72 container=restic msg="failed to send profile" err="rpc error: code = Internal desc = failed to normalize pprof: execute SQL statement: interrupted (9)"
level=error ts=2021-10-10T13:34:35.273917565Z caller=profile.go:458 namespace=flux-system pod=kustomize-controller-6f6647b88d-kqf6l container=manager msg="failed to send profile" err="rpc error: code = Internal desc = failed to normalize pprof: interrupted (9)"
level=error ts=2021-10-10T13:35:45.309098538Z caller=profile.go:458 namespace=nextcloud pod=mysql-dd64d4b74-rz77s container=mysql msg="failed to send profile" err="rpc error: code = Internal desc = failed to normalize pprof: get lines by location ID: execute SQL query: interrupted (9)"
level=error ts=2021-10-10T13:35:45.30941254Z caller=profile.go:458 namespace=parca pod=parca-agent-r7xsz container=parca-agent msg="failed to send profile" err="rpc error: code = Internal desc = failed to normalize pprof: execute SQL statement: interrupted (9)"

Support I/O in generic profiles

For a class of applications (such as databases) it is important to be able to profile I/O, including network and filesystem related data.

Hence, I'd like to propose to support I/O in profiles the Parca agent is able to obtain.

Check symlink when trying to fetch debuginfo

Got error on parca-agent:

level=debug ts=2021-10-08T01:13:40.251386198Z caller=debuginfo.go:166 namespace=gitops-platform-storage pod=rook-ceph-osd-10-7746699478-n72kk container=osd component=debuginfoextractor msg="failed to find additional debug information" root=/proc/3820454/root err="failed to walk debug files: failed to extract elf build ID, failed to open elf: read /proc/3820454/root/usr/lib/debug/bin: is a directory"

/proc/3820454/root/usr/lib/debug/bin is actually a symlink to a directory.

/proc/3822251/root/usr/lib/debug# ll
total 4
lrwxrwxrwx 1 root root    7 Aug  9  2020 bin -> usr/bin
lrwxrwxrwx 1 root root    7 Aug  9  2020 lib -> usr/lib
lrwxrwxrwx 1 root root    9 Aug  9  2020 lib64 -> usr/lib64
lrwxrwxrwx 1 root root    8 Aug  9  2020 sbin -> usr/sbin
drwxr-xr-x 6 root root 4096 Aug  9  2020 usr

Source code of this error: https://github.com/parca-dev/parca-agent/blob/main/pkg/debuginfo/debuginfo.go#L204

grpc: received message larger than max

Opening this in here but this could be solved on the agent side as well. This is probably an overlooked side effect of Write Request batching on the agent side. #116 We probably need to communicate the max message size to the agent (with a flag or an exposed discovery API?)

I see another message now, a lot of them.

level=error ts=2022-01-10T13:20:04.693514836Z caller=write_client.go:83 msg="Writeclient failed to send profiles" err="rpc error: code = ResourceExhausted desc = grpc: received message larger than max (174243321 vs. 4194304)"

Originally posted by @korjavin in parca-dev/parca#514 (comment)

cc @Sylfrena @brancz

Explain further what "Show Profile" does

This feature originated from a debug mechanism for me to understand profiling data without a storage backend. It turns out people like this to understand what is going on under the hood, but for that, it is still quite raw of a feature and needs some more explanation (also why clicking show profiles can take up to 10 seconds to display anything). Beyond that, the actual showing of the profile could use some explanation of what the user is looking at.

Parca Agent symbols are not resolving

I'm not sure where the problem lies for this, but it appears that when viewing a CPU profile taken from the Parca Agent itself when run on minikube, even though the symbols appear to be uploaded, they don't appear to resolve when looking at the profile.

Steps to reproduce:

  • start minikube with VM: minikube start --driver=virtualbox
  • deploy parca agent with --store-address=grpc.polarsignals.com:443 and --bearer-token=<project-token>

Allow running Parca Agent within user namespaces

I'm not 100% sure whether this is even possible, but here it goes. Currently, we recommend using minikube for demo purposes, for various reasons people may prefer kind, k3d, or others, which may not use actual virtual machines, but rather linux user namespaces. The problem this causes is that even if we can load the eBPF program, from within the eBPF program we will see the true host/kernel view of a process, meaning we see the host-wide PID. If we only have the host-wide PID then the user-space program of parca-agent cannot find the process maps, perf-maps, etc. that we need to actually create a profile with useful data.

Something that I've had in my head for a while that might be worth exploring: Maybe if the Parca Agent knows it's running in user-namespaces (which we might be able to discover, or if not then we could have it be a flag) it might be able to communicate to the eBPF program which level of namespaces to consider as the "root" and have the eBPF program only capture data for those processes within that nested namespace, as well as the PID that would be local to that namespace.

There may very well be strategies that I haven't thought of.

Extended Go support

I've only done a light investigation on this, so this may still be bogus.

Elf symbols (.symtab) are often stripped using -ldflags="-s -w", but the .gosymtab elf section is almost always there, ideally we convert .gosymtab to a “normal” elf debug symbol table (.symtab), if it's not already the same, and upload that as debug info as if they had been there the entire time.

Edit: According to this email thread .gosymtab is empty and the content of .gopclntab is what we should be able to use.

Batch write requests

Currently, whenever a profile has been recorded, it is individually sent directly to a configured server. With potentially hundreds of profilers running on a single machine, it would make sense to batch these requests and only send them out once every 10 seconds or so. The write API already supports batched writes both in the series dimension as well as the number of samples.

Problems with PID detection on k3s

I got multiple warnings like the one below when running parca-agent in k3s on ubuntu 20.04 server.

level=warn ts=2021-10-10T12:03:44.054726335Z caller=k8s.go:200 msg="skipping pod, cannot find pid" namespace=monitoring pod=prometheus-k8s-1 err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: no such file or directory\""

containerd.sock on k3s is available in /run/k3s/containerd/containerd.sock so it might be good to add it to auto discovery.

An alternative approach could be to add a parameter to jsonnet library allowing passing socket configuration to the agent instead of patching arguments.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.