Coder Social home page Coder Social logo

signalfx / signalfx-agent Goto Github PK

View Code? Open in Web Editor NEW
122.0 70.0 182.0 39.59 MB

The SignalFx Smart Agent

Home Page: https://signalfx.com

License: Apache License 2.0

Shell 3.92% Go 65.50% Makefile 0.40% Python 24.45% Ruby 0.77% HTML 0.13% Puppet 0.42% SaltStack 0.20% PowerShell 1.91% Groovy 0.13% Dockerfile 1.16% JavaScript 0.10% Java 0.78% Mustache 0.12%
integrations collectd

signalfx-agent's Introduction

ℹ️  SignalFx was acquired by Splunk in October 2019. See Splunk SignalFx for more information.

⚠️ End of Support (EoS) Notice

The SignalFx Smart Agent has reached End of Support.

The Splunk Distribution of OpenTelemetry Collector is the successor. Smart Agent monitors are available and supported through the Smart Agent receiver in the Splunk Distribution of OpenTelemetry Collector.

To learn how to migrate, see Migrate from SignalFx Smart Agent to the Splunk Distribution of OpenTelemetry Collector.

SignalFx Smart Agent

GoDoc CircleCI

The SignalFx Smart Agent is a metric agent written in Go for monitoring infrastructure and application services in a variety of different environments. It is meant as a successor to our previous collectd agent, but still uses that internally on Linux -- so any existing Python or C-based collectd plugins will still work without modification. On Windows collectd is not included, but the agent is capable of running python based collectd plugins without collectd. C-based collectd plugins are not available on Windows.

Concepts

The agent has three main components:

  1. Observers that discover applications and services running on the host
  2. Monitors that collect metrics, events, and dimension properties the host and applications
  3. The Writer that sends the metrics, events, and dimension updates collected by monitors to SignalFx.

Observers

Observers watch the various environments that we support to discover running services and automatically configure the agent to send metrics for those services.

For a list of supported observers and their configurations, see Observer Config.

Monitors

Monitors collect metrics from the host system and services. They are configured under the monitors list in the agent config. For application-specific monitors, you can define discovery rules in your monitor configuration. A separate monitor instance is created for each discovered instance of applications that match a discovery rule. See Auto Discovery for more information.

Many of the monitors are built around collectd, an open source third-party monitor, and use it to collect metrics. Some other monitors do not use collectd. However, either type is configured in the same way.

For a list of supported monitors and their configurations, see Monitor Config.

The agent is primarily intended to monitor services/applications running on the same host as the agent. This is in keeping with the collectd model. The main issue with monitoring services on other hosts is that the host dimension that collectd sets on all metrics will currently get set to the hostname of the machine that the agent is running on. This allows everything to have a consistent host dimension so that metrics can be matched to a specific machine during metric analysis.

Writer

The writer collects metrics emitted by the configured monitors and sends them to SignalFx on a regular basis. There are a few things that can be configured in the writer, but this is generally only necessary if you have a very large number of metrics flowing through a single agent.

Installation

The agent is available for Linux in both a containerized and standalone form. Whatever form you use, the dependencies are completely bundled along with the agent, including a Java JRE runtime and a Python runtime, so there are no additional dependencies required. This means that the agent should work on any relatively modern Linux distribution (kernel version 2.6+).

Note: The agent is incompatible on Linux systems with SELinux enabled. Check the documentation for your distribution to learn how to disable SELinux.

The agent is also available on Windows in standalone form. It contains its own Python runtime. The agent supports Windows Server 2012 and above.

To get started deploying the Smart Agent directly on a host, see the Smart Agent Quick Install guide.

Deployment

We support the following deployment/configuration management tools to automate the installation process. See Bundles for a list of underlying packages for the agent.

Installer Script

For non-containerized environments, there is a convenience script that you can run on your host to install the agent package. This is useful for testing and trials, but for full-scale deployments you will probably want to use a configuration management system like Chef or Puppet.

Linux

You can view the source for the installer script and use it on your hosts by running:

curl -sSL https://dl.signalfx.com/signalfx-agent.sh > /tmp/signalfx-agent.sh;
sudo sh /tmp/signalfx-agent.sh --realm YOUR_SIGNALFX_REALM -- YOUR_SIGNALFX_API_TOKEN
Windows

The Agent has one dependency on Windows which must be satisfied before running the installer script.

The installer script is written for PowerShell v3.0 and above and will not function correctly on earlier versions.

Once the dependencies have been installed, please run the installer script below. You can view the source for the installer script and use it on your hosts in PowerShell by running:

& {Set-ExecutionPolicy Bypass -Scope Process -Force; $script = ((New-Object System.Net.WebClient).DownloadString('https://dl.signalfx.com/signalfx-agent.ps1')); $params = @{access_token = "YOUR_SIGNALFX_ACCESS_TOKEN"; ingest_url = "https://ingest.YOUR_SIGNALFX_REALM.signalfx.com"; api_url = "https://api.YOUR_SIGNALFX_REALM.signalfx.com"}; Invoke-Command -ScriptBlock ([scriptblock]::Create(". {$script} $(&{$args} @params)"))}

The agent files are installed to \Program Files\SignalFx\SignalFxAgent, and the default configuration file is installed at \ProgramData\SignalFxAgent.

You can also use Chocolatey to install the agent. See the section Windows Chocolatey Package.

Chef

We offer a Chef cookbook to install and configure the agent. See the cookbook source and on the Chef Supermarket.

Puppet

We also offer a Puppet manifest to install and configure the agent on Linux. See the manifest source and on the Puppet Forge.

Ansible

We also offer an Ansible Role to install and configure the Smart Agent on Linux. See the role source.

Salt

We also offer a Salt Formula to install and configure the Smart Agent on Linux. See the formula source.

Docker Image

See Docker Deployment for more information.

Kubernetes

See our Kubernetes setup instructions and Monitor Kubernetes for more information. Helm version 3 or higher is supported.

AWS Elastic Container Service (ECS)

See the ECS directory, which includes a sample config and task definition for the agent.

Bundles

We offer the agent in the following forms:

Debian Package

We provide a Debian package repository that you can make use of with the following commands:

curl -sSL https://splunk.jfrog.io/splunk/signalfx-agent-deb/splunk-B3CD4420.gpg > /etc/apt/trusted.gpg.d/splunk.gpg
echo 'deb https://splunk.jfrog.io/splunk/signalfx-agent-deb release main' > /etc/apt/sources.list.d/signalfx-agent.list
apt-get update
apt-get install -y signalfx-agent

RPM Package

We provide a RHEL/RPM package repository that you can make use of with the following commands:

cat <<EOH > /etc/yum.repos.d/signalfx-agent.repo
[signalfx-agent]
name=SignalFx Agent Repository
baseurl=https://splunk.jfrog.io/splunk/signalfx-agent-rpm/release
gpgcheck=1
gpgkey=https://splunk.jfrog.io/splunk/signalfx-agent-rpm/splunk-B3CD4420.pub
enabled=1
EOH

yum install -y signalfx-agent

Linux Standalone tar.gz

If you don't want to use a distro package, we offer a .tar.gz that can be deployed to the target host. This bundle is available for download on the Github Releases Page for each new release.

To use the bundle:

  1. Unarchive it to a directory of your choice on the target system.

  2. Go into the unarchived signalfx-agent directory and run bin/patch-interpreter $(pwd). This ensures that the binaries in the bundle have the right loader set on them since your host's loader may not be compatible.

  3. Ensure a valid configuration file is available somewhere on the target system. The main thing that the distro packages provide -- but that you will have to provide manually with the bundle -- is a run directory for the agent to use. Since you aren't installing from a package, there are three config options that you will especially want to consider:

  • internalStatusHost - This is the host name that the agent will listen on so that the signalfx-agent status command can read diagnostic information from a running agent. This is also the host name the agent will listen on to serve internal metrics about the agent. These metrics can can be scraped by the internal-metrics monitor. This will default to localhost if left blank.

  • internalStatusPort - This is the port that the agent will listen on so that the signalfx-agent status command can read diagnostic information from a running agent. This is also the host name the agent will listen on to serve internal metrics about the agent. These metrics can can be scraped by the internal-metrics monitor. This will default to 8095.

  • collectd.configDir - This is where the agent writes the managed collectd config, since collectd can only be configured by files. Note that this entire dir will be wiped by the agent upon startup so that it doesn't pick up stale collectd config, so be sure that it is not used for anything else. Also note that these files could have sensitive information in them if you have passwords configured for collectd monitors, so you might want to place this dir on a tmpfs mount to avoid credentials persisting on disk.

See the section on Privileges for information on the capabilities the agent requires.

  1. Run the agent by invoking the archive path signalfx-agent/bin/signalfx-agent -config <path to config.yaml>. By default, the agent logs only to stdout/err. If you want to persist logs, you must direct the output to a log file or other log management system. See the signalfx-agent command doc for more information on supported command flags.

Windows Chocolatey Package

Only available for Smart Agent v5.3.0 and higher.

To install the Smart Agent using Chocolatey, run the following PowerShell command as an administrator:

choco install signalfx-agent [choco options] --params="'/access_token:YOUR_SIGNALFX_ACCESS_TOKEN /ingest_url:https://ingest.YOUR_SIGNALFX_REALM.signalfx.com /api_url:https://api.YOUR_SIGNALFX_REALM.signalfx.com'"

The Smart Agent looks for a configuration file at \ProgramData\SignalFxAgent\agent.yaml. If this file does not already exist during installation, a default config file will be copied into place by the installer.

The following package parameters are available:

  • /access_token - The access token (org token) used to send metric data to SignalFx. If the parameter is specified, the token will be saved to the \ProgramData\SignalFxAgent\token file. If the parameter is not specified and \ProgramData\SignalFxAgent\token does not exist or is empty, the Smart Agent service is not started after installation or upgrade. To start the service, add or update \ProgramData\SignalFxAgent\token with a valid token, and then either restart Windows or run the following PowerShell command: & "\Program Files\SignalFx\SignalFxAgent\bin\signalfx-agent.exe" -service "start"
  • /ingest_url - URL of the SignalFx ingest endpoint (e.g. https://ingest.YOUR_SIGNALFX_REALM.signalfx.com). The URL will be saved to the \ProgramData\SignalFxAgent\ingest_url file. If the parameter is not specified, the value found in \ProgramData\SignalFxAgent\ingest_url (if it exists) will be used. Otherwise, defaults to https://ingest.us0.signalfx.com.
  • /api_url - URL of the SignalFx API endpoint (e.g. https://api.YOUR_SIGNALFX_REALM.signalfx.com). The URL will be saved to the \ProgramData\SignalFxAgent\api_url file. If the parameter is not specified, the value found in \ProgramData\SignalFxAgent\api_url (if it exists) will be used. Otherwise, defaults to https://api.us0.signalfx.com.
  • /install_dir - Installation directory. Defaults to \Program Files\SignalFx\SignalFxAgent.

To learn more, see the Chocolatey SignalFx Smart Agent page.

Windows Standalone .zip

A .zip bundle is also available that can be deployed to the target host. To obtain the bundle, go to Github Releases Page and download the most recent release.

To learn more, see Install to Windows using a ZIP file.

Privileges

Linux

When using the host observer, the agent requires the Linux capabilities DAC_READ_SEARCH and SYS_PTRACE, both of which are necessary to allow the agent to determine which processes are listening on network ports on the host. Otherwise, there is nothing built into the agent that requires privileges. When using a package to install the agent, the agent binary is given those capabilities in the package post-install script, but the agent is run as the signalfx-agent user. If you are not using the host observer, then you can strip those capabilities from the agent binary if so desired.

You should generally not run the agent as root unless you can't use capabilities for some reason.

Windows

On Windows, the Smart Agent can be installed and run using an Administrator account. You can also run the Smart Agent in non-Administrator mode, See Configure user privileges.

Configuration

The agent is configured primarily from a YAML file. By default, the agent config is installed at and looked for at /etc/signalfx/agent.yaml on Linux and \ProgramData\SignalFxAgent\agent.yaml on Windows. This can be overridden by the -config command line flag.

For the full schema of the config, see Config Schema.

For information on how to configure the agent from remote sources, such as other files on the filesystem or KV stores such as Etcd, see Remote Configuration.

Logging

The default log level is info, which will log anything noteworthy in the agent without spamming the logs too much. Most of the info level logs are on startup and upon service discovery changes. debug will create very verbose log output and should only be used when trying to resolve a problem with the agent. You can change the log level with the logging: {level: info} YAML config option.

The agent will emit logs in either an unstructed text (default) or JSON format. You can configure it to emit JSON logs with the logging: {format: json} YAML config option.

Linux

Currently the agent only supports logging to stdout/stderr, which will generally be redirected by the init scripts we provide to either a file at /var/log/signalfx-agent.log or to the systemd journal on newer distros.

Windows

On Windows, the agent will log to the console when executed directly in a shell. If the agent is configured as a windows service, log events will be logged to the Windows Event Log. Use the Event Viewer application to read the logs. The Event Viewer is located under Start > Administrative Tools > Event Viewer. You can see logged events from the agent service under Windows Logs > Application.

Proxy Support

To use an HTTP(S) proxy, set the environment variable HTTP_PROXY and/or HTTPS_PROXY in the container configuration to proxy either protocol. The SignalFx ingest and API servers both use HTTPS. If the NO_PROXY envvar exists, the agent will automatically append the local services to the envvar to not use the proxy.

If the agent is running as a local service on the host, refer to the host's service management documentation for how to pass environment variables to the agent service in order to enable proxy support when the agent service is started.

For example, if the host services are managed by systemd, create the /etc/systemd/system/signalfx-agent.service.d/myproxy.conf file and add the following to the file:

[Service]
Environment="HTTP_PROXY=http://proxy.example.com:8080/"
Environment="HTTPS_PROXY=https://proxy.example.com:8081/"

Then execute systemctl daemon-reload and systemctl restart signalfx-agent.service to restart the agent service with proxy support.

Sys-V based init.d systems: Debian * RHEL

Create /etc/default/signalfx-agent with the following contents:

HTTP_PROXY="http://proxy.example.com:8080/"
HTTPS_PROXY="https://proxy.example.com:8081/"

Diagnostics

The agent serves diagnostic information on an HTTP server at the address configured by the internalStatusHost and internalStatusPort option. As a convenience, the command signalfx-agent status will read this server and dump out its contents. That command will also explain how to get further diagnostic information.

Also see our FAQ for more troubleshooting help.

Development

If you wish to contribute to the agent, see the Developer's Guide.

signalfx-agent's People

Contributors

asuresh4 avatar atoulme avatar barbara-sfx avatar benkeith-splunk avatar bjsignalfx avatar brianashby avatar charless-splunk avatar chien-splunk avatar codesmith14 avatar crobert-1 avatar dependabot[bot] avatar dloucasfx avatar flands avatar galpizarsfx avatar harnitsignalfx avatar jadeblaquiere avatar jchengsfx avatar jeffreyc-splunk avatar jrcamp avatar keitwb avatar molner avatar mpetazzoni avatar nitaliya avatar pellared avatar prasadsfx avatar rmfitzpatrick avatar seonsfx avatar tmorrisonsfx avatar varadaprakash avatar xp-1000 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

signalfx-agent's Issues

K8s: Pod Container Status Metrics

Hi,

kube-state-metrics includes two useful metrics for checking if you have pods stuck in CrashLoopBackoff, ErrImagePull or ContainerCreating:

  • kube_pod_container_status_running
  • kube_pod_container_status_waiting_reason
  • kube_pod_container_status_last_terminated_reason

The docs for these metrics can be found here

It'd be really useful to not have to deploy kube-state-metrics in order to retrieve these metrics. Perhaps there is a way they could be put into the agent?

Cheers,
DD

x509: certificate signed by unknown authority

Using rancher 2.0.0-beta2

time="2018-04-12T18:29:51Z" level=info msg="Initialization complete, entering read-loop." collectdInstance=global
time="2018-04-12T18:29:51Z" level=info msg="found host migration-dev1" collectdInstance=global plugin=signalfx-metadata
time="2018-04-12T18:29:56Z" level=info msg="adding small dither of 19 seconds before sending notifications" collectdInstance=global plugin=signalfx-metadata
time="2018-04-12T18:29:57Z" level=error msg="Couldn't get machine info: Get https://migration-dev1:10250/spec/: x509: certificate signed by unknown authority"
time="2018-04-12T18:29:57Z" level=error msg="Couldn't get cAdvisor container stats" error="failed to get all container stats from Kubelet URL "https://migration-dev1:10250/stats/container/": Post https://migration-dev1:10250/stats/container/: x509: certificate signed by unknown authority"
time="2018-04-12T18:30:02Z" level=error msg="Could not fetch docker stats for container id 44efdc1bd4ebcc3d5a2e14a489347f3bef04c59fe9f0c478fe35fd9982373d54" error="context deadline exceeded" monitorType=docker-container-stats
time="2018-04-12T18:30:02Z" level=error msg="Could not fetch docker stats for container id bf1f1dbf94a80467369d48608f896d765a090de2a657232410d0c02249382647" error="context deadline exceeded" monitorType=docker-container-stats
time="2018-04-12T18:30:02Z" level=error msg="Could not fetch docker stats for container id e478525a3f08f917937bb8d251c41534fe59f083637d7257ef0622541d50cd0e" error="context deadline exceeded" monitorType=docker-container-stats
time="2018-04-12T18:30:02Z" level=error msg="Could not fetch docker stats for container id 0253ac7cf61a0106f3a9176df857ef6cb03a8da19b866d42354b3abea8222361" error="context deadline exceeded" monitorType=docker-container-stats

Chef cookbook fails on ubuntu 14.04

The chef cookbook fails to run on ubuntu 14.04 with the following error:

  * file[/etc/apt/sources.list.d/signalfx-agent.list] action create[2019-02-18T10:25:16+00:00] INFO: Processing file[/etc/apt/sources.list.d/signalfx-agent.list] action create (signalfx_agent::deb_repo line 8)
 (up to date)
  * execute[apt-get update] action nothing[2019-02-18T10:25:16+00:00] INFO: Processing execute[apt-get update] action nothing (signalfx_agent::deb_repo line 13)
 (skipped due to action :nothing)
Recipe: signalfx_agent::default
  * apt_package[signalfx-agent] action install[2019-02-18T10:25:16+00:00] INFO: Processing apt_package[signalfx-agent] action install (signalfx_agent::default line 33)


    ================================================================================
    Error executing action `install` on resource 'apt_package[signalfx-agent]'
    ================================================================================

    Mixlib::ShellOut::ShellCommandFailed
    ------------------------------------
    Expected process to exit with [0], but received '100'
    ---- Begin output of ["apt-get", "-q", "-y", "-o", "Dpkg::Options::=--force-confdef", "-o", "Dpkg::Options::=--force-confold", "--allow-downgrades", "install", "signalfx-agent=4.0.1-1"] ----
    STDOUT:
    STDERR: E: Command line option --allow-downgrades is not understood
    ---- End output of ["apt-get", "-q", "-y", "-o", "Dpkg::Options::=--force-confdef", "-o", "Dpkg::Options::=--force-confold", "--allow-downgrades", "install", "signalfx-agent=4.0.1-1"] ----
    Ran ["apt-get", "-q", "-y", "-o", "Dpkg::Options::=--force-confdef", "-o", "Dpkg::Options::=--force-confold", "--allow-downgrades", "install", "signalfx-agent=4.0.1-1"] returned 100

    Resource Declaration:
    ---------------------
    # In /var/chef/cache/cookbooks/signalfx_agent/recipes/default.rb

     33: package 'signalfx-agent' do  # ~FC009
     34:   action :install
     35:   version node['signalfx_agent']['package_version'] if !node['signalfx_agent']['package_version'].nil?
     36:   options '--allow-downgrades' if platform_family?('debian')
     37:   allow_downgrade true if platform_family?('rhel', 'amazon', 'fedora')
     38:   notifies :restart, 'service[signalfx-agent]', :delayed
     39: end
     40:

    Compiled Resource:
    ------------------
    # Declared in /var/chef/cache/cookbooks/signalfx_agent/recipes/default.rb:33:in `from_file'

    apt_package("signalfx-agent") do
      package_name "signalfx-agent"
      action [:install]
      default_guard_interpreter :default
      declared_type :package
      cookbook_name "signalfx_agent"
      recipe_name "default"
      options ["--allow-downgrades"]
    end

    System Info:
    ------------
    chef_version=14.1.12
    platform=ubuntu
    platform_version=14.04
    ruby=ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux]
    program_name=/usr/bin/chef-client
    executable=/opt/chef/bin/chef-client

Running on AWS. System info:

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.2 LTS"

$ uname -a
Linux test-splunk-master-1 3.13.0-48-generic #80-Ubuntu SMP Thu Mar 12 11:16:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
$ chef-client --version
Chef: 14.1.12

Chef won't update the package on windows

The chef run will update the version file even if it fails to properly update the package. This means it won't even try to do it the next time and it will just silently ignore it. The version check should be based on the actual output of the agent, not a file so at least it keeps trying and failing.

PS C:\> & 'C:\Program Files\SignalFx\SignalFxAgent\bin\signalfx-agent.exe' --version
agent-version: '4.0.2', built-time: '2019-02-22T12:35:01-08'
PS C:\>
PS C:\> chef-client
Starting Chef Client, version 14.11.21
[2019-03-26T12:07:58+00:00] INFO: *** Chef 14.11.21 ***
[2019-03-26T12:07:58+00:00] INFO: Platform: x64-mingw32
[2019-03-26T12:07:58+00:00] INFO: Chef-client pid: 1296
[2019-03-26T12:07:58+00:00] INFO: The plugin path C:\chef\ohai\plugins does not exist. Skipping...
[2019-03-26T12:08:03+00:00] INFO: Run List is [recipe[signalfx_agent]]
[2019-03-26T12:08:03+00:00] INFO: Run List expands to [signalfx_agent]
[2019-03-26T12:08:03+00:00] INFO: Starting Chef Run for wint
[2019-03-26T12:08:03+00:00] INFO: Running start handlers
[2019-03-26T12:08:03+00:00] INFO: Start handlers complete.
[2019-03-26T12:08:03+00:00] INFO: Error while reporting run start to Data Collector. URL: https://base.chef.cloudreach.c
om/organizations/cloudreach-test-mock-environment/data-collector Exception: 404 -- 404 "Not Found"  (This is normal if y
ou do not have Chef Automate)
resolving cookbooks for run list: ["signalfx_agent"]
[2019-03-26T12:08:04+00:00] INFO: Loading cookbooks [[email protected], [email protected]]
Synchronizing Cookbooks:
  - signalfx_agent (0.2.2)
  - windows (5.2.3)
Installing Cookbook Gems:
Compiling Cookbooks...
Converging 6 resources
Recipe: signalfx_agent::default
  * directory[\ProgramData\SignalFxAgent] action create[2019-03-26T12:08:04+00:00] INFO: Processing directory[\ProgramDa
ta\SignalFxAgent] action create (signalfx_agent::default line 20)
 (up to date)
Recipe: signalfx_agent::win
  * windows_zipfile[\Program Files\SignalFx\] action unzip[2019-03-26T12:08:04+00:00] INFO: Processing windows_zipfile[\
Program Files\SignalFx\] action unzip (signalfx_agent::win line 2)

    - unzip https://dl.signalfx.com/windows/final/zip/SignalFxAgent-4.2.0-win64.zip
    * remote_file[c:/chef/cache/SignalFxAgent-4.2.0-win64.zip] action create[2019-03-26T12:08:04+00:00] INFO: Processing
 remote_file[c:/chef/cache/SignalFxAgent-4.2.0-win64.zip] action create (c:/chef/cache/cookbooks/windows/resources/zipfi
le.rb line 40)
[2019-03-26T12:08:07+00:00] INFO: remote_file[c:/chef/cache/SignalFxAgent-4.2.0-win64.zip] created file c:/chef/cache/Si
gnalFxAgent-4.2.0-win64.zip

      - create new file c:/chef/cache/SignalFxAgent-4.2.0-win64.zip[2019-03-26T12:08:08+00:00] INFO: remote_file[c:/chef
/cache/SignalFxAgent-4.2.0-win64.zip] updated file contents c:/chef/cache/SignalFxAgent-4.2.0-win64.zip

      - update content in file c:/chef/cache/SignalFxAgent-4.2.0-win64.zip from none to 767f07
      (file sizes exceed 10000000 bytes, diff output suppressed)
    * ruby_block[Unzipping] action run[2019-03-26T12:08:08+00:00] INFO: Processing ruby_block[Unzipping] action run (c:/
chef/cache/cookbooks/windows/resources/zipfile.rb line 54)
[2019-03-26T12:08:13+00:00] INFO: ruby_block[Unzipping] called

      - execute the ruby block Unzipping

  * file[\Program Files\SignalFx\\version.txt] action create[2019-03-26T12:08:13+00:00] INFO: Processing file[\Program F
iles\SignalFx\\version.txt] action create (signalfx_agent::win line 11)
[2019-03-26T12:08:13+00:00] INFO: file[\Program Files\SignalFx\\version.txt] backed up to c:/chef/backup\Program Files\S
ignalFx\\version.txt.chef-20190326120813.853506
[2019-03-26T12:08:13+00:00] INFO: file[\Program Files\SignalFx\\version.txt] updated file contents \Program Files\Signal
Fx\\version.txt

    - update content in file \Program Files\SignalFx\\version.txt from 5fde70 to 96507e
    --- \Program Files\SignalFx\\version.txt    2019-03-26 12:06:56.563825400 +0000
    +++ \Program Files\SignalFx/chef-version20190326-1296-1cu0dtj.txt   2019-03-26 12:08:13.837742400 +0000
    @@ -1,2 +1,2 @@
    -4.0.2
    +4.2.0
  * powershell_script[ensure service created] action run[2019-03-26T12:08:13+00:00] INFO: Processing powershell_script[e
nsure service created] action run (signalfx_agent::win line 23)
[2019-03-26T12:08:14+00:00] INFO: powershell_script[ensure service created] ran successfully

    - execute "C:\Windows\system32\WindowsPowerShell\v1.0\powershell.exe" -NoLogo -NonInteractive -NoProfile -ExecutionP
olicy Bypass -InputFormat None -File "C:/Users/ADMINI~1/AppData/Local/Temp/2/chef-script20190326-1296-1uf2ibj.ps1"
Recipe: signalfx_agent::default
  * template[\ProgramData\SignalFxAgent\agent.yaml] action create[2019-03-26T12:08:14+00:00] INFO: Processing template[\
ProgramData\SignalFxAgent\agent.yaml] action create (signalfx_agent::default line 47)
 (up to date)
  * windows_service[signalfx-agent] action enable[2019-03-26T12:08:14+00:00] INFO: Processing windows_service[signalfx-a
gent] action enable (signalfx_agent::default line 55)
C:/opscode/chef/embedded/lib/ruby/gems/2.5.0/gems/win32-service-1.0.1/lib/win32/service.rb:1086:in `rescue in block in s
ervices' : WARNING: Failed to retrieve description for the CDPUserSvc_93d7e service. (StructuredWarnings::StandardWarnin
g)C:/opscode/chef/embedded/lib/ruby/gems/2.5.0/gems/win32-service-1.0.1/lib/win32/service.rb:1099:in `rescue in block in
 services' : WARNING: Unable to get delayed auto start information for the CDPUserSvc_93d7e service (StructuredWarnings:
:StandardWarning)C:/opscode/chef/embedded/lib/ruby/gems/2.5.0/gems/win32-service-1.0.1/lib/win32/service.rb:1157:in `res
cue in block in services' : WARNING: Unable to retrieve failure actions for the CDPUserSvc_93d7e service (StructuredWarn
ings::StandardWarning) (up to date)
  * windows_service[signalfx-agent] action start[2019-03-26T12:08:14+00:00] INFO: Processing windows_service[signalfx-ag
ent] action start (signalfx_agent::default line 55)
C:/opscode/chef/embedded/lib/ruby/gems/2.5.0/gems/win32-service-1.0.1/lib/win32/service.rb:1086:in `rescue in block in s
ervices' : WARNING: Failed to retrieve description for the CDPUserSvc_93d7e service. (StructuredWarnings::StandardWarnin
g)C:/opscode/chef/embedded/lib/ruby/gems/2.5.0/gems/win32-service-1.0.1/lib/win32/service.rb:1099:in `rescue in block in
 services' : WARNING: Unable to get delayed auto start information for the CDPUserSvc_93d7e service (StructuredWarnings:
:StandardWarning)C:/opscode/chef/embedded/lib/ruby/gems/2.5.0/gems/win32-service-1.0.1/lib/win32/service.rb:1157:in `res
cue in block in services' : WARNING: Unable to retrieve failure actions for the CDPUserSvc_93d7e service (StructuredWarn
ings::StandardWarning) (up to date)
[2019-03-26T12:08:15+00:00] INFO: windows_zipfile[\Program Files\SignalFx\] sending restart action to windows_service[si
gnalfx-agent] (delayed)
  * windows_service[signalfx-agent] action restart[2019-03-26T12:08:15+00:00] INFO: Processing windows_service[signalfx-
agent] action restart (signalfx_agent::default line 55)
C:/opscode/chef/embedded/lib/ruby/gems/2.5.0/gems/win32-service-1.0.1/lib/win32/service.rb:1086:in `rescue in block in s
ervices' : WARNING: Failed to retrieve description for the CDPUserSvc_93d7e service. (StructuredWarnings::StandardWarnin
g)C:/opscode/chef/embedded/lib/ruby/gems/2.5.0/gems/win32-service-1.0.1/lib/win32/service.rb:1099:in `rescue in block in
 services' : WARNING: Unable to get delayed auto start information for the CDPUserSvc_93d7e service (StructuredWarnings:
:StandardWarning)C:/opscode/chef/embedded/lib/ruby/gems/2.5.0/gems/win32-service-1.0.1/lib/win32/service.rb:1157:in `res
cue in block in services' : WARNING: Unable to retrieve failure actions for the CDPUserSvc_93d7e service (StructuredWarn
ings::StandardWarning)[2019-03-26T12:08:16+00:00] INFO: windows_service[signalfx-agent] configured.
[2019-03-26T12:08:17+00:00] INFO: windows_service[signalfx-agent] restarted

    - restart service windows_service[signalfx-agent]
[2019-03-26T12:08:17+00:00] INFO: Chef Run complete in 13.800471 seconds

Running handlers:
[2019-03-26T12:08:17+00:00] INFO: Running report handlers
Running handlers complete
[2019-03-26T12:08:17+00:00] INFO: Report handlers complete
Chef Client finished, 6/10 resources updated in 18 seconds
PS C:\> & 'C:\Program Files\SignalFx\SignalFxAgent\bin\signalfx-agent.exe' --version
agent-version: '4.0.2', built-time: '2019-02-22T12:35:01-08'

As per the above, the file gets updated:

    -4.0.2
    +4.2.0

But the actual version doesn't change. It's also the only cookbook in the runlist.

Docker memory stats are misleading

We've just switched to the smart agent, and had some problems with docker memory stats again, which we'd also had with the old collectd agent.

In the smart agent, you've correctly deducted cache when exposing memory.percent, in the enhanced statistics here:

100.0*(float64(stats.Usage)-float64(stats.Stats["cache"]))/float64(stats.Limit)),

However, you are not deducting it here:

sfxclient.Gauge("memory.usage.total", nil, int64(stats.Usage)),

I think this is incorrect, and gives a misleading figure which will cause people to think their app is using too much memory.

A lot of people will try to calculate percentage by doing memory.usage.total / memory.usage.limit * 100. This will then produce a completely incorrect value.

It's confused some of our developers, as they have seen very odd memory behaviour in applications.

The reason for this being misleading, is that the buffer cache increases with I/O. As such, any application doing enough I/O will eventually appear to consume all the memory. Which of course is not is what is actually happening!

Here's an example of the problem which I produced with smart agent. To do it, I execed into a couple of containers and ran these commands, to create a 500MB and then a 1000MB file and delete them again.

It is then reflected as 500 and 1000MB more memory use in the stats.

dd if=/dev/zero of=test.dd bs=1M count=500
rm test.dd

And then on the other container:

dd if=/dev/zero of=test.dd bs=1M count=1000
rm test.dd

Here's the memory graph.

image

As you can see, the memory use spikes to 100% and then back down when I delete the file. This shows that the stats are really not helpful once the container does any I/O.

If you compare stats with what comes from ECS in Amazon cloudwatch, the stats there are much more accurate, because they decided here to start deducting the buffer cache:

aws/amazon-ecs-agent#280

I think the root problem is down to this decision by the docker API developers to expose memory usage, including buffer cache:

docker-archive/libcontainer#506

This was a questionable decision in my opinion, because developers consuming the API (like SignalFX) will trust the figure, and then expose it to end users. It causes much confusion, since it bears no real link to the memory use of the application inside the container.

I guess it's another one of those "naming is hard problems" since the devs who put cgroups into the Linux kernel probably made a bad choice of name in the first place too. memory.usage_in_bytes is deceptively simple, and doesn't tell you it actually includes page cache and kernel memory use too.

Chef cookbook fails on ubuntu 12.04

The chef cookbook fails to run on ubuntu 12.04, error as follows:

  ArgumentError
  -------------
  Malformed version number string 0.8.16~exp12ubuntu10.21

  Cookbook Trace:
  ---------------
    /var/chef.cloudreach/cache/cookbooks/signalfx_agent/recipes/default.rb:42:in `block in from_file'
    /var/chef.cloudreach/cache/cookbooks/signalfx_agent/recipes/default.rb:35:in `from_file'

  Relevant File Content:
  ----------------------
  /var/chef.cloudreach/cache/cookbooks/signalfx_agent/recipes/default.rb:

   35:    package 'signalfx-agent' do # ~FC009
   36:      action :install
   37:      version node['signalfx_agent']['package_version'] unless node['signalfx_agent']['package_version'].nil?
   38:      flush_cache [ :before ] if platform_family?('rhel')
   39:      options '--allow-downgrades' if platform_family?('debian') \
   40:        && node['packages'] \
   41:        && node['packages']['apt'] \
   42>>       && Gem::Version.new(node['packages']['apt']['version']) >= Gem::Version.new('1.1.0')
   43:      allow_downgrade true if platform_family?('rhel', 'amazon', 'fedora')
   44:      notifies :restart, 'service[signalfx-agent]', :delayed
   45:    end
   46:  end
   47:
   48:  template node['signalfx_agent']['conf_file_path'] do
   49:    source 'agent.yaml.erb'
   50:    owner node['signalfx_agent']['user']
   51:    group node['signalfx_agent']['group']

  System Info:
  ------------
  chef_version=13.4.24
  platform=ubuntu
  platform_version=12.04
  ruby=ruby 2.4.2p198 (2017-09-14 revision 59899) [x86_64-linux]
  program_name=chef-client worker: ppid=1461;start=22:54:54;
  executable=/opt/chef/bin/chef-client


  Running handlers:
[2019-05-22T22:55:13+00:00] ERROR: Running exception handlers
  Running handlers complete
[2019-05-22T22:55:13+00:00] ERROR: Exception handlers complete
  Chef Client failed. 2 resources updated in 18 seconds
[2019-05-22T22:55:13+00:00] FATAL: Stacktrace dumped to /var/chef.cloudreach/cache/chef-stacktrace.out
[2019-05-22T22:55:13+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2019-05-22T22:55:13+00:00] ERROR: Malformed version number string 0.8.16~exp12ubuntu10.21
[2019-05-22T22:55:13+00:00] FATAL: Chef::Exceptions::ChildConvergeError: Chef run process exited unsuccessfully (exit code 1)

K8s: Horizontal Pod Autoscaler Metrics

Hi,

It'd be great to have access to HPA metrics like:

  • hpa_current_replicas,
  • hpa_min_replicas
  • hpa_max_replicas
  • hpa_desired_replicas

These are really useful for alerting when our HPA policies are incorrectly sized. kube-state-metrics has similar values here

Cheers,
DD

Enable building custom monitors from extenal repos

I have some custom integrations I would like to create as native signalfx agent monitors. I definitely prefer this to alternatives (python / collectd exec / prometheus) because it is fairly straightforward, and we already have a body of go collectors that can be fairly easily transformed into monitors.

The main problems are in logistics. Primarily the fact that all of the interfaces I need to create a monitor are in the internal packages. This means I can only build them inside of this repository. There are a number of reasons I (or others) might not want to do that:

  • We have very specific integrations that will never be of interest to the community and will not be open-sourced.
  • We would happily open source an integration, but would prefer to keep it separate from this repo. Possibly for governance or licensing or ownership reasons.
  • We would like to be able to test and compile the integration in isolation from the rest of the agent code as much as possible.

Currently to make custom integrations I must build the agent myself. I agree this is probably unavoidable, and plugins are probably not worth the trouble for all the headache they give. I have another suggestion for making this part a little easier, but that will need to be another issue.

But right now my only option is to essentially:

  1. Fork this repo
  2. Add my own monitors in (adding files that are extremely unlikely to introduce future conflicts.
  3. Add a new file somewhere sole to import _ my package(s).
  4. Compile the code.

I don't like this for a few reasons:

  1. It is a bit of a pain to merge in upstream changes.
  2. New developers on our end don't know the boundaries between our code and your code.

I would much rather have a dedicated monitors repo that has scripts to build the entire agent as appropriate. So my proposal is simply this:

I would like to move some packages from github.com/signalfx/signalfx-agent/internal into github.com/signalfx/signalfx-agent/pkg.

Specifically, I just need to move the packages that need to be directly imported to build a monitor. Going from the documentation, that includes:

  • github.com/signalfx/signalfx-agent/internal/core/config
  • github.com/signalfx/signalfx-agent/internal/monitors
  • github.com/signalfx/signalfx-agent/internal/monitors/types
  • github.com/signalfx/signalfx-agent/internal/utils

It is a fairly simple change, but it does touch a large number of files. Is there a significant reason not to make that change? I have tested this change, and have gotten things compiling after moving the packages. I will make a PR shortly.

Issues with docker observer and host-based networking

Somewhat related to #907.

While setting up the agent in our environment, we came across a series of issues with the docker observer and the way the agent handles endpoints, especially when using host networking. While we managed to work around the issues, it certainly seems like the agent could be improved to reduce friction. The issues were:

  1. When using host networking and useHostnameIfPresent: true, the host config parameter ends up being the machine's hostname. However, in some cases, we actually want to use localhost. In our case, this is true for consul. consul only binds to localhost to avoid exposing the API for security reasons. We were able to work around this by using the host observer, but with that, we lose the extra metadata provided through the docker observer. Our current config is:
- type: collectd/consul
  discoveryRule: discovered_by == "host" && port == 8500
  1. The observer requires ports to be exposed (this can be done via a Dockerfile or at runtime). This in itself is not a major issue, but starts to breakdown when using --net=host as you don't need to expose ports anymore. The observer will still work with host networking as long as the ports were exposed in the Dockerfile.

  2. For some containers, the ports exposed via the Dockerfile aren't the ones you want to check against. This is particularly true for the JMX check (as well as Kafka and Cassandra). For example, the default confluent image for Kafka does not expose the JMX port. The monitor configuration would fail to find the Kafka container if the discovery rule searches for the JMX port. The other option is to use the Kafka "listening" port 9092. This makes Kafka observable, but the port is wrong.

  3. You cannot override the "discovered" port by supplying your own port config. The discovered port overwrites the supplied port. This is true for host as well. Luckily, for JMX checks, we can supply a custom serviceUrl which takes precedence above all this, allowing us to override the port that way.
    Our Kafka monitor config looks like this for example:

- type: collectd/kafka
  # Override url temporarily until we more widely deploy a new kafka image that exposes JMX ports
  serviceURL: service:jmx:rmi:///jndi/rmi://{{.Host}}:9999/jmxrmi
  clusterName: ${ENV_NAME}-${REGION}
  discoveryRule: container_name == "kafka" && port == 9092

Allowing users to override the discovered host and port would solve most, if not all, the issues we faced. It would be great if we could supply the host or port manually to overwrite what is discovered by observers in special cases (especially with host networking).

Chef cookbook agent config yaml doesn't work with smart agent

Currently, the chef cookbook will create an agent.yaml from conf attributes with json-ified yaml, that the java smart agent, in this case, can't use, and so does not start up.

Such as:

monitors: [{type: "trace-forwarder", listenAddress: "127.0.0.1:9080"}, {"type": "collectd/cpu"}]
signalFxAccessToken: "howdy"
traceEndpointUrl: "http://somewhere:8080/v1/trace"
writer: {"sendTraceHostCorrelationMetrics": true}

When correct yaml that the agent needs to start is:

monitors:
  - type: trace-forwarder
    listenAddress: 127.0.0.1:9080
  - type: collectd/cpu
signalFxAccessToken: howdy
traceEndpointUrl: http://somewhere:8080/v1/trace
writer:
  sendTraceHostCorrelationMetrics: true

Docker observer misbehaving when docker deamon is underload

We are currently running quay.io/signalfx/signalfx-agent:4.6.3 docker image within our production environment in AWS. Our version of docker is Docker version 18.06.1-ce.

What we have noticed is if the docker daemon is unresponsive when trying access it via the docker socket, the agent furiously retries causing increases in CPU load and cause other applications to become unresponsive.

During this state, the main application that is responding to the health check fails so I don't have any debug logs from the agent at the time due to it being configured to run to log errors only.

Chef convergence fails on windows

During the first convergence the cookbook fails on windows. The service gets created but it is stopped. It can be manually started and it will report data correctly. Probably a race condition?

Recipe: signalfx_agent::default
  * directory[\ProgramData\SignalFxAgent] action create[2019-02-25T14:06:04+00:00] INFO: Processing directory[\ProgramDa
ta\SignalFxAgent] action create (signalfx_agent::default line 23)
[2019-02-25T14:06:04+00:00] INFO: directory[\ProgramData\SignalFxAgent] created directory \ProgramData\SignalFxAgent

    - create new directory \ProgramData\SignalFxAgent[2019-02-25T14:06:04+00:00] INFO: directory[\ProgramData\SignalFxAg
ent] owner changed to S-1-5-21-2064249170-1864486422-3813938014-500
[2019-02-25T14:06:04+00:00] INFO: directory[\ProgramData\SignalFxAgent] group changed to S-1-5-21-2064249170-1864486422-
3813938014-500

    - change owner
    - change group
Recipe: signalfx_agent::win
  * windows_zipfile[\Program Files\SignalFx\] action unzip[2019-02-25T14:06:04+00:00] INFO: Processing windows_zipfile[\
Program Files\SignalFx\] action unzip (signalfx_agent::win line 2)

    - unzip https://dl.signalfx.com/windows/final/zip/SignalFxAgent-4.0.1-win64.zip
    * remote_file[C:/chef.cloudreach/cache/SignalFxAgent-4.0.1-win64.zip] action create[2019-02-25T14:06:04+00:00] INFO:
 Processing remote_file[C:/chef.cloudreach/cache/SignalFxAgent-4.0.1-win64.zip] action create (C:/chef.cloudreach/cache/
cookbooks/windows/resources/zipfile.rb line 40)
[2019-02-25T14:06:07+00:00] INFO: remote_file[C:/chef.cloudreach/cache/SignalFxAgent-4.0.1-win64.zip] created file C:/ch
ef.cloudreach/cache/SignalFxAgent-4.0.1-win64.zip

      - create new file C:/chef.cloudreach/cache/SignalFxAgent-4.0.1-win64.zip[2019-02-25T14:06:10+00:00] INFO: remote_f
ile[C:/chef.cloudreach/cache/SignalFxAgent-4.0.1-win64.zip] updated file contents C:/chef.cloudreach/cache/SignalFxAgent
-4.0.1-win64.zip

      - update content in file C:/chef.cloudreach/cache/SignalFxAgent-4.0.1-win64.zip from none to be6afe
      (file sizes exceed 10000000 bytes, diff output suppressed)
    * ruby_block[Unzipping] action run[2019-02-25T14:06:11+00:00] INFO: Processing ruby_block[Unzipping] action run (C:/
chef.cloudreach/cache/cookbooks/windows/resources/zipfile.rb line 54)
[2019-02-25T14:06:31+00:00] INFO: ruby_block[Unzipping] called

      - execute the ruby block Unzipping

  * file[\Program Files\SignalFx\\version.txt] action create[2019-02-25T14:06:31+00:00] INFO: Processing file[\Program F
iles\SignalFx\\version.txt] action create (signalfx_agent::win line 11)
[2019-02-25T14:06:31+00:00] INFO: file[\Program Files\SignalFx\\version.txt] created file \Program Files\SignalFx\\versi
on.txt

    - create new file \Program Files\SignalFx\\version.txt[2019-02-25T14:06:31+00:00] INFO: file[\Program Files\SignalFx
\\version.txt] updated file contents \Program Files\SignalFx\\version.txt

    - update content in file \Program Files\SignalFx\\version.txt from none to f4da6c
    --- \Program Files\SignalFx\\version.txt    2019-02-25 14:06:31.000000000 +0000
    +++ \Program Files\SignalFx/chef-version20190225-2604-x7pno1.txt    2019-02-25 14:06:31.000000000 +0000
    @@ -1 +1,2 @@
    +4.0.1
  * powershell_script[ensure service created] action run[2019-02-25T14:06:31+00:00] INFO: Processing powershell_script[e
nsure service created] action run (signalfx_agent::win line 23)
[2019-02-25T14:06:35+00:00] INFO: powershell_script[ensure service created] ran successfully

    - execute "C:\Windows\system32\WindowsPowerShell\v1.0\powershell.exe" -NoLogo -NonInteractive -NoProfile -ExecutionP
olicy Bypass -InputFormat None -File "C:/Users/ADMINI~1/AppData/Local/Temp/2/chef-script20190225-2604-6g6g8d.ps1"
Recipe: signalfx_agent::default
  * template[\ProgramData\SignalFxAgent\agent.yaml] action create[2019-02-25T14:06:35+00:00] INFO: Processing template[\
ProgramData\SignalFxAgent\agent.yaml] action create (signalfx_agent::default line 49)
[2019-02-25T14:06:35+00:00] INFO: template[\ProgramData\SignalFxAgent\agent.yaml] created file \ProgramData\SignalFxAgen
t\agent.yaml

    - create new file \ProgramData\SignalFxAgent\agent.yaml[2019-02-25T14:06:35+00:00] INFO: template[\ProgramData\Signa
lFxAgent\agent.yaml] updated file contents \ProgramData\SignalFxAgent\agent.yaml

    - update content in file \ProgramData\SignalFxAgent\agent.yaml from none to 201e6b
    --- \ProgramData\SignalFxAgent\agent.yaml   2019-02-25 14:06:35.000000000 +0000
    +++ \ProgramData\SignalFxAgent/chef-agent20190225-2604-1iz4jrg.yaml 2019-02-25 14:06:35.000000000 +0000
    @@ -1 +1,26 @@
    +# Automatically generated by Chef
    +
    +---
    +hostname: xxxxx
    +ingestUrl: https://ingest.eu0.signalfx.com
    +apiUrl: https://api.eu0.signalfx.com
    +signalFxAccessToken: xxxxx
    +globalDimensions:
    +  chef_environment: mock-environment
    +  shortcode: TEST
    +monitors:
    +- type: cpu
    +- type: processlist
    +- type: filesystems
    +- type: disk-io
    +- type: memory
    +- type: net-io
    +- type: vmem
    +- type: host-metadata
    +observers:
    +- type: host
    +metricsToExclude:
    +- "#from": C:\Program Files\SignalFx\SignalFxAgent\lib\whitelist.json
    +  flatten: true
    +[2019-02-25T14:06:35+00:00] INFO: template[\ProgramData\SignalFxAgent\agent.yaml] owner changed to S-1-5-21-2064249
170-1864486422-3813938014-500
[2019-02-25T14:06:35+00:00] INFO: template[\ProgramData\SignalFxAgent\agent.yaml] group changed to S-1-5-21-2064249170-1
864486422-3813938014-500
[2019-02-25T14:06:35+00:00] INFO: template[\ProgramData\SignalFxAgent\agent.yaml] permissions changed to [WIN-H4KL0TTBQV
O\Administrator/flags:0/mask:c0010000]

    - change dacl
    - change owner
    - change group
[2019-02-25T14:06:35+00:00] INFO: template[\ProgramData\SignalFxAgent\agent.yaml] not queuing delayed action restart on
windows_service[signalfx-agent] (delayed), as it's already been queued
  * windows_service[signalfx-agent] action enable[2019-02-25T14:06:35+00:00] INFO: Processing windows_service[signalfx-a
gent] action enable (signalfx_agent::default line 57)


    ================================================================================
    Error executing action `enable` on resource 'windows_service[signalfx-agent]'
    ================================================================================

    SystemCallError
    ---------------
    The specified service does not exist as an installed service. - OpenService: The specified service does not exist as
 an installed service.

    Resource Declaration:
    ---------------------
    # In C:/chef.cloudreach/cache/cookbooks/signalfx_agent/recipes/default.rb

     57: service node['signalfx_agent']['service_name'] do
     58:   action [:enable, :start]
     59: end

    Compiled Resource:
    ------------------
    # Declared in C:/chef.cloudreach/cache/cookbooks/signalfx_agent/recipes/default.rb:57:in `from_file'

    windows_service("signalfx-agent") do
      action [:enable, :start]
      default_guard_interpreter :default
      service_name "signalfx-agent"
      enabled nil
      running nil
      masked nil
      pattern "signalfx-agent"
      declared_type :service
      cookbook_name "signalfx_agent"
      recipe_name "default"
    end

    System Info:
    ------------
    chef_version=14.10.9
    platform=windows
    platform_version=6.1.7601
    ruby=ruby 2.5.3p105 (2018-10-18 revision 65156) [x64-mingw32]
    program_name=C:/opscode/chef/bin/chef-client
    executable=C:/opscode/chef/bin/chef-client

[2019-02-25T14:06:35+00:00] INFO: Running queued delayed notifications before re-raising exception
[2019-02-25T14:06:35+00:00] INFO: template[C:\Program Files\SplunkUniversalForwarder/etc/system/local/inputs.conf] sendi
ng restart action to windows_service[splunk] (delayed)
Recipe: splunk::default
  * windows_service[splunk] action restart[2019-02-25T14:06:35+00:00] INFO: Processing windows_service[splunk] action re
start (splunk::default line 342)
[2019-02-25T14:06:37+00:00] INFO: windows_service[splunk] configured.
[2019-02-25T14:06:45+00:00] INFO: windows_service[splunk] restarted

    - restart service windows_service[splunk]
[2019-02-25T14:06:45+00:00] INFO: windows_zipfile[\Program Files\SignalFx\] sending restart action to windows_service[si
gnalfx-agent] (delayed)
Recipe: signalfx_agent::default
  * windows_service[signalfx-agent] action restart[2019-02-25T14:06:45+00:00] INFO: Processing windows_service[signalfx-
agent] action restart (signalfx_agent::default line 57)
[2019-02-25T14:06:45+00:00] INFO: windows_service[signalfx-agent] restarted

    - restart service windows_service[signalfx-agent]

Running handlers:
[2019-02-25T14:06:45+00:00] ERROR: Running exception handlers
  - CRChefReporting
Running handlers complete
[2019-02-25T14:06:45+00:00] ERROR: Exception handlers complete
Chef Client failed. 69 resources updated in 02 minutes 24 seconds
[2019-02-25T14:06:45+00:00] FATAL: Stacktrace dumped to C:/chef.cloudreach/cache/chef-stacktrace.out
[2019-02-25T14:06:45+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2019-02-25T14:06:45+00:00] FATAL: SystemCallError: windows_service[signalfx-agent] (signalfx_agent::default line 57) ha
d an error: SystemCallError: The specified service does not exist as an installed service. - OpenService: The specified
service does not exist as an installed service.

kubernetes-events monitor not working with 3.0.2

I'm receiving this error when using the docker image: quay.io/signalfx/signalfx-agent:3.0.2

Unknown monitor type kubernetes-events

My configmap sections looks like this:

    monitors:
      - type: kubernetes-events
        whitelistedEvents:
        - reason: ScaleDownEmpty
          involvedObjectKind: ConfigMap
        - reason: ScaledUpGroup
          involvedObjectKind: ConfigMap

Option to disable disk metrics from collectd/signalfx-metadata

Given that several golang monitors now officially replace their collectd counterparts, it would be great if we could do the same for the filesystems monitor.

Currently, filesystems duplicates some metrics sent by collectd/signalfx-metadata and, unfortunately, collectd/signalfx-metadata does not support omitting disk info like it does for process info.

While it is possible to use datapointsToExclude to omit sending the metrics from the collectd monitor, it would be great if there was a more obvious and simple way to ensure both monitors can be run simultaneously.

Migration to non scratch container base image

Hey guys,

I was looking through the repo and had noted that the final docker image released for the signalfx agent uses FROM scratch as its base.

What was the reason / rationale for this?

From my point of view, I see it being an issue as I can not:

  • Install tools into the image at runtime to diagnose issues regarding networking, performance, utilisation
  • Extending the image to include tools such as dumb-init adds an additional layer

I would like to advocate for using an image that has a package manager as part of the default so :

  • packages installed can be updated easily
  • Support for dumb-init to avoid the zombie process problem
  • complete linux environment for when debugging issues

Chef cookbook intermittently fails on first convergence on yum base

yum occasionally fails to find the signalfx package in the repository and fails the chef run. Re-running chef without changes solves this. A potential fix could be to move to the yum_repository resource instead of just writing the file? Or is this potentially an issue with the repository?

* yum_package[signalfx-agent] action install[2019-02-18T18:29:39+00:00] INFO: Processing yum_package[signalfx-agent] action install (signalfx_agent::default line 33)

* No candidate version available for signalfx-agent
================================================================================
Error executing action `install` on resource 'yum_package[signalfx-agent]'
================================================================================

    Chef::Exceptions::Package
-------------------------
No candidate version available for signalfx-agent

    Resource Declaration:
---------------------
# In /var/chef.cloudreach/cache/cookbooks/signalfx_agent/recipes/default.rb

     33: package 'signalfx-agent' do  # ~FC009
     34:   action :install
     35:   version node['signalfx_agent']['package_version'] if !node['signalfx_agent']['package_version'].nil?
     36:   options '--allow-downgrades' if platform_family?('debian') \
     37:     && node['packages']['apt'] \
     38:     && Gem::Version.new(node['packages']['apt']['version']) >= Gem::Version.new('1.1.0')
     39:   allow_downgrade true if platform_family?('rhel', 'amazon', 'fedora')
     40:   notifies :restart, 'service[signalfx-agent]', :delayed
     41: end
     42:

    Compiled Resource:
------------------
# Declared in /var/chef.cloudreach/cache/cookbooks/signalfx_agent/recipes/default.rb:33:in `from_file'

    yum_package("signalfx-agent") do
      package_name "signalfx-agent"
      action [:install]
      default_guard_interpreter :default
      declared_type :package
      cookbook_name "signalfx_agent"
      recipe_name "default"
      allow_downgrade true
    end

    System Info:
------------
chef_version=14.10.9
    platform=amazon
    platform_version=2018.03
    ruby=ruby 2.5.3p105 (2018-10-18 revision 65156) [x86_64-linux]
    program_name=/usr/bin/chef-client
    executable=/opt/chef/bin/chef-client

Ansible Role

The Ansible Role is not in Ansible Galaxy or mentioned on the main README.md page.

  1. deployments/ansible/meta/main.yml you should change the author line to:
---
galaxy_info:
  author: signalfx
  description: Ansible role to install and configure the SignalFx Smart Agent
  company: SignalFx, Inc.
  license: Apache-2.0
  min_ansible_version: 1.9

This will allow you to submit your Smart Agent for SignalFx to Ansible Galaxy for easier adoption by Ansible users. Also, your docs do not say how to install the role, but it would be helpful to other users. I had to actually manually put the files there - or someone would need to install the Ansible specific folder by using the Github method described when Installing SignalFx for Ansible

After you're in Ansible Galaxy it would be the preferred method in the README.md of Ansible to show a simple role installation, and then playbook use. I realize you already have a sample playbook - the idea I'm suggesting is to make this a simpler install for onboarding new clients you prospect.

ansible-galaxy --install signalfx.signalfx-agent

The above command assumes Ansible Galaxy submission, and meta/main.yaml edit of "author" line to suggestion above.

Feature Request: Additional Host Metadata via config

I'd like to be able to list of properties in the hostmetadata config (which there is currently no config options) to pass through as host metadata. This would allow me to create this file from our configuration management system and provide additional host properties that we have in that system.

Example config yaml:

monitors:
  - type: host-metadata
    customproperties:
      property1: value1
      property2: value2

These could just be read from the config and passed through as another metadataFuncs

Puppet module defaults are broken

Using this puppet module with its defaults results in the SignalFX agent getting 401's trying to send data, the problems seems to be with the apiUrl and ingestUrl, the current config is defined as:

ingestUrl: https://ingest.signalfx.com

---
signalfx_agent::config:
  signalFxAccessToken: "MY_TOKEN"
  ingestUrl: https://ingest.signalfx.com
  apiUrl: https://api.signalfx.com
  hostname: "%{facts.fqdn}"
  intervalSeconds: 10
  logging:
    level: info
  observers: [
    {"type": "host"}
  ]

The config schema defines these as optional, the values of which are derived from the signalFxRealm. The schema doc also states that if these values are defined they should be done using the following format:

ingestUrl: "https://ingest.YOUR_SIGNALFX_REALM.signalfx.com"
apiUrl: "https://api.YOUR_SIGNALFX_REALM.signalfx.com"
traceEndpointUrl: "https://ingest.YOUR_SIGNALFX_REALM.signalfx.com/v1/trace"

Couple options to fix this, either remove these defaults from the config template, or add the full enpoints as per the schema, which would also require adding in the signalFxRealm with us1 as a default value.

collectd vs golang-based default monitors?

First, I apologize if this is not the best place to ask this.

It seems that there are a set of go-based monitors that were added earlier this year that duplicate all the metrics offered by the default set of collectd monitors. These include:

  • cpu
  • filesystems
  • memory
  • disk-io
  • net-io
  • vmem

It appears that in at least some cases, the go-based monitors have more fidelity. For example, the filesystems monitor included dimensions for both the device and mountpoint, while the df collectd plugin will only tag by one of them.

It also appears that the go-based monitors were primarily added for Windows support.

Is there a general recommendation on which to use? If the recommendation is to use the collect plugins, should we just replace colllectd plugins with their go-based equivalents when we want the extra dimensions?

Questions about python monitor

Hello,

I had a couple of questions about python monitor in signalfx.

  1. pip is not installed as part of python dist which is installed along with signalfx-agent which is understandable since installing new packages or upgrading existing packages may have unintended side effects. I wanted to understand the motivation for not including pip. pip would actually help us in installing packages which are required for our custom checks.

  2. Why are the modules in https://github.com/signalfx/signalfx-agent/tree/v4.6.0/python not packaged and published to pypi? We could build python wheel and distribute via private artifactory or clone the repo and add the package to sys.path but that is an ugly hack. If we could install signalfx packages mentioned above via pip that would be ideal.

Thanks,
Raju Kadam

Specifying package version works differently across platforms on chef, fails on apt

On windows and yum based systems specifying 4.3.0 as package_version works as expected. On apt based systems it fails:

Recipe: signalfx_agent::default
  * apt_package[signalfx-agent] action install[2019-03-27T13:50:23+00:00] INFO: Processing apt_package[signalfx-agent] action install (signalfx_agent::default line 35)


    ================================================================================
    Error executing action `install` on resource 'apt_package[signalfx-agent]'
    ================================================================================

    Mixlib::ShellOut::ShellCommandFailed
    ------------------------------------
    Expected process to exit with [0], but received '100'
    ---- Begin output of ["apt-get", "-q", "-y", "-o", "Dpkg::Options::=--force-confdef", "-o", "Dpkg::Options::=--force-confold", "install", "signalfx-agent=4.3.0"] ----
    STDOUT: Reading package lists...
    Building dependency tree...
    Reading state information...
    STDERR: E: Version '4.3.0' for 'signalfx-agent' was not found
    ---- End output of ["apt-get", "-q", "-y", "-o", "Dpkg::Options::=--force-confdef", "-o", "Dpkg::Options::=--force-confold", "install", "signalfx-agent=4.3.0"] ----
    Ran ["apt-get", "-q", "-y", "-o", "Dpkg::Options::=--force-confdef", "-o", "Dpkg::Options::=--force-confold", "install", "signalfx-agent=4.3.0"] returned 100

    Resource Declaration:
    ---------------------
    # In /var/chef.cloudreach/cache/cookbooks/signalfx_agent/recipes/default.rb

     35:   package 'signalfx-agent' do # ~FC009
     36:     action :install
     37:     version node['signalfx_agent']['package_version'] unless node['signalfx_agent']['package_version'].nil?
     38:     flush_cache [ :before ] if platform_family?('rhel')
     39:     options '--allow-downgrades' if platform_family?('debian') \
     40:       && node['packages'] \
     41:       && node['packages']['apt'] \
     42:       && Gem::Version.new(node['packages']['apt']['version']) >= Gem::Version.new('1.1.0')
     43:     allow_downgrade true if platform_family?('rhel', 'amazon', 'fedora')
     44:     notifies :restart, 'service[signalfx-agent]', :delayed
     45:   end
     46: end

    Compiled Resource:
    ------------------
    # Declared in /var/chef.cloudreach/cache/cookbooks/signalfx_agent/recipes/default.rb:35:in `from_file'

    apt_package("signalfx-agent") do
      package_name "signalfx-agent"
      action [:install]
      default_guard_interpreter :default
      declared_type :package
      cookbook_name "signalfx_agent"
      recipe_name "default"
      version "4.3.0"
    end

    System Info:
    ------------
    chef_version=14.10.9
    platform=ubuntu
    platform_version=14.04
    ruby=ruby 2.5.3p105 (2018-10-18 revision 65156) [x86_64-linux]
    program_name=/usr/bin/chef-client
    executable=/opt/chef/bin/chef-client

Likely because the package is versioned as 4.3.0-1. Could you either version your packages in the same format or handle this so specifying 4.3.0 would get the same build across all platforms?

root@ip-10-0-1-127:~# apt-get install signalfx-agent=4.3.0
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Version '4.3.0' for 'signalfx-agent' was not found
root@ip-10-0-1-127:~# apt-get install signalfx-agent=4.3.0-1
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be upgraded:
  signalfx-agent

Helm chart configmap collectd rule missing top-level attributes

The helm chart still specifies procPath and etcPath under the collectd rule, not at a global level.

see:
https://docs.signalfx.com/en/latest/integrations/agent/monitors/host-metadata.html
https://docs.signalfx.com/en/latest/integrations/agent/monitors/collectd-signalfx-metadata.html

These should get defined at a global level with appropriate defaults:

procPath: /proc
etcPath: /etc
time="2019-04-02T12:28:16Z" level=info msg="Creating new monitor" discoveryRule= monitorID=9 monitorType="collectd/signalfx-metadata"
time="2019-04-02T12:28:16Z" level=warning msg="please set the `procPath` top level agent configuration instead of the monitor level configuration" monitorType="collectd/signalfx-metadata"
time="2019-04-02T12:28:16Z" level=error msg="Invalid module-specific configuration" error="yaml: unmarshal errors:
  line 1: field etcPath not found in struct hostmetadata.Config
  line 2: field procFSPath not found in struct hostmetadata.Config" otherConfig="(map[string]interface {}) (len=2) {
 (string) (len=7) "etcPath": (string) (len=11) "/hostfs/etc",
 (string) (len=10) "procFSPath": (string) (len=12) "/hostfs/proc"
}
(interface {}) <nil>

Windows service can not be restarted via chef

With the introduction of python monitors it has become necessary to restart the agent once the monitors change, as the changes are not detected otherwise. Currently this does not seem possible due to the guards in the windows service resource. The effect is, that even thought chef schedules the restart of the service, the resource decides there is nothing to be done due the agent version matching what's expected. In other words, it's only possible to restart the service via chef when the version changes:

So doing something like the following in a recipe

  cookbook_file "#{integration_directory}\\bitlocker.py" do
    source 'integrations/bitlocker.py'
    action :create
    notifies :restart, "windows_service[#{node['signalfx_agent']['service_name']}]", :delayed
  end

would yield:

[2019-04-15T14:14:33+00:00] INFO: cookbook_file[\Program Files\SignalFx\\integrations\bitlocker.py] sending restart acti
on to windows_service[signalfx-agent] (delayed)
Recipe: signalfx_agent::win
  * windows_service[signalfx-agent] action restart[2019-04-15T14:14:33+00:00] INFO: Processing windows_service[signalfx-
agent] action restart (signalfx_agent::win line 1)
 (skipped due to only_if)
[2019-04-15T14:14:33+00:00] INFO: Chef Run complete in 10.9068 seconds

(note the skipped due to only_if above)

Could you please either fix the agent so it detects changes to the custom python integrations without a restart, or the chef cookbooks so that the restart can be triggered on the service?

Redis ListLength expects DBIndex/KeyPattern dict but redis_info.py expects tuple

The ListLength struct (https://github.com/signalfx/signalfx-agent/blob/master/internal/monitors/collectd/redis/redis.go#L59) has keys DBIndex and KeyPattern.

However, redis_info.py (https://github.com/signalfx/redis-collectd-plugin/blob/master/redis_info.py#L245) expects a 2-item list ((len(node.values)) == 2). This causes the integration to break - you get a warning "redis_info plugin: monitoring length of keys requires both database index and key value".

If I had to guess, the fix could be to modify https://github.com/signalfx/signalfx-agent/blob/master/internal/monitors/collectd/redis/redis.go#L115 to just convert conf.SendListLengths into this tuple form.

statsd monitor converter fails to parse `pattern` or `metricName` values not ending with a {pattern}

signalfx-agent config file -> monitors -> statsd -> converters:

If pattern or metricName doesn't have any {patterns}, for example metric.count, signalfx agent ignores the rule with the following error message in logs:

level=error msg="Invalid pattern. Mismatched brackets : metric.count" monitorType=statsd

If pattern or metricName does have {patterns} but not at the end for of the string, for example metric.{region}.count, signalfx-agent fails to start with the following stacktrace:

Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]: panic: runtime error: slice bounds out of range [39:38]
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]: goroutine 97 [running]:
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]: github.com/signalfx/signalfx-agent/internal/monitors/statsd.parseFields(0xc000ce47e0, 0x28, 0xc0009738e0)
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]:         /usr/src/signalfx-agent/internal/monitors/statsd/parser.go:54 +0x5db
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]: github.com/signalfx/signalfx-agent/internal/monitors/statsd.initConverter(0xc000a3d128, 0xc00003e018)
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]:         /usr/src/signalfx-agent/internal/monitors/statsd/parser.go:18 +0x3b
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]: github.com/signalfx/signalfx-agent/internal/monitors/statsd.(*Monitor).Configure(0xc000cfd4a0, 0xc00035f400, 0x0, 0x0)
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]:         /usr/src/signalfx-agent/internal/monitors/statsd/monitor.go:90 +0x168
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]: reflect.Value.call(0x1b748c0, 0xc000cfd4a0, 0x213, 0x1e23a00, 0x4, 0xc000a3d520, 0x1, 0x1, 0x5, 0xc000a3d418, ...)
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]:         /usr/local/go/src/reflect/value.go:460 +0x5f6
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]: reflect.Value.Call(0x1b748c0, 0xc000cfd4a0, 0x213, 0xc000a3d520, 0x1, 0x1, 0xc000cfd4a0, 0x213, 0xc000cfd4a0)
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]:         /usr/local/go/src/reflect/value.go:321 +0xb4
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]: github.com/signalfx/signalfx-agent/internal/core/config.CallConfigure(0x1b748c0, 0xc000cfd4a0, 0x1cc4760, 0xc00035f400, 0x30, 0xc000cfd8c0)
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]:         /usr/src/signalfx-agent/internal/core/config/dynamic.go:111 +0x369
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]: github.com/signalfx/signalfx-agent/internal/monitors.(*ActiveMonitor).configureMonitor(0xc00065a690, 0x21ba7a0, 0xc00035f400, 0x0, 0x21ba7a0)
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]:         /usr/src/signalfx-agent/internal/monitors/activemonitor.go:87 +0x381
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]: github.com/signalfx/signalfx-agent/internal/monitors.(*MonitorManager).createAndConfigureNewMonitor(0xc000222f80, 0x21ba7a0, 0xc00035f040, 0x0, 0x0, 0xc000a3d868, 0x7b216d9a5a6a2bf1)
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]:         /usr/src/signalfx-agent/internal/monitors/manager.go:386 +0x6e5
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]: github.com/signalfx/signalfx-agent/internal/monitors.(*MonitorManager).handleNewConfig(0xc000222f80, 0xc000cee400, 0xc1ca85db204e7540, 0xc00044fc48, 0x0, 0x0)
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]:         /usr/src/signalfx-agent/internal/monitors/manager.go:179 +0xa5
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]: github.com/signalfx/signalfx-agent/internal/monitors.(*MonitorManager).Configure(0xc000222f80, 0xc000213400, 0xc, 0xc, 0xc0005bf208, 0xa, 0xc0009ac201)
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]:         /usr/src/signalfx-agent/internal/monitors/manager.go:106 +0x3a2
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]: github.com/signalfx/signalfx-agent/internal/core.(*Agent).configure(0xc000222f00, 0xc0005bf000)
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]:         /usr/src/signalfx-agent/internal/core/agent.go:134 +0x3e9
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]: github.com/signalfx/signalfx-agent/internal/core.Startup.func1(0xc0009610e0, 0x1e4ac0d, 0x18, 0xc000222f00, 0xc00092d200, 0x2206140, 0xc000130b40)
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]:         /usr/src/signalfx-agent/internal/core/agent.go:191 +0x172
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]: created by github.com/signalfx/signalfx-agent/internal/core.Startup
Dec 03 01:30:18 ip-xx-xx-xxx-xx signalfx-agent[24212]:         /usr/src/signalfx-agent/internal/core/agent.go:178 +0x147

Cookstyle Errors

When running Cookstyle on the chef cookbook I am receiving a lot of errors.

The majority of these can be automatically fixed with cookstyle -a .

Agent crashing on kubernetes cluster

Agent keeps crashing on kubernetes.
I'm using EKS cluster
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.12-eks-eb1860", GitCommit:"eb1860579253bb5bf83a5a03eb0330307ae26d18", GitTreeState:"clean", BuildDate:"2019-12-23T08:58:45Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}

Here is the Error from the logs:
time="2020-01-19T21:47:49Z" level=info msg="Starting K8s API resource sync" E0119 21:47:49.547438 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 2170 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x2260c00, 0x40b14c0) /go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0xa3 k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x82 panic(0x2260c00, 0x40b14c0) /usr/local/go/src/runtime/panic.go:679 +0x1b2 github.com/signalfx/signalfx-agent/pkg/monitors/kubernetes/cluster/metrics.datapointsForJob(0xc00103e000, 0x2488e00, 0x2, 0x20ed403) /usr/src/signalfx-agent/pkg/monitors/kubernetes/cluster/metrics/jobs.go:25 +0x1ff github.com/signalfx/signalfx-agent/pkg/monitors/kubernetes/cluster/metrics.(*DatapointCache).HandleAdd(0xc000cb3800, 0x2b75220, 0xc00103e000, 0x2b75220, 0xc00103e000) /usr/src/signalfx-agent/pkg/monitors/kubernetes/cluster/metrics/cache.go:109 +0x78d github.com/signalfx/signalfx-agent/pkg/monitors/kubernetes/cluster.(*State).beginSyncForType.func3(0xc000f98480, 0x17, 0x17, 0xc000ddd808, 0x8, 0x0, 0x0) /usr/src/signalfx-agent/pkg/monitors/kubernetes/cluster/clusterstate.go:154 +0x2d4 k8s.io/client-go/tools/cache.(*FakeCustomStore).Replace(0xc0012482d0, 0xc000f98480, 0x17, 0x17, 0xc000ddd808, 0x8, 0x0, 0x0) /go/pkg/mod/k8s.io/[email protected]/tools/cache/fake_custom_store.go:91 +0x65 k8s.io/client-go/tools/cache.(*Reflector).syncWith(0xc00021e840, 0xc000f98300, 0x17, 0x17, 0xc000ddd808, 0x8, 0x0, 0xc000d74cc0) /go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:354 +0xf8 k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch.func1(0xc00021e840, 0xc0008f4400, 0xc000d74000, 0xc001181bb8, 0x0, 0x0) /go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:250 +0x8fa k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch(0xc00021e840, 0xc000d74000, 0x0, 0x0) /go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:257 +0x1a9 k8s.io/client-go/tools/cache.(*Reflector).Run.func1() /go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:155 +0x33 k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000b63778) /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152 +0x5e k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001181f78, 0x3b9aca00, 0x0, 0xc000c29901, 0xc000d74000) /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153 +0xf8 k8s.io/apimachinery/pkg/util/wait.Until(...) /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88 k8s.io/client-go/tools/cache.(*Reflector).Run(0xc00021e840, 0xc000d74000) /go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:154 +0x16b created by github.com/signalfx/signalfx-agent/pkg/monitors/kubernetes/cluster.(*State).beginSyncForType /usr/src/signalfx-agent/pkg/monitors/kubernetes/cluster/clusterstate.go:167 +0x2f6 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x19a587f]

By looking on the stack trace and the code it seems that when getting kubernetes jobs there is a call to null pointer.
datapoint.NewIntValue(int64(*job.Spec.Completions))

Here is the job description from kubernetes:
➜ ~ kubectl describe job automation-d9c368 Name: automation-d9c368 Namespace: automation-testing Selector: controller-uid=88b3bc31-3ac4-11ea-b7bc-0203bc266fa1 Labels: controller-uid=88b3bc31-3ac4-11ea-b7bc-0203bc266fa1 job-name=automation-d9c368 Annotations: <none> Parallelism: 1 Completions: <unset> Start Time: Sun, 19 Jan 2020 16:03:56 +0200 Completed At: Sun, 19 Jan 2020 16:28:41 +0200 Duration: 24m

You can see that Completions attribute is unset.

From kubernetes docs the value of job completion can be unset in some cases.

For a non-parallel Job, you can leave both .spec.completions and .spec.parallelism unset. When both are unset, both are defaulted to 1.

For a fixed completion count Job, you should set .spec.completions to the number of completions needed. You can set .spec.parallelism, or leave it unset and it will default to 1.

For a work queue Job, you must leave .spec.completions unset, and set .spec.parallelism to a non-negative integer
https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/

So unset completion attribute should be handled.

helm chart crashlooping for signalfx agent.

I commented out the enableBuiltInFiltering: true in the helm chart. But not sure what that line in the config does.

time="2019-05-13T20:52:38Z" level=info msg="Starting up agent version 4.6.3"
time="2019-05-13T20:52:38Z" level=info msg="Watching for config file changes"
time="2019-05-13T20:52:38Z" level=error msg="Could not unmarshal config file:

  readThreads: 5
  timeout: 40
  writeQueueLimitHigh: 500000
  writeQueueLimitLow: 400000
enableBuiltInFiltering: true
^^^^^^^
etcPath: /hostfs/etc
globalDimensions:
  kubernetes_cluster: sandbox
intervalSeconds: 20
logging:

yaml: unmarshal errors:
  line 7: field enableBuiltInFiltering not found in type config.Config
"
time="2019-05-13T20:52:38Z" level=error msg="Error loading main config" configPath="/etc/signalfx/agent.yaml" error="yaml: unmarshal errors:
  line 7: field enableBuiltInFiltering not found in type config.Config"

Panic when loading config referencing non-existent environment variable

When starting the agent with a config that references variables using ${} substitution, the agent panic. In our case, the following addition to the config seems to be the culprit:

globalDimensions:
  env: ${ENV_NAME}
  region: ${REGION}

When starting the agent, it panicked with the following output:

Sep 28 01:37:52 ip-10-4-12-42 docker[11742]: time="2019-09-28T01:37:52Z" level=info msg="Starting up agent version 4.11.1"
Sep 28 01:37:52 ip-10-4-12-42 docker[11742]: time="2019-09-28T01:37:52Z" level=info msg="Watching for config file changes"
Sep 28 01:37:52 ip-10-4-12-42 docker[11742]: panic: interface conversion: error is *errors.errorString, not *yaml.TypeError
Sep 28 01:37:52 ip-10-4-12-42 docker[11742]: goroutine 1 [running]:
Sep 28 01:37:52 ip-10-4-12-42 docker[11742]: github.com/signalfx/signalfx-agent/internal/core/config.loadYAML(0xc00056c000, 0x1048, 0x1300, 0x6f9, 0xc000082de0, 0xc00056c000)
Sep 28 01:37:52 ip-10-4-12-42 docker[11742]:         /usr/src/signalfx-agent/internal/core/config/loader.go:107 +0x734
Sep 28 01:37:52 ip-10-4-12-42 docker[11742]: github.com/signalfx/signalfx-agent/internal/core/config.LoadConfig(0x223f020, 0xc000461500, 0x1e5462d, 0x18, 0xc000123d60, 0x18, 0xc000000180)
Sep 28 01:37:52 ip-10-4-12-42 docker[11742]:         /usr/src/signalfx-agent/internal/core/config/loader.go:42 +0x41f
Sep 28 01:37:52 ip-10-4-12-42 docker[11742]: github.com/signalfx/signalfx-agent/internal/core.Startup(0x1e5462d, 0x18, 0xc00055fe20, 0x1)
Sep 28 01:37:52 ip-10-4-12-42 docker[11742]:         /usr/src/signalfx-agent/internal/core/agent.go:148 +0x95
Sep 28 01:37:52 ip-10-4-12-42 docker[11742]: main.runAgent.func1()
Sep 28 01:37:52 ip-10-4-12-42 docker[11742]:         /usr/src/signalfx-agent/cmd/agent/main.go:191 +0xac
Sep 28 01:37:52 ip-10-4-12-42 docker[11742]: main.runAgent(0xc00046dd10, 0xc00046a8a0, 0xc000082c60)
Sep 28 01:37:52 ip-10-4-12-42 docker[11742]:         /usr/src/signalfx-agent/cmd/agent/main.go:194 +0xa6
Sep 28 01:37:52 ip-10-4-12-42 docker[11742]: main.runAgentPlatformSpecific(...)
Sep 28 01:37:52 ip-10-4-12-42 docker[11742]:         /usr/src/signalfx-agent/cmd/agent/nonwindows.go:10
Sep 28 01:37:52 ip-10-4-12-42 docker[11742]: main.main()
Sep 28 01:37:52 ip-10-4-12-42 docker[11742]:         /usr/src/signalfx-agent/cmd/agent/main.go:277 +0x497

Once we made the environment variables available, the agent started up successfully.

While the agent failing to start is not unreasonable, the lack of an appropriate error message is very confusing.

Fatal error with optional consul remote config source

Our config includes:

  - {"#from": "consul:signalfx/monitors/*", flatten: true, optional: true}

This may be intentional, but if the consul service is not available, the smart agent dies after logging the following error:

time="2019-10-01T06:40:39Z" level=error msg="Error loading main config" configPath="/etc/signalfx/agent.yaml" error="could not resolve path consul:signalfx/monitors/*: Get http://localhost:8500/v1/kv/signalfx/monitors?recurse=: dial tcp [::1]:8500: connect: connection refused"

Since the path is optional, should the appropriate behavior be to log the failure and continue?

python-monitor should log unhandled exceptions

Hello,

In my current environment (debian9), signalfx agent logging works fine through systemd.

Nevertheless, trying to create a new python-monitor script I noticed that if errors happen during script execution by the agent, this is not retrieved in agent logs.

For example, a python stacktrace due to unhandled exception will not be visible anywhere (in my knowledge) and so the script will fail silently.

This would be far better to provider logs from python script to signalfx agent logs

Emitting a metrics for failing plugins

I wanted to open a conversation about potentially reducing the amount of logs been written to switch them to a counter / gauge being sent to SignalFx instead.

For example, the collected plugins can be rather noisey if they are interrupted or if the service they are monitoring restarts, doesn't respond in time, etc.
My thought would be to emit an event that we could track or alert on in the UI to follow up rather than searching through all of our logs as that can get rather expensive storing it.

Support setting proxy environment variables for service on rhel6

In the second on Proxy Support in README.md it does not detail how to setup the required environment variables for configuring proxy support.

On sysv init scripts these customisations, are normally added into a file in /etc/default/, so for the SignalFx agent would be the following file

/etc/default/signalfx-agent

The code to check and source the file is need around these lines:

dir="/usr/lib/signalfx-agent"
cmd="/usr/bin/signalfx-agent"
user="signalfx-agent"
group="signalfx-agent"
rundir="/var/run/signalfx-agent"
name="signalfx-agent"
pidfile="/var/run/$name.pid"
logfile="/var/log/$name.log"

I'll be raising this with support directly too.

Chef cookbook fails on subsequent runs on windows

The chef cookbook fails on every run after the first as it tries to re-create a service that already exists:

Recipe: signalfx_agent::win
  * windows_zipfile[\Program Files\SignalFx\] action unzip[2019-02-25T14:14:02+00:00] INFO: Processing windows_zipfile[\
Program Files\SignalFx\] action unzip (signalfx_agent::win line 2)
 (skipped due to only_if)
  * file[\Program Files\SignalFx\\version.txt] action create[2019-02-25T14:14:02+00:00] INFO: Processing file[\Program F
iles\SignalFx\\version.txt] action create (signalfx_agent::win line 11)
 (up to date)
  * powershell_script[ensure service created] action run[2019-02-25T14:14:02+00:00] INFO: Processing powershell_script[e
nsure service created] action run (signalfx_agent::win line 23)

    [execute] time="2019-02-25T14:14:03Z" level=error msg="Failed to control the service" error="Failed to install Signa
lFx Smart Agent: service SignalFx Smart Agent already exists"
              E: 14:14:03 time="2019-02-25T14:14:03Z" level=error msg="Failed to control the service" error="Failed to i
nstall SignalFx Smart Agent: service SignalFx Smart Agent already exists"

    ================================================================================
    Error executing action `run` on resource 'powershell_script[ensure service created]'
    ================================================================================

    Mixlib::ShellOut::ShellCommandFailed
    ------------------------------------
    Expected process to exit with [0], but received '1'
    ---- Begin output of "C:\Windows\system32\WindowsPowerShell\v1.0\powershell.exe" -NoLogo -NonInteractive -NoProfile
-ExecutionPolicy Bypass -InputFormat None -File "C:/Users/ADMINI~1/AppData/Local/Temp/2/chef-script20190225-3088-1hdsuas
.ps1" ----
    STDOUT: time="2019-02-25T14:14:03Z" level=error msg="Failed to control the service" error="Failed to install SignalF
x Smart Agent: service SignalFx Smart Agent already exists"
    STDERR: E: 14:14:03 time="2019-02-25T14:14:03Z" level=error msg="Failed to control the service" error="Failed to ins
tall SignalFx Smart Agent: service SignalFx Smart Agent already exists"
    ---- End output of "C:\Windows\system32\WindowsPowerShell\v1.0\powershell.exe" -NoLogo -NonInteractive -NoProfile -E
xecutionPolicy Bypass -InputFormat None -File "C:/Users/ADMINI~1/AppData/Local/Temp/2/chef-script20190225-3088-1hdsuas.p
s1" ----
    Ran "C:\Windows\system32\WindowsPowerShell\v1.0\powershell.exe" -NoLogo -NonInteractive -NoProfile -ExecutionPolicy
Bypass -InputFormat None -File "C:/Users/ADMINI~1/AppData/Local/Temp/2/chef-script20190225-3088-1hdsuas.ps1" returned 1

    Resource Declaration:
    ---------------------
    # In C:/chef.cloudreach/cache/cookbooks/signalfx_agent/recipes/win.rb

     23: powershell_script 'ensure service created' do
     24:     code <<-EOH
     25:       if (!((Get-CimInstance -ClassName win32_service -Filter "Name = '#{node['signalfx_agent']['service_name']
}'" | Select Name, State).Name)){
     26:           & "#{node['signalfx_agent']['install_dir']}\\SignalFxAgent\\bin\\signalfx-agent.exe" -service "instal
l" -logEvents -config "#{node['signalfx_agent']['conf_file_path']}"
     27:       }
     28:     EOH
     29: end

    Compiled Resource:
    ------------------
    # Declared in C:/chef.cloudreach/cache/cookbooks/signalfx_agent/recipes/win.rb:23:in `from_file'

    powershell_script("ensure service created") do
      action [:run]
      default_guard_interpreter :powershell_script
      command nil
      backup 5
      interpreter "powershell.exe"
      declared_type :powershell_script
      cookbook_name "signalfx_agent"
      recipe_name "win"
      code "      if (!((Get-CimInstance -ClassName win32_service -Filter \"Name = 'signalfx-agent'\" | Select Name, Sta
te).Name)){\n          & \"\\Program Files\\SignalFx\\\\SignalFxAgent\\bin\\signalfx-agent.exe\" -service \"install\" -l
ogEvents -config \"\\ProgramData\\SignalFxAgent\\agent.yaml\"\n      }\n"
      domain nil
      user nil
    end

    System Info:
    ------------
    chef_version=14.10.9
    platform=windows
    platform_version=6.1.7601
    ruby=ruby 2.5.3p105 (2018-10-18 revision 65156) [x64-mingw32]
    program_name=C:/opscode/chef/bin/chef-client
    executable=C:/opscode/chef/bin/chef-client

[2019-02-25T14:14:03+00:00] INFO: Running queued delayed notifications before re-raising exception

Running handlers:
[2019-02-25T14:14:03+00:00] ERROR: Running exception handlers
  - CRChefReporting
Running handlers complete
[2019-02-25T14:14:03+00:00] ERROR: Exception handlers complete
Chef Client failed. 7 resources updated in 16 seconds
[2019-02-25T14:14:03+00:00] FATAL: Stacktrace dumped to C:/chef.cloudreach/cache/chef-stacktrace.out
[2019-02-25T14:14:03+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2019-02-25T14:14:03+00:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: powershell_script[ensure service created] (sign
alfx_agent::win line 23) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but rece
ived '1'
---- Begin output of "C:\Windows\system32\WindowsPowerShell\v1.0\powershell.exe" -NoLogo -NonInteractive -NoProfile -Exe
cutionPolicy Bypass -InputFormat None -File "C:/Users/ADMINI~1/AppData/Local/Temp/2/chef-script20190225-3088-1hdsuas.ps1
" ----
STDOUT: time="2019-02-25T14:14:03Z" level=error msg="Failed to control the service" error="Failed to install SignalFx Sm
art Agent: service SignalFx Smart Agent already exists"
STDERR: E: 14:14:03 time="2019-02-25T14:14:03Z" level=error msg="Failed to control the service" error="Failed to install
 SignalFx Smart Agent: service SignalFx Smart Agent already exists"
---- End output of "C:\Windows\system32\WindowsPowerShell\v1.0\powershell.exe" -NoLogo -NonInteractive -NoProfile -Execu
tionPolicy Bypass -InputFormat None -File "C:/Users/ADMINI~1/AppData/Local/Temp/2/chef-script20190225-3088-1hdsuas.ps1"
----
Ran "C:\Windows\system32\WindowsPowerShell\v1.0\powershell.exe" -NoLogo -NonInteractive -NoProfile -ExecutionPolicy Bypa
ss -InputFormat None -File "C:/Users/ADMINI~1/AppData/Local/Temp/2/chef-script20190225-3088-1hdsuas.ps1" returned 1

Discovery Rule on container_labels not configuring any monitors

I'm having a few issues getting a global discovery rule working. Most of our microservices behave the same, so I want to write a rule that will work for 90% of them, but somethings not working correctly and I can't quite pinpoint why.

I have this in my agent.yaml:

monitors:
  - type: prometheus-exporter
    discoveryRule: Get(container_labels, "metrics") == "true" && private_port == "9125"
    extraDimensionsFromEndpoint:
      app: Get(container_labels, "app")

and I have a Pod spec of:

apiVersion: v1
kind: Pod
metadata:
  name: my-service
  namespace: default
  labels:
    name: "my-service"
    metrics: "true"
spec:
  containers:
  - name: my-service
    image: my-service:latest

This isn't configuring any monitors for me 😕 If I change the discoveryRule to container_image =~ "my-service" it works fine.

He're's the output of signalfx-agent status endpoints if that helps:

 - internalId: my-service-5ccd5bb89c-g577w-4995517-9125 (UNMONITORED)
   alternate_port: 0
   container_command: 
   container_id: f9387b00f26064c02c6cb5ffb13c5c1f10a429e857f5a2eb854e785c0cdb1c57
   container_image: my-service:latest
   container_labels: map[app:my-service metrics:true pod-template-hash:5ccd5bb89c release:my-service-1]
   container_name: my-service
   container_names: [my-service]
   container_spec_name: my-service
   container_state: running
   discovered_by: k8s-api
   has_port: true
   host: 100.96.5.36
   id: my-service-5ccd5bb89c-g577w-4995517-9125
   ip_address: 100.96.5.36
   kubernetes_annotations: map[artifact.spinnaker.io/location:default cni.projectcalico.org/podIP:100.96.5.36/32]
   kubernetes_namespace: default
   kubernetes_pod_name: my-service-5ccd5bb89c-g577w
   kubernetes_pod_uid: 4995517c-0d1b-11ea-9d1e-06033b20cc02
   name: metrics
   network_port: 9125
   orchestrator: 1
   pod_metadata: &ObjectMeta{Name:my-service-5ccd5bb89c-g577w,GenerateName:my-service-5ccd5bb89c-,Namespace:default,SelfLink:/api/v1/namespaces/default/pods/my-service-5ccd5bb89c-g577w,UID:4995517c-0d1b-11ea-9d1e-06033b20cc02,ResourceVersion:34107591,Generation:0,CreationTimestamp:2019-11-22 11:29:03 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: my-service,metrics: true,pod-template-hash: 5ccd5bb89c,release: my-service-1,},Annotations:map[string]string{cni.projectcalico.org/podIP: 100.96.5.36/32,},OwnerReferences:[{apps/v1 ReplicaSet my-service-5ccd5bb89c 498e8e1f-0d1b-11ea-9d1e-06033b20cc02 0xc00235838a 0xc00235838b}],Finalizers:[],ClusterName:,Initializers:nil,}
   pod_spec: &PodSpec{Volumes:[{logging-config {nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil ConfigMapVolumeSource{LocalObjectReference:LocalObjectReference{Name:my-service-logging-v000,},Items:[{content config.xml <nil>}],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil}} {default-token-qktlw {nil nil nil nil nil &SecretVolumeSource{SecretName:default-token-qktlw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{my-service my-service:latest [/bin/bash /start.sh] []  [{http 0 80 TCP } {metrics 0 9125 TCP }] [{ ConfigMapEnvSource{LocalObjectReference:LocalObjectReference{Name:my-service-environment-v000,},Optional:nil,} nil} { nil &SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:my-service-secrets,},Optional:nil,}}] [] {map[memory:{{566231040 0} {<nil>} 540Mi BinarySI}] map[cpu:{{600 -3} {<nil>} 600m DecimalSI} memory:{{566231040 0} {<nil>} 540Mi BinarySI}]} [{logging-config false /config/logging  <nil>} {default-token-qktlw true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}] [] &Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/ping,Port:http,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:60,TimeoutSeconds:5,PeriodSeconds:30,SuccessThreshold:1,FailureThreshold:3,} &Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/ping,Port:http,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:60,TimeoutSeconds:5,PeriodSeconds:30,SuccessThreshold:1,FailureThreshold:3,} nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{kops.k8s.io/instancegroup: au-production,},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:ip-10-31-51-23.ap-southeast-2.compute.internal,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists  NoExecute 0xc002358670} {node.kubernetes.io/unreachable Exists  NoExecute 0xc0023586a0}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],RuntimeClassName:nil,EnableServiceLinks:*true,}
   port: 9125
   port_labels: map[]
   port_type: TCP
   private_port: 9125
   public_port: 0
   target: hostport

Any ideas what's going on?

Pass signalFxAccessToken to consul monitor automatically?

The collectd/consul plugin configuration can optionally accept signalFxAccessToken to send leadership change events.

Currently, the token has to be passed manually. Would it be possible to pass it in automatically from the top-level signalFxAccessToken config? This would simplify configuration and allow the default dashboard to completely work out=of-the-box without any surprises.

Interface monitor does not collect metrics from other network namespaces

I currently have the agent running inside a docker container, in privilege mode but not running within the host network. The network statistics produced is only for the network that the container is current running in instead of the host.

I was wondering if there is some magic that could be done either in code or config to have it collect from other namespaces?

Support for "Check from Only One Node" Monitors

At the moment I don't believe there is a way to scrape a specific endpoint only once (e.g. postgresql checks or a single host/port via prometheus-exporter).

Instead the monitors are applied to all signalfx-agent instances and each will make duplicate requests. In some circumstances (like with postgres) the overhead involved for a per-database, per-agent connection is less than desirable (e.g. we have a shared RDS instance and monitoring it on a 46 node Kubernetes cluster results in 46 * db-count worth of TCP connections to the RDS instance).

I'm not familiar with the exact inner-workings of the signalfx-agent codebase, but I had always assumed that there was a single leader that is actually aggregating data and sending it on to the ingestion endpoints. Is this correct? If it is, then it would be great if there was a way to have only the leader agent perform certain checks.

One suggestion we have had is to run a single pod with a specific configuration of the signalfx-agent to monitor specific endpoints only once, which seems a bit excessive.

So two questions:

  • Does the agent have leader-election or am I getting confused with something else?
  • If there is leadership election, would it be feasible to add support for leader-only check support to prevent duplication?

High CPU Usage

We're seeing signalfx-agent use a large amount of CPU after an undetermined length of running without problems.

SignalFx Agent version:           4.14.0
ATOP - $HOST                          2019/11/06  23:07:15                          ------                           5s elapsed
PRC | sys    4.23s  | user  10.48s | #proc    161  | #trun      2 |  #tslpi   353 | #tslpu     0 |  #zombie    0 | clones    39 |  #exit     29 |
CPU | sys      78%  | user    203% | irq       3%  | idle    316% |  wait      0% | steal     0% |  guest     0% | curf 2.20GHz |  curscal   ?% |
cpu | sys      12%  | user     38% | irq       1%  | idle     48% |  cpu004 w  0% | steal     0% |  guest     0% | curf 2.20GHz |  curscal   ?% |
cpu | sys      14%  | user     34% | irq       0%  | idle     51% |  cpu000 w  0% | steal     0% |  guest     0% | curf 2.20GHz |  curscal   ?% |
cpu | sys      12%  | user     34% | irq       0%  | idle     53% |  cpu002 w  0% | steal     0% |  guest     0% | curf 2.20GHz |  curscal   ?% |
cpu | sys      14%  | user     31% | irq       1%  | idle     54% |  cpu005 w  0% | steal     0% |  guest     0% | curf 2.20GHz |  curscal   ?% |
cpu | sys      12%  | user     34% | irq       0%  | idle     54% |  cpu001 w  0% | steal     0% |  guest     0% | curf 2.20GHz |  curscal   ?% |
cpu | sys      12%  | user     32% | irq       0%  | idle     55% |  cpu003 w  0% | steal     0% |  guest     0% | curf 2.20GHz |  curscal   ?% |
CPL | avg1    4.82  | avg5    4.42 | avg15   4.75  |              |               | csw   264496 |  intr  150589 |              |  numcpu     6 |
MEM | tot    38.3G  | free   32.6G | cache   3.9G  | dirty   0.5M |  buff  160.6M | slab  378.2M |               |              |               |
SWP | tot     0.0M  | free    0.0M |               |              |               |              |               | vmcom   3.0G |  vmlim  19.2G |
DSK |          sda  | busy      0% | read       0  | write      2 |  KiB/r      0 | KiB/w     26 |  MBr/s   0.00 | MBw/s   0.01 |  avio 0.00 ms |
NET | transport     | tcpi     128 | tcpo     136  | udpi       2 |  udpo       2 | tcpao      6 |  tcppo      4 | tcprs      1 |  udpip      0 |
NET | network       | ipi     8784 | ipo     8784  | ipfrw   8654 |  deliv    130 |              |               | icmpi      0 |  icmpo      0 |
NET | ens4    ----  | pcki    4758 | pcko    3973  | si 1715 Kbps |  so 1690 Kbps | erri       0 |  erro       0 | drpi       0 |  drpo       0 |
NET | vethbaf ----  | pcki    3950 | pcko    4737  | si 1655 Kbps |  so 1701 Kbps | erri       0 |  erro       0 | drpi       0 |  drpo       0 |
NET | docker0 ----  | pcki    3950 | pcko    4737  | si 1566 Kbps |  so 1701 Kbps | erri       0 |  erro       0 | drpi       0 |  drpo       0 |
NET | lo      ----  | pcki      77 | pcko      77  | si   52 Kbps |  so   52 Kbps | erri       0 |  erro       0 | drpi       0 |  drpo       0 |

  PID   RUID        EUID        THR     SYSCPU    USRCPU     VGROW    RGROW     RDDSK    WRDSK    ST   EXC    S   CPUNR     CPU   CMD         1/3
 6842   signalfx    signalfx     21      1.94s     8.79s        0K       0K        0K       0K    --     -    S       5    234%   signalfx-agent
  447   root        root          1      1.74s     0.77s        0K       0K        0K       0K    --     -    S       3     55%   systemd-journa
 9930   root        root         20      0.27s     0.36s        0K       0K        0K       0K    --     -    S       4     14%   (censored)

Attaching to the process with strace doesn't reveal much:

root@asg-collector-prod-r7nx:~# strace -p 6842 -d
strace: ptrace_setoptions = 0x11
strace: attach to pid 6842 (main) succeeded
strace: Process 6842 attached
strace: [wait(0x80057f) = 6842] WIFSTOPPED,sig=SIGTRAP,EVENT_STOP (128)
strace: pid 6842 has TCB_STARTUP, initializing it
strace: [wait(0x00857f) = 6842] WIFSTOPPED,sig=133
futex(0x30f1a88, FUTEX_WAIT_PRIVATE, 0, NULL


^Cstrace: cleanup: looking at pid 6842
strace: detach wait: event:0 sig:133
strace: Process 6842 detached
strace: dropped tcb for pid 6842, 0 remain
 <detached ...>

Comparing that strace output to one from a host that has signalfx but isn't experiencing the problem shows an identical output, so likely nothing helpful here.

We've found that restarting the signalfx-agent process temporarily resolves the issue, however it returns about a week later in our environment.

This is the slightly redacted output of signalfx-agent status all:

root@asg-collector-prod-d5gx:~# signalfx-agent status all
SignalFx Agent version:           4.14.0
Agent uptime:                     4m4s
Observers active:
Active Monitors:                  12
Configured Monitors:              12
Discovered Endpoint Count:        0
Bad Monitor Config:               None
Global Dimensions:                {host: $HOSTNAME, gcp_id: $GCP_PROJECT_ID}
Datapoints sent (last minute):    1044
Datapoints failed (last minute):  0
Events Sent (last minute):        12
Trace Spans Sent (last minute):   0
Agent Configuration:
  signalFxAccessToken: (censored)
  ingestUrl: https://ingest.us0.signalfx.com
  traceEndpointUrl: https://ingest.us0.signalfx.com/v1/trace
  apiUrl: https://api.us0.signalfx.com
  signalFxRealm: us0
  hostname: $HOSTNAME
  useFullyQualifiedHost:
  disableHostDimensions: false
  intervalSeconds: 5
  globalDimensions:
  sendMachineID: false
  cluster: ""
  syncClusterOnHostDimension: false
  validateDiscoveryRules: true
  observers:
  monitors:
    - type: collectd/cpu
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:
    - type: collectd/cpufreq
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:
    - type: collectd/df
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:
    - type: collectd/disk
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:
    - type: collectd/interface
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:
    - type: collectd/load
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:
    - type: collectd/memory
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:
    - type: collectd/protocols
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:
    - type: collectd/signalfx-metadata
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:
    - type: collectd/uptime
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:
    - type: collectd/vmem
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:
    - type: docker-container-stats
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:
  writer:
    datapointMaxBatchSize: 1000
    maxDatapointsBuffered: 5000
    traceSpanMaxBatchSize: 1000
    datapointMaxRequests: 10
    maxRequests: 10
    eventSendIntervalSeconds: 1
    propertiesMaxRequests: 20
    propertiesMaxBuffered: 10000
    propertiesSendDelaySeconds: 30
    propertiesHistorySize: 10000
    logDatapoints: false
    logEvents: false
    logTraceSpans: false
    logDimensionUpdates: false
    logDroppedDatapoints: false
    sendTraceHostCorrelationMetrics: true
    staleServiceTimeout: 5m0s
    traceHostCorrelationMetricsInterval: 1m0s
    maxTraceSpansInFlight: 100000
  logging:
    level: info
    format: text
  collectd:
    disableCollectd: false
    timeout: 40
    readThreads: 5
    writeThreads: 2
    writeQueueLimitHigh: 500000
    writeQueueLimitLow: 400000
    logLevel: notice
    intervalSeconds: 5
    writeServerIPAddr: 127.9.8.7
    writeServerPort: 0
    configDir: /var/run/signalfx-agent/collectd
  enableBuiltInFiltering: false
  metricsToInclude:
  metricsToExclude:
  propertiesToExclude:
  internalStatusHost: localhost
  internalStatusPort: 8095
  profiling: false
  profilingHost: 127.0.0.1
  profilingPort: 6060
  bundleDir: /usr/lib/signalfx-agent
  configSources:
    watch: true
    file:
      pollRateSeconds: 5
    zookeeper:
    etcd2:
    consul:
    vault:
  procPath: /proc
  etcPath: /etc
  varPath: /var
  runPath: /run
  sysPath: /sys

Active Monitors:
1. collectd/cpu
    Reporting Interval (seconds): 5
    Enabled Metrics: []
    Not using auto-discovery
    Config:
      type: collectd/cpu
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:

2. collectd/cpufreq
    Reporting Interval (seconds): 5
    Enabled Metrics: []
    Not using auto-discovery
    Config:
      type: collectd/cpufreq
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:

3. collectd/df
    Reporting Interval (seconds): 5
    Enabled Metrics: []
    Not using auto-discovery
    Config:
      type: collectd/df
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:
      hostFSPath: ""
      ignoreSelected: true
      fsTypes:
        - aufs
        - overlay
        - tmpfs
        - proc
        - sysfs
        - nsfs
        - cgroup
        - devpts
        - selinuxfs
        - devtmpfs
        - debugfs
        - mqueue
        - hugetlbfs
        - securityfs
        - pstore
        - binfmt_misc
        - autofs
      mountPoints:
        - /^/var/lib/docker/
        - /^/var/lib/rkt/pods/
        - /^/net//
        - /^/smb//
      reportByDevice: false
      reportInodes: false
      valuesPercentage: false

4. collectd/disk
    Reporting Interval (seconds): 5
    Enabled Metrics: []
    Not using auto-discovery
    Config:
      type: collectd/disk
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:
      disks:
        - /^loop[0-9]+$/
        - /^dm-[0-9]+$/
      ignoreSelected: true

5. collectd/interface
    Reporting Interval (seconds): 5
    Enabled Metrics: []
    Not using auto-discovery
    Config:
      type: collectd/interface
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:
      excludedInterfaces:
        - /^lo\d*$/
        - /^docker.*/
        - /^t(un|ap)\d*$/
        - /^veth.*$/
      includedInterfaces:

6. collectd/load
    Reporting Interval (seconds): 5
    Enabled Metrics: []
    Not using auto-discovery
    Config:
      type: collectd/load
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:

7. collectd/memory
    Reporting Interval (seconds): 5
    Enabled Metrics: []
    Not using auto-discovery
    Config:
      type: collectd/memory
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:

8. collectd/protocols
    Reporting Interval (seconds): 5
    Enabled Metrics: []
    Not using auto-discovery
    Config:
      type: collectd/protocols
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:

9. collectd/signalfx-metadata
    Reporting Interval (seconds): 5
    Enabled Metrics: []
    Not using auto-discovery
    Config:
      type: collectd/signalfx-metadata
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:
      writeServerURL: (censored)
      procFSPath: /proc
      etcPath: /etc
      perCoreCPUUtil: false
      persistencePath: /var/run/signalfx-agent
      omitProcessInfo: false
      dogStatsDPort:
      token:
      dogStatsDIP: ""
      ingestEndpoint: ""
      verbose: false

10. collectd/uptime
    Reporting Interval (seconds): 5
    Enabled Metrics: []
    Not using auto-discovery
    Config:
      type: collectd/uptime
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:

11. collectd/vmem
    Reporting Interval (seconds): 5
    Enabled Metrics: []
    Not using auto-discovery
    Config:
      type: collectd/vmem
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:

12. docker-container-stats
    Reporting Interval (seconds): 5
    Enabled Metrics: []
    Not using auto-discovery
    Config:
      type: docker-container-stats
      discoveryRule: ""
      validateDiscoveryRule: true
      extraDimensions:
      extraDimensionsFromEndpoint:
      configEndpointMappings:
      intervalSeconds: 5
      solo: false
      metricsToExclude:
      datapointsToExclude:
      disableHostDimensions: false
      disableEndpointDimensions: false
      dimensionTransformations:
      extraMetrics:
      extraGroups:
      enableExtraBlockIOMetrics: false
      enableExtraCPUMetrics: false
      enableExtraMemoryMetrics: false
      enableExtraNetworkMetrics: false
      dockerURL: unix:///var/run/docker.sock
      timeoutSeconds: 5
      labelsToDimensions:
      envToDimensions:
      excludedImages:


Discovered Endpoints:

If there's any extra details we can provide just let us know, and thanks in advance for your help with this!

Chef breaks service management on centos 7, redhat 7 and amazon 2

After installing the agent via chef on redhat 7, centos 7 and amazon 2 the agent is started, but it runs outside of the service manager. It can then be started a 2nd time by using the service manager. The instance launched by chef can not be stopped using the service manager, only by killing it.

root@ip-10-0-2-189 ~]# service signalfx-agent status
Not running
[root@ip-10-0-2-189 ~]# ps aux | grep signalfx-agent
signalf+  2581  0.4 21.8 281656 221656 ?       Ssl  13:15   0:00 /usr/bin/signalfx-agent
signalf+  2691  0.7  2.1 784008 22160 ?        Sl   13:15   0:01 /usr/lib/signalfx-agent/lib64/ld-linux-x86-64.so.2 /usr/lib/signalfx-agent/bin/collectd -f -C /var/run/signalfx-agent/collectd/global/collectd.conf
root      3628  0.0  0.0 112640   964 pts/0    S+   13:18   0:00 grep --color=auto signalfx-agent
[root@ip-10-0-2-189 ~]# service signalfx-agent restart
Not running
Starting signalfx-agent
Started.  Logs will go to /var/log/signalfx-agent.log
[root@ip-10-0-2-189 ~]# service signalfx-agent status
Running with pid 3695
[root@ip-10-0-2-189 ~]# ps aux | grep signalfx-agent
signalf+  2581  0.4 21.8 281656 221656 ?       Ssl  13:15   0:00 /usr/bin/signalfx-agent
signalf+  2691  0.7  2.1 784008 22160 ?        Sl   13:15   0:01 /usr/lib/signalfx-agent/lib64/ld-linux-x86-64.so.2 /usr/lib/signalfx-agent/bin/collectd -f -C /var/run/signalfx-agent/collectd/global/collectd.conf
root      3695  0.0  0.2  77064  2132 pts/0    S    13:18   0:00 su -s /bin/sh -c exec "$0" "$@" signalfx-agent -- /usr/bin/signalfx-agent
signalf+  3696  1.3  3.0 145144 31048 ?        Ssl  13:18   0:00 /usr/bin/signalfx-agent
signalf+  3704  5.5  2.0 549660 21200 ?        Sl   13:18   0:00 /usr/lib/signalfx-agent/lib64/ld-linux-x86-64.so.2 /usr/lib/signalfx-agent/bin/collectd -f -C /var/run/signalfx-agent/collectd/global/collectd.conf
root      3728  0.0  0.0 112640   968 pts/0    S+   13:19   0:00 grep --color=auto signalfx-agent
[root@ip-10-0-2-189 ~]# service signalfx-agent stop
Stopping signalfx-agent....Stopped
[root@ip-10-0-2-189 ~]# ps aux | grep signalfx-agent
signalf+  2581  0.3 21.8 281656 221816 ?       Ssl  13:15   0:01 /usr/bin/signalfx-agent
signalf+  2691  0.6  2.1 784008 22160 ?        Sl   13:15   0:02 /usr/lib/signalfx-agent/lib64/ld-linux-x86-64.so.2 /usr/lib/signalfx-agent/bin/collectd -f -C /var/run/signalfx-agent/collectd/global/collectd.conf
root      3970  0.0  0.0 112640   968 pts/0    R+   13:21   0:00 grep --color=auto signalfx-agent

The issue might be that the package always installs an initd service regardless of available service managers. Chef will attempt to use the service manager of the os when using the service resource and therefore it will try and use systemd to manage the service on these systems.

Redis ListLength does not seem to honor DBIndex

Hi there, I'm still trying to troubleshoot more and see if I can get a more precise issue, but we're observing that the Redis integration is only reporting DBIndex=0 and is ignoring our configuration of DBIndex=2.

Our agent.yaml file:

signalFxAccessToken: {"#from": "env:ACCESS_TOKEN"}
ingestUrl: {"#from": "env:INGEST_URL", optional: true}
apiUrl: {"#from": "env:API_URL", optional: true}

intervalSeconds: {"#from": "env:INTERVAL_SECONDS", default: 10}

logging:
  level: {"#from": "env:LOG_LEVEL", default: "info"}

monitors:
  - type: collectd/redis
    host: {"#from": "env:REDIS_HOST"}
    port: {"#from": "env:REDIS_PORT", default: 6379}
    sendListLengths:
      - databaseIndex: 2
        keyPattern: '*'
    disableHostDimensions: true
    extraDimensions:
      deploy: {"#from": "env:DEPLOY"}

Within SignalFx, we're seeing keys being reported for db 0. Any chance the db number config is being ignored?

Python is not loading the correct certificate bundle in the docker image

I recently an into issues running the mongodb monitor with TLS enabled. After several hours of debugging, I finally isolated the issue to the certificate bundle included in the docker image.

If I download the latest bundle from https://mkcert.org/generate/ and set SSL_CERT_FILE, everything appears to work fine. Without the latest bundle, the pymongo throws the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/lib/python2.7/site-packages/pymongo/database.py", line 1018, in authenticate
    connect=True)
  File "/lib/python2.7/site-packages/pymongo/mongo_client.py", line 439, in _cache_credentials
    writable_preferred_server_selector)
  File "/lib/python2.7/site-packages/pymongo/topology.py", line 210, in select_server
    address))
  File "/lib/python2.7/site-packages/pymongo/topology.py", line 186, in select_servers
    self._error_message(selector))
pymongo.errors.ServerSelectionTimeoutError: SSL handshake failed: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)

Unfortunately, the error doesn't appear in the SignalFx logs (presumably because collectd orr the agent times out before the error is thrown).

For the time being, I've fixed the issue by adding the following commands to a wrapping Dockerfile that extends from the published agent docker image:

RUN curl https://mkcert.org/generate/ -o /etc/ssl/certs/updated-ca-bundle.pem
ENV SSL_CERT_FILE=/etc/ssl/certs/updated-ca-bundle.pem

Proposal/Discussion: SignalFx Kubernetes Controller

Hi,

Having recently dove into using the dynamic configuration via Kubernetes annotations, I've personally found it to be quite a pain to configure more complex monitors (via escaped JSON in annotations) and it got me thinking, is it worth implementing a native Kubernetes controller that automatically reconfigures the agents, allowing for easier implementation/formatting of more complex monitors?

My thinking is to implement a CustomResourceDefinition of type SignalFxMonitor that would act as a wrapper around the current data structures in the agent. These SignalFxMonitors could then be watched by the controller, which would be configured to automatically rewrite the agent.yaml file stored in a given ConfigMap. This would then kick of the normal dynamic-reload of the agent config.

The helm chart for grafana implements this concept through a sidecar that automatically watches for specifically labeled ConfigMaps, but I figure implementing an actual type for this would be cleaner.

I figured I'd post this here instead of forging ahead just in case this was something that you would think would be good as an "official" thing.

Cheers
DD

EDIT: This could also be extended to other areas like SignalFxObserver objects for dynamically configuring observers.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.