Coder Social home page Coder Social logo

newrelic / nri-prometheus Goto Github PK

View Code? Open in Web Editor NEW
39.0 14.0 45.0 6.02 MB

Fetch metrics in the Prometheus metrics inside or outside Kubernetes and send them to the New Relic Metrics platform.

License: Apache License 2.0

Dockerfile 0.35% Makefile 2.06% Go 95.08% Shell 2.08% PowerShell 0.06% Smarty 0.12% Mustache 0.25%

nri-prometheus's Introduction

New Relic Open Source community plus project banner.

New Relic Prometheus OpenMetrics integration

🚧 Important Notice

Prometheus Open Metrics integration for Kubernetes has been replaced by the Prometheus Agent.

See how to install the Prometheus agent to understand its benefits and get a full visibility of your Prometheus workloads running in a Kubernetes cluster.

In case you need to migrate from the Prometheus Open Metrics integration to Open Metrics check the following migration guide.

Fetch metrics in the Prometheus metrics format, inside or outside Kubernetes, and send them to the New Relic platform.

Installation and usage

For documentation about how to use the integration, refer to our documentation website.

Find out more about Prometheus and New Relic in this blog post.

Helm chart

You can install this chart using nri-bundle located in the helm-charts repository or directly from this repository by adding this Helm repository:

helm repo add nri-prometheus https://newrelic.github.io/nri-prometheus
helm upgrade --install nri-prometheus/nri-prometheus -f your-custom-values.yaml

For further information of the configuration needed for the chart just read the chart's README.

Building

Golang is required to build the integration. We recommend Golang 1.11 or higher.

This integration requires having a Kubernetes cluster available to deploy and run it. For development, we recommend using Docker, Minikube, and skaffold.

After cloning this repository, go to the directory of the Prometheus integration and build it:

make

The command above executes the tests for the Prometheus integration and builds an executable file called nri-prometheus under the bin directory.

To start the integration, run nri-prometheus:

./bin/nri-prometheus

If you want to know more about usage of ./bin/nri-prometheus, pass the -help parameter:

./bin/nri-prometheus -help

External dependencies are managed through the govendor tool. Locking all external dependencies to a specific version (if possible) into the vendor directory is required.

Build the Docker image

In case you wish to push your own version of the image to a Docker registry, you can build it with:

IMAGE_NAME=<YOUR_IMAGE_NAME> make docker-build

And push it later with docker push

Executing the integration in a development cluster

  • You need to configure how to deploy the integration in the cluster. Copy deploy/local.yaml.example to deploy/local.yaml and edit the placeholders.
  • To get the New Relic license key, visit: https://newrelic.com/accounts/<YOUR_ACCOUNT_ID>. It's located in the right sidebar.
  • After updating the yaml file, you need to compile the integration: GOOS=linux make compile-only.
  • Once you have it compiled, you need to deploy it in your Kubernetes cluster: skaffold run

Running the Kubernetes Target Retriever locally

It can be useful to run the Kubernetes Target Retriever locally against a remote/local cluster to debug the endpoints that are discovered. The binary located in /cmd/k8s-target-retriever is made for this.

To run the program, run the following command in your terminal:

# ensure your kubectl is configured correcly & against the correct cluster
kubectl config get-contexts
# run the program
go run cmd/k8s-target-retriever/main.go

Testing

To run the tests execute:

make test

At the moment, tests are totally isolated and you don't need a cluster to run them.

Support

Should you need assistance with New Relic products, you are in good hands with several support diagnostic tools and support channels.

New Relic offers NRDiag, a client-side diagnostic utility that automatically detects common problems with New Relic agents. If NRDiag detects a problem, it suggests troubleshooting steps. NRDiag can also automatically attach troubleshooting data to a New Relic Support ticket.

If the issue has been confirmed as a bug or is a Feature request, please file a Github issue.

Support Channels

Privacy

At New Relic we take your privacy and the security of your information seriously, and are committed to protecting your information. We must emphasize the importance of not sharing personal data in public forums, and ask all users to scrub logs and diagnostic information for sensitive information, whether personal, proprietary, or otherwise.

We define “Personal Data” as any information relating to an identified or identifiable individual, including, for example, your name, phone number, post code or zip code, Device ID, IP address, and email address.

For more information, review New Relic’s General Data Privacy Notice.

Contribute

We encourage your contributions to improve this project! Keep in mind that when you submit your pull request, you'll need to sign the CLA via the click-through using CLA-Assistant. You only have to sign the CLA one time per project.

If you have any questions, or to execute our corporate CLA (which is required if your contribution is on behalf of a company), drop us an email at [email protected].

A note about vulnerabilities

As noted in our security policy, New Relic is committed to the privacy and security of our customers and their data. We believe that providing coordinated disclosure by security researchers and engaging with the security community are important means to achieve our security goals.

If you believe you have found a security vulnerability in this project or any of New Relic's products or websites, we welcome and greatly appreciate you reporting it to New Relic through our bug bounty program.

If you would like to contribute to this project, review these guidelines.

To all contributors, we thank you! Without your contribution, this project would not be what it is today.

License

nri-prometheus is licensed under the Apache 2.0 License.

nri-prometheus's People

Contributors

alejandrodnm avatar alvarocabanas avatar ardias avatar areina avatar arvdias avatar carlosroman avatar davidgit avatar dependabot-preview[bot] avatar dependabot[bot] avatar gsanchezgavier avatar invidian avatar jlegoff avatar jorik avatar juanjjaramillo avatar kang-makes avatar lorgan3 avatar mangulonr avatar marcsanmi avatar matiasburni avatar mlong-nr avatar newrelic-coreint-bot avatar nr-security-github avatar paologallinaharbur avatar renovate[bot] avatar roobre avatar sigilioso avatar smcavallo avatar snyk-bot avatar vihangm avatar xqi-nr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nri-prometheus's Issues

`make validate` fails on "golangci-lint --version"

The golangci-lint command doesn't appear to have a --version flag:

Error: unknown flag: --version

If this is an earlier version of golangci-lint that does have this flag, it doesn't appear to be vendored.

413 response on telemetry SDK

Description

Some request are being rejected by Telemetry API with 413 error. this packages are dropped and the metric lost.

Expected Behavior

The packages are corectly choped by the telemetry sdk

NR Diag results

Steps to Reproduce

In a high load environment (the chance to reproduce it increase) look at the log message. We start to see this error with more than 400 targets.

Your Environment

Additional context

A PR has been submitted to the sdk . After this is released, the dependency should be updated.

issue with metrics on 2.0.0

Hello,

We've recently updated our clusters to use the new 2.0.0 image of nri-prometheus, and we're seeing a lot of errors being dumped in the pod logs, as well as an absence of metrics on our dashboards since the change.

We're seeing a large amount of errors similar to this, for various different metric names:

2020/07/09 21:42:45.138144 {"err":"invalid float is NaN","message":"invalid gauge field","name":"apiserver_request_latencies_summary"}

I've left our configuration as it was in version 1.5.0. Here are the contents of the file:

    # The name of your cluster. It's important to match other New Relic products to relate the data.
    cluster_name: eks-observability-hvuv-npge
    # How often the integration should run. Defaults to 30s.
    scrape_duration: "1m"
    # The HTTP client timeout when fetching data from endpoints. Defaults to 5s.
    scrape_timeout: "20s"
    # Wether the integration should run in verbose mode or not. Defaults to false.
    verbose: false
    # Wether the integration should skip TLS verification or not. Defaults to false.
    insecure_skip_verify: false
    # The label used to identify scrapable targets. Defaults to "prometheus.io/scrape".
    scrape_enabled_label: "prometheus.io/scrape"
    # Whether k8s nodes need to be labelled to be scraped or not. Defaults to true.
    require_scrape_enabled_label_for_nodes: false

    targets:
     - description: Kubernetes API Server
       urls: []

    #targets:
    #  - description: Secure etcd example
    #    urls: ["https://192.168.3.1:2379", "https://192.168.3.2:2379", "https://192.168.3.3:2379"]
    #    tls_config:
    #      ca_file_path: "/etc/etcd/etcd-client-ca.crt"
    #      cert_file_path: "/etc/etcd/etcd-client.crt"
    #      key_file_path: "/etc/etcd/etcd-client.key"

    # Proxy to be used by the emitters when submitting metrics. It should be
    # in the format [scheme]://[domain]:[port].
    # The emitter is the component in charge of sending the scraped metrics.
    # This proxy won't be used when scraping metrics from the targets.
    # By default it's empty, meaning that no proxy will be used.
    emitter_proxy: http://proxy.ebiz.verizon.com:80/

    # Certificate to add to the root CA that the emitter will use when
    # verifying server certificates.
    # If left empty, TLS uses the host's root CA set.
    # emitter_ca_file: "/path/to/cert/server.pem"

    # Whether the emitter should skip TLS verification when submitting data.
    # Defaults to false.
    # emitter_insecure_skip_verify: false

    # Histogram support is based on New Relic's guidelines for higher
    # level metrics abstractions https://github.com/newrelic/newrelic-exporter-specs/blob/master/Guidelines.md.
    # To better support visualization of this data, percentiles are calculated
    # based on the histogram metrics and sent to New Relic.
    # By default, the following percentiles are calculated: 50, 95 and 99.
    #
    percentiles:
      - 50
      - 90
      - 95
      - 99

    transformations:
      - description: "General processing rules"
      #  rename_attributes:
      #    - metric_prefix: ""
      #      attributes:
      #        container_name: "containerName"
      #        pod_name: "podName"
      #        namespace: "namespaceName"
      #        node: "nodeName"
      #        container: "containerName"
      #        pod: "podName"
      #        deployment: "deploymentName"
      #  ignore_metrics:
      #    # Metrics on pods and containers are being ignored as they are already collected by the New Relic Kubernetes Integration.
      #    - except:
      #      - kube_hpa_
      #      - kube_daemonset_
      #      - kube_statefulset_
      #      - kube_endpoint_
      #      - kube_service_
      #      - kube_limitrange
      #      - kube_node_
      #      - kube_poddisruptionbudget_
      #      - kube_resourcequota
      #      - nr_stats
        copy_attributes:
          - from_metric: "kube_hpa_labels"
            to_metrics: "kube_hpa_"
            match_by:
              - namespace
              - hpa
          - from_metric: "kube_daemonset_labels"
            to_metrics: "kube_daemonset_"
            match_by:
              - namespace
              - daemonset
          - from_metric: "kube_statefulset_labels"
            to_metrics: "kube_statefulset_"
            match_by:
              - namespace
              - statefulset
          - from_metric: "kube_endpoint_labels"
            to_metrics: "kube_endpoint_"
            match_by:
              - namespace
              - endpoint
          - from_metric: "kube_service_labels"
            to_metrics: "kube_service_"
            match_by:
              - namespace
              - service
          - from_metric: "kube_node_labels"
            to_metrics: "kube_node_"
            match_by:
              - namespace
              - node

We have another deployment purely scraping the Kubernetes API Server that we've upgraded which is also facing the same problem. It has the same config, save for a different scrape tag and having the target URL filled in.

ignore_metrics doesnt work?

I'm trying to make use of the ignore_metrics parameter but it doesn't seem to be working. For example, I'm trying to only include apiserver_requests_total as a test (in addition to the other kube* stuff). When I apply this to my cluster, I still see over 200 unique metric names showing up.

Am I misunderstanding the use of this option?

ignore_metrics:
  - except:
    - kube_hpa_
    - kube_daemonset_
    - kube_statefulset_
    - kube_endpoint_
    - kube_service_
    - kube_limitrange
    - kube_node_
    - kube_poddisruptionbudget_
    - kube_resourcequota
    - nr_stats
    - apiserver_request_total

How to get metrics\telemetry from nri-prometheus container itself?

Summary

We are moving nri-prometheus component to production. Sometimes in dev\qa cluster we observe many restart of nri-prometheus pod and high memory consumption. Would be great to have some telemetry from component and have more inside. Also would be nice to have standart goland metrics.

Desired Behavior

Component has metrics for detailed monitoring. We can use those metrics for sizing prediction and troubleshooting.

Possible Solution

Have http /metrcis endpoint with standard Prometheus format metrics (golang and component specific).

Additional context

Metrics of type "counter" with labels of differing values are not shipped to Newrelic

Background
I am currently attempting to scrape and push pulsar component prometheus metrics into Newrelic.

  • nri-prometheus image version: newrelic/nri-prometheus:1.3.0

  • config map:

scrape_duration: "20s"
verbose: false
insecure_skip_verify: true
scrape_enabled_label: "prometheus.io/scrape"
require_scrape_enabled_label_for_nodes: true

Issue 1 🚨
It seems that the nri-prometheus is unable to process metrics with labels of differing values for type counter:

Does not arrive at Newrelic 👎

bookie_journal_JOURNAL_SYNC_count{success="false"} 0
bookie_journal_JOURNAL_SYNC_count{success="true"} 24684154

Arrives at Newrelic 👍

bookie_WRITE_BYTES 20991304885

Issue 2 🚨
We also see tonnes of error messages in the logs for metrics of type summary with NaN values

Example

{"err":"invalid float is NaN","message":"invalid gauge field","name":"bookie_journal_JOURNAL_CREATION_LATENCY.percentiles"}

And the corresponding metric values:

# TYPE bookie_journal_JOURNAL_CREATION_LATENCY summary
bookie_journal_JOURNAL_CREATION_LATENCY{success="false",quantile="0.5"} NaN
bookie_journal_JOURNAL_CREATION_LATENCY{success="false",quantile="0.75"} NaN
bookie_journal_JOURNAL_CREATION_LATENCY{success="false",quantile="0.95"} NaN
bookie_journal_JOURNAL_CREATION_LATENCY{success="false",quantile="0.99"} NaN
bookie_journal_JOURNAL_CREATION_LATENCY{success="false",quantile="0.999"} NaN
bookie_journal_JOURNAL_CREATION_LATENCY{success="false",quantile="0.9999"} NaN
bookie_journal_JOURNAL_CREATION_LATENCY{success="false",quantile="1.0"} -Infinity

Help regarding understanding both the above behaviours would be greatly appreciated :)!

Make HarvestTimeout configurable

Is your feature request related to a problem? Please describe.

Sometimes the harvest takes more time than the default timeout, the SDK provides a parameter to increase/descrease such timeout that we are not leveraging

Feature Description

Make HarvestTimeout configurable

Priority

[Nice to Have]

Document missing config options

Is your feature request related to a problem? Please describe.

There are some undocumented config options in the yaml files and in the docs. Some options could be helpful for customers.

Feature Description

Review the config options and document the missing ones.

Priority

Nice to Have

Prometheus Federation

Hi there,

Is it possible to use this to federate specific metrics from a Prometheus server instead of scraping from each service? I'd like to set this up so we don't have to configure/ align multiple ways to scrape from services, and rather just maintain one scraping path to Prometheus, and then one path from Prometheus to New Relic.

I know one problem with this approach is that the Prometheus server doesn't store metric types -- how does New Relic handle metrics without types? Is this something that would be worth contributing?

Prometheus with Vault Integration

Hello,

I am trying to utilize this Docker image to scrape Prometheus metrics for a Vault cluster, which exposes the /sys/metrics endpoint. The only issue is that this endpoint is authenticated, which would require the scrape request to contain the X-Vault-Token: ... header.

It would be useful if we could specify a header to be set in the config.yaml file. Ideally this value may be dynamic -- in this case the Vault token expires every hour. I'm not sure how you'd want to handle this case. I'd be glad to submit a change upstream if you have a solution but don't have time to work on this.

Thanks!

Change the autodiscovery behaviour of services

Is your feature request related to a problem? Please describe.

Currently when a service is discovered and has to be monitored it is scraped directly.
Nri-prometheus will monitor the endpoint returned by the service that can change with time and can lead to misleading data.

I believe that it would be better to retrieve the endpoints associated to such service and scrape those instead of the service.

Screen Shot 2020-09-16 at 4 11 34 PM

In purple the value of the service scraped jumps from the value of an endpoint to an other

Feature Description

Have the possibility to scrape all the endpoints of a service discovered and not the service itself

Describe Alternatives

Currently the alternative it to place the annotation at a pod level, but it some context is not convenient

Priority

[ Really Want ]

Update prometheus and telemetry-sdk

Right now POMI uses quite old versions of github.com/prometheus/common. github.com/prometheus/client_golang and github.com/newrelic/newrelic-telemetry-sdk-go.

It would be nice to bump those versions to benefit from potential performance improvements and bug fixes.

Metrics not exported in 1.1.0

I've recently upgraded from version 0.10.3 to 1.0.0 ( see issue #13 ) and I noticed that the metrics we provide are no longer scraped by NRI. Our service is annotated with prometheus.io/scrape: true and is being scrapped by Prometheus and the NRI scraper version 0.10.3.

"metadata": {
    "name": "iot4i-prometheus-exporter-service",
    "namespace": "iot4i",
    "selfLink": "/api/v1/namespaces/iot4i/services/iot4i-prometheus-exporter-service",
    "uid": "5efc240f-2e10-11e9-8b01-0e1261484d27",
    "resourceVersion": "20017396",
    "creationTimestamp": "2019-02-11T15:19:05Z",
    "labels": {
      "app": "iot4i-prometheus-exporter",
      "chart": "iot4i-prometheus-exporter-0.1.1",
      "heritage": "Tiller",
      "release": "iot4iiot4i"
    },
    "annotations": {
      "prometheus.io/scrape": "true"
    }
  }

I deployed the nri-prometheus with all the default values (changing the cluster name and license key of course). So scrape_enabled_label is set to "prometheus.io/scrape". If I change this to use a label specific to my service (ex: iot4i-prometheus-exporter) then the nri logs show Target list for fetching metrics is empty

What I am doing wrong here? Thank you.

feature: support for replicas

I apologize if this isn't the right format or location for a feature request, but I didn't see any clear guidance on how to submit one.

Problem statement

Currently, the nri-prometheus deployment doesn't have any form of quorom or primary/secondaries. This prevents users from properly using replicas to manage additional load due to increases in scrapes and metric counts.

Current solution

In order to handle situations with overload on the nri-prometheus pod, users need to create separate deployments with unique targets and/or scrape tags, manually slicing up their overall metric workload. Otherwise, the replicas will all try to scrape the same metrics and offer no real gain

Suggestion

Implement a solution allowing nri-prometheus to support multiple replicas and intelligently divide up workloads so that users do not need to maintain separate deployments targeting unique endpoints.

Modify pomi packaging

The current asset names don't match exactly the name as used in other integrations. We need to change that in order to facilitate embedding POMI in the Infra agent

[CICD] Currently not setting the version of the integration

When moving from 2.2.0 to 2.3.0 the integration version is no longer populated and left with a generic "dev".

This can create confusion and make debugging more difficult.
Moreover we should add a flag to simply print this information and exit.

Screen Shot 2020-12-09 at 4 19 56 PM

New Relic Prometheus OpenMetrics integration (Docker)

I am trying to investigate how to monitor containers/container-host(docker) via newrelic. In our present scenario, all container-host instances are running the prometheus node_exporter, additionally they run the docker daemon metrics exporter. I came across the following documentation which states how to scrape these metrics in newrelic. https://docs.newrelic.com/docs/integrations/prometheus-integrations/prometheus-docker/new-relic-prometheus-openmetrics-integration-docker.

I need some help while configuring "nri-prometheus-latest.yaml "

  1. the node_exporter and docker daemon exporting are running on their standard ports "port1" for node_exporter, and "port 2" for the docker daemon exporter. Where should I provide these ports in the config file?

Not getting data for "counter" metrics

Hey, I'm using an exporter with newrelic/nri-prometheus:1.2.2 and docker. I am receiving plenty of data in metrics, however none of the metric data types are counters.

When using verbose: true in my config.yaml, I don't see any of the counter metrics being sent to NR in the logs. Not even the counters on the nri-prometheus/metrics endpoint.

For example, I do get data for "promhttp_metric_handler_requests_in_flight" (gauge), but I don't get data for "promhttp_metric_handler_requests_total" (counter).

I am not using any emitter or transformations.

Any help is appreciated!

Workers load test

  • For fixed amount of targets and CPU increase the number of workers and chart the total time the scraping.
    Run a test with a high number of targets and of workers. Understand which is safe range for the number of workers given different CPU configurations (assuming no latencies)
  • Document suggestion for the number of workers recommended for amount of targets. Add target scrape duration check in troubleshoot guide to identify targets with big latency that could block workers.

Load test scenarios

Load test scenarios to answer how many targets per core of CPU and GB of RAM?:

  • Fix the CPU of the POMI pod and start increasing the number of services(exporters) until the scrape interval overflows. At that point the number of services will be the metric (related to a response time and number of metrics). Tabulate the results to chart them and write the targets/cpu recommendation. We should also fix the target response time.
  • (low priority) For a large number of CPU look at memory consumption of the POMI for a large number of services.
  • Document results in form of suggestions to add to the Docs

Increasing the number of workers causes memory usage to skyrocket

Description

Increasing the number of workers to overcome network latency causes nri-prometheus' memory usage to skyrocket to several gigabytes.

Expected Behavior

Memory usage should remain reasonable regarding the number of workers

Additional context

This is caused by the Telemetry SDK harvester not harvesting quick enough the metrics scraped by the workers. A workaround could be to reduce the Harvesting period to a very small amount, however that would needlessly put load on the client and server even on scenarios with less workers.

This was discovered as a part of #108

dropping metrics

2020/04/20 14:55:39.718963 {"err":"error posting data: Post https://metric-api.eu.newrelic.com/metric/v1/mynamespace: context deadline exceeded"}
2020/04/20 14:55:39.719000 {"context-error":"context deadline exceeded","event":"harvest cancelled or timed out","message":"dropping data"}
2020/04/20 14:55:51.682807 {"err":"error posting data: Post https://metric-api.eu.newrelic.com/metric/v1/mynamespace: context deadline exceeded"}
2020/04/20 14:55:51.682916 {"context-error":"context deadline exceeded","event":"harvest cancelled or timed out","message":"dropping data"}

Should not it retry to send metrics-api instead of dropping it? Is there a way to prevent this behavior? I saw those errors so often.

Old metrics still visible

It has been reported that in case targets are no longer present and not scraped anymore they keeps appearing in the metrics exposed.

We should check all metrics and if make sense to update some of them

That was confusing since it might lead to think that old targets is still scraped

Description

It has been reported that in case targets are no longer present and not scraped anymore they keeps appearing in the metrics exposed.

This can lead to a situation where the size of the POMI/metrics output increase forever and in a significant way in dynamic environments

Expected Behavior

Old metrics should be cleaned away if no longer meaningful calling reset() as we already do with totalTimeseriesByTargetType

# HELP nr_stats_metrics_total_timeseries_by_target_type Total number of metrics by type and target
# TYPE nr_stats_metrics_total_timeseries_by_target_type gauge
nr_stats_metrics_total_timeseries_by_target_type{target="kube-dns",type="counter"} 22
nr_stats_metrics_total_timeseries_by_target_type{target="kube-dns",type="gauge"} 39
nr_stats_metrics_total_timeseries_by_target_type{target="kube-dns",type="histogram"} 5
nr_stats_metrics_total_timeseries_by_target_type{target="kube-dns",type="summary"} 1
nr_stats_metrics_total_timeseries_by_target_type{target="test-kube-state-metrics",type="counter"} 21
nr_stats_metrics_total_timeseries_by_target_type{target="test-kube-state-metrics",type="gauge"} 1731

After deleting test-kube-state-metrics service

# HELP nr_stats_metrics_total_timeseries_by_target_type Total number of metrics by type and target
# TYPE nr_stats_metrics_total_timeseries_by_target_type gauge
nr_stats_metrics_total_timeseries_by_target_type{target="kube-dns",type="counter"} 22
nr_stats_metrics_total_timeseries_by_target_type{target="kube-dns",type="gauge"} 39
nr_stats_metrics_total_timeseries_by_target_type{target="kube-dns",type="histogram"} 5
nr_stats_metrics_total_timeseries_by_target_type{target="kube-dns",type="summary"} 1

Steps to Reproduce

Check /metrics output of pomi, delete one of the targets, it will be not scraped anymore, however the metric will be still there

# HELP nr_stats_integration_fetch_target_duration_seconds The total time in seconds to fetch the metrics of a target
# TYPE nr_stats_integration_fetch_target_duration_seconds gauge
nr_stats_integration_fetch_target_duration_seconds{target="localhost:8080"} 0.002163734
nr_stats_integration_fetch_target_duration_seconds{target="test-kube-state-metrics"} 0.008961112

After deleting test-kube-state-metrics service

# HELP nr_stats_integration_fetch_target_duration_seconds The total time in seconds to fetch the metrics of a target
# TYPE nr_stats_integration_fetch_target_duration_seconds gauge
nr_stats_integration_fetch_target_duration_seconds{target="localhost:8080"} 0.001516932
nr_stats_integration_fetch_target_duration_seconds{target="test-kube-state-metrics"} 0.008961112

The metric is still present and not updated

Summary metric should do delta calculation for summary metric

Description

Due to the "sum" not being delta calculated like "count" we cannot perform a proper average over boths values.

Expected Behavior

"sum" value in the summary metric should be delta calculated. See histogram for how it should be done

question regarding scrape interval attribute in nri-prometheus-latest.yaml

I have a question regarding an attribute in nri-prometheus-latest.yaml provided in the documentation- scrape interval . I wanted to know what is the time interval at which I am receiving insights into newrelic. And if we want to increase the interval, then which metric needs to be customized? # How often the integration should run. Defaults to 30s. # scrape_duration: "30s" # The HTTP client timeout when fetching data from endpoints. Defaults to 5s. # scrape_timeout: "5s"

EMITTERS=stdout logs some default struct format

The public docs suggest:

To see the exact data that is being sent to the Metric API, set the EMITTERS environment variable to "api,stdout".

I set EMITTERS=stdout, and observed that the output was affected, but it looks like some sort of default output format for a struct:

time="2019-10-28T23:01:36Z" level=debug msg="Starting fetch process..." component=fetcher
time="2019-10-28T23:01:36Z" level=debug msg="fetching URL: {http   localhost:8080 /metrics  false  }" component=Fetcher target="localhost:8080"
time="2019-10-28T23:01:36Z" level=debug msg="Finished fetch process." component=fetcher
[{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{}]

I tried playing with the verbose flag in the config.yml, and setting emitters to EMITTERS=api,stdout, and neither had any effect on the logged default-struct output.

[Repolinter] Open Source Policy Issues

Repolinter Report

🤖This issue was automatically generated by repolinter-action, developed by the Open Source and Developer Advocacy team at New Relic. This issue will be automatically updated or closed when changes are pushed. If you have any problems with this tool, please feel free to open a GitHub issue or give us a ping in #help-opensource.

This Repolinter run generated the following results:

❗ Error ❌ Fail ⚠️ Warn ✅ Pass Ignored Total
0 1 0 6 0 7

Fail #

code-of-conduct-file-does-not-exist #

New Relic has moved the CODE_OF_CONDUCT file to a centralized location where it is referenced automatically by every repository in the New Relic organization. Because of this change, any other CODE_OF_CONDUCT file in a repository is now redundant and should be removed. Note that you will need to adjust any links to the local CODE_OF_CONDUCT file in your documentation to point to the central file (README and CONTRIBUTING will probably have links that need updating). For more information please visit https://docs.google.com/document/d/1y644Pwi82kasNP5VPVjDV8rsmkBKclQVHFkz8pwRUtE/view. Found files. Below is a list of files or patterns that failed:

  • CODE_OF_CONDUCT.md
    • 🔨 Suggested Fix: Remove file

Passed #

Click to see rules

license-file-exists #

Found file (LICENSE). New Relic requires that all open source projects have an associated license contained within the project. This license must be permissive (e.g. non-viral or copyleft), and we recommend Apache 2.0 for most use cases. For more information please visit https://docs.google.com/document/d/1vML4aY_czsY0URu2yiP3xLAKYufNrKsc7o4kjuegpDw/edit.

readme-file-exists #

Found file (README.md). New Relic requires a README file in all projects. This README should give a general overview of the project, and should point to additional resources (security, contributing, etc.) where developers and users can learn further. For more information please visit https://github.com/newrelic/open-by-default.

readme-starts-with-community-plus-header #

The first 5 lines contain all of the requested patterns. (README.md). The README of a community plus project should have a community plus header at the start of the README. If you already have a community plus header and this rule is failing, your header may be out of date, and you should update your header with the suggested one below. For more information please visit https://opensource.newrelic.com/oss-category/.

readme-contains-link-to-security-policy #

Contains a link to the security policy for this repository (README.md). New Relic recommends putting a link to the open source security policy for your project (https://github.com/newrelic/<repo-name>/security/policy or ../../security/policy) in the README. For an example of this, please see the "a note about vulnerabilities" section of the Open By Default repository. For more information please visit https://nerdlife.datanerd.us/new-relic/security-guidelines-for-publishing-source-code.

readme-contains-discuss-topic #

Contains a link to the appropriate discuss.newrelic.com topic (README.md). New Relic recommends directly linking the your appropriate discuss.newrelic.com topic in the README, allowing developer an alternate method of getting support. For more information please visit https://nerdlife.datanerd.us/new-relic/security-guidelines-for-publishing-source-code.

third-party-notices-file-exists #

Found file (THIRD_PARTY_NOTICES.md). A THIRD_PARTY_NOTICES.md file can be present in your repository to grant attribution to all dependencies being used by this project. This document is necessary if you are using third-party source code in your project, with the exception of code referenced outside the project's compiled/bundled binary (ex. some Java projects require modules to be pre-installed in the classpath, outside the project binary and therefore outside the scope of the THIRD_PARTY_NOTICES). Please review your project's dependencies and create a THIRD_PARTY_NOTICES.md file if necessary. For JavaScript projects, you can generate this file using the oss-cli. For more information please visit https://docs.google.com/document/d/1y644Pwi82kasNP5VPVjDV8rsmkBKclQVHFkz8pwRUtE/view.

nri-prometheus is failing to read metrics compatible with v2 prometheus format

Hi!

We are getting error when reading metrics compatible to V2 prometheus format.

text format parsing error in line <line number>: second HELP line for metric name <name>

This is probably caused by old libraries being used to parse response - from vendor.json I see deprecated https://github.com/prometheus/client_model usage dated with 2018.

Example metrics output:

# HELP connection_seconds Tracks time to obtain connection
# TYPE connection_seconds summary
connection_seconds{outcome="failure",quantile="0.5",} 0.0
connection_seconds{outcome="failure",quantile="0.75",} 0.0
connection_seconds{outcome="failure",quantile="0.95",} 0.0
connection_seconds{outcome="failure",quantile="0.98",} 0.0
connection_seconds{outcome="failure",quantile="0.99",} 0.0
connection_seconds{outcome="failure",quantile="0.999",} 0.0
connection_count{outcome="failure",} 0.0
connection_seconds_total{outcome="failure",} 0.0
# HELP connection_seconds Tracks time to obtain connection
# TYPE connection_seconds summary
connection_seconds{outcome="success",quantile="0.5",} 0.0
connection_seconds{outcome="success",quantile="0.75",} 0.0
connection_seconds{outcome="success",quantile="0.95",} 0.0
connection_seconds{outcome="success",quantile="0.98",} 0.0
connection_seconds{outcome="success",quantile="0.99",} 0.0
connection_seconds{outcome="success",quantile="0.999",} 0.0
connection_count{outcome="success",} 333.0
connection_seconds_total{outcome="success",} 0.05

Services are scraped directly instead of service endpoints

While services with the a prometheus.io/scrape annotation can be discovered, it appears that nri-prometheus scrapes the service itself and not service endpoints:

func serviceTarget(s *apiv1.Service, port, path string) Target {
lbls := labels.Set{}
hostname := fmt.Sprintf("%s.%s.svc", s.Name, s.Namespace)
addr := url.URL{
Scheme: "http",
Host: net.JoinHostPort(hostname, port),
Path: path,
}
for lk, lv := range s.Labels {
lbls["label."+lk] = lv
}
lbls["serviceName"] = s.Name
lbls["namespaceName"] = s.Namespace
return New(s.Name, addr, Object{Name: s.Name, Kind: "service", Labels: lbls})
}

Usually you want the service endpoints to be discovered instead. This is useful for applications like kube/coredns that only expose the application port (53) on the service and not the metrics port (9153):

apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/port: "9153"
    prometheus.io/scrape: "true"
  labels:
    eks.amazonaws.com/component: kube-dns
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
    kubernetes.io/name: CoreDNS
  name: kube-dns
  namespace: kube-system
#...

In the current configuration, nri-prometheus cannot scrape metrics from kube-dns because the port indicated by prometheus.io/port isn't exposed on the service:

default/nri-prometheus-f99cfb76b-h7hpw[nri-prometheus]: time="2019-10-29T20:26:41Z" level=warning msg="fetching Prometheus: http://kube-dns.kube-system.svc:9153/metrics (kube-dns)" component=Fetcher error="Get http://kube-dns.kube-system.svc:9153/metrics: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"

This also presents a problem for applications that do expose metrics on the same port as the service: nri-prometheus won't have a complete view of the application because it'll only scrape a single pod each scrape interval instead of all pods that back the service.

I could manually update the target list every time a new pod gets deployed but that's not exactly ideal.

summary type metrics are not showing up.

I've noticed that none of the summary type metrics from the following scrape (all scrapes on the pods in my cluster) do not show up in NR1/insights

https://gist.github.com/fieldju/90cf92976358853245ba1eb99a94523c#file-prom-scrap-txt-L413-L422

Again I am not seeing any of the sum or count metrics from Micrometer timers.

.*-seconds_count
.*_seconds_sum

This post says that summaries were coming:
#14 (comment)

This pull request says it was adding support:
#15

I am running 1.5.0 of this integration

Controlled By:  ReplicaSet/nri-prometheus-67ff84fbf8
Containers:
  nri-prometheus:
    Container ID:  docker://4c7ea34dc7ac0658d8b3ec45c134da583c20f3fff9abfc038685bdfbc052b02b
    Image:         newrelic/nri-prometheus:1.5.0

Helm deployed agent does not fetch metrics until restart

We're deploying nri-prometheus as a subchart of our own app's helm chart. For some reason nri-prometheus is not fetching our app's metrics until after we kubectl delete it's pod. When the pod is recreated, we can see from it's logs that our app's pod gets added to the fetcher.

Any ideas?

Relevant part of our Chart.yaml:

dependencies:
  - name: nri-prometheus
    version: 1.2.0
    repository: https://helm-charts.newrelic.com
    condition: nri-prometheus.install

[Repolinter] Open Source Policy Issues

Repolinter Report

🤖This issue was automatically generated by repolinter-action, developed by the Open Source and Developer Advocacy team at New Relic. This issue will be automatically updated or closed when changes are pushed. If you have any problems with this tool, please feel free to open a GitHub issue or give us a ping in #help-opensource.

This Repolinter run generated the following results:

❗ Error ❌ Fail ⚠️ Warn ✅ Pass Ignored Total
0 1 0 5 0 6

Fail #

readme-starts-with-community-plus-header #

The README of a community plus project should have a community plus header at the start of the README. If you already have a community plus header and this rule is failing, your header may be out of date. For more information please visit https://opensource.newrelic.com/oss-category/. Below is a list of files or patterns that failed:

  • README.md: The first 1 lines do not contain the pattern(s): Open source Community Plus header (see https://opensource.newrelic.com/oss-category).
    • 🔨 Suggested Fix: prepend [![Community Plus header](https://github.com/newrelic/opensource-website/raw/master/src/images/categories/Community_Plus.png)](https://opensource.newrelic.com/oss-category/#community-plus) to file

Passed #

Click to see rules

license-file-exists #

Found file (LICENSE). New Relic requires that all open source projects have an associated license contained within the project. This license must be permissive (e.g. non-viral or copyleft), and we recommend Apache 2.0 for most use cases. For more information please visit https://docs.google.com/document/d/1vML4aY_czsY0URu2yiP3xLAKYufNrKsc7o4kjuegpDw/edit.

readme-file-exists #

Found file (README.md). New Relic requires a README file in all projects. This README should give a general overview of the project, and should point to additional resources (security, contributing, etc.) where developers and users can learn further. For more information please visit https://github.com/newrelic/open-by-default.

readme-contains-link-to-security-policy #

Contains a link to the security policy for this repository (README.md). New Relic recommends putting a link to the open source security policy for your project (https://github.com/newrelic/<repo-name>/security/policy or ../../security/policy) in the README. For an example of this, please see the "a note about vulnerabilities" section of the Open By Default repository. For more information please visit https://nerdlife.datanerd.us/new-relic/security-guidelines-for-publishing-source-code.

readme-contains-discuss-topic #

Contains a link to the appropriate discuss.newrelic.com topic (README.md). New Relic recommends directly linking the your appropriate discuss.newrelic.com topic in the README, allowing developer an alternate method of getting support. For more information please visit https://nerdlife.datanerd.us/new-relic/security-guidelines-for-publishing-source-code.

third-party-notices-file-exists #

Found file (THIRD_PARTY_NOTICES.md). A THIRD_PARTY_NOTICES.md file must be present in your repository to grant attribution to all dependencies being used by this project. For JavaScript projects, you can generate this file using the oss-cli. For more information please visit https://docs.google.com/document/d/1y644Pwi82kasNP5VPVjDV8rsmkBKclQVHFkz8pwRUtE/view.

Scraper uses license key in place of API key

Hello,

I am trying to scrape Prometheus metrics and send them to the Insights dashboard via a proxy we have setup. Looking at the code and the docs for the Metrics API endpoint, it looks like the endpoint requires either the Api-Key or X-Insert-Key header to be set, but this parameter in gets set as the license_key option. Should this field have been set by an API key parameter in the YAML? In my testing I was able to POST a metric by specifying both the license key and the insert/API key, but received a 400 Bad Request when the API key was omitted.

Also when I was debugging the above issue in verbose=True, debug=True mode I found that the logs did not write any messages regarding emitting metrics, it seemed to silently failed -- it would be very useful to have more verbose logging around the emitter. Here's a small snippet of the logs:

time="2019-12-02T22:36:21Z" level=debug msg="fetching URL: {http   localhost:8080 /metrics  false  }" component=Fetcher target="localhost:8080"
time="2019-12-02T22:36:21Z" level=debug msg="Finished fetch process." component=fetcher
[{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{}]
time="2019-12-02T22:36:21Z" level=debug msg="Starting fetch process..." component=fetcher
time="2019-12-02T22:36:21Z" level=debug msg="fetching URL: {https   localhost:443 /v1/sys/metrics  false format=prometheus }" component=Fetcher target="localhost:443"
time="2019-12-02T22:36:21Z" level=debug msg="Finished fetch process." component=fetcher
[{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{}]

Revert default emitter_harvest_period to 1s

Description

While merging #118, the default emitter_harvest_period was increased from 1s to 5s. This change was motivated by the fact that it was no longer necessary to Harvest so often, since now memory is bounded anyway.

However, for small environments this can increase the memory usage which, while below the limit, might not be desirable.

use common attributes for infra sdk output

Is your feature request related to a problem? Please describe.

Common attributes of the integration sdk is not currently being used. And attributes shared by all metrics are being placed at metric label instead of in the common section.
At the moment the sdk doesn't have implemented setters for this atttributes, there is an issue for this .

Feature Description

Move shared attributes to the common sections to avoid duplication.

    "data": [
        {
            "common": {},
            "entity": {
                "name": "ravendb:localhost:9440:database:DemoUser-2dd5861b-2371-43c8-9647-878a15b01af0",
                "displayName": "",
                "type": "RAVENDB_DATABASE",
                "metadata": {}
            },
            "metrics": [
                {
                    "timestamp": 1606819312,
                    "name": "ravendb_database_request_total",
                    "type": "count",
                    "attributes": {
                        "database": "DemoUser-2dd5861b-2371-43c8-9647-878a15b01af0",
                        "integrationName": "nri-prometheus",
                        "integrationVersion": "dev",
                        "nrMetricType": "count",
                        "promMetricType": "counter",
                        "scrapedTargetKind": "user_provided",
                        "scrapedTargetName": "localhost:9440",
                        "scrapedTargetURL": "http://localhost:9440/metrics",
                        "targetName": "localhost:9440"
                    },
                    "value": 696
                },
                {
                    "timestamp": 1606819312,
                    "name": "ravendb_database_indexes",
                    "type": "gauge",
                    "attributes": {
                        "database": "DemoUser-2dd5861b-2371-43c8-9647-878a15b01af0",
                        "integrationName": "nri-prometheus",
                        "integrationVersion": "dev",
                        "nrMetricType": "gauge",
                        "promMetricType": "gauge",
                        "scrapedTargetKind": "user_provided",
                        "scrapedTargetName": "localhost:9440",
                        "scrapedTargetURL": "http://localhost:9440/metrics",
                        "targetName": "localhost:9440"

Describe Alternatives

use the current implementation. Is not the best options since this is much more verbose.

Additional context

Sdk should be documented first and a clear idea of this feature should be had before implementing this.

Priority

Please help us better understand this feature request by choosing a priority from the following options:
[Nice to Have, Really Want, Must Have, Blocker]

Prometheus values not queryable in New Relic?

I originally attempted to file this issue via the New Relic issue tracker, and was redirected here.

tl;dr -- the values from the OpenMetrics integration 1.2.2 are showing up in New Relic for me, but they looked a like a single, JSON-valued column, which doesn't appear to be queryable?

I followed the setup guide for the Prometheus OpenMetrics integration running in k8s ( https://docs.newrelic.com/docs/integrations/prometheus-integrations/prometheus-kubernetes/new-relic-prometheus-openmetrics-integration-kubernetes), and got the integration stood up and copying metrics from several different prometheus instances in our cluster, but the values seem to be coming out in a format that's not queryable in New Relic One or Insights.

As an example of the behavior I'm seeing, I followed some of the query suggestions in the linked setup:

FROM Metric SELECT keySet() WHERE metricName = 'kube_node_status_allocatable_cpu_cores'

I get back a response that includes:

"allKeys": [
"clusterName",
"endTimestamp",
"integrationName",
"integrationVersion",
"k8s.cluster.name",
"kube_node_status_allocatable_cpu_cores",
"label.k8s-app",
"metricName",
"namespaceName",
"newrelic.source",
"node",
"nodeName",
"nrMetricType",
"promMetricType",
"scrapedTargetKind",
"scrapedTargetName",
"scrapedTargetURL",
"serviceName",
"targetName",
"timestamp"
]

None of those keys appear to be the values themselves, however it looks like the kube_node_status_allocatable_cpu_cores is a json object that contains the values.

Running a query to select the values:

FROM Metric SELECT * WHERE metricName = 'kube_node_status_allocatable_cpu_cores'

I get, for example:

{
"clusterName": "aws1.k8s.redfintest.com",
"endTimestamp": 1572289479247,
"integrationName": "nri-prometheus",
"integrationVersion": "1.2.2",
"k8s.cluster.name": "aws1.k8s.redfintest.com",
"kube_node_status_allocatable_cpu_cores": {
    "type": "gauge",
    "count": 1,
    "sum": 16,
    "min": 16,
    "max": 16,
    "latest": 16
},
"label.k8s-app": "kube-state-metrics",
"metricName": "kube_node_status_allocatable_cpu_cores",
"namespaceName": "kube-system",
"newrelic.source": "metricAPI",
"node": "ip-10-196-95-237.us-west-2.compute.internal",
"nodeName": "ip-10-196-95-237.us-west-2.compute.internal",
"nrMetricType": "gauge",
"promMetricType": "gauge",
"scrapedTargetKind": "service",
"scrapedTargetName": "kube-state-metrics",
"scrapedTargetURL": "http://kube-state-metrics.kube-system.svc:8080/metrics",
"serviceName": "kube-state-metrics",
"targetName": "kube-state-metrics",
"timestamp": 1572289478247
},

the kube_node_status_allocatable_cpu_cores key has the values I want in it, but they don't appear to be queryable. Trying to query with SELECT kube_node_status_allocatable_cpu_cores.count ... results in an error.

I'm really excited about being able to consolidate all this data into New Relic -- I just have to figure out how to query it!

Software:

  • k8s: 1.11.x
  • nri-prometheus: 1.2.2

My config is:

---
apiVersion: v1
data:
  config.yaml: |
    # The name of your cluster. It's important to match other New Relic products to relate the data.
    cluster_name: "<removed>"

    # How often the integration should run. Defaults to 30s.
    # scrape_duration: "30s"

    # The HTTP client timeout when fetching data from endpoints. Defaults to 5s.
    # scrape_timeout: "5s"

    # Wether the integration should run in verbose mode or not. Defaults to false.
    verbose: true

    # Wether the integration should skip TLS verification or not. Defaults to false.
    insecure_skip_verify: false

    # The label used to identify scrapable targets. Defaults to "prometheus.io/scrape".
    scrape_enabled_label: "prometheus.io/scrape"

    # Whether k8s nodes need to be labelled to be scraped or not. Defaults to true.
    require_scrape_enabled_label_for_nodes: true

    # targets:
    #   - description: Secure etcd example
    #     urls: ["https://192.168.3.1:2379", "https://192.168.3.2:2379", "https://192.168.3.3:2379"]
    #     tls_config:
    #       ca_file_path: "/etc/etcd/etcd-client-ca.crt"
    #       cert_file_path: "/etc/etcd/etcd-client.crt"
    #       key_file_path: "/etc/etcd/etcd-client.key"

    # Proxy to be used by the emitters when submitting metrics. It should be
    # in the format [scheme]://[domain]:[port].
    # The emitter is the component in charge of sending the scraped metrics.
    # This proxy won't be used when scraping metrics from the targets.
    # By default it's empty, meaning that no proxy will be used.
    # emitter_proxy: "http://localhost:8888"

    # Certificate to add to the root CA that the emitter will use when
    # verifying server certificates.
    # If left empty, TLS uses the host's root CA set.
    # emitter_ca_file: "/path/to/cert/server.pem"

    # Whether the emitter should skip TLS verification when submitting data.
    # Defaults to false.
    # emitter_insecure_skip_verify: false

    # Histogram support is based on New Relic's guidelines for higher
    # level metrics abstractions https://github.com/newrelic/newrelic-exporter-specs/blob/master/Guidelines.md.
    # To better support visualization of this data, percentiles are calculated
    # based on the histogram metrics and sent to New Relic.
    # By default, the following percentiles are calculated: 50, 95 and 99.
    #
    # percentiles:
    #   - 50
    #   - 95
    #   - 99

    transformations:
      - description: "General processing rules"
        rename_attributes:
          - metric_prefix: ""
            attributes:
              container_name: "containerName"
              pod_name: "podName"
              namespace: "namespaceName"
              node: "nodeName"
              container: "containerName"
              pod: "podName"
              deployment: "deploymentName"
        ignore_metrics:
          # Ignore all the metrics except the ones listed below.
          # This is a list that complements the data retrieved by the New
          # Relic Kubernetes Integration, that's why Pods and containers are
          # not included, because they are already collected by the
          # Kubernetes Integration.
          - except:
            - kube_hpa_
            - kube_daemonset_
            - kube_statefulset_
            - kube_endpoint_
            - kube_service_
            - kube_limitrange
            - kube_node_
            - kube_poddisruptionbudget_
            - kube_resourcequota
            - nr_stats
    #     copy_attributes:
    #       # Copy all the labels from the timeseries with metric name
    #       # `kube_hpa_labels` into every timeseries with a metric name that
    #       # starts with `kube_hpa_` only if they share the same `namespace`
    #       # and `hpa` labels.
    #       - from_metric: "kube_hpa_labels"
    #         to_metrics: "kube_hpa_"
    #         match_by:
    #           - namespace
    #           - hpa
    #       - from_metric: "kube_daemonset_labels"
    #         to_metrics: "kube_daemonset_"
    #         match_by:
    #           - namespace
    #           - daemonset
    #       - from_metric: "kube_statefulset_labels"
    #         to_metrics: "kube_statefulset_"
    #         match_by:
    #           - namespace
    #           - statefulset
    #       - from_metric: "kube_endpoint_labels"
    #         to_metrics: "kube_endpoint_"
    #         match_by:
    #           - namespace
    #           - endpoint
    #       - from_metric: "kube_service_labels"
    #         to_metrics: "kube_service_"
    #         match_by:
    #           - namespace
    #           - service
    #       - from_metric: "kube_node_labels"
    #         to_metrics: "kube_node_"
    #         match_by:
    #           - namespace
    #           - node
kind: ConfigMap
metadata:
  name: nri-prometheus-cfg
  namespace: newrelic

Thanks in advance.

Modify POMI to report new instrumentation attributes:

We need to report these new attributes:

  • instrumentation.provider: newRelic
  • instrumentation.name: nri-prometheus
  • instrumentation.version: x.y.z

This will be done in the SDK but we also need to update the names of the vars in POMI and add the provider attribute

Load test environment Helm chart

Create a chart that will generate multiple services using a few exporters mocks pods:
The service created will be labeled as prometheus endpoints so POMI can automatically discover them and scrape them. Since these are much lighter than pods we can create hundreds of them in a local environment and simulate targets that can be auto-discovered.
The prometheus labels for auto discovery prometheus.io/port, prometheus.io/scrape should be added to the services fo auto-discovery functionality.

If SCRAPE_ENABLED_LABEL is provided, the scraper won't start

I have successfully setup both the infrastructure agent and the NR Scraper using the provided documentation. Both got installed and worked. I checked the link you provided and there is nothing there that does not work for me as long as the scrape label is unset.

For the Scraper I noticed that many metrics get discarded, included metrics we generate, because too many metrics are sent. To solve that I wanted to filter metrics sent from specific deployments using the LABEL property. Once I set the label property in the configuration for the scraper, the scraper stops working, it crashes at start with the following error:

time="2019-09-23T12:02:37Z" level=info msg="Starting New Relic Prometheus Integration version 0.10.3"
panic: runtime error: integer divide by zero
goroutine 86 [running]:
go.datanerd.us/p/fsi/nri-prometheus-scraper/internal/integration.(*prometheusFetcher).Fetch.func1(0xc000306b00, 0x1ee4398, 0x0, 0x0, 0xc0003670e0)
/go/src/go.datanerd.us/p/fsi/nri-prometheus-scraper/internal/integration/fetcher.go:168 +0x1c7

Removing the label property solves the issue. Also the value provided makes no difference.

Note: I was pointed to this repo from https://support.newrelic.com/tickets/370148

Issues when not running in standalone mode

when standalone is false:

  • the default emitter should be infra-sdk
  • the definition_files_path config option should have a default value
  • if you don't provide the definition files, a debug log message shows up for every metric
time="2020-09-23T14:22:12Z" level=debug msg="time=\"2020-09-23T14:22:12Z\" level=debug msg=\"failed to map metric to entity. using 'host' entity\" error=\"no spec files for service: ravendb\"" component=integrations.runner.Group integration_name=nri-prometheus

which spams the debug log and it's quite annoying. We should only log once per namespace.

Fails to run the integration in standalone=false when no definition_file folder is present

Description

If the definition files folder is not present , the integration exits with a error.

Expected Behavior

The integration continues with no definition files loaded, as it would do for an exporter without definition file.

NR Diag results

Steps to Reproduce

When the agent is installed no definition file folder is created so if you try to run POMI in standalone=false after a clean installation the error message will appear.

Your Environment

Additional context

The common scenario would be to run the integration having an exporter package installed that would provide the configuration file to run the integration and the definition files.

Gettng some, but not all, metrics from AWS MSK (Kafka) cluster using nri-prometeus

We set up OpenPrometheus integration with our MSK cluster following the instructions in this article - https://blog.newrelic.com/product-news/monitor-amazon-msk/. The integration seems to be working because we can see several of the MSK metrics when creating a chart.

Some metrics are missing, though, like the "kafka_server_BrokerTopicMetrics_Count" metrics mentioned in Example 2 of the article. If I go to the /metrics endpoint for the cluster, I can see values for that metric being returned, so I'm not sure why it's not showing up in New Relic.

Histograms and Summaries with no data are noisy

It appears the prometheus client libraries may emit a NaN for histogram buckets or summary quantiles that do not have any observations. nri-prometheus bails out when it encounters this, wrapping the error:

if !validNRValue(v) {
err := fmt.Errorf("invalid percentile value for %s: %g", metric.name, v)
if results == nil {
results = err
} else {
results = fmt.Errorf("%v: %w", err, results)
}
continue
}

// validNRValue returns if v is a New Relic metric supported float64.
func validNRValue(v float64) bool {
return !math.IsInf(v, 0) && !math.IsNaN(v)
}

This can cause the logs to be extremely noisy for high-cardinality series. Here's an example from cert-manager (which isn't easy to troubleshoot because only the metric name is logged, not its labels):

default/nri-prometheus-f99cfb76b-h7hpw[nri-prometheus]: time="2019-10-29T19:53:44Z" level=warning msg="error emitting metrics" component=integration.Execute emitter=telemetry error="invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN: invalid percentile value for certmanager_http_acme_client_request_duration_seconds: NaN"

Is there a way to ignore this?

Load test exporter mock

Modify/Create the exporter Mock to set the timeout configurable and handle multiple requests concurrently (each of these exporters will be mapped to multiple services and will receive concurrent connection from POMI) and different set of metrics from a file.

Not getting custom metrics from Prometheus endpoint

I have an app that runs inside single container (and single pod). There's also a Prometheus instance running inside that container that picks-up metrics from various app processes.

I've deployed NewRelic with nri-prometheus and generally I get all the infrastructure metrcis, events, logs and so on, however I do not get any custom metrics from my prometheus instance.

Logs:
nri-prometheus-75b664b985-vfptw nri-prometheus time="2020-04-29T11:53:24Z" level=debug msg="fetching URL: {http 100.96.2.226:21090 /federate false }" component=Fetcher target=pzu-fwserver-sts-0

Pod annotations:
prometheus.io/path: /federate prometheus.io/port: "21090" prometheus.io/scrape: "true"

Config:

---
apiVersion: v1
data:
  config.yaml: |
    scrape_configs:
      - job_name: 'federate'
        scrape_interval: 30s
        honor_labels: true
        metrics_path: '/federate'

        params:
          'match[]':
            - '{job="prometheus"}'
            - '{__name__=~"job:.*"}'

    # The name of your cluster. It's important to match other New Relic products to relate the data.
    cluster_name: "k8s-dev.XXX.XXX"

    # How often the integration should run. Defaults to 30s.
    # scrape_duration: "30s"

    # The HTTP client timeout when fetching data from endpoints. Defaults to 5s.
    # scrape_timeout: "5s"

    # Wether the integration should run in verbose mode or not. Defaults to false.
    verbose: true

    # Wether the integration should skip TLS verification or not. Defaults to false.
    insecure_skip_verify: true

    # The label used to identify scrapable targets. Defaults to "prometheus.io/scrape".
    scrape_enabled_label: "prometheus.io/scrape"

    # Whether k8s nodes need to be labelled to be scraped or not. Defaults to true.
    require_scrape_enabled_label_for_nodes: true

    # targets:
    #   - description: Secure etcd example
    #     urls: ["https://192.168.3.1:2379", "https://192.168.3.2:2379", "https://192.168.3.3:2379"]
    #     tls_config:
    #       ca_file_path: "/etc/etcd/etcd-client-ca.crt"
    #       cert_file_path: "/etc/etcd/etcd-client.crt"
    #       key_file_path: "/etc/etcd/etcd-client.key"

    # Proxy to be used by the emitters when submitting metrics. It should be
    # in the format [scheme]://[domain]:[port].
    # The emitter is the component in charge of sending the scraped metrics.
    # This proxy won't be used when scraping metrics from the targets.
    # By default it's empty, meaning that no proxy will be used.
    # emitter_proxy: "http://localhost:8888"

    # Certificate to add to the root CA that the emitter will use when
    # verifying server certificates.
    # If left empty, TLS uses the host's root CA set.
    # emitter_ca_file: "/path/to/cert/server.pem"

    # Whether the emitter should skip TLS verification when submitting data.
    # Defaults to false.
    # emitter_insecure_skip_verify: false

    # Histogram support is based on New Relic's guidelines for higher
    # level metrics abstractions https://github.com/newrelic/newrelic-exporter-specs/blob/master/Guidelines.md.
    # To better support visualization of this data, percentiles are calculated
    # based on the histogram metrics and sent to New Relic.
    # By default, the following percentiles are calculated: 50, 95 and 99.
    #
    # percentiles:
    #   - 50
    #   - 95
    #   - 99

    # transformations:
    #   - description: "General processing rules"
    #     rename_attributes:
    #       - metric_prefix: ""
    #         attributes:
    #           container_name: "containerName"
    #           pod_name: "podName"
    #           namespace: "namespaceName"
    #           node: "nodeName"
    #           container: "containerName"
    #           pod: "podName"
    #           deployment: "deploymentName"
    #     ignore_metrics:
    #       # Ignore all the metrics except the ones listed below.
    #       # This is a list that complements the data retrieved by the New
    #       # Relic Kubernetes Integration, that's why Pods and containers are
    #       # not included, because they are already collected by the
    #       # Kubernetes Integration.
    #       - except:
    #         - kube_hpa_
    #         - kube_daemonset_
    #         - kube_statefulset_
    #         - kube_endpoint_
    #         - kube_service_
    #         - kube_limitrange
    #         - kube_node_
    #         - kube_poddisruptionbudget_
    #         - kube_resourcequota
    #         - nr_stats
    #     copy_attributes:
    #       # Copy all the labels from the timeseries with metric name
    #       # `kube_hpa_labels` into every timeseries with a metric name that
    #       # starts with `kube_hpa_` only if they share the same `namespace`
    #       # and `hpa` labels.
    #       - from_metric: "kube_hpa_labels"
    #         to_metrics: "kube_hpa_"
    #         match_by:
    #           - namespace
    #           - hpa
    #       - from_metric: "kube_daemonset_labels"
    #         to_metrics: "kube_daemonset_"
    #         match_by:
    #           - namespace
    #           - daemonset
    #       - from_metric: "kube_statefulset_labels"
    #         to_metrics: "kube_statefulset_"
    #         match_by:
    #           - namespace
    #           - statefulset
    #       - from_metric: "kube_endpoint_labels"
    #         to_metrics: "kube_endpoint_"
    #         match_by:
    #           - namespace
    #           - endpoint
    #       - from_metric: "kube_service_labels"
    #         to_metrics: "kube_service_"
    #         match_by:
    #           - namespace
    #           - service
    #       - from_metric: "kube_node_labels"
    #         to_metrics: "kube_node_"
    #         match_by:
    #           - namespace
    #           - node
kind: ConfigMap
metadata:
  name: nri-prometheus-cfg
  namespace: monitoring

Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nri-prometheus
  namespace: monitoring
  labels:
    app: nri-prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nri-prometheus
  template:
    metadata:
      labels:
        app: nri-prometheus
        prometheus.io/scrape: "true"
    spec:
      serviceAccountName: nri-prometheus
      containers:
        - name: nri-prometheus
          image: newrelic/nri-prometheus:1.3.0
          args:
            - "--configfile=/etc/nri-prometheus/config.yaml"
          ports:
            - containerPort: 8080
          volumeMounts:
            - name: config-volume
              mountPath: /etc/nri-prometheus/
          env:
            - name: "LICENSE_KEY"
              value: "XXX"
            - name: "BEARER_TOKEN_FILE"
              value: "/var/run/secrets/kubernetes.io/serviceaccount/token"
            - name: "CA_FILE"
              value: "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
      volumes:
        - name: config-volume
          configMap:
            name: nri-prometheus-cfg

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.