metrico / otel-collector Goto Github PK

View Code? Open in Web Editor NEW

27.0 6.0 8.0 1.56 MB

OpenTelemetry Collector for qryn with preconfigured ingestors for Loki, Prometheus, Influx, OTLP and many more

Home Page: https://qryn.dev

License: Apache License 2.0

Go 97.26% Makefile 0.83% Dockerfile 0.32% JavaScript 1.59%

clickhouse otel-collector qryn opentelemetry-collector opentelemetry-contrib otel

otel-collector's Introduction

qryn-otel-collector

Open Telemetry distribution for qryn

About

The qryn-otel-collector is designed to store observability data (Traces, Logs, Metrics) from multiple vendors/platforms into ClickHouse using qryn fingerprinting and table formats transparently accessible through qryn via LogQL, PromQL, Tempo and Pyroscope queries.

Popular ingestion formats (out of many more):

Logs
- Loki
- Splunk
- Fluentd
- Cloudwatch
- Syslog
Metrics
- Prometheus
- InfluxDB
- OTLP
Traces
- Zipkin
- Jaeger
- Skywalking
- OTLP

Usage

otel-collector:
    container_name: otel-collector
    image: ghcr.io/metrico/qryn-otel-collector:latest
    volumes:
      - ./otel-collector-config.yaml:/etc/otel/config.yaml
    ports:
      - "3100:3100"     # Loki/Logql HTTP receiver
      - "3200:3200"     # Loki/Logql gRPC receiver
      - "8088:8088"     # Splunk HEC receiver
      - "5514:5514"     # Syslog TCP Rereceiverceiver
      - "24224:24224"   # Fluent Forward receiver
      - "4317:4317"     # OTLP gRPC receiver
      - "4318:4318"     # OTLP HTTP receiver
      - "14250:14250"   # Jaeger gRPC receiver
      - "14268:14268"   # Jaeger thrift HTTP receiver
      - "9411:9411"     # Zipkin Trace receiver
      - "11800:11800"   # Skywalking gRPC receiver
      - "12800:12800"   # Skywalking HTTP receiver
      
      - "8086:8086"     # InfluxDB Line proto HTTP

    restart: on-failure

Config Template view

The following template enables popular log, metric and tracing ingestion formats supported by qryn

receivers:
  loki:
    use_incoming_timestamp: true
    protocols:
      http:
        endpoint: 0.0.0.0:3100
      grpc:
        endpoint: 0.0.0.0:3200
  syslog:
    protocol: rfc5424
    tcp:
      listen_address: "0.0.0.0:5514"
  fluentforward:
    endpoint: 0.0.0.0:24224
  splunk_hec:
    endpoint: 0.0.0.0:8088
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250
      thrift_http:
        endpoint: 0.0.0.0:14268
  zipkin:
    endpoint: 0.0.0.0:9411
  skywalking:
    protocols:
      grpc:
        endpoint: 0.0.0.0:11800
      http:
        endpoint: 0.0.0.0:12800
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 5s
          static_configs:
            - targets: ['exporter:8080']
  influxdb:
    endpoint: 0.0.0.0:8086
connectors:
  servicegraph:
    latency_histogram_buckets: [ 100us, 1ms, 2ms, 6ms, 10ms, 100ms, 250ms ]
    dimensions: [ cluster, namespace ]
    store:
      ttl: 2s
      max_items: 1000
    cache_loop: 2m
    store_expiration_loop: 2s
    virtual_node_peer_attributes:
      - db.name
      - rpc.service
  spanmetrics:
    namespace: span.metrics
    exemplars:
      enabled: false
    dimensions_cache_size: 1000
    aggregation_temporality: 'AGGREGATION_TEMPORALITY_CUMULATIVE'
    metrics_flush_interval: 30s
    metrics_expiration: 5m
    events:
      enabled: false
processors:
  batch:
    send_batch_size: 10000
    timeout: 5s
  memory_limiter:
    check_interval: 2s
    limit_mib: 1800
    spike_limit_mib: 500
  resourcedetection/system:
    detectors: ['system']
    system:
      hostname_sources: ['os']
  resource:
    attributes:
      - key: service.name
        value: "serviceName"
        action: upsert
  metricstransform:
    transforms:
      - include: calls_total
        action: update
        new_name: traces_spanmetrics_calls_total
      - include: latency
        action: update
        new_name: traces_spanmetrics_latency
exporters:
  qryn:
    dsn: tcp://clickhouse-server:9000/qryn?username=default&password=*************
    timeout: 10s
    sending_queue:
      queue_size: 100
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
    logs:
       format: raw
  otlp/spanmetrics:
    endpoint: localhost:4317
    tls:
      insecure: true
extensions:
  health_check:
  pprof:
  zpages:

service:
  extensions: [pprof, zpages, health_check]
  pipelines:
    logs:
      receivers: [fluentforward, otlp, loki, syslog, splunk_hec]
      processors: [memory_limiter, resourcedetection/system, resource, batch]
      exporters: [qryn]
    traces:
      receivers: [otlp, jaeger, zipkin, skywalking]
      processors: [memory_limiter, resourcedetection/system, resource, batch]
      exporters: [qryn, spanmetrics, servicegraph]
    metrics:
      receivers: [prometheus, influxdb, spanmetrics, servicegraph]
      processors: [memory_limiter, resourcedetection/system, resource, batch]
      exporters: [qryn]

Kafka Receiver

In order to correctly set labels when using Kafka (or other generic receiver) you will have to elect fields as labels.

For example this processor copies severity json field to the severity label:

processors:
  logstransform:
    operators:
      - type: copy
        from: 'body.severity'
        to: 'attributes.severity'

Use the label processor inside the pipeline you want:

  pipelines:
    logs:
      receivers: [kafka]
      processors: [logstransform, memory_limiter, batch]
      exporters: [qryn]

Kafka Example

A stream containing {"severity":"info", "data": "a"} should produce the following fingerprint and log:

┌───────date─┬──────────fingerprint─┬─labels──────────────┬─name─┐
│ 2023-10-05 │ 11473756280579456548 │ {"severity":"info"} │      │
└────────────┴──────────────────────┴─────────────────────┴──────┘

┌──────────fingerprint─┬────────timestamp_ns─┬─value─┬─string─────────────────────────┐
│ 11473756280579456548 │ 1696502612955383384 │     0 │ {"data":"a","severity":"info"} │
└──────────────────────┴─────────────────────┴───────┴────────────────────────────────┘

otel-collector's People

Contributors

Stargazers

Watchers

Forkers

luis-sousa-pinto arnitolog tomershafir afzalabbasi mvandermeulen tr11 plum330 lysunagopher

otel-collector's Issues

Error log during a golang app profiling

A misleading error message appears in the logs of a golang application:

[DEBUG] uploading at http://localhost:8062/ingest?aggregationType=&from=1707942234560629781&name=otel-collector%7B__session_id__%3D73ae637eff3002f1%7D&sampleRate=0&spyName=gospy&units=&until=1707942249560416161
[DEBUG] content type: multipart/form-data; boundary=5f0ae53aa5f15f6528f2d7fdeb554fb3ff1bff68084e28af44c9cf549c00
[ERROR] upload profile: failed to upload. server responded with statusCode: '204' and body: ''

a profile gets saved successfully.

Contributing otel-collector as an Exporter to OpenTelemetry Collector Contrib

I hope you're doing well. I'm curious why it hasn't been contributed as an exporter to the OpenTelemetry Collector Contrib project (https://github.com/open-telemetry/opentelemetry-collector-contrib).

Contributing your otel-collector to OpenTelemetry Collector Contrib could offer several benefits, including increased visibility, community-driven enhancements, and wider adoption. I would appreciate any insights you can provide on the decision to maintain the otel-collector separately and whether there are specific considerations involved.

Thank you for your time and any information you can share on this matter.

Test serviceGraph and spanMetrics

Validate compatibility with Grafana APM Dashboard 🤞

Difficulty connecting to Clickhouse Cloud over HTTPS and/or via Chproxy

We are currently trying to configure otel-collector to receive traces from Kafka and export to a Clickhouse cloud instance via Chproxy. We have tried different ways of connecting but are failing to establish a connection with/without Chproxy to Clickhouse cloud.

Context on our clickhouse setup:

We have a clickhouse cloud instance which we have been able to connect successfully over https via other tools.
We noticed latency and packet drops so we used Chproxy with caching to alleviate the pain. We have been able to connect to Chproxy over https with other tools like Jaeger-clickhouse plugin.

Here is the qryn otel-collector configuration for reference:

receivers:
      kafka/qryn_traces: 
        topic: otlp_traces
        encoding: otlp_proto
        group_id: qryn_traces
        auth:
          sasl:
            username: ${KAFKA_USER}
            password: ${KAFKA_PASSWORD}
            mechanism: ${KAFKA_MECH}
          tls:
            ca_file: /var/tls/ca.crt
        brokers:  
          - kafka.monitoring.svc.cluster.local:9093
    processors:
      resourcedetection:
        detectors: [env, system]
      cumulativetodelta:
      batch:
        send_batch_size: 1000
        timeout: 10s
    exporters:
      qryn: 
        dsn: https://${CH_ADDR}:8443/qryntraces?skip_verify=false&secure=true&username=${CH_USER}&password=${CH_PASS}
        timeout: 10s
        sending_queue:
          queue_size: 1000
        retry_on_failure:
          enabled: true
          initial_interval: 5s
          max_interval: 30s
          max_elapsed_time: 300s
    service:
      telemetry:
        logs:
          level: "debug"
      pipelines:
        traces:
          receivers: [kafka/qryn_traces]
          processors: [batch]
          exporters: [qryn]

We have tried different DSNs:
Scenario 1 (secure):

DSN: https://${CH_ADDR}:8443/qryntraces?skip_verify=false&secure=true&username=${CH_USER}&password=${CH_PASS}

(With skip_verify=false and skip_verify=true)

Error: 
Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "traces", "name": "qryn", "error": "[handshake] unexpected packet [72] from server", "interval": "3.264886497s"}

Scenario 2 (insecure):

DSN: http://${CH_ADDR}:8443/qryntraces?skip_verify=false&secure=false&username=${CH_USER}&password=${CH_PASS}

DSN: tcp://${CH_ADDR}:9440/qryntraces?skip_verify=false&secure=false&username=${CH_USER}&password=${CH_PASS}

Error:
exporterhelper/retry_sender.go:118	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "traces", "name": "qryn", "error": "[read: EOF](read:%20EOF)", "errorVerbose": "read:\n    github.com/ClickHouse/ch-go/proto.(*Reader).ReadFull\n        /go/pkg/mod/github.com/!click!house/[email protected]/proto/reader.go:62\n  - EOF", "interval": "10.972525583s"}

In the scenarios above, we have tried using both Clickhouse Cloud as well as Chproxy addresses and creds with no success. We haven't faced any issues with Kafka receiver but just qryn exporter trying to connect to clickhouse over http(s). Any help/guidance would be really appreciated. Please let me know if you need any further information.

error decoding 'processors'

Hello!

Seems to me that some dependencies got update and how I don't quite get how to write a working config file for the collector

Can you please provide an example?

journalctl -u otel-collector -f
Apr 05 08:14:41 ip-10-151-1-19 otel-collector[7618]: * error decoding 'processors': unknown type: "spanmetrics" for id: "spanmetrics" (valid values: [filter groupbytrace redaction routing tail_sampling batch deltatorate memory_limiter span groupbyattrs k8sattributes experimental_metricsgeneration metricstransform resource schema transform attributes logstransform probabilistic_sampler resourcedetection cumulativetodelta])
Apr 05 08:14:41 ip-10-151-1-19 otel-collector[7618]: 2024/04/05 08:14:41 application run finished with error: failed to get config: cannot unmarshal the configuration: 1 error(s) decoding:
Apr 05 08:14:41 ip-10-151-1-19 otel-collector[7618]: * error decoding 'processors': unknown type: "spanmetrics" for id: "spanmetrics" (valid values: [filter groupbytrace redaction routing tail_sampling batch deltatorate memory_limiter span groupbyattrs k8sattributes experimental_metricsgeneration metricstransform resource schema transform attributes logstransform probabilistic_sampler resourcedetection cumulativetodelta])

qryn + qryn-otel

Hello,
I have the following flow: logs are sent to Kafka in JSON format, then qryn-otel reads, and ingests them into ClickHouse from where qryn reads them.
The current qryn-otel configuration is the following:

kind: ConfigMap
metadata:
  name: qryn-opentelemetry-collector-configmap
data:
  config.yaml: |
    receivers:
      kafka:
        protocol_version: 2.8.0
        brokers: kafka-cluster-kafka-bootstrap.kafka.svc.cluster.local:9093
        encoding: json
        topic: logging-cloki
        group_id: cloki
        initial_offset: latest
        auth:
          tls:
            ca_file: /tmp/certificates/ca.crt
            insecure: true
        message_marking:
          on_error: false
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
      jaeger:
        protocols:
          grpc:
            endpoint: 0.0.0.0:14250
          thrift_http:
            endpoint: 0.0.0.0:14268
      zipkin:
        endpoint: 0.0.0.0:9411
      fluentforward:
        endpoint: 0.0.0.0:24224
      prometheus:
        config:
          scrape_configs:
            - job_name: 'otel-collector'
              scrape_interval: 5s
              static_configs:
                - targets: ['exporter:8080']
    processors:
      batch:
        send_batch_size: 10000
        timeout: 5s
      memory_limiter:
        check_interval: 2s
        limit_mib: 1800
        spike_limit_mib: 500
      resourcedetection/system:
        detectors: ['system']
        system:
          hostname_sources: ['os']
      resource:
        attributes:
          - key: service.name
            value: "serviceName"
            action: upsert
      spanmetrics:
        metrics_exporter: otlp/spanmetrics
        latency_histogram_buckets: [100us, 1ms, 2ms, 6ms, 10ms, 100ms, 250ms]
        dimensions_cache_size: 1500
      servicegraph:
        metrics_exporter: otlp/spanmetrics
        latency_histogram_buckets: [100us, 1ms, 2ms, 6ms, 10ms, 100ms, 250ms]
        dimensions: [cluster, namespace]
        store:
          ttl: 2s
          max_items: 200
      metricstransform:
        transforms:
          - include: calls_total
            action: update
            new_name: traces_spanmetrics_calls_total
          - include: latency
            action: update
            new_name: traces_spanmetrics_latency
    exporters:
      qryn:
        dsn: tcp://chi-log-analytics-log-analytics-0-0:9000/cloki?username=cloki&password=qwerty
        timeout: 10s
        sending_queue:
          queue_size: 100
        retry_on_failure:
          enabled: true
          initial_interval: 5s
          max_interval: 30s
          max_elapsed_time: 300s
        logs:
          format: json

    extensions:
      health_check:
      pprof:
      zpages:
      memory_ballast:
        size_mib: 1000

    service:
      extensions: [pprof, zpages, health_check]
      pipelines:
        logs:
          receivers: [kafka]
          processors: [memory_limiter, batch]
          exporters: [qryn]

I see the logs in the ClickHouse table:

But I don't see anything in the qryn UI.
here is how qryn is configured:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    cloki.cmd: cloki.org
  creationTimestamp: null
  labels:
    io.kompose.service: cloki
  name: cloki
spec:
  replicas: 1
  selector:
    matchLabels:
      io.kompose.service: cloki
  strategy: {}
  template:
    metadata:
      annotations:
        cloki.cmd: cloki.org
      creationTimestamp: null
      labels:
        io.kompose.service: cloki
    spec:
      containers:
        - env:
            - name: CLICKHOUSE_AUTH
              value: "cloki:qwerty"
            - name: CLICKHOUSE_DB
              value: "cloki"
            - name: CLICKHOUSE_PORT
              value: "8123"
            - name: CLICKHOUSE_SERVER
              value: "chi-log-analytics-log-analytics-0-0"
            - name: DEBUG
              value: "true"
          image: qxip/qryn:latest
          name: cloki
          ports:
            - containerPort: 3100
          resources: {}
      restartPolicy: Always
status: {}

Could someone point me to what might be wrong here?

Difficult code parts in feat-add_metrics_exporter branch

Two functions are very complicated to support.

Please

split the functions
keep the nesting level no more than 4
no more than 4 tabs before the code on a line
let's keep these recommendations for the future

Use Connectors

We need to migrate the spanprocessor metrics & co to use connectors rather than processors:

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/connector/spanmetricsconnector/README.md

Add Loki Receiver

Otel has a Loki receiver contrib - we should add to our distribution and make it work with our logql reader next.

Example:

receivers:
  loki:
    protocols:
      http:
      grpc:
    use_incoming_timestamp: true

Consider change of SQL library

Current state

In order to send data into the tables the github.com/ClickHouse/clickhouse-go/v2 library is used.
The library is used behind the database/sql abstraction adding even more CPU overhead.

On the other hand there is much more effective library ch-go .

Proposition

As far as we know the schema it's not complicated to migrate to ch-go.
Please check the library API and consider the migration.