Coder Social home page Coder Social logo

apiconnect-trawler's Issues

Doesn't discover v10 based DataPower images

The annotations of a v10 datapower look like this:

  annotations: "2020-09-28T14:29:43Z" "2020-09-28T14:29:43Z" "2020-09-28T14:29:43Z" "2020-09-28T14:29:43Z" ibm-privileged-psp
    productChargedContainers: datapower
    productID: 887a7b80dd7b40c9b978ff085230604e
    productName: IBM DataPower Gateway Virtual Edition - Production Edition
    productVersion: dpStatusMIB /snmp "63512" "true"

Where as a v2018 datapower looks like:

  annotations: ibm-privileged-psp
    productChargedContainers: ""
    productFlexpointBundle: ""
    productID: IBMDataPowerGatewayVirtualEdition_2018.4.1.13...
    productName: IBM DataPower Gateway Virtual Edition for Developers
    productVersion: 2018.4.1.13-324822-release-prod dpStatusMIB /snmp?target= "63512" "true"
    restPort: "5554"
    sshPort: "9022"
    webGUIPort: "9090"

Currently trawler is looking for restPort - so doesn't find the v10 pods.

Looking to move to using the productName annotation instead for discovery

Metrics for portal

At the moment the trawler deliveres no metrics for the portal submodule. It would be nice to have some metrics to get an overview about the healthiness of the portal.

Missing metrics after first-time-install

Simply put, my problem is that I am not seeing the "promised" metrics in Prometheus that are linked here.

I have apiconnect-trawler installed in a cluster on which API Connect is installed in a namespace in my "monitoring" namespace. After some initial problems, it has been running for a week now. Now I wanted to use metrics like apiconnect_health_status or datapower_gateway_peering_primary_info.

API Connect:


DataPower: too much, more than documented ...


Cert monitoring: Is working as aspected.

Is it possible that I have a configuration error? Have the metrics simply been implemented yet?

  frequency: 10
  use_kubeconfig: false
  level: debug
  format: json
  port: 63512
  enabled: true
  enabled: false
    enabled: true
    timeout: 5
    username: admin
    namespace: apic
    enabled: true
    username: admin
    namespace: apic
    max_frequency: 300
    process_org_metrics: true
    grant_type: password
    enabled: true
    namespace: apic
    enabled: true
    enabled: true
    username: admin
    namespace: apic

Add datapower gateway peering status

Gather the gateway peering status data so that it can be viewed, specifically the Primary node for a peer group.

The CLI command is show gateway-peering-status and the equivalent REST call is /mgmt/status/{domain}/GatewayPeeringStatus

The returned JSON format is

    "_links": {
      "self": {
        "href": "/mgmt/status/apiconnect/GatewayPeeringStatus"
      "doc": {
        "href": "/mgmt/docs/status/GatewayPeeringStatus"
    "GatewayPeeringStatus": [
        "Address": "IP Node 1",
        "Name": "gwd",
        "PendingUpdates": 0,
        "ReplicationOffset": 5881225785,
        "LinkStatus": "ok",
        "Primary": "no"
        "Address": "IP Node 2",
        "Name": "gwd",
        "PendingUpdates": 0,
        "ReplicationOffset": 5881225785,
        "LinkStatus": "ok",
        "Primary": "no"
        "Address": "IP Node 3",
        "Name": "gwd",
        "PendingUpdates": 0,
        "ReplicationOffset": 5881225785,
        "LinkStatus": "ok",
        "Primary": "yes"

The output format needs to be determined, probably with a naming standard like

Connection errors cause trawler to crash

Example trace:

{"channel": "management", "exception": null, "level": "info", "message": "Getting data from API Manager", "num_indent": 0, "timestamp": "2022-10-24T11:40:09.618886"}
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib64/python3.8/http/", line 1348, in getresponse
  File "/usr/lib64/python3.8/http/", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python3.8/http/", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/requests/", line 489, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.8/site-packages/urllib3/", line 787, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.8/site-packages/urllib3/packages/", line 769, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.8/site-packages/urllib3/", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib64/python3.8/http/", line 1348, in getresponse
  File "/usr/lib64/python3.8/http/", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python3.8/http/", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/", line 213, in <module>
  File "/usr/local/lib/python3.8/site-packages/click/", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/app/", line 209, in cli
  File "/usr/local/lib/python3.8/site-packages/alog/", line 798, in wrapper
    return func(*args, **kwargs)
  File "/app/", line 194, in trawl_metrics
  File "/usr/local/lib/python3.8/site-packages/alog/", line 798, in wrapper
    return func(*args, **kwargs)
  File "/app/", line 184, in fish
    response = requests.get(
  File "/usr/local/lib/python3.8/site-packages/requests/", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/", line 547, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Update cluster role to cover analytics api group

the one in the repo was not complete (missing apigroup for analytics)
*       - apiGroups: [""]
*         resources: ["analyticsclusters"]
*         verbs: ["get","list"]

Monitor API Connect Custom Resource status

Poll the custom resources in the cluster and expose the status conditions to prometheus.

e.g. For ManagementCluster:

    - lastTransitionTime: "2022-06-15T09:11:42Z"
      message: ""
      reason: na
      status: "False"
      type: Warning
    - lastTransitionTime: "2022-06-15T09:21:37Z"
      message: 17/17
      reason: na
      status: "True"
      type: Ready
    - lastTransitionTime: "2022-06-15T09:20:57Z"
      message: ""
      reason: na
      status: "False"
      type: Pending
    - lastTransitionTime: "2022-06-15T09:11:42Z"
      message: ""
      reason: na
      status: "False"
      type: Error
    - lastTransitionTime: "2022-06-15T09:11:42Z"
      message: ""
      reason: na
      status: "False"
      type: Failed

Question regarding datapower object count metrics


thank you for providing the opportunity to adjust the timeout in the datapower_net.
We noticed in our logs that something is gathered now:

2023-03-16T08:05:59.902914 [trawl:INFO] Trawling for metrics...
2023-03-16T08:06:00.278873 [datap:INFO] Processing status provider ObjectInstanceCounts
2023-03-16T08:06:00.443304 [datap:INFO] DataPowers in list: 6
2023-03-16T08:06:00.677276 [datap:INFO] Processing status provider ObjectInstanceCounts
2023-03-16T08:06:00.825884 [datap:INFO] DataPowers in list: 6
2023-03-16T08:06:01.091210 [datap:INFO] Processing status provider ObjectInstanceCounts
2023-03-16T08:06:01.275606 [datap:INFO] DataPowers in list: 6
2023-03-16T08:06:01.547945 [datap:INFO] Processing status provider ObjectInstanceCounts
2023-03-16T08:06:01.720889 [datap:INFO] DataPowers in list: 6
2023-03-16T08:06:01.973523 [datap:INFO] Processing status provider ObjectInstanceCounts
2023-03-16T08:06:02.140620 [datap:INFO] DataPowers in list: 6
2023-03-16T08:06:02.369868 [datap:INFO] Processing status provider ObjectInstanceCounts
2023-03-16T08:06:02.523395 [datap:INFO] DataPowers in list: 6

Unfortunately however, data are not showing up in Grafana yet.
We noticed that instead of querying the ObjectStatus, the ObjectInstanceCounts is fetched now.
Q1: Does this change the identifier where we would find them?

Example identifieres we would expect to find:

  • datapower_GatewayPeeringManager_total
  • datapower_GatewayPeering_total
  • datapower_APIGateway_total
  • datapower_APICollection_total
  • datapower_APIPath_total

Update documentation for recent features

  • Ensure all new metrics are documented
    • Document Cache
    • New Analytics subsystem
    • CR monitoring
  • Org level metrics being optional
  • All namespace support for Datapower discovery
  • Cert checking

Correspondence between metrics

It would be nice if there were possibilities in the "manager_" metrics to see correspondences between them. If i e.g. want to see how many spaces a POrg has, the label "POrg" in the spaces metric would help.

analytics - apicalls in 30 seconds window

At the moment there is a metric in the analytics module to see the apicalls with their http status code only for the last hour. This is not precise enough to use it e.g. for alerts. We need these metrics with a last 30sec window.

Resolving objectstatus of datapower takes way more than 1s


in our deployment, the trawler times out while fetching the objectstatus over the RMI of the datapower gateways.
Currently, the timout is hardcoded to 1s.

In our test environment, we observe times of 5s (environment A) and 15s (environment B) with a rough download size of 25MiB.

Please provide a way to configure such timeouts from the outside by e.g. using environment variables which we can configure in a configmap.

Affected line:

certificate net does not check secrets with only 'tls.crt'

I noticed that some endpoint certs were not getting checked by trawler

It's running foul of this bit of code

which assumes the secrets we're interested in will have both a ca.crt and a tls.crt. However if the certs are generated by a trusted CA or simply self signed then there may not be a ca.crt in the data, so this code skips over them.

It's also possible that if you only have a ca.crt the code will blow up when trying to get the expiry for tls.crt.

Suggest refactoring so that the ca.crt and tls.crt each have their own existence check.

KeyError: 'graphite' raised when graphite key is not in config.yaml

When running the project with a config.yaml file that doesn't include the graphite key, the following error is raised:

Traceback (most recent call last):
  File "/app/", line 55, in __init__
    if self.config['graphite']['enabled']:
KeyError: 'graphite'

This seems to indicate that the graphite key is expected in the config.yaml file, although it's not included in example configurations or documented.

Steps to Reproduce:

  1. Clone the project.
  2. Run the tests: SECRETS=test-assets coverage run --source . -m py.test
  3. Run the application with a config.yaml that doesn't include the graphite key: python3 --config deployment/config.yaml

Expected Behavior:

The application should either run without requiring the graphite key or should provide a more descriptive error message if the key is required.

Suggested Solution:

Document the graphite Key: Update the example config.yaml file and documentation to include the graphite key. For example:

# Example configuration file
  frequency: 10
  use_kubeconfig: false
  level: debug
  filters: trawler:trace
  format: pretty
  port: 63512
  enabled: true
  enabled: false
    enabled: true
    username: admin
    namespace: apic
    enabled: true
    username: admin
    namespace: apic
    enabled: true
    namespace: apic

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.