Coder Social home page Coder Social logo

apiconnect-trawler's Introduction

trawler

Trawler Logo

Trawler is a metrics exporter for IBM API Connect.

CII Best Practices

Deployment

Trawler is designed to run within the same kubernetes cluster as API Connect, such that it can scrape metrics from the installed components and make them available. The metrics gathering in Trawler is separated into separate nets for the different types of metrics to expose so you can select which ones to enable for a particular environment.

It requires a service account with read access to list pods and services in the namespace(s) the API Connect components are deployed in.

More details on installing trawler

Configuring trawler

Trawler gets its config from a mounted configmap containing config.yaml which looks like this:

trawler:
  frequency: 10
  use_kubeconfig: false
prometheus:
  port: 63512 
  enabled: true
logging: 
  level: debug
  filters: trawler:trace
  format: pretty
nets:
  datapower:
    enabled: true
    timeout: 5 
    username: trawler-monitor
    namespace: apic-gateway
  product:
    enabled: true
    username: trawler-monitor
    namespace: apic-management

General trawler settings:

  • frequency: number of seconds to wait between trawling for metrics
  • use_kubeconfig: use the current kubeconfig from the environment instead looking at in cluster config
  • logging: set the default logging level, output format and filters for specific components Prometheus settings: The port specified in the prometheus block needs to match the prometheus annotations on the deployed trawler pod for prometheus to discover the metrics exposed.

Individual nets Each of the different areas of metrics is handled by a separate net, which can be enabled/disabled independently. The configuration for these is currently a pointer to the namespace the relevant subsystem is deployed into and a username to use. Passwords are loaded separately from the following values in a kubernetes secret mounted at the default location of /app/secrets - which can be overridden using the SECRETS environment variable:

  • datapower_password - password to use with the datapower net for accessing the DataPower REST management interface.
  • cloudmanager_password - password to use with the manager net to retreive API Connect usage metrics.

Issues, enhancements and pull requests

Feature requests and issue reports are welcome as github issues through this repository. Contributions of pull requests are also accepted and should be provided with a linked issue explaining the reasoning behind the change, should follow the existing code format standards and tests should be included in the PR ensuring the overall code coverage is not reduced.

More documentation

Development tips

Setting up your development environment

Install the pre-reqs for trawler from requirements.txt and development and testing requirements from requirements-dev.txt

pip install -r requirements.txt
pip install -r requirements-dev.txt

Initialise the pre-commit checks for Trawler using pre-commit

pre-commit install

Running test cases locally

Trawler uses py.test for test cases and the test suite is intended to be run with the test-assets directory as the secrets path.

SECRETS=test-assets coverage run --source . -m py.test

Running locally

To run locally point the config parameter to a local config file

python3 trawler.py --config local/config.yaml

You can view the data that is being exposed to prometheus at localhost:63512 (or the custom port value if it's been changed)

Notes on developing with a running k8s pod:

kubectl cp datapower_trawl.py {trawler_pod}:/app/datapower_trawl.py
kubectl cp newconfig.yaml {trawler_pod}:/app/newconfig.yaml
kubectl exec {trawler_pod} -- sh -c 'cd /app;python3 trawler.py -c newconfig.yaml'

apiconnect-trawler's People

Contributors

alexisph avatar dependabot[bot] avatar djarcan avatar finnribm avatar imgbot[bot] avatar perryan avatar perryan-coder avatar rickymoorhouse avatar sauravsuresh avatar stevemar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

apiconnect-trawler's Issues

Update documentation for recent features

  • Ensure all new metrics are documented
    • Document Cache
    • New Analytics subsystem
    • CR monitoring
  • Org level metrics being optional
  • All namespace support for Datapower discovery
  • Cert checking

certificate net does not check secrets with only 'tls.crt'

I noticed that some endpoint certs were not getting checked by trawler

It's running foul of this bit of code

https://github.com/IBM/apiconnect-trawler/blob/main/certs_net.py#L52-L63

which assumes the secrets we're interested in will have both a ca.crt and a tls.crt. However if the certs are generated by a trusted CA or simply self signed then there may not be a ca.crt in the data, so this code skips over them.

It's also possible that if you only have a ca.crt the code will blow up when trying to get the expiry for tls.crt.

Suggest refactoring so that the ca.crt and tls.crt each have their own existence check.

Resolving objectstatus of datapower takes way more than 1s

Hello,

in our deployment, the trawler times out while fetching the objectstatus over the RMI of the datapower gateways.
Currently, the timout is hardcoded to 1s.

In our test environment, we observe times of 5s (environment A) and 15s (environment B) with a rough download size of 25MiB.

Please provide a way to configure such timeouts from the outside by e.g. using environment variables which we can configure in a configmap.

Affected line:
https://github.com/IBM/apiconnect-trawler/blob/main/datapower_net.py#L264

Update cluster role to cover analytics api group

the one in the repo was not complete (missing apigroup for analytics)
*       - apiGroups: ["analytics.apiconnect.ibm.com"]
*         resources: ["analyticsclusters"]
*         verbs: ["get","list"]
*  

Missing metrics after first-time-install

Simply put, my problem is that I am not seeing the "promised" metrics in Prometheus that are linked here.

I have apiconnect-trawler installed in a cluster on which API Connect is installed in a namespace in my "monitoring" namespace. After some initial problems, it has been running for a week now. Now I wanted to use metrics like apiconnect_health_status or datapower_gateway_peering_primary_info.

API Connect:
image

Manager:
image

DataPower: too much, more than documented ...

Analytics:
image

Cert monitoring: Is working as aspected.

Is it possible that I have a configuration error? Have the metrics simply been implemented yet?

trawler:
  frequency: 10
  use_kubeconfig: false
logging: 
  level: debug
  format: json
prometheus:
  port: 63512
  enabled: true
graphite:
  enabled: false
nets:
  datapower:
    enabled: true
    timeout: 5
    username: admin
    namespace: apic
  manager:
    enabled: true
    username: admin
    namespace: apic
    max_frequency: 300
    process_org_metrics: true
    grant_type: password
  analytics:
    enabled: true
    namespace: apic
  certs:
    enabled: true
  product:
    enabled: true
    username: admin
    namespace: apic

Doesn't discover v10 based DataPower images

The annotations of a v10 datapower look like this:

  annotations:
    datapower.ibm.com/domains.apiconnect.reconciled: "2020-09-28T14:29:43Z"
    datapower.ibm.com/domains.default.reconciled: "2020-09-28T14:29:43Z"
    datapower.ibm.com/user.admin.reconciled: "2020-09-28T14:29:43Z"
    datapower.ibm.com/username.commands.reconciled: "2020-09-28T14:29:43Z"
    kubernetes.io/psp: ibm-privileged-psp
    productChargedContainers: datapower
    productID: 887a7b80dd7b40c9b978ff085230604e
    productMetric: VIRTUAL_PROCESSOR_CORE
    productName: IBM DataPower Gateway Virtual Edition - Production Edition
    productVersion: 10.0.0.0
    prometheus.io/module: dpStatusMIB
    prometheus.io/path: /snmp
    prometheus.io/port: "63512"
    prometheus.io/scrape: "true"
    prometheus.io/target: 127.0.0.1:1161

Where as a v2018 datapower looks like:

  annotations:
    kubernetes.io/psp: ibm-privileged-psp
    productChargedContainers: ""
    productFlexpointBundle: ""
    productID: IBMDataPowerGatewayVirtualEdition_2018.4.1.13...
    productName: IBM DataPower Gateway Virtual Edition for Developers
    productVersion: 2018.4.1.13-324822-release-prod
    prometheus.io/module: dpStatusMIB
    prometheus.io/path: /snmp?target=127.0.0.1:1161&module=dpStatusMIB
    prometheus.io/port: "63512"
    prometheus.io/scrape: "true"
    prometheus.io/target: 127.0.0.1:1161
    restPort: "5554"
    sshPort: "9022"
    webGUIPort: "9090"

Currently trawler is looking for restPort - so doesn't find the v10 pods.

Looking to move to using the productName annotation instead for discovery

Connection errors cause trawler to crash

Example trace:

{"channel": "management", "exception": null, "level": "info", "message": "Getting data from API Manager", "num_indent": 0, "timestamp": "2022-10-24T11:40:09.618886"}
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib64/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/lib64/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.8/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib64/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/lib64/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/trawler.py", line 213, in <module>
    cli()
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/app/trawler.py", line 209, in cli
    trawler.trawl_metrics()
  File "/usr/local/lib/python3.8/site-packages/alog/alog.py", line 798, in wrapper
    return func(*args, **kwargs)
  File "/app/trawler.py", line 194, in trawl_metrics
    net.fish()
  File "/usr/local/lib/python3.8/site-packages/alog/alog.py", line 798, in wrapper
    return func(*args, **kwargs)
  File "/app/manager_net.py", line 184, in fish
    response = requests.get(
  File "/usr/local/lib/python3.8/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 547, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Add datapower gateway peering status

Gather the gateway peering status data so that it can be viewed, specifically the Primary node for a peer group.

The CLI command is show gateway-peering-status and the equivalent REST call is /mgmt/status/{domain}/GatewayPeeringStatus

The returned JSON format is

{
    "_links": {
      "self": {
        "href": "/mgmt/status/apiconnect/GatewayPeeringStatus"
      },
      "doc": {
        "href": "/mgmt/docs/status/GatewayPeeringStatus"
      }
    },
    "GatewayPeeringStatus": [
      {
        "Address": "IP Node 1",
        "Name": "gwd",
        "PendingUpdates": 0,
        "ReplicationOffset": 5881225785,
        "LinkStatus": "ok",
        "Primary": "no"
      },
      ....
      {
        "Address": "IP Node 2",
        "Name": "gwd",
        "PendingUpdates": 0,
        "ReplicationOffset": 5881225785,
        "LinkStatus": "ok",
        "Primary": "no"
      },
      ....
      {
        "Address": "IP Node 3",
        "Name": "gwd",
        "PendingUpdates": 0,
        "ReplicationOffset": 5881225785,
        "LinkStatus": "ok",
        "Primary": "yes"
      },
      ...
    ]
}

The output format needs to be determined, probably with a naming standard like https://prometheus.io/docs/practices/naming/

KeyError: 'graphite' raised when graphite key is not in config.yaml

When running the project with a config.yaml file that doesn't include the graphite key, the following error is raised:

Traceback (most recent call last):
  ...
  File "/app/trawler.py", line 55, in __init__
    if self.config['graphite']['enabled']:
KeyError: 'graphite'

This seems to indicate that the graphite key is expected in the config.yaml file, although it's not included in example configurations or documented.

Steps to Reproduce:

  1. Clone the project.
  2. Run the tests: SECRETS=test-assets coverage run --source . -m py.test
  3. Run the application with a config.yaml that doesn't include the graphite key: python3 trawler.py --config deployment/config.yaml

Expected Behavior:

The application should either run without requiring the graphite key or should provide a more descriptive error message if the key is required.

Suggested Solution:

Document the graphite Key: Update the example config.yaml file and documentation to include the graphite key. For example:

# Example configuration file
trawler:
  frequency: 10
  use_kubeconfig: false
logging: 
  level: debug
  filters: trawler:trace
  format: pretty
prometheus:
  port: 63512
  enabled: true
graphite:
  enabled: false
nets:
  datapower:
    enabled: true
    username: admin
    namespace: apic
  manager:
    enabled: true
    username: admin
    namespace: apic
  analytics:
    enabled: true
    namespace: apic

Monitor API Connect Custom Resource status

Poll the custom resources in the cluster and expose the status conditions to prometheus.

e.g. For ManagementCluster:

   conditions:
    - lastTransitionTime: "2022-06-15T09:11:42Z"
      message: ""
      reason: na
      status: "False"
      type: Warning
    - lastTransitionTime: "2022-06-15T09:21:37Z"
      message: 17/17
      reason: na
      status: "True"
      type: Ready
    - lastTransitionTime: "2022-06-15T09:20:57Z"
      message: ""
      reason: na
      status: "False"
      type: Pending
    - lastTransitionTime: "2022-06-15T09:11:42Z"
      message: ""
      reason: na
      status: "False"
      type: Error
    - lastTransitionTime: "2022-06-15T09:11:42Z"
      message: ""
      reason: na
      status: "False"
      type: Failed

Correspondence between metrics

It would be nice if there were possibilities in the "manager_" metrics to see correspondences between them. If i e.g. want to see how many spaces a POrg has, the label "POrg" in the spaces metric would help.

analytics - apicalls in 30 seconds window

At the moment there is a metric in the analytics module to see the apicalls with their http status code only for the last hour. This is not precise enough to use it e.g. for alerts. We need these metrics with a last 30sec window.

Question regarding datapower object count metrics

Hi,

thank you for providing the opportunity to adjust the timeout in the datapower_net.
We noticed in our logs that something is gathered now:

2023-03-16T08:05:59.902914 [trawl:INFO] Trawling for metrics...
2023-03-16T08:06:00.278873 [datap:INFO] Processing status provider ObjectInstanceCounts
2023-03-16T08:06:00.443304 [datap:INFO] DataPowers in list: 6
2023-03-16T08:06:00.677276 [datap:INFO] Processing status provider ObjectInstanceCounts
2023-03-16T08:06:00.825884 [datap:INFO] DataPowers in list: 6
2023-03-16T08:06:01.091210 [datap:INFO] Processing status provider ObjectInstanceCounts
2023-03-16T08:06:01.275606 [datap:INFO] DataPowers in list: 6
2023-03-16T08:06:01.547945 [datap:INFO] Processing status provider ObjectInstanceCounts
2023-03-16T08:06:01.720889 [datap:INFO] DataPowers in list: 6
2023-03-16T08:06:01.973523 [datap:INFO] Processing status provider ObjectInstanceCounts
2023-03-16T08:06:02.140620 [datap:INFO] DataPowers in list: 6
2023-03-16T08:06:02.369868 [datap:INFO] Processing status provider ObjectInstanceCounts
2023-03-16T08:06:02.523395 [datap:INFO] DataPowers in list: 6

Unfortunately however, data are not showing up in Grafana yet.
We noticed that instead of querying the ObjectStatus, the ObjectInstanceCounts is fetched now.
Q1: Does this change the identifier where we would find them?

Example identifieres we would expect to find:

  • datapower_GatewayPeeringManager_total
  • datapower_GatewayPeering_total
  • datapower_APIGateway_total
  • datapower_APICollection_total
  • datapower_APIPath_total

Metrics for portal

At the moment the trawler deliveres no metrics for the portal submodule. It would be nice to have some metrics to get an overview about the healthiness of the portal.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.