ibm / apiconnect-trawler Goto Github PK
View Code? Open in Web Editor NEWAPI Connect metrics exporter
License: MIT License
API Connect metrics exporter
License: MIT License
The annotations of a v10 datapower look like this:
annotations:
datapower.ibm.com/domains.apiconnect.reconciled: "2020-09-28T14:29:43Z"
datapower.ibm.com/domains.default.reconciled: "2020-09-28T14:29:43Z"
datapower.ibm.com/user.admin.reconciled: "2020-09-28T14:29:43Z"
datapower.ibm.com/username.commands.reconciled: "2020-09-28T14:29:43Z"
kubernetes.io/psp: ibm-privileged-psp
productChargedContainers: datapower
productID: 887a7b80dd7b40c9b978ff085230604e
productMetric: VIRTUAL_PROCESSOR_CORE
productName: IBM DataPower Gateway Virtual Edition - Production Edition
productVersion: 10.0.0.0
prometheus.io/module: dpStatusMIB
prometheus.io/path: /snmp
prometheus.io/port: "63512"
prometheus.io/scrape: "true"
prometheus.io/target: 127.0.0.1:1161
Where as a v2018 datapower looks like:
annotations:
kubernetes.io/psp: ibm-privileged-psp
productChargedContainers: ""
productFlexpointBundle: ""
productID: IBMDataPowerGatewayVirtualEdition_2018.4.1.13...
productName: IBM DataPower Gateway Virtual Edition for Developers
productVersion: 2018.4.1.13-324822-release-prod
prometheus.io/module: dpStatusMIB
prometheus.io/path: /snmp?target=127.0.0.1:1161&module=dpStatusMIB
prometheus.io/port: "63512"
prometheus.io/scrape: "true"
prometheus.io/target: 127.0.0.1:1161
restPort: "5554"
sshPort: "9022"
webGUIPort: "9090"
Currently trawler is looking for restPort - so doesn't find the v10 pods.
Looking to move to using the productName annotation instead for discovery
At the moment the trawler deliveres no metrics for the portal submodule. It would be nice to have some metrics to get an overview about the healthiness of the portal.
Screenshots and json
Simply put, my problem is that I am not seeing the "promised" metrics in Prometheus that are linked here.
I have apiconnect-trawler installed in a cluster on which API Connect is installed in a namespace in my "monitoring" namespace. After some initial problems, it has been running for a week now. Now I wanted to use metrics like apiconnect_health_status or datapower_gateway_peering_primary_info.
DataPower: too much, more than documented ...
Cert monitoring: Is working as aspected.
Is it possible that I have a configuration error? Have the metrics simply been implemented yet?
trawler:
frequency: 10
use_kubeconfig: false
logging:
level: debug
format: json
prometheus:
port: 63512
enabled: true
graphite:
enabled: false
nets:
datapower:
enabled: true
timeout: 5
username: admin
namespace: apic
manager:
enabled: true
username: admin
namespace: apic
max_frequency: 300
process_org_metrics: true
grant_type: password
analytics:
enabled: true
namespace: apic
certs:
enabled: true
product:
enabled: true
username: admin
namespace: apic
Add an option to use mTLS on the inbound communication from prometheus
Gather the gateway peering status data so that it can be viewed, specifically the Primary node for a peer group.
The CLI command is show gateway-peering-status
and the equivalent REST call is /mgmt/status/{domain}/GatewayPeeringStatus
The returned JSON format is
{
"_links": {
"self": {
"href": "/mgmt/status/apiconnect/GatewayPeeringStatus"
},
"doc": {
"href": "/mgmt/docs/status/GatewayPeeringStatus"
}
},
"GatewayPeeringStatus": [
{
"Address": "IP Node 1",
"Name": "gwd",
"PendingUpdates": 0,
"ReplicationOffset": 5881225785,
"LinkStatus": "ok",
"Primary": "no"
},
....
{
"Address": "IP Node 2",
"Name": "gwd",
"PendingUpdates": 0,
"ReplicationOffset": 5881225785,
"LinkStatus": "ok",
"Primary": "no"
},
....
{
"Address": "IP Node 3",
"Name": "gwd",
"PendingUpdates": 0,
"ReplicationOffset": 5881225785,
"LinkStatus": "ok",
"Primary": "yes"
},
...
]
}
The output format needs to be determined, probably with a naming standard like https://prometheus.io/docs/practices/naming/
For v5c gateways it is key to keep track of the document cache
Trawler is supposed to be a read-only monitoring tool but is currently attempting to change the state of statistics on datapower:
https://github.com/IBM/apiconnect-trawler/blob/main/datapower_net.py#L116
This should be replaced with code to just check if statistics is enabled.
For example:
manager_users{pod="gateway-2"} 3609.0
Example trace:
{"channel": "management", "exception": null, "level": "info", "message": "Getting data from API Manager", "num_indent": 0, "timestamp": "2022-10-24T11:40:09.618886"}
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib64/python3.8/http/client.py", line 1348, in getresponse
response.begin()
File "/usr/lib64/python3.8/http/client.py", line 316, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python3.8/http/client.py", line 285, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 489, in send
resp = conn.urlopen(
File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 787, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 550, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.8/site-packages/urllib3/packages/six.py", line 769, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib64/python3.8/http/client.py", line 1348, in getresponse
response.begin()
File "/usr/lib64/python3.8/http/client.py", line 316, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python3.8/http/client.py", line 285, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/app/trawler.py", line 213, in <module>
cli()
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/app/trawler.py", line 209, in cli
trawler.trawl_metrics()
File "/usr/local/lib/python3.8/site-packages/alog/alog.py", line 798, in wrapper
return func(*args, **kwargs)
File "/app/trawler.py", line 194, in trawl_metrics
net.fish()
File "/usr/local/lib/python3.8/site-packages/alog/alog.py", line 798, in wrapper
return func(*args, **kwargs)
File "/app/manager_net.py", line 184, in fish
response = requests.get(
File "/usr/local/lib/python3.8/site-packages/requests/api.py", line 73, in get
return request("get", url, params=params, **kwargs)
File "/usr/local/lib/python3.8/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 547, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
the one in the repo was not complete (missing apigroup for analytics)
* - apiGroups: ["analytics.apiconnect.ibm.com"]
* resources: ["analyticsclusters"]
* verbs: ["get","list"]
*
Poll the custom resources in the cluster and expose the status conditions to prometheus.
e.g. For ManagementCluster:
conditions:
- lastTransitionTime: "2022-06-15T09:11:42Z"
message: ""
reason: na
status: "False"
type: Warning
- lastTransitionTime: "2022-06-15T09:21:37Z"
message: 17/17
reason: na
status: "True"
type: Ready
- lastTransitionTime: "2022-06-15T09:20:57Z"
message: ""
reason: na
status: "False"
type: Pending
- lastTransitionTime: "2022-06-15T09:11:42Z"
message: ""
reason: na
status: "False"
type: Error
- lastTransitionTime: "2022-06-15T09:11:42Z"
message: ""
reason: na
status: "False"
type: Failed
Hi,
thank you for providing the opportunity to adjust the timeout in the datapower_net.
We noticed in our logs that something is gathered now:
2023-03-16T08:05:59.902914 [trawl:INFO] Trawling for metrics...
2023-03-16T08:06:00.278873 [datap:INFO] Processing status provider ObjectInstanceCounts
2023-03-16T08:06:00.443304 [datap:INFO] DataPowers in list: 6
2023-03-16T08:06:00.677276 [datap:INFO] Processing status provider ObjectInstanceCounts
2023-03-16T08:06:00.825884 [datap:INFO] DataPowers in list: 6
2023-03-16T08:06:01.091210 [datap:INFO] Processing status provider ObjectInstanceCounts
2023-03-16T08:06:01.275606 [datap:INFO] DataPowers in list: 6
2023-03-16T08:06:01.547945 [datap:INFO] Processing status provider ObjectInstanceCounts
2023-03-16T08:06:01.720889 [datap:INFO] DataPowers in list: 6
2023-03-16T08:06:01.973523 [datap:INFO] Processing status provider ObjectInstanceCounts
2023-03-16T08:06:02.140620 [datap:INFO] DataPowers in list: 6
2023-03-16T08:06:02.369868 [datap:INFO] Processing status provider ObjectInstanceCounts
2023-03-16T08:06:02.523395 [datap:INFO] DataPowers in list: 6
Unfortunately however, data are not showing up in Grafana yet.
We noticed that instead of querying the ObjectStatus
, the ObjectInstanceCounts
is fetched now.
Q1: Does this change the identifier where we would find them?
Example identifieres we would expect to find:
It would be nice if there were possibilities in the "manager_" metrics to see correspondences between them. If i e.g. want to see how many spaces a POrg has, the label "POrg" in the spaces metric would help.
At the moment there is a metric in the analytics module to see the apicalls with their http status code only for the last hour. This is not precise enough to use it e.g. for alerts. We need these metrics with a last 30sec window.
Hello,
in our deployment, the trawler times out while fetching the objectstatus over the RMI of the datapower gateways.
Currently, the timout is hardcoded to 1s.
In our test environment, we observe times of 5s (environment A) and 15s (environment B) with a rough download size of 25MiB.
Please provide a way to configure such timeouts from the outside by e.g. using environment variables which we can configure in a configmap.
Affected line:
https://github.com/IBM/apiconnect-trawler/blob/main/datapower_net.py#L264
specify -Werror in the python testing call.
Identify and resolve the issues found
Move to a model where the metric name is common but the certs are labelled by name
I noticed that some endpoint certs were not getting checked by trawler
It's running foul of this bit of code
https://github.com/IBM/apiconnect-trawler/blob/main/certs_net.py#L52-L63
which assumes the secrets we're interested in will have both a ca.crt
and a tls.crt
. However if the certs are generated by a trusted CA or simply self signed then there may not be a ca.crt
in the data, so this code skips over them.
It's also possible that if you only have a ca.crt
the code will blow up when trying to get the expiry for tls.crt
.
Suggest refactoring so that the ca.crt
and tls.crt
each have their own existence check.
When running the project with a config.yaml file that doesn't include the graphite key, the following error is raised:
Traceback (most recent call last):
...
File "/app/trawler.py", line 55, in __init__
if self.config['graphite']['enabled']:
KeyError: 'graphite'
This seems to indicate that the graphite key is expected in the config.yaml file, although it's not included in example configurations or documented.
SECRETS=test-assets coverage run --source . -m py.test
python3 trawler.py --config deployment/config.yaml
The application should either run without requiring the graphite key or should provide a more descriptive error message if the key is required.
Document the graphite Key: Update the example config.yaml file and documentation to include the graphite key. For example:
# Example configuration file
trawler:
frequency: 10
use_kubeconfig: false
logging:
level: debug
filters: trawler:trace
format: pretty
prometheus:
port: 63512
enabled: true
graphite:
enabled: false
nets:
datapower:
enabled: true
username: admin
namespace: apic
manager:
enabled: true
username: admin
namespace: apic
analytics:
enabled: true
namespace: apic
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.