guilbaults / infiniband-exporter Goto Github PK
View Code? Open in Web Editor NEWPrometheus exporter for a Infiniband Fabric
License: Apache License 2.0
Prometheus exporter for a Infiniband Fabric
License: Apache License 2.0
Hi,
it would be very helpful, if timestamps would be provided to print calls.
Anyway, I would suggest to directly use a logger for info and error messages.
Best
Gabriele
At the moment, only the statistics of the switches are collected. The original idea was to limit interaction with the compute nodes since every packets go to a switch, so the switch counters see all the traffic. This implementation does not capture errors that are localized on the HCA of a compute node, only the errors on the switches are seen.
ibqueryerrors
is currently called with --switch
, it should run without that flag, or run twice with the --ca
flag on the second execution.
看exporter的间隔写的是60台交换机、900台服务器需要3s的时间,如何优化到1s内完成呢?
Hi,
it looks like, that the metric symbolerrorcounter
is not exported for positive values.
A short example:
ibqueryerrors
and verify for SymbolErrorCounter
errors:$ ibqueryerrors > ibqueryerrors.txt
$ grep "SymbolErrorCounter" ibqueryerrors.txt
GUID 0xc42a10300dcf8e2 port 1: [SymbolErrorCounter == 3] [PortXmitWait == 12097916]
GUID 0xc42a10300dd0bea port 1: [SymbolErrorCounter == 4] [PortXmitWait == 224462059]
$ ./infiniband-exporter.py --verbose 2>exporter_stderr.txt
$ wget localhost:9683/metrics
$ grep "symbolerrorcounter" metrics | cut -d "}" -f 2 | sort | uniq -c
2558 0.0
1 # HELP infiniband_symbolerrorcounter_total Total number of minor link errors detected on one or more physical lanes.
1 # TYPE infiniband_symbolerrorcounter_total counter
$ grep "0xc42a10300dcf8e2" metrics
$ grep "0xc42a10300dd0bea" metrics
As you can see no positive values for the symbolerrorcounter
metric is exported, nor the both GUIDs are listed too.
For completeness I have added the redirected messages to stderr from the exporter:
exporter_stderr.txt
The GUIDs from above are not listed in the exporter_stderr.txt
.
I would have expected, that the metrics for the GUIDs were exported.
The metric for PortXmitWait
is also be missing then.
Can you please verify?
If I am not mistaken, then we should also check for other not exported metrics.
I would like to test the exporter with a local metrics file.
But I do not get it working, for which I will create another issue.
Best
Gabriele
Hi, I didn't find node_infiniband_port_data_received_bytes_total and node_infiniband_port_data_transmitted_bytes_total in Metrics, how do I calculate to get them?
Hi,
it would be very useful, to have the following metrics also exported:
What do you think?
Thank you for creating and releasing this package. I was surprised to see Version 0.0.6
after installing infiniband-exporter-0.0.7-1.el7.noarch.rpm
.
# Download the latest release
wget https://github.com/guilbaults/infiniband-exporter/releases/download/v0.0.7/infiniband-exporter-0.0.7-1.el7.noarch.rpm
# Install the package
sudo rpm -ivh infiniband-exporter-0.0.7-1.el7.noarch.rpm
# Check the installation
infiniband-exporter --version
# Version 0.0.6
Hi,
I tried to run the exporter to process the input data from file, but it crashes.
Execute exporter:
$ ./infiniband-exporter.py --from-file ../ibqueryerrors_nohostnames.txt 2>exporter_errors.txt
It shows up error messages with Unknown link state on guid...
and finally will crash with the following Python errors:
File "./infiniband-exporter.py", line 346, in collect
self.parse_switch(switch_name, item[0], item[1])
File "./infiniband-exporter.py", line 222, in parse_switch
guid = m_port.group(1)
AttributeError: 'NoneType' object has no attribute 'group'
^CTraceback (most recent call last):
File "./infiniband-exporter.py", line 437, in <module>
Files:
Looks like objects are not generated because of the Unknown link state
errors,
but objects were expected and then the exporter crashes accessing a NoneType
object.
Labels column should be added for exported STDERR metrics in the table for completeness.
We also see errors on STDERR from ibqueryerrors
that indicate collecting metrics for specific CA are failing.
In the end the errors could be collected as we did with the bad status
errors.
Probably something like a query failed
error metric...
The exporter prints the following right now:
2021-06-23 14:49:39,353 - ERROR - Could not process line from STDERR: ibwarn: [1203] query_and_dump: PortXmitDiscardDetails query failed on HOSTXXX, Lid 1051 port 1
...
Travis-ci.org is being shutdown, the linting and RPM build need to be transfered to github.
Dear Simon,
your tool is greatly appreciated since the idea to gather all IB statistics from one place (instead of from a thousand separate "node-exporter" instances on all compute nodes) is the best, seeing that all IB fabric counters are accessible from just one node.
However, infiniband-exporter.py
fails at interpreting our ibqueryerrors
output. Firstly, it refused to work without a --node-name-map
file (even though our nodes are explicitely named in ibqueryerrors
output):
Traceback (most recent call last):
File "./infiniband-exporter.py", line 359, in <module>
args.node_name_map))
File "./infiniband-exporter.py", line 34, in __init__
if self.node_name_map:
AttributeError: 'InfinibandCollector' object has no attribute 'node_name_map'
Creating such node-name-map file and running again yields, secondly, another error:
Traceback (most recent call last):
File "./infiniband-exporter.py", line 359, in <module>
args.node_name_map))
File "/home/phew/.local/lib/python2.7/site-packages/prometheus_client/registry.py", line 26, in register
names = self._get_names(collector)
File "/home/phew/.local/lib/python2.7/site-packages/prometheus_client/registry.py", line 66, in _get_names
for metric in desc_func():
File "./infiniband-exporter.py", line 318, in collect
self.parse_switch(switch_name, item[0], item[1])
File "./infiniband-exporter.py", line 239, in parse_switch
m_link.group('remote_GUID'),
AttributeError: 'NoneType' object has no attribute 'group'
I tried to fix it myself, but being in no way a python programmer, just an administrator: could you find the time to have a look at that and possibly fix the problem...?
If you want, I can send you an example output of our fabric.
Thanks!
I just started using this exporter recently and found a total of three counters that showed up as missing in my error output while running it on my two clusters. The names are:
PortXmitConstraintErrors
PortMalformedPktErrors
PortSwLifetimeLimitDiscards
I was able to get rid of the errors by adding those to self.counter_info locally. I'd like it if they could be added to the upstream code.
Thanks!
Hi,
could you please add the discussions feature to the project page
(Settings->Options->Features: Enable Discussions),
so we could ask questions or start a discussion?
That would be great!
Best
Gabriele
Hello, I'm getting an error when attempting to start the infiniband exporter service:
Started Infiniband_exporter.
Aug 11 14:20:42 <nodename> python3[374627]: Traceback (most recent call last):
Aug 11 14:20:42 <nodename> python3[374627]: File "/usr/bin/infiniband_exporter.py", line 12, in
Aug 11 14:20:42 <nodename> python3[374627]: from prometheus_client.core import CounterMetricFamily, GaugeMetricFamily
Aug 11 14:20:42 <nodename> python3[374627]: ModuleNotFoundError: No module named 'prometheus_client'
Aug 11 14:20:42 <nodename> systemd[1]: infiniband_exporter.service: main process exited, code=exited, status=1/FAILURE
Aug 11 14:20:42 <nodename> systemd[1]: Unit infiniband_exporter.service entered failed state.
Aug 11 14:20:42 <nodename> systemd[1]: infiniband_exporter.service failed.
Is there a dependency that I am missing - CounterMetricFamily, GaugeMetricFamily ? If so, where do I get it?
i'm using v 0.0.4
Thanks
After commit c6eef51
I got
src/query_smp.c:199; umad (DR path slid 0; dlid 0; 0,1,10,19 Attr 0x11:0) bad status 110; Connection timed out
infiniband_scrape_ok 0.0
Hello,
I found that some nodes where missing from my grafana panels. I've converged to the behavior of ibqueryerrors which is not reporting node information if its not a "bad" node (a node with errors).
For example, here is the report for a node without errors:
# ibqueryerrors --verbose --details --data --report-port --switch --ca --threshold-file ./error_thresholds -G 0xb8cef60300a1d92a
## Summary: 1 nodes checked, 0 bad nodes found
## 1 ports checked, 0 ports have errors beyond threshold
## Thresholds: [SymbolErrorCounter = 0][LinkErrorRecoveryCounter = 0][LinkDownedCounter = 0][PortRcvErrors = 0][PortRcvRemotePhysicalErrors = 0][PortRcvSwitchRelayErrors = 0][PortXmitDiscards = 0][PortXmitConstraintErrors = 0][PortRcvConstraintErrors = 0][LocalLinkIntegrityErrors = 0][ExcessiveBufferOverrunErrors = 0][VL15Dropped = 0][PortXmitWait = 0]
## Suppressed:
And the report for a 'bad' node:
# ibqueryerrors --verbose --details --data --report-port --switch --ca --threshold-file ./error_thresholds -G 0x0c42a1030079989c
Errors for "maestro-3002 HCA-1"
GUID 0xc42a1030079989c port 1: [PortXmitWait == 2544] [PortXmitData == 6399401 (24.412MB)] [PortRcvData == 1758872 (6.710MB)] [PortXmitPkts == 13959 (13.632K)] [PortRcvPkts == 13514 (13.197K)] [PortUnicastXmitPkts == 13959 (13.632K)] [PortUnicastRcvPkts == 13514 (13.197K)]
Link info: 155 1[ ] ==( 4X 53.125 Gbps Active/ LinkUp)==> [ ] "" ( )
## Summary: 1 nodes checked, 1 bad nodes found
## 1 ports checked, 1 ports have errors beyond threshold
## Thresholds: [SymbolErrorCounter = 0][LinkErrorRecoveryCounter = 0][LinkDownedCounter = 0][PortRcvErrors = 0][PortRcvRemotePhysicalErrors = 0][PortRcvSwitchRelayErrors = 0][PortXmitDiscards = 0][PortXmitConstraintErrors = 0][PortRcvConstraintErrors = 0][LocalLinkIntegrityErrors = 0][ExcessiveBufferOverrunErrors = 0][VL15Dropped = 0][PortXmitWait = 0]
## Suppressed:
Indeed, the 'good' node does not report any errors at the moment:
# perfquery -G 0xb8cef60300a1d92a 1
# Port counters: Lid 160 port 1 (CapMask: 0x5A00)
PortSelect:......................1
CounterSelect:...................0x0000
SymbolErrorCounter:..............0
LinkErrorRecoveryCounter:........0
LinkDownedCounter:...............0
PortRcvErrors:...................0
PortRcvRemotePhysicalErrors:.....0
PortRcvSwitchRelayErrors:........0
PortXmitDiscards:................0
PortXmitConstraintErrors:........0
PortRcvConstraintErrors:.........0
CounterSelect2:..................0x00
LocalLinkIntegrityErrors:........0
ExcessiveBufferOverrunErrors:....0
QP1Dropped:......................0
VL15Dropped:.....................0
PortXmitData:....................14804777
PortRcvData:.....................4168543
PortXmitPkts:....................32281
PortRcvPkts:.....................31220
PortXmitWait:....................0
In that case, I guess infiniband-exporter.py cannot extract information for this node. I can see the equivalent information from the other side of the link, using remote_name, so I can workaround it if I really need to retrieve the values. But it somehow break the global view of the fabric I've build in grafana, since I can miss nodes without errors.
Maybe I've missed something ? If not, do you have a suggestion ?
While our site plans on using NODE_NAME_MAP
, I noticed that the service doesn't work if one isn't supplied by default.
As a workaround, we do something like
/usr/bin/infiniband-exporter --node-name-map /dev/null
Can the code act similarly if NODE_NAME_MAP
is not set?
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.