Coder Social home page Coder Social logo

infiniband-exporter's People

Contributors

bensallen avatar gabrieleiannetti avatar guilbaults avatar jbd avatar jknedlik avatar likueimo avatar lramosrocha avatar mark-tomich avatar mglants avatar mrobbert avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

infiniband-exporter's Issues

Gather HCA statistics

At the moment, only the statistics of the switches are collected. The original idea was to limit interaction with the compute nodes since every packets go to a switch, so the switch counters see all the traffic. This implementation does not capture errors that are localized on the HCA of a compute node, only the errors on the switches are seen.

ibqueryerrors is currently called with --switch, it should run without that flag, or run twice with the --ca flag on the second execution.

SymbolErrorCounter with postive value not exported

Hi,

it looks like, that the metric symbolerrorcounter is not exported for positive values.

A short example:

  1. Run ibqueryerrors and verify for SymbolErrorCounter errors:
$ ibqueryerrors > ibqueryerrors.txt

$ grep "SymbolErrorCounter" ibqueryerrors.txt

GUID 0xc42a10300dcf8e2 port 1: [SymbolErrorCounter == 3] [PortXmitWait == 12097916]
GUID 0xc42a10300dd0bea port 1: [SymbolErrorCounter == 4] [PortXmitWait == 224462059]
  1. Run exporter, get exported metrics and check for the errors:
$ ./infiniband-exporter.py --verbose 2>exporter_stderr.txt

$ wget localhost:9683/metrics

$ grep "symbolerrorcounter" metrics | cut -d "}" -f 2 | sort | uniq -c
   2558  0.0
      1 # HELP infiniband_symbolerrorcounter_total Total number of minor link errors detected on one or more physical lanes.
      1 # TYPE infiniband_symbolerrorcounter_total counter

$ grep "0xc42a10300dcf8e2" metrics 
$ grep "0xc42a10300dd0bea" metrics 

As you can see no positive values for the symbolerrorcounter metric is exported, nor the both GUIDs are listed too.

For completeness I have added the redirected messages to stderr from the exporter:
exporter_stderr.txt

The GUIDs from above are not listed in the exporter_stderr.txt.

I would have expected, that the metrics for the GUIDs were exported.
The metric for PortXmitWait is also be missing then.

Can you please verify?

If I am not mistaken, then we should also check for other not exported metrics.

I would like to test the exporter with a local metrics file.
But I do not get it working, for which I will create another issue.

Best
Gabriele

Mismatch between infiniband-exporter --version and downloaded RPM

Thank you for creating and releasing this package. I was surprised to see Version 0.0.6 after installing infiniband-exporter-0.0.7-1.el7.noarch.rpm.

# Download the latest release
wget https://github.com/guilbaults/infiniband-exporter/releases/download/v0.0.7/infiniband-exporter-0.0.7-1.el7.noarch.rpm

# Install the package
sudo rpm -ivh infiniband-exporter-0.0.7-1.el7.noarch.rpm

# Check the installation
infiniband-exporter --version
# Version 0.0.6

Error processing from file

Hi,

I tried to run the exporter to process the input data from file, but it crashes.

Execute exporter:

$ ./infiniband-exporter.py --from-file ../ibqueryerrors_nohostnames.txt 2>exporter_errors.txt

It shows up error messages with Unknown link state on guid... and finally will crash with the following Python errors:

  File "./infiniband-exporter.py", line 346, in collect
    self.parse_switch(switch_name, item[0], item[1])
  File "./infiniband-exporter.py", line 222, in parse_switch
    guid = m_port.group(1)
AttributeError: 'NoneType' object has no attribute 'group'
^CTraceback (most recent call last):
  File "./infiniband-exporter.py", line 437, in <module>

Files:

Looks like objects are not generated because of the Unknown link state errors,
but objects were expected and then the exporter crashes accessing a NoneType object.

Processing of query failed errors from STDERR retrieved by ibqueryerrors

We also see errors on STDERR from ibqueryerrors that indicate collecting metrics for specific CA are failing.

In the end the errors could be collected as we did with the bad status errors.
Probably something like a query failed error metric...

The exporter prints the following right now:

2021-06-23 14:49:39,353 - ERROR - Could not process line from STDERR: ibwarn: [1203] query_and_dump: PortXmitDiscardDetails query failed on HOSTXXX, Lid 1051 port 1   
...

AttributeError: 'NoneType' object has no attribute 'group'

Dear Simon,

your tool is greatly appreciated since the idea to gather all IB statistics from one place (instead of from a thousand separate "node-exporter" instances on all compute nodes) is the best, seeing that all IB fabric counters are accessible from just one node.

However, infiniband-exporter.py fails at interpreting our ibqueryerrors output. Firstly, it refused to work without a --node-name-map file (even though our nodes are explicitely named in ibqueryerrors output):

Traceback (most recent call last):
  File "./infiniband-exporter.py", line 359, in <module>
    args.node_name_map))
  File "./infiniband-exporter.py", line 34, in __init__
    if self.node_name_map:
AttributeError: 'InfinibandCollector' object has no attribute 'node_name_map'

Creating such node-name-map file and running again yields, secondly, another error:

Traceback (most recent call last):
  File "./infiniband-exporter.py", line 359, in <module>
    args.node_name_map))
  File "/home/phew/.local/lib/python2.7/site-packages/prometheus_client/registry.py", line 26, in register
    names = self._get_names(collector)
  File "/home/phew/.local/lib/python2.7/site-packages/prometheus_client/registry.py", line 66, in _get_names
    for metric in desc_func():
  File "./infiniband-exporter.py", line 318, in collect
    self.parse_switch(switch_name, item[0], item[1])
  File "./infiniband-exporter.py", line 239, in parse_switch
    m_link.group('remote_GUID'),
AttributeError: 'NoneType' object has no attribute 'group'

I tried to fix it myself, but being in no way a python programmer, just an administrator: could you find the time to have a look at that and possibly fix the problem...?

If you want, I can send you an example output of our fabric.
Thanks!

Missing counter names

I just started using this exporter recently and found a total of three counters that showed up as missing in my error output while running it on my two clusters. The names are:

PortXmitConstraintErrors
PortMalformedPktErrors
PortSwLifetimeLimitDiscards

I was able to get rid of the errors by adding those to self.counter_info locally. I'd like it if they could be added to the upstream code.

Thanks!

Add Feature for Discussions to Project Page

Hi,

could you please add the discussions feature to the project page
(Settings->Options->Features: Enable Discussions),
so we could ask questions or start a discussion?

That would be great!

Best
Gabriele

ModuleNotFoundError: No module named 'prometheus_client'

Hello, I'm getting an error when attempting to start the infiniband exporter service:

Started Infiniband_exporter.
Aug 11 14:20:42 <nodename> python3[374627]: Traceback (most recent call last):
Aug 11 14:20:42 <nodename> python3[374627]: File "/usr/bin/infiniband_exporter.py", line 12, in
Aug 11 14:20:42 <nodename> python3[374627]: from prometheus_client.core import CounterMetricFamily, GaugeMetricFamily
Aug 11 14:20:42 <nodename> python3[374627]: ModuleNotFoundError: No module named 'prometheus_client'
Aug 11 14:20:42 <nodename> systemd[1]: infiniband_exporter.service: main process exited, code=exited, status=1/FAILURE
Aug 11 14:20:42 <nodename> systemd[1]: Unit infiniband_exporter.service entered failed state.
Aug 11 14:20:42 <nodename> systemd[1]: infiniband_exporter.service failed.

Is there a dependency that I am missing - CounterMetricFamily, GaugeMetricFamily ? If so, where do I get it?

i'm using v 0.0.4

Thanks

Nodes without errors are not reported

Hello,

I found that some nodes where missing from my grafana panels. I've converged to the behavior of ibqueryerrors which is not reporting node information if its not a "bad" node (a node with errors).

For example, here is the report for a node without errors:

# ibqueryerrors --verbose --details --data --report-port --switch --ca --threshold-file ./error_thresholds -G 0xb8cef60300a1d92a

## Summary: 1 nodes checked, 0 bad nodes found
##          1 ports checked, 0 ports have errors beyond threshold
## Thresholds: [SymbolErrorCounter = 0][LinkErrorRecoveryCounter = 0][LinkDownedCounter = 0][PortRcvErrors = 0][PortRcvRemotePhysicalErrors = 0][PortRcvSwitchRelayErrors = 0][PortXmitDiscards = 0][PortXmitConstraintErrors = 0][PortRcvConstraintErrors = 0][LocalLinkIntegrityErrors = 0][ExcessiveBufferOverrunErrors = 0][VL15Dropped = 0][PortXmitWait = 0]
## Suppressed:

And the report for a 'bad' node:

# ibqueryerrors --verbose --details --data --report-port --switch --ca --threshold-file ./error_thresholds -G 0x0c42a1030079989c
Errors for "maestro-3002 HCA-1"
   GUID 0xc42a1030079989c port 1: [PortXmitWait == 2544] [PortXmitData == 6399401 (24.412MB)] [PortRcvData == 1758872 (6.710MB)] [PortXmitPkts == 13959 (13.632K)] [PortRcvPkts == 13514 (13.197K)] [PortUnicastXmitPkts == 13959 (13.632K)] [PortUnicastRcvPkts == 13514 (13.197K)]
       Link info:    155   1[  ] ==( 4X        53.125 Gbps Active/  LinkUp)==>             [  ] "" ( )

## Summary: 1 nodes checked, 1 bad nodes found
##          1 ports checked, 1 ports have errors beyond threshold
## Thresholds: [SymbolErrorCounter = 0][LinkErrorRecoveryCounter = 0][LinkDownedCounter = 0][PortRcvErrors = 0][PortRcvRemotePhysicalErrors = 0][PortRcvSwitchRelayErrors = 0][PortXmitDiscards = 0][PortXmitConstraintErrors = 0][PortRcvConstraintErrors = 0][LocalLinkIntegrityErrors = 0][ExcessiveBufferOverrunErrors = 0][VL15Dropped = 0][PortXmitWait = 0]
## Suppressed:

Indeed, the 'good' node does not report any errors at the moment:

# perfquery -G 0xb8cef60300a1d92a 1
# Port counters: Lid 160 port 1 (CapMask: 0x5A00)
PortSelect:......................1
CounterSelect:...................0x0000
SymbolErrorCounter:..............0
LinkErrorRecoveryCounter:........0
LinkDownedCounter:...............0
PortRcvErrors:...................0
PortRcvRemotePhysicalErrors:.....0
PortRcvSwitchRelayErrors:........0
PortXmitDiscards:................0
PortXmitConstraintErrors:........0
PortRcvConstraintErrors:.........0
CounterSelect2:..................0x00
LocalLinkIntegrityErrors:........0
ExcessiveBufferOverrunErrors:....0
QP1Dropped:......................0
VL15Dropped:.....................0
PortXmitData:....................14804777
PortRcvData:.....................4168543
PortXmitPkts:....................32281
PortRcvPkts:.....................31220
PortXmitWait:....................0

In that case, I guess infiniband-exporter.py cannot extract information for this node. I can see the equivalent information from the other side of the link, using remote_name, so I can workaround it if I really need to retrieve the values. But it somehow break the global view of the fabric I've build in grafana, since I can miss nodes without errors.

Maybe I've missed something ? If not, do you have a suggestion ?

Must a NODE_NAME_MAP be specified?

While our site plans on using NODE_NAME_MAP, I noticed that the service doesn't work if one isn't supplied by default.

As a workaround, we do something like

/usr/bin/infiniband-exporter --node-name-map /dev/null

Can the code act similarly if NODE_NAME_MAP is not set?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.