gathering / gondul Goto Github PK

View Code? Open in Web Editor NEW

42.0 21.0 10.0 2.68 MB

Network management/monitoring system specialized for temporary events

Home Page: http://tech.gathering.org

License: GNU General Public License v2.0

Perl 19.50% Python 7.10% PHP 2.73% CSS 4.51% HTML 3.11% JavaScript 59.23% Shell 0.73% VCL 2.56% Dockerfile 0.53%

snmp monitoring templating dhcp pinger varnish

gondul's Introduction

Gondul - The network monitoring/management system

This is the system used to monitor the network during The Gathering (a computer party with between 5000 and 10000 active clients - see http://gathering.org). It is now provided as a stand-alone application with the goal of being usable to any number of computer parties and events of similar nature. First up of non-TG users was Digitality X 2016 (http://digitalityx.no), taking place in June / July 2016.

Unlike other NMS's, Gondul is not designed to run perpetually, but for a limited time and needs to be effective with minimal infrastructure in place as it is used during initial installation of the network.

You should be able to install this on your own for other similar events of various scales. The system requirements are minimal, but some advise:

You can run it on a single VM or split it based on roles. Either works.
The database is used extensively, but careful attention has been paid to scaling it sensibly.
Do not (unless you like high CPU loads) ignore the caching layer (Varnish). We use it extensively and are able to invalidate cache properly if needed so it is not a hindrance.

Some facts from The Gathering 2016:

Non-profit.
5000+ participants, 400 volunteers/crew, plus numerous visitors.
Lasted 5 days during the easter of 2016. Tech crew arrived on-site 5 days before.
Total of 10500+ unique network devices seen (unique mac addresses).
Active network devices at 2016-03-22T12:00:00: 206
Active network devices at 2016-03-23T08:00:00: 346
Active network devices at 2016-03-23T20:00:00: 6467
180+ switches and routers. Pinged several times per second. Polled for SNMP every minute. Every reply (or lack thereof) is kept.
Collected roughly 300 million database rows, or 30GB of data in postgresql.
Public NMS and API provided to all participants and the world at large.
The NMS saw between 200 and 500 requests per second during normal operation. Many were 304 "Not Modified".
99.99% cache hit rate (Varnish cache size: default 256MB).
~300 rows inserted per second. Most of these are COPY() of ping replies (thus performs well).
Biggest CPU hog was the SNMP polling, but not an issue.
Numerous features developed during the event with no database changes, mainly in the frontend, but also tweaking the API.

Name

The name comes from the Norse Valkyrie Gondul, also known as the wand bearer.

Features

Some of Gondul's features are:

Collects SNMP and ping-data frequently.
Per-device configurable SNMP polling-interval
IPv4 and IPv6 support
Provides per-port statistics.
Client-counter (based on active DHCP leases)
Intelligent, easy-to-use and real-time device search based on name, description (e.g.: sysDescr, so also software versions/models), serial numbers, distribution switches, IP addresses, etc.
Low-effort operations log with optional device-association, using the same search pattern.
Intelligent health-map that will alert you of any error without overloading you with information.
- All "map handlers" evaluate a device and return a health score from 0 to 1000 to signify what their opinion of the device's health is. Whatever map handler provides the worst score will be shown.
- Map handlers are trivial to write and exist in pure javascript.
- Some map handlers include:
  - Latency (v4/v6)
  - Temperature (Cisco and Juniper)
  - SNMP sysname versus database sysname-mismatch
  - DHCP age (where a client subnet is set)
  - Lack of management info (e.g.: missing IPv6 management info)
  - Recent reboots
  - (more)
Replay capabilities: Easy to review the state of the network as it was at any given time Gondul was running. And "fast forward" to real time.
Modular JavaScript front-end that is reasonably easy to adapt
Templating (using jinja2 and all data available to Gondul, from management information to latency)
Graphing and dashboards through Graphite
Huge-ass README that is still not complete.

Current state

Gondul is used at The Gathering and Digitality X among other places. It was spun off as a separate project from the big "Tech:Server misc tools" git repository in 2015. It was also used extensively at The Gathering 2017.

There is no "release" process for the time being since all development is directly linked to upcoming events and development continues throughout events.

The current state of deployment is that it is in the middle of a re-design. As such, the current documentation is slightly out-of-date.

Installation

See INSTALLING.rst.

Architecture

Gondul is split in multiple roles, at the very core is the database server (postgresql).

The data is provided by three individual data collectors. They are found in collectors/. Two of these can run on any host with database access. The third, the dhcptailer, need to run on your dhcp server, or some server with access to the DHCP log. It is picky about log formating (patches welcome).

All three of these collectors provide systemd service-files which should keep them running even if they fall over. Which they might do if you fiddle with the database.

In addition to the collectors, there is the API. The API provides three different sets of endpoints. Two of these are considered moderately sensitive (e.g.: provides management information and port-specific statistics), while the third is considered public. The two private API end points are split into a read-only and write-only name space.

Last is the frontend. This is written entirely in HTML and JavaScript and interacts with the API. It comes in two minimally different versions: one public and one "private". The only actual difference should be what they _try to access.

The basic philosophy of Gondul is to have a generic and solid API, a data base model that is somewhat agnostic to what we collect (so we can add more interesting SNMP communities on the fly) and a front end that does a lot of magic.

Recently, graphite/grafana was added, but as it failed to deliver during The Gathering 2017, the integration is being re-worked slightly. It is currently non-functional.

APIs

See doc/API.rst.

On the topic of the front-end....

The front end uses bootstrap and jquery, but not really all that extensively.

The basic idea is to push a ton of information to the front-end and exploit modern concepts such as "8MB of data is essentially nothing" and "your browser actually does client-side caching sensibly" and "it's easier to develop js than adapt a backend when the need arises". If you look in a developer console, you will see frequent requests, but if you look closer, they should almost all be client side cache hits. And those which aren't can either be 304 Not Modified's or server-side cache hits. Caching is absolutely crucial to the entire process.

We need more user-documentation though.

Also, the front-end can be somewhat bandwidth intensive. Use gzip. Patches for variable polling frequency on mobile devices are welcome.

Security

Security is ensured in multiple ways. First of all, database passwords should obviously be kept secret. It is never visible in the frontend.

Secondly, APIs are clearly separated. Some data is actually duplicated because it has to be available both in a public API in an aggregated form, and in detailed form in the private API.

Gondul it self does not implement any actual authentication mechanisms for the API. That is left up to the web server. An example Apache configuration file is provided and the default ansible recipies use them.

gondul's People

Contributors

Stargazers

Watchers

Forkers

torstehu sjurtf sklirg slinderud foxboron olemathias denji znarrez shahjahan-2625 edgardeven

gondul's Issues

NMS: Linting av config og svitsjer

From @KristianLyng on March 27, 2016 8:10

Vi har et bilde i databasen av hvordan ting skal være. Vi bør ha et verktøy som kan verifisere (via snmp helst, eventuelt ssh) at dette faktisk stemmer med virkeligheten.

Copied from original issue: gathering/tgmanage#69

NMS: Flytt "now" ut av canvas og få den til å gi mening

From @KristianLyng on March 27, 2016 8:20

Hopper mye fram og tilbake. Se også Nicco sin fine "tv-mode" visning.

Copied from original issue: gathering/tgmanage#75

NMS: Integrering med ansible

From @KristianLyng on March 27, 2016 8:42

Eneste problemet er at dette er python.

Bør dog ikke være et stort problem.

Copied from original issue: gathering/tgmanage#86

NMS: Fikse bedre lagring av mac-adresser

From @KristianLyng on March 27, 2016 8:52

Måtte special-case ifPhyAddr (?) i år fordi det ble binær-garble som JSON:XS likte dårlig.

Må finne en bedre (og helst generell) løsning på det. Særlig når vi ønsker oss flere datafelter med macadresser.

Copied from original issue: gathering/tgmanage#92

NMS: API: Bedre feilhåndtering og tilbakemeldinger.

From @KristianLyng on March 27, 2016 8:36

Har i praksis ingen feilhåndtering nå. Bør være noe hvertfall.

Copied from original issue: gathering/tgmanage#81

NMS: Støtte for "tabs" for alternative visninger

From @KristianLyng on March 27, 2016 8:10

I stedet for kart kan vi ha en tab for inventory f.eks.

Dette vil også gjøre eventuelt adresse-plan bedre.

Copied from original issue: gathering/tgmanage#68

NMS: Dokumentasjon av søkefilter

From @KristianLyng on March 27, 2016 8:45

Det er mange kule søkefilter. De er nå kun synlig ved at jeg forklarer dem, selv om de fleste er ganske "åpenbare".

Bør være noe form for in-line dokumentasjon.

Eksempel på "usynlig" ting: "active>4".

Copied from original issue: gathering/tgmanage#88

NMS: Feilmelding fra deltakere

From @KristianLyng on March 27, 2016 8:38

Ingen grunn å gå til info:desk for å melde om småting. Fint med en måte for deltakere å lett melde om mindre feil ("youtube går tregt..."). Bør assosieres med en svitsj der det lar seg gjøre slik at vi kan se trender.

Copied from original issue: gathering/tgmanage#83

NMS: Wannabe-integrasjon?

From @KristianLyng on March 27, 2016 8:36

Dette er litt skummelt da det binner NMS enda mer mot TG. Bør i det minste være en modul man lett kan velge bort.

Copied from original issue: gathering/tgmanage#82

Skille mellom sensitiv og ikke-sensitiv konfig

From @KristianLyng on March 27, 2016 8:43

Jeg har en lang liste over OID'er som nå ligger lagret på obi-wan, men ikke i git, fordi det er konfigurasjon. Mesteparten av konfigen er ikke sensitiv, mens ting som databasepassord er det.

Bør skille ut det sensitive så vi kan ha revisjonskontroll på resten under arrangementet.

Copied from original issue: gathering/tgmanage#87

NMS: Real-time search

From @KristianLyng on April 11, 2016 18:21

Update search matching as time ticks (specially useful during replay)

Copied from original issue: gathering/tgmanage#100

NMS: Skille ut selve API'et til eget domene.

From @KristianLyng on March 27, 2016 8:46

Det vil gjøre autentisering og slikt enklere. Og gir generelt mening for å redusere copy/pasta.

Copied from original issue: gathering/tgmanage#89

NMS: Filtrering av objekter i kartet

From @KristianLyng on March 27, 2016 8:17

Bør være mulig å bare vise kantsvitsjer, bare ap'er, bare servere, eller en kombo.

Copied from original issue: gathering/tgmanage#74

NMS: "Hvilken rad er jeg på"

From @KristianLyng on March 17, 2016 21:39

Både IPv4- og IPv6-støtte.

Må hente ARP-tabell, ND-tabell, samt MAC-adresse-tabell fra alle rutere/switcher, slik at switch + switchport kan finnes ut basert på klienten sin IP-adresse.

ARP/ND/MAC-tabell gir også fordel at man kan lage statistikk på «antall IPv4 vs. IPv6-klienter», samt f.eks. «antall unike klienter» (basert på MAC-adresse). Sistnevnte kan kanskje delvis gjøres vha. DHCP-tail-greia, men.

Copied from original issue: gathering/tgmanage#51

NMS: Linkable times/now=

From @KristianLyng on March 31, 2016 19:38

I want to be able to post a link that'll take people to a certain map at a certain time.

The map bit is OK, now we need the now.

Copied from original issue: gathering/tgmanage#98

NMS: Klikk en plass utenfor containeren for å komme tilbake til "root" view

From @skandix on March 22, 2016 18:5

Legge til slik at når man går inn på noen av elementene som ligger i meny listen f.eks "Keyboard Shortcuts" at når man så vil komme seg ut, at man kan bare trykke på en plass som ligger utenfor "containeren/boksen" for å så komme seg tilbake til "root" view.

Copied from original issue: gathering/tgmanage#56

NMS: Lagre nabo-tabeller og klient-mac'er og slikt også!

From @KristianLyng on March 27, 2016 8:53

Vil kunne gi oss mye nytte å vite hvilken port en mac henger på.

Copied from original issue: gathering/tgmanage#93

NMS: Introdusere servere og "annet"

From @KristianLyng on March 27, 2016 8:16

I dag er "alt" svitsjer. Dette er litt unødvendig. Vi ønsker også å overvåke servere, wireless controllere, "random" internett-noder, og mer.

Copied from original issue: gathering/tgmanage#72

NMS: Speedometer!

From @KristianLyng on March 17, 2016 7:13

Dette var stort request i fjor. JS ligger i old/, men må oppdateres til moderne API. Helst også oppdateres til å bruke vanlig nms-js'en.

Copied from original issue: gathering/tgmanage#34

NMS: Alarms/reporting

From @jallakim on March 27, 2016 2:10

Få på plass system for alarmer, samt varsling (primært via push og/eller SMS).

Hvilket system skal brukes? Icinga2?

Copied from original issue: gathering/tgmanage#64

NMS: Dokumentere

From @jallakim on March 27, 2016 3:14

Dokumentere NMS.

Satt på spissen; at man kan gi repo-URL-en til en «random» dude, og så skal vedkommende klare å få det til å kjøre uten å måtte reverse-engineere halve koden for å skjønne halvparten.

Copied from original issue: gathering/tgmanage#67

NMS: Bedre/sentral håndtering av felter i switches

From @KristianLyng on March 27, 2016 8:42

Mye overlapp. Fint med et sentralt sted som definerer hvilke felter som finnes, hvilke som kan leses, hvilke som er "hemmlig" og hvilke som kan endres via API. Dette er nå lokalisert i diverse api-endpoints.

Copied from original issue: gathering/tgmanage#85

NMS: Addresseoversikt

From @KristianLyng on March 27, 2016 8:13

Langsiktig mål bør være å erstatte mesteparten av funksjonaliteten vi har lagt i confluence-tabellen.

Det krever dog mye arbeid og bør deles opp over flere arrangement så vi ser hva som fungerer. Må være særs fleksibel.

Copied from original issue: gathering/tgmanage#71

NMS: Lage PoC dashboard

From @KristianLyng on March 17, 2016 9:53

Trenger et dashboard som viser noe som helst, basert på samme js som i js/nms*js.

Copied from original issue: gathering/tgmanage#40

NMS: Driftslogg/backlogg

From @jallakim on March 27, 2016 2:58

Få på plass veldig enkel driftslogg/backlogg som kan vises i NMS-et. Brukes for å logge hendelser som er relevant for påfølgende skift. (typ «distro5 ble slapp, redusert til 2x i VC + 2x uplink mot kant»).

Nice-to-have; flerbrukerstøtte (via BasicAuth mot Wannabe, eller andre ting) slik at man kan merke meldinger som lest eller ei (slik at meldinger forblir om man ikke aktivt merker de som lest).

Copied from original issue: gathering/tgmanage#66

NMS: Real-time search

From @KristianLyng on April 11, 2016 18:21

Update search matching as time ticks (specially useful during replay)

Copied from original issue: gathering/tgmanage#100

NMS: Finne ut av hvorfor 2-3 pings droppes hver gang.

From @KristianLyng on March 17, 2016 7:16

Håper det er grunnet testmiljøet.

Copied from original issue: gathering/tgmanage#38

NMS: Combined health map

From @KristianLyng on March 19, 2016 21:37

A map showing a combined intelligent health.

Each map module would expose a function to determine it's perspective of the health of a switch in a scale of 0 to 1000.

The combined map would poll each handler and display a "worst case" color.

Each handler would also expose a textual representation of the state, to dispalyed in an info-box on clicking.

Copied from original issue: gathering/tgmanage#52

NMS: AP-kart

From @KristianLyng on March 27, 2016 8:16

Innføre AP'er i kartet. I seg selv uproblematisk.

Copied from original issue: gathering/tgmanage#73

NMS: Sette opp LLDP-skrot-skrat

From @KristianLyng on March 17, 2016 7:12

Copied from original issue: gathering/tgmanage#33

NMS: Vask include/nms/snmp.pm

From @KristianLyng on March 27, 2016 8:33

Ikke mye brukt lenger. Bør enten gjenninnføres eller slettes.

Copied from original issue: gathering/tgmanage#80

NMS: Teste en del i iceweasel/firefox

From @KristianLyng on March 17, 2016 7:14

Copied from original issue: gathering/tgmanage#35

NMS: Switch from apache to nginx or something?

From @KristianLyng on March 31, 2016 17:34

TG16 saw: nginx->varnish->apache

Should reduce that somewhat if possible.

Copied from original issue: gathering/tgmanage#95

NMS: Ikke lagre tom SNMP-data

From @KristianLyng on March 27, 2016 8:51

Unødvendig å inserte '{}' når ting er timet ut.

Copied from original issue: gathering/tgmanage#91

NMS: DHCP-map med custom-farger (til DHCP-løp)

From @KristianLyng on March 17, 2016 18:5

Jocke får rosa.

Copied from original issue: gathering/tgmanage#47

NMS: Scrubbe tidsreiser (igjen)

From @KristianLyng on March 17, 2016 19:4

Kjernefunksjonaliteten er endret/flyttet litt, så må oppdateres. Kan bli litt mer utfordrende fordi vi må eksplisitt be om to datapunkter. Nye modellen er bedre, men krever altså litt mer logikk på tidsreiser.

Setter dette til mandag, så vi kan replaye åpningen.

Copied from original issue: gathering/tgmanage#49

NMS: Test-caser

From @KristianLyng on March 27, 2016 8:40

Det er mye som bør kunne testes ganske lett. ("Legg inn svitsj, se at den er der". "Modifiser svitsj, se at det funker" osv).

Copied from original issue: gathering/tgmanage#84

Oppdatere mibs-logikken

From @KristianLyng on March 27, 2016 9:24

Hente fra Juniper i stedet for Cisco (de er også "Renere" virker det som).

Må dobbeltsjekke at det er nok.

Copied from original issue: gathering/tgmanage#94

NMS: Stop playback if we jump in time

From @KristianLyng on March 31, 2016 19:37

Copied from original issue: gathering/tgmanage#97

NMS: Bedre SNMP-browser

From @KristianLyng on March 17, 2016 7:15

Egen boks
Ikke en enorm dump i plain-text
Live-oppdatering? Sparklines?

Copied from original issue: gathering/tgmanage#37

NMS: Teste på mobil/nettbrett.

From @KristianLyng on March 17, 2016 7:14

Copied from original issue: gathering/tgmanage#36

NMS: Linknet that are not point to point

From @KristianLyng on April 11, 2016 18:15

This means adding more logic in the backend and drawing... But will look way better.

Copied from original issue: gathering/tgmanage#99

NMS: Pushing av config til switcher/rutere

From @jallakim on March 27, 2016 2:23

Få på plass system for å pushe config til switcher/rutere.

Ansible er vel veien som er diskutert, men se på hvordan dette skal integreres med NMS-et.

Copied from original issue: gathering/tgmanage#65

NMS: Use UTC and epoch all over

From @KristianLyng on March 31, 2016 19:37

Seriously, people can just convert in their head, because I have no idea what we're supposed to do when someone asks for 2016-03-27T02:30:00.

We could possibly have a front-end thing for converting to/from UTC. But code-wise we should be UTC all over.

Copied from original issue: gathering/tgmanage#96

NMS: Verktøy for å sjekke SNMP OID-ting.

From @KristianLyng on March 27, 2016 8:33

Det ble en del prøv-og-feil når jeg la til nye OID'er. Bør ha et enkelt verktøy som kan sjekke navn og slikt.... Nesten så enkelt som snmpwalk, men bruker våre egne oppsett.

Copied from original issue: gathering/tgmanage#79

NMS: Vise DHCP-statistikk

From @KristianLyng on March 27, 2016 8:30

Type: Antall "aktive klienter" og slikt.

Copied from original issue: gathering/tgmanage#76

PXE IPv6 support

From @norrs on April 13, 2014 13:49

https://wiki.kubuntu.org/UEFI/SecureBoot-PXE-IPv6

Copied from original issue: gathering/tgmanage#7

NMS: Vaske lese-api'ene.

From @KristianLyng on March 27, 2016 8:30

Har blitt litt grums. Bør evaluere hvordan ting skal hentes ut og hva.

Copied from original issue: gathering/tgmanage#77

NMS: API-støtte for å hente data for bare én svitsj

From @KristianLyng on March 27, 2016 8:32

Gjelder samtlige api'er som er svitsj-basert. (switch-state, snmp, ping...).

Bør være mulig å hente data for kun telegw f.eks.

Må tenke litt på hvordan dette skal gjøres om man ønsker kun infrastruktur f.eks.

Copied from original issue: gathering/tgmanage#78

NMS: Drag and drop-type linknet definering

From @KristianLyng on March 27, 2016 8:12

Trenger både frontend og backend.

Copied from original issue: gathering/tgmanage#70