Coder Social home page Coder Social logo

gathering / gondul Goto Github PK

View Code? Open in Web Editor NEW
42.0 21.0 10.0 2.68 MB

Network management/monitoring system specialized for temporary events

Home Page: http://tech.gathering.org

License: GNU General Public License v2.0

Perl 19.50% Python 7.10% PHP 2.73% CSS 4.51% HTML 3.11% JavaScript 59.23% Shell 0.73% VCL 2.56% Dockerfile 0.53%
snmp monitoring templating dhcp pinger varnish

gondul's Introduction

Gondul - The network monitoring/management system

This is the system used to monitor the network during The Gathering (a computer party with between 5000 and 10000 active clients - see http://gathering.org). It is now provided as a stand-alone application with the goal of being usable to any number of computer parties and events of similar nature. First up of non-TG users was Digitality X 2016 (http://digitalityx.no), taking place in June / July 2016.

Unlike other NMS's, Gondul is not designed to run perpetually, but for a limited time and needs to be effective with minimal infrastructure in place as it is used during initial installation of the network.

You should be able to install this on your own for other similar events of various scales. The system requirements are minimal, but some advise:

  • You can run it on a single VM or split it based on roles. Either works.
  • The database is used extensively, but careful attention has been paid to scaling it sensibly.
  • Do not (unless you like high CPU loads) ignore the caching layer (Varnish). We use it extensively and are able to invalidate cache properly if needed so it is not a hindrance.

Some facts from The Gathering 2016:

  • Non-profit.
  • 5000+ participants, 400 volunteers/crew, plus numerous visitors.
  • Lasted 5 days during the easter of 2016. Tech crew arrived on-site 5 days before.
  • Total of 10500+ unique network devices seen (unique mac addresses).
  • Active network devices at 2016-03-22T12:00:00: 206
  • Active network devices at 2016-03-23T08:00:00: 346
  • Active network devices at 2016-03-23T20:00:00: 6467
  • 180+ switches and routers. Pinged several times per second. Polled for SNMP every minute. Every reply (or lack thereof) is kept.
  • Collected roughly 300 million database rows, or 30GB of data in postgresql.
  • Public NMS and API provided to all participants and the world at large.
  • The NMS saw between 200 and 500 requests per second during normal operation. Many were 304 "Not Modified".
  • 99.99% cache hit rate (Varnish cache size: default 256MB).
  • ~300 rows inserted per second. Most of these are COPY() of ping replies (thus performs well).
  • Biggest CPU hog was the SNMP polling, but not an issue.
  • Numerous features developed during the event with no database changes, mainly in the frontend, but also tweaking the API.

Name

The name comes from the Norse Valkyrie Gondul, also known as the wand bearer.

Features

Some of Gondul's features are:

  • Collects SNMP and ping-data frequently.
  • Per-device configurable SNMP polling-interval
  • IPv4 and IPv6 support
  • Provides per-port statistics.
  • Client-counter (based on active DHCP leases)
  • Intelligent, easy-to-use and real-time device search based on name, description (e.g.: sysDescr, so also software versions/models), serial numbers, distribution switches, IP addresses, etc.
  • Low-effort operations log with optional device-association, using the same search pattern.
  • Intelligent health-map that will alert you of any error without overloading you with information.
    • All "map handlers" evaluate a device and return a health score from 0 to 1000 to signify what their opinion of the device's health is. Whatever map handler provides the worst score will be shown.
    • Map handlers are trivial to write and exist in pure javascript.
    • Some map handlers include:
      • Latency (v4/v6)
      • Temperature (Cisco and Juniper)
      • SNMP sysname versus database sysname-mismatch
      • DHCP age (where a client subnet is set)
      • Lack of management info (e.g.: missing IPv6 management info)
      • Recent reboots
      • (more)
  • Replay capabilities: Easy to review the state of the network as it was at any given time Gondul was running. And "fast forward" to real time.
  • Modular JavaScript front-end that is reasonably easy to adapt
  • Templating (using jinja2 and all data available to Gondul, from management information to latency)
  • Graphing and dashboards through Graphite
  • Huge-ass README that is still not complete.

Current state

Gondul is used at The Gathering and Digitality X among other places. It was spun off as a separate project from the big "Tech:Server misc tools" git repository in 2015. It was also used extensively at The Gathering 2017.

There is no "release" process for the time being since all development is directly linked to upcoming events and development continues throughout events.

The current state of deployment is that it is in the middle of a re-design. As such, the current documentation is slightly out-of-date.

Installation

See INSTALLING.rst.

Architecture

Gondul is split in multiple roles, at the very core is the database server (postgresql).

The data is provided by three individual data collectors. They are found in collectors/. Two of these can run on any host with database access. The third, the dhcptailer, need to run on your dhcp server, or some server with access to the DHCP log. It is picky about log formating (patches welcome).

All three of these collectors provide systemd service-files which should keep them running even if they fall over. Which they might do if you fiddle with the database.

In addition to the collectors, there is the API. The API provides three different sets of endpoints. Two of these are considered moderately sensitive (e.g.: provides management information and port-specific statistics), while the third is considered public. The two private API end points are split into a read-only and write-only name space.

Last is the frontend. This is written entirely in HTML and JavaScript and interacts with the API. It comes in two minimally different versions: one public and one "private". The only actual difference should be what they _try to access.

The basic philosophy of Gondul is to have a generic and solid API, a data base model that is somewhat agnostic to what we collect (so we can add more interesting SNMP communities on the fly) and a front end that does a lot of magic.

Recently, graphite/grafana was added, but as it failed to deliver during The Gathering 2017, the integration is being re-worked slightly. It is currently non-functional.

APIs

See doc/API.rst.

On the topic of the front-end....

The front end uses bootstrap and jquery, but not really all that extensively.

The basic idea is to push a ton of information to the front-end and exploit modern concepts such as "8MB of data is essentially nothing" and "your browser actually does client-side caching sensibly" and "it's easier to develop js than adapt a backend when the need arises". If you look in a developer console, you will see frequent requests, but if you look closer, they should almost all be client side cache hits. And those which aren't can either be 304 Not Modified's or server-side cache hits. Caching is absolutely crucial to the entire process.

We need more user-documentation though.

Also, the front-end can be somewhat bandwidth intensive. Use gzip. Patches for variable polling frequency on mobile devices are welcome.

Security

Security is ensured in multiple ways. First of all, database passwords should obviously be kept secret. It is never visible in the frontend.

Secondly, APIs are clearly separated. Some data is actually duplicated because it has to be available both in a public API in an aggregated form, and in detailed form in the private API.

Gondul it self does not implement any actual authentication mechanisms for the API. That is left up to the web server. An example Apache configuration file is provided and the default ansible recipies use them.

gondul's People

Contributors

eriktm avatar foxboron avatar joachimtingvold avatar kristianlyng avatar lasseh avatar ldev avatar magnuskiro avatar niccofyren avatar olemathias avatar sjurtf avatar sklirg avatar slinderud avatar torstehu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gondul's Issues

NMS: Feilmelding fra deltakere

From @KristianLyng on March 27, 2016 8:38

Ingen grunn å gå til info:desk for å melde om småting. Fint med en måte for deltakere å lett melde om mindre feil ("youtube går tregt..."). Bør assosieres med en svitsj der det lar seg gjøre slik at vi kan se trender.

Copied from original issue: gathering/tgmanage#83

Skille mellom sensitiv og ikke-sensitiv konfig

From @KristianLyng on March 27, 2016 8:43

Jeg har en lang liste over OID'er som nå ligger lagret på obi-wan, men ikke i git, fordi det er konfigurasjon. Mesteparten av konfigen er ikke sensitiv, mens ting som databasepassord er det.

Bør skille ut det sensitive så vi kan ha revisjonskontroll på resten under arrangementet.

Copied from original issue: gathering/tgmanage#87

NMS: "Hvilken rad er jeg på"

From @KristianLyng on March 17, 2016 21:39

Både IPv4- og IPv6-støtte.

Må hente ARP-tabell, ND-tabell, samt MAC-adresse-tabell fra alle rutere/switcher, slik at switch + switchport kan finnes ut basert på klienten sin IP-adresse.

ARP/ND/MAC-tabell gir også fordel at man kan lage statistikk på «antall IPv4 vs. IPv6-klienter», samt f.eks. «antall unike klienter» (basert på MAC-adresse). Sistnevnte kan kanskje delvis gjøres vha. DHCP-tail-greia, men.

Copied from original issue: gathering/tgmanage#51

NMS: Alarms/reporting

From @jallakim on March 27, 2016 2:10

Få på plass system for alarmer, samt varsling (primært via push og/eller SMS).

Hvilket system skal brukes? Icinga2?

Copied from original issue: gathering/tgmanage#64

NMS: Dokumentere

From @jallakim on March 27, 2016 3:14

Dokumentere NMS.

Satt på spissen; at man kan gi repo-URL-en til en «random» dude, og så skal vedkommende klare å få det til å kjøre uten å måtte reverse-engineere halve koden for å skjønne halvparten.

Copied from original issue: gathering/tgmanage#67

NMS: Addresseoversikt

From @KristianLyng on March 27, 2016 8:13

Langsiktig mål bør være å erstatte mesteparten av funksjonaliteten vi har lagt i confluence-tabellen.

Det krever dog mye arbeid og bør deles opp over flere arrangement så vi ser hva som fungerer. Må være særs fleksibel.

Copied from original issue: gathering/tgmanage#71

NMS: Driftslogg/backlogg

From @jallakim on March 27, 2016 2:58

Få på plass veldig enkel driftslogg/backlogg som kan vises i NMS-et. Brukes for å logge hendelser som er relevant for påfølgende skift. (typ «distro5 ble slapp, redusert til 2x i VC + 2x uplink mot kant»).

Nice-to-have; flerbrukerstøtte (via BasicAuth mot Wannabe, eller andre ting) slik at man kan merke meldinger som lest eller ei (slik at meldinger forblir om man ikke aktivt merker de som lest).

Copied from original issue: gathering/tgmanage#66

NMS: Combined health map

From @KristianLyng on March 19, 2016 21:37

A map showing a combined intelligent health.

Each map module would expose a function to determine it's perspective of the health of a switch in a scale of 0 to 1000.

The combined map would poll each handler and display a "worst case" color.

Each handler would also expose a textual representation of the state, to dispalyed in an info-box on clicking.

Copied from original issue: gathering/tgmanage#52

NMS: Scrubbe tidsreiser (igjen)

From @KristianLyng on March 17, 2016 19:4

Kjernefunksjonaliteten er endret/flyttet litt, så må oppdateres. Kan bli litt mer utfordrende fordi vi må eksplisitt be om to datapunkter. Nye modellen er bedre, men krever altså litt mer logikk på tidsreiser.

Setter dette til mandag, så vi kan replaye åpningen.

Copied from original issue: gathering/tgmanage#49

NMS: Use UTC and epoch all over

From @KristianLyng on March 31, 2016 19:37

Seriously, people can just convert in their head, because I have no idea what we're supposed to do when someone asks for 2016-03-27T02:30:00.

We could possibly have a front-end thing for converting to/from UTC. But code-wise we should be UTC all over.

Copied from original issue: gathering/tgmanage#96

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.