Coder Social home page Coder Social logo

flapjack / flapjack Goto Github PK

View Code? Open in Web Editor NEW
640.0 33.0 92.0 11.52 MB

Monitoring notification routing + event processing system. For issues with the Flapjack packages, please see https://github.com/flapjack/omnibus-flapjack/

Home Page: http://flapjack.io

License: MIT License

Ruby 57.94% Gherkin 4.53% HTML 2.92% CSS 0.85% JavaScript 32.66% Go 1.04% Shell 0.07%

flapjack's Introduction

Flapjack Flapjack

Build Status

Flapjack is a flexible monitoring notification routing system that handles:

  • Alert routing (determining who should receive alerts based on interest, time of day, scheduled maintenance, etc)
  • Alert summarisation (with per-user, per media summary thresholds)
  • Your standard operational tasks (setting scheduled maintenance, acknowledgements, etc)

Flapjack will be immediately useful to you if:

  • You want to identify failures faster by rolling up your alerts across multiple monitoring systems.
  • You monitor infrastructures that have multiple teams responsible for keeping them up.
  • Your monitoring infrastructure is multitenant, and each customer has a bespoke alerting strategy.
  • You want to dip your toe in the water and try alternative check execution engines like Sensu, Icinga, or cron in parallel to Nagios.

Try it out with the Quickstart Guide

The Quickstart guide will help you get Flapjack up and running in a VM locally using Vagrant and VirtualBox.

The technical low-down

Flapjack provides a scalable method for dealing with events representing changes in system state (OK -> WARNING -> CRITICAL transitions) and alerting appropriate people as necessary.

At its core, Flapjack processes events received from external check execution engines, such as Nagios. Nagios provides a 'perfdata' event output channel, which writes to a named pipe. flapjack-nagios-receiver then reads from this named pipe, converts each line to JSON and adds them to the events queue.

Flapjack sits downstream of check execution engines (like Nagios, Sensu, Icinga, or cron), processing events to determine:

  • if a problem has been detected
  • who should know about the problem
  • how they should be told

Additional check engines can be supported by adding additional receiver processes similar to the nagios receiver.

Installing

NB: v2 packages will be ready soon -- for the moment these instructions will not work

Ubuntu Precise 64 (12.04):

Tell apt to trust the Flapjack package signing key:

gpg --keyserver keys.gnupg.net --recv-keys 803709B6
gpg -a --export 803709B6 | sudo apt-key add -

Add the Flapjack Debian repository to your Apt sources:

echo "deb http://packages.flapjack.io/deb/v2 precise main" | sudo tee /etc/apt/sources.list.d/flapjack.list

Install the latest Flapjack package:

sudo apt-get update
sudo apt-get install flapjack

Alternatively, download the deb and install using sudo dpkg -i <filename>

The Flapjack package is an Omnibus package and as such contains most dependencies under /opt/flapjack, including Redis.

Installing the package will start Redis (non standard port) and Flapjack. You should now be able to access the Flapjack Web UI at:

http://localhost:3080/

And consume the REST API at:

http://localhost:3081/

N.B. The Redis installed by Flapjack runs on a non-standard port (6380), so it doesn't conflict with other Redis instances you may already have installed.

Other OSes:

Currently we only make a package for Ubuntu Precise (amd64). If you feel comfortable getting a ruby environment going on your preferred OS, then you can also just install Flapjack from rubygems.org:

gem install flapjack

Using a tool like rbenv or rvm is recommended to keep your Ruby applications from intefering with one another.

Alternatively, you can add support for your OS of choice to omnibus-flapjack and build a native package. Pull requests welcome, and we'll help you make this happen!

You'll also need Redis >= 2.6.12.

Configuring

Have a look at the default config file and modify things as required. The package installer copies this to /etc/flapjack/flapjack_config.toml if it doesn't already exist.

# edit the config
sudo vi /etc/flapjack/flapjack_config.toml

# reload the config
sudo /etc/init.d/flapjack reload

Running

Ubuntu Precise 64:

After installing the Flapjack package, Redis and Flapjack should be automatically started.

First up, start Redis if it's not already started:

# status:
sudo /etc/init.d/redis-flapjack status

# start:
sudo /etc/init.d/redis-flapjack start

Operating Flapjack:

# status:
sudo /etc/init.d/flapjack status

# reload:
sudo /etc/init.d/flapjack reload

# restart:
sudo /etc/init.d/flapjack restart

# stop:
sudo /etc/init.d/flapjack stop

# start:
sudo /etc/init.d/flapjack start

Usage

Please see the documentation.

Developing Flapjack

Information on developing more Flapjack components or contributing to core Flapjack development can be found in the Developing section of the docs.

Note that the master branch corresponds to Flapjack 2; maintenance builds for Flapjack 1 are built from the maint/1.x branch.

Documentation Submodule

We have the documentation for this project on a github wiki and also referenced as a submodule at /doc in this project. Run the following commands to populate the local doc/ directory:

git submodule init
git submodule update

If you make changes to the documentation locally, here's how to publish them:

  • Checkout master within the doc subdir, otherwise you'll be commiting to no branch, a.k.a. no man's land.
  • git add, commit and push from inside the doc subdir
  • Add, commit and push the doc dir from the root (this updates the pointer in the main git repo to the correct ref in the doc repo, we think...)

RTFM

All of the documentation.

flapjack's People

Contributors

ali-graham avatar alperkokmen avatar aussielunix avatar auxesis avatar clarkf avatar damncabbage avatar elmobp avatar ferrisoxide avatar giganteous avatar jbergler avatar jessereynolds avatar jsoriano avatar kbailey4444 avatar masteinhauser avatar michaelneale avatar mkobel avatar mrichar1 avatar portertech avatar rebyn avatar sarahriley avatar someword avatar stephenweber avatar vegetableman avatar zoran avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

flapjack's Issues

Expose internal Flapjack performance statistics as JSON

Provide a JSON endpoint that can be polled.

Metrics to expose:

  • events per second
  • min, avg, max, median, ne: time taken to process events
  • min, avg, max, median, ne: freshness of checks
  • number of entities
  • number of checks

Create HTTP API endpoint for basic entity and contact import

This will accept a POSTed collection of data from an external source as JSON and import it into the Redis database. It will operate in a similar fashion to the current flapjack-populator import script.

In the interests of simplicity, this will be destructive:

  • it will nuke existing contact data, and apply what is POSTed
  • it won't nuke existing entity data, but update any entities that already exist (it's not safe, as various other redis records depend on the ID)

This could later be extended to support exporting the data as well, which would be useful for:

  • backups
  • upgrading + downgrading Flapjack when there are internal data structure schema changes

Longer term we should extend the API to be fully REST and support viewing + manipulating:

  • scheduled + unscheduled maintenance
  • contacts + entities
  • event processor statistics
  • gateway statistics

Show registered contacts for each entity and check

We need to show who will be notified when checks go into problem states; this is data that will be imported into Redis from a JSON source. A web page with a table showing name, contact method, contact details for each contact for each check should be sufficient.

flapjack-nagios-receiver stops sending events after nagios is restarted

It needs to detect when the writing end of the named pipe has gone away and then probably try closing and opening the named pipe until its getting data again.

It could also do this by having an inactivity timeout - triggering re-opening of the pipe after say a minute of no events arriving.

Add rudimentary performance analysis reporting with ruby-prof

Look at getting some useful metrics out of flapjack with the help of ruby-prof.

This a first step to more detailed and better presented / integrated reporting, eg perhaps by rendering statistics in newrelic compatible format for the RPM agent to send to newrelic ( see #54 )

api and web common logger is writing duplicate lines

eg:

127.0.0.1 - - [06/Nov/2012 13:11:15] "GET / HTTP/1.1" 200 13842 0.0990
127.0.0.1 - - [06/Nov/2012 13:11:15] "GET / HTTP/1.1" 200 13842 0.0992

The time taken in the last column is slightly longer on the second log entry, indicating the Rack::CommonLogger is firing twice for each request for some reason.

The hits are logged without duplicates when run from rspec, however.

Can't initialise notifier when notifying

Stephen Nelson-Smith/LordCope reported this:

root@ubuntu-test:~/flapjack-admin# flapjack-notifier -r
/etc/flapjack/recipients.yaml -c /etc/flapjack/flapjack-notifier.yaml
-d /root/flapjack-admin/production.db
DEBUG notifier: Loading the Mailer notifier
 INFO notifier: using the Mailer notifier
/var/lib/gems/1.8/gems/extlib-0.9.13/lib/extlib/inflection.rb:41:in
`underscore': undefined method `to_const_path' for nil:NilClass
(NoMethodError)
       from /var/lib/gems/1.8/gems/dm-core-0.9.11/lib/dm-core.rb:144:in `setup'
       from /var/lib/gems/1.8/gems/flapjack-0.4.11/lib/flapjack/cli/notifier.rb:136:in
`setup_database'
       from /var/lib/gems/1.8/gems/flapjack-0.4.11/bin/flapjack-notifier:27
       from /usr/local/bin/flapjack-notifier:19:in `load'
       from /usr/local/bin/flapjack-notifier:19

root@ubuntu-test:~/flapjack-admin# install-flapjack-systemwide
/usr/local/bin/install-flapjack-systemwide:19:in `load':
/var/lib/gems/1.8/gems/flapjack-0.4.11/bin/install-flapjack-systemwide:56:
unterminated string meets end of file (SyntaxError)
       from /usr/local/bin/install-flapjack-systemwide:19

Implement configurable notification intervals per media per contact

Different contacts on entities + checks may want to receive notifications at different intervals. e.g.

an on-call tech may want to receive a recurring alert every 15 minutes when a check is failing, but a manager only wants to receive a single alert when there is a state change.

Details:

  • allow a custom interval per media
  • allow setting an interval per contact
  • if no interval is set on the media inherit the interval from the contact

Implementation:

  • we may need to implement a second set of filters in the flapjack executive, called “notification filters”, and rename the existing filters to “event filters”
  • the event filters decide whether a notification should be sent out
  • the notification filters decide who should receive the notification
  • the event filters are run first, then the notification filters are run straight after
  • the notification filters would perform a map/reduce
    • the map phase gets a list of all contact medias who may want to be notified about this service
    • the reduce phase applies a list of filters to each contact media to determine if a notification should be sent right now

self stats - event pocessing rate is incorrect

The count of events processed is no longer being reset to zero it would seem, and so the 'Average rate' shown on the self stats page is incorrect (much higher than reality) as it's being calculated by total events processed over uptime.

Events processed: 41153810 (ok: 39788555, failure: 1351902, action: 0)
Average rate: 74284.85559566788 events per second
Total keys in redis: 81900
Uptime: 9 minutes, 14 seconds
Boot time: 2012-10-03 14:26:44 +1000
Current time: 2012-10-03 14:35:58 +1000

Flapping services have PROBLEM alerts masked for up to five minutes

When testing with a service that is flapping up / down every minute, the delays filter blocks the problem alerts such that there is a maximum of one every five minutes. This means that if a service does this:

up->down, 1 minute passes, down->up, 1 minute passes, up-> down

Then the PROBLEM alert for the second up->down state transition will be delayed by three minutes.

RECOVERY alerts are not subject to this so if the service keeps flapping you may see something like:

PROBLEM, RECOVERY, RECOVERY, RECOVERY, PROBLEM, RECOVERY, RECOVERY, RECOVERY, ...etc

Solution: The delays filter needs to be smarter and not block PROBLEM alerts when the last alert to be sent out was a RECOVERY.

Mass outage detection

When a service event’s status is a failure, for each tag associated with the check and entity, update a counter.

This can be used by filters later on to suppress + collapse notifications if the counter exceeds a threshold.

Creating scheduled maintenance via Web UI is broken

Adding scheduled maintenance via the web interface is broken, you get "String can't be coerced into Fixnum" in the browser. Here's the log:

TypeError - String can't be coerced into Fixnum:
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/flapjack-0.6.33/lib/flapjack/data/entity_check.rb:378:in `+'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/flapjack-0.6.33/lib/flapjack/data/entity_check.rb:378:in `block in maintenances'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/flapjack-0.6.33/lib/flapjack/data/entity_check.rb:376:in `collect'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/flapjack-0.6.33/lib/flapjack/data/entity_check.rb:376:in `maintenances'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/flapjack-0.6.33/lib/flapjack/data/entity_check.rb:165:in `update_scheduled_maintenance'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/flapjack-0.6.33/lib/flapjack/data/entity_check.rb:139:in `create_scheduled_maintenance'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/flapjack-0.6.33/lib/flapjack/web.rb:136:in `block in <class:Web>'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/sinatra-1.3.3/lib/sinatra/base.rb:1265:in `call'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/sinatra-1.3.3/lib/sinatra/base.rb:1265:in `block in compile!'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/sinatra-1.3.3/lib/sinatra/base.rb:835:in `[]'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/sinatra-1.3.3/lib/sinatra/base.rb:835:in `block (3 levels) in route!'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/sinatra-1.3.3/lib/sinatra/base.rb:851:in `route_eval'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/sinatra-1.3.3/lib/sinatra/base.rb:835:in `block (2 levels) in route!'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/sinatra-1.3.3/lib/sinatra/base.rb:872:in `block in process_route'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/sinatra-1.3.3/lib/sinatra/base.rb:870:in `catch'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/sinatra-1.3.3/lib/sinatra/base.rb:870:in `process_route'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/sinatra-1.3.3/lib/sinatra/base.rb:834:in `block in route!'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/sinatra-1.3.3/lib/sinatra/base.rb:833:in `each'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/sinatra-1.3.3/lib/sinatra/base.rb:833:in `route!'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/sinatra-1.3.3/lib/sinatra/base.rb:936:in `dispatch!'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/sinatra-1.3.3/lib/sinatra/base.rb:769:in `block in call!'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/sinatra-1.3.3/lib/sinatra/base.rb:921:in `block in invoke'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/sinatra-1.3.3/lib/sinatra/base.rb:921:in `catch'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/sinatra-1.3.3/lib/sinatra/base.rb:921:in `invoke'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/sinatra-1.3.3/lib/sinatra/base.rb:769:in `call!'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/sinatra-1.3.3/lib/sinatra/base.rb:755:in `call'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/rack-1.4.1/lib/rack/methodoverride.rb:21:in `call'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/rack-fiber_pool-0.9.2/lib/rack/fiber_pool.rb:21:in `block in call'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/rack-fiber_pool-0.9.2/lib/fiber_pool.rb:48:in `call'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/rack-fiber_pool-0.9.2/lib/fiber_pool.rb:48:in `block (3 levels) in initialize'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/rack-fiber_pool-0.9.2/lib/fiber_pool.rb:47:in `loop'
        /opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/rack-fiber_pool-0.9.2/lib/fiber_pool.rb:47:in `block (2 levels) in initialize'

And a similar error from executive when trying to act on scheduled maintenance I think:

FATAL flapjack-coordinator: String can't be coerced into Fixnum
/opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/flapjack-0.6.33/lib/flapjack/data/entity_check.rb:378:in `+'
/opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/flapjack-0.6.33/lib/flapjack/data/entity_check.rb:378:in `block in maintenances'
/opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/flapjack-0.6.33/lib/flapjack/data/entity_check.rb:376:in `collect'
/opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/flapjack-0.6.33/lib/flapjack/data/entity_check.rb:376:in `maintenances'
/opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/flapjack-0.6.33/lib/flapjack/data/entity_check.rb:165:in `update_scheduled_maintenance'
/opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/flapjack-0.6.33/lib/flapjack/executive.rb:164:in `update_keys'
/opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/flapjack-0.6.33/lib/flapjack/executive.rb:101:in `process_event'
/opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/flapjack-0.6.33/lib/flapjack/executive.rb:74:in `main'
/opt/rbenv/versions/1.9.3-p125/lib/ruby/gems/1.9.1/gems/flapjack-0.6.33/lib/flapjack/coordinator.rb:133:in `block in build_pikelet'

Buffer XMPP messages when not connected

Sometimes the XMPP gateway tries to send a notification just before the presence in the group chat has been set up and so it is missed.

This is only an issue when first starting up Flapjack and processing initial problems. It might be good to buffer notifications in a list if we're not connected yet, and send them all when connected.

Notifier event creation is non-functional

Stephen Nelson-Smith/LordCope reported this:


---
:retval: 2
:output: |
 ERROR: banana not found at http://172.16.47.130

:id: 1
Result is:
#<Flapjack::Result output="ERROR: banana not found at
http://172.16.47.130\n", retval=2, id=1>
 INFO notifier: Processing result for check '1'
 INFO notifier: Notifying on check '1'
 INFO notifier: Notifying Jane Doe via Flapjack::Notifiers::Mailer about check 1
 INFO notifier: Creating event for check '1'
/usr/lib64/ruby/gems/1.8/gems/flapjack-0.4.11/lib/flapjack/cli/notifier.rb:234:in
`process_result': uninitialized constant Flapjack::NotifierCLI::Event
(NameError)
       from /usr/lib64/ruby/gems/1.8/gems/flapjack-0.4.11/lib/flapjack/cli/notifier.rb:194:in
`process_loop'
       from /usr/lib64/ruby/gems/1.8/gems/flapjack-0.4.11/lib/flapjack/cli/notifier.rb:193:in
`loop'
       from /usr/lib64/ruby/gems/1.8/gems/flapjack-0.4.11/lib/flapjack/cli/notifier.rb:193:in
`process_loop'
       from /usr/lib64/ruby/gems/1.8/gems/flapjack-0.4.11/bin/flapjack-notifier:35
       from /usr/bin/flapjack-notifier:19:in `load'
       from /usr/bin/flapjack-notifier:19

Note I added some debug into the notifier to show the result body.

It seems to fall over at:

event = Event.new(:check_id => result.id)

mute all notifications matching a regex (via web, api, jabber)

eg via jabber:

flapjack: mute /foo-app-0[123]/ duration: 30 minutes

Possible implementation:

  • create a key 'mute_regexs' that is a list of regular expressions
  • create an event filter that looks to see if the entity_check key (ie the "$entity:$check" string for the service event) matches any of the regexs, if it does, block

The mute_regexs can be manipulated via web interface, api interface, and jabber gateway.

Note, not sure how to implement the auto expiry of each mute regex. Perhaps each one should be in a separate string key with expiry, and provide pointers to these from a sorted set ...

Reject ACKs via jabber if the duration given is not acceptable to flapjack

Currently, if you give a duration but it fails to parse, flapjack chooses the 4 hour default. It would be better to reject the ACK with a message saying the duration specified couldn't be parsed.

Here's an example of current behaviour:

[5:05] PROBLEM ::: flapjack: ACKID 2210956 ::: "HTTP Port 80" on wiki.bar.net is CRITICAL ::: No route to host
[5:06] flapjack: ACKID 2210956 JR ignoring duration: 3 days 5 minutes
[5:06] ACKing HTTP Port 80 on wiki.bar.net (2210956)
[5:06] ACKNOWLEDGEMENT ::: "HTTP Port 80" on wiki.bar.net has been acknowledged, unscheduled maintenance created for 4 hours ::: JR ignoring

Dry up Rakefile - make use of bundler/rake tasks

Replace https://github.com/flpjck/flapjack/blob/master/Rakefile#L27-52 with the tasks included with Bundler.

require 'bundler'
Bundler::GemHelper.install_tasks

This gives:

lunix@glenmorangie] -> be rake -T
rake build    # Build flapjack-0.0.1.gem into the pkg directory
rake install  # Build and install flapjack-0.0.1.gem into system gems
rake release  # Create tag v0.0.1 and build and push flapjack-0.0.1.gem to Rubygems

Replace https://github.com/flpjck/flapjack/blob/master/Rakefile#L54-60 with the clean rake tasks from rake.

require 'rake/clean'

This gives:

be rake -T
rake clean    # Remove any temporary products.
rake clobber  # Remove any generated file.

Reduce gem dependencies

We need to keep Flapjack's footprint small by limiting the number of libraries it depends on.

In particular:

  • Use ERB instead of HAML for HTML templates - it's bundled with Ruby
  • Use built-in Ruby Logger rather than log4r
  • Is YAJL necessary? Ruby 1.9 incorporates a JSON library with C bindings, we could use the 'json' gem on 1.8, which has no external library dependency.
  • Perhaps split some components into separate gems? flapjack and em-flapjack? flapjack would run separate processes, older XMPP lib, different web-server (maybe unicorn?); em-flapjack would use EM, fibers, blather and thin.

Write user documentation

Should at a minimum cover:

  • Installation
  • What components do
  • Architecture
  • Configuring Flapjack components
  • Configuring Nagios
  • Starting Flapjack
  • Importing contacts and entities (CLI)
  • Format of JSON contact + entity data
  • Configuring New Relic integration

For the moment, we should:

  • Put all the above documentation into a single document (e.g. USING.markdown) in the Flapjack GitHub Wiki.
  • Make README.markdown a quickstart guide, with appropriate pointers to USING.markdown

Make doc/ a git submodule of the Flapjack GitHub Wiki.

Improve Flapjack daemon control (start/stop/restart/status)

Currently flapjack has a pretty apalling init script that sends a simple minded TERM signal to the running process and hopes for the best, doesn't wait for it to be stopped etc.

This needs to be addressed, perhaps by adding a separate control script, such as currently implemented with flapjack-nagios-receiver-control using the daemons gem.

flapjack does use the daemons gem but much of the functionality it can provide is not currently being leveraged.

Once this is done, we can then convert flapjack-nagios-receiver into a pikelet too.

Write developer documentation

Should at a minimum cover:

  • Coding standards
  • Running tests
  • Contribution process
  • Releasing
  • (placeholder text saying “read the architecture documentation first”)
  • Redis database instances
  • Redis data structures
  • Writing gateways
  • Writing pikelets
  • Importing contacts and entities (talking to the API)
  • How to benchmark

For the moment, we should:

  • Put all the above documentation into a single document (e.g. DEVELOPING.markdown) in the Flapjack GitHub Wiki.
  • Make README.markdown a quickstart guide, with appropriate pointers to DEVELOPING.markdown

Make doc/ a git submodule of the Flapjack GitHub Wiki.

Individual Jabber ID addressing

Flapjack currently decides whether to generate a jabber notification for a notification event by examining the contacts who care about the check (or the checks entity) and seeing if there is a Jabber ID included on the contact. But the content of the Jabber ID is ignored and the message goes to whatever Jabber conference room IDs are configured under the jabber_gateway section of the configuration file.

We need to use the Jabber IDs of each contact instead.

Group Chat vs Individual Jabber IDs

I'm thinking that we assume a Jabber ID on a contact is not the address of a group chat room, unless it matches one of the jabber rooms in the configuration file. So we still only join (set up presence) group chat rooms that are specified in the configuration file. But notifications are sent Jabber IDs regardless of whether they are known group chat rooms or not.

One possible complication is that jabber servers usually will not allow posting to group chat rooms unless presence has been established. This will be mitigated somewhat by #12

Allow newlines in Jabber comments and commands

If you put a newline within comment text given to the ACKID command the newline and what's after it will be discarded, e.g. at the moment if you put the following as the input to Flapjack:

flapjack: ACKID 4146118 JR foo
trying a newline or two duration: 30 seconds

You'll get:

[10:47] ACKNOWLEDGEMENT ::: "PING" on foo.bar.com has been acknowledged, unscheduled maintenance created for 4 hours ::: JR foo

It may be useful to include newlines in comments / commands.

Add "disable per-check notifications" functionality

@jessereynolds brain dumped:

Some kind of permanent acknowledgement like 'i know this is screwed, i have no idea when it will be fixed, and nobody cares, so stop bugging everyone about it' ... needs a bit more specification ... perhaps a 'disable notifications' is ok, which could create an unscheduled outage with no end time, so it would be akin to a 'permanent acknowledgement' (or 'indefinite acknowledgement' more correctly), just don't set the expire on the unscheduled maintenance key. thoughts?

@auxesis + @ali-graham posited that ideologically it's pretty poor practice to be monitoring things that you don't take action on (if a tree falls in the forest...). @jessereynolds countered that while in principle he agrees, in the real world it's a needed feature.

Use case: you provision monitoring for a customer system that is in active development, and isn't going live for 2 weeks. You don't want to enable monitoring without disabling notifications, otherwise the on-call engineer will need to acknowledge alerts every 4 hours.

As a compromise:

  • We can achieve the desired "disable monitoring" outcome by using scheduled maintenance on a per check basis, rather than a global kill switch
  • Implement ability to cancel a currently active scheduled maintenance, in case you need to cut short the maintenance

Implement continuous out of band end-to-end testing

To ensure Flapjack is behaving correctly, Flapjack should perform some sort of self monitoring in the form of end-to-end testing.

Implementation details:

  • Make this a Pikelet
  • Run it as a separate process on a separate machine
  • Notify using XMPP + PagerDuty

@jessereynolds: can you add more details on how you think this should work?

Add logging of web and api requests

So we can see what web and api requests are being received and processed by flapjack, and the resulting response code, time served, length etc. These should be separate files, one for web and one for api requests.

Presumably Thin makes this real easy...

Crash - too many open files

Triggering a crash of flapjack by running the 'identify' command via jabber. It shells out to get hostname -f and various things so probably the straw breaking the camels back here.

Provide mechanism for simulating a failure to test notifications

When setting up monitoring for a new check, we want to be able to simulate a failure of the check in order to test that people are notified as expected.

Possible implementation:

  • keep a separate data structure for when checks are in simulated failure (so that these failure periods can be subtracted / masked out of the SLA reporting)
  • define a new kind of action event, similar to setting unscheduled maintenance, which when processed would
    • put a check into this simulated failure state with an expiry specified in the event (or a default of eg 5 minutes)
    • kick off an EM timer to generate the simulated failure events every 10 seconds until the simulated failure state expires
  • executive set to ignore the real checks updates from the checking engine (eg nagios) as long as they are OK
  • if a real failure arrives then terminate the simulated failure and start alerting for real (this last bit may be too messy to implement as we'd need to reset all the states)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.