Coder Social home page Coder Social logo

bun's Introduction

Bun

โš ๏ธ Please note that D2iQ supplies Bun "as is" and does not support it. We created this tool for internal D2iQ development and support teams to surfaces potential errors or misconfigurations for further analysis by trained individuals. As such, it can and does produce false-positive results.

Command-line program which detects the most common problems in a DC/OS cluster by analyzing its diagnostics bundle.

$ bun
+-------------+-------------------------------------------------------------+
| Check       | nscd-running                                                |
+-------------+-------------------------------------------------------------+
| Status      | [UNDEFINED]                                                 |
+-------------+-------------------------------------------------------------+
| Description | Detects if Name Service Cache Daemon (nscd) is running on a |
|             | DC/OS node                                                  |
+-------------+-------------------------------------------------------------+
| Summary     | Couldn't check any hosts because of the error(s). Launch    |
|             | this check with the -v flag to see the details.             |
+-------------+-------------------------------------------------------------+

+------------------------+------------------------------------------------------------+
| Check                  | oom-kills                                                  |
+------------------------+------------------------------------------------------------+
| Status                 | [PROBLEM]                                                  |
+------------------------+------------------------------------------------------------+
| Description            | Detects out of memory kills in dmesg log                   |
+------------------------+------------------------------------------------------------+
| Cure                   | The operating system is killing processes which exceed     |
|                        | system or container memory limits. Please check which      |
|                        | processes are getting killed. If it is a DC/OS container,  |
|                        | increase its memory limit.                                 |
+------------------------+------------------------------------------------------------+
| Summary                | Error pattern "invoked oom-killer" found.                  |
+------------------------+------------------------------------------------------------+
| [P] agent 10.10.10.104 | Error pattern occurred 3 time(s) in file dmesg-0.output.gz |
+------------------------+------------------------------------------------------------+
| [P] agent 10.10.10.105 | Error pattern occurred 2 time(s) in file dmesg-0.output.gz |
+------------------------+------------------------------------------------------------+

+-----------+----+
|  SUMMARY  |    |
+-----------+----+
| Failed    |  1 |
| Undefined |  1 |
| Passed    | 20 |
+-----------+----+
|   TOTAL   | 22 |
+-----------+----+

Installation

macOS

  1. Download and unpack the binary:
$ curl -O -L https://github.com/mesosphere/bun/releases/latest/download/bun_darwin_amd64.tar.gz && tar -zxvf bun_darwin_amd64.tar.gz
  1. Move the bun binary to one of the directories in the PATH.

Linux

  1. Download and unpack the binary:
$ curl -O -L https://github.com/mesosphere/bun/releases/latest/download/bun_linux_amd64.tar.gz && tar -zxvf bun_linux_amd64.tar.gz
  1. Move the bun binary to one of the directories in the PATH.

Windows

  1. Download the command
  2. Unzip it and move the bun binary to one of the folders in the PATH.

From sources

  1. Install Go compiler.
  2. Run the following command in your terminal:
$ go get github.com/mesosphere/bun

Usage

$ bun -p <path to bundle directory>

or if the working directory is the bundle directory simply:

$ bun

Please, launch the following command to learn more:

$ bun --help

Update

Bun checks for its new versions and updates itself automatically with your permission.

How to contribute

Please, report bugs and share your ideas for new features via the issue page.

The project is written in Go; please, use the latest version of the compiler.

To add a new feature or fix a bug, clone the repository: git clone https://github.com/mesosphere/bun.git and use your favorite editor or IDE.

To test your changes, simply build the CLI and launch it against some bundle:

$ go build
$ ./go -p <path to a bundle directory>

Bundle files

Names of DC/OS diagnostics bundle files may vary from one version of DC/OS to another, moreover, they are not always descriptive or handy. That is why in Bun we give each file a human-readable self-explanatory ID and use these IDs to refer to the bundle files. File files_type_yaml.go contains description of bundle files. The bundle.Bundle struct is a representation or the diagnostics bundle file structure; use it to browse through the bundle and access its files.

How to add new checks

The core abstraction of the Bun tool is checks.Check:

package checks

type Check struct {
	Name           string 
	Description    string
	Cure           string
	OKSummary      string
	ProblemSummary string
	Run            CheckBundleFunc 
}

type CheckBundleFunc func(bundle.Bundle) Results

type Result struct {
	Status Status
	Value  interface{}
	Host   bundle.Host
}

To add a new check you need to create an instance of that struct, describe the check by specifying its string fields, and provide a Run function, which does actual testing.

To make adding checks easier, Bun provides some help; for example, you can declare checks as a YAML object, or use the CheckFuncBuilder struct to simplify creation of the Cehck.Run function. Please. see the next sections for the details.

Search check

Search checks are looking for a specified strings or regular expressions in a bundle file to detect or rule out a specific problem. Also, search checks is very easy to add -- you don't even need to write a code.

To create a new search check, simply add a new object to the YAML document in the checks/search_checks_yaml.go file. For example:

- name: exhibitor-disk-space
  description: Checks for disk space errors in Exhibitor logs
  fileTypeName: exhibitor-log
  errorPattern: 'No space left on device'
  cure: Please check that there is sufficient free space on the disk.

To avoid false positives, you can specify a a string or regular expression, which manifests that the problem is gone. For example, the following check will not fail if the string "Time is in sync" appears in the networking log after the last "Checks if time is synchronised on the host machine." message.

- name: time-sync
  description: Checks if time is synchronised on the host machine.
  fileTypeName: net-log
  errorPattern: '(internal consistency is broken|Unable to determine clock sync|Time is not synchronized|Clock is less stable than allowed|Clock is out of sync)'
  isErrorPatternRegexp: true
  curePattern: 'Time is in sync'
  cure: Check NTP settings and NTP server availability.

Check a condition on each node of a certain type

If you need to check that a certain condition is satisfied on each DC/OS node of a given type (i.e.: master, agent, or public agent), you can use the checks.CheckFuncBuilder. With its help, you only need to create a function which checks for the condition on one node. The builder will do the rest. For example, the following check detects a situation when Mesos mailboxes have too many messages:

...
	builder := checks.CheckFuncBuilder{
		CheckMasters:      collect,
		CheckAgents:       collect,
		CheckPublicAgents: collect,
	}
	check := checks.Check{
		Name: "mesos-actor-mailboxes",
		Description: "Checks if actor mailboxes in the Mesos process " +
			"have a reasonable amount of messages",
		Cure: "Check I/O on the correspondent hosts and if something is overloading Mesos agents or masters" +
			" with API calls.",
		OKSummary:      "All Mesos actors are fine.",
		ProblemSummary: "Some Mesos actors are backlogged.",
		Run:            builder.Build(),
	}
...

type MesosActor struct {
	ID     string `json:"id"`
	Events []struct{}
}

func collect(host bundle.Host) checks.Result {
	var actors []MesosActor
	if err := host.ReadJSON("mesos-processes", &actors); err != nil {
		return checks.Result{
			Status: checks.SUndefined,
			Value:  err,
		}
	}
	var mailboxes []string
	for _, a := range actors {
		if len(a.Events) > maxEvents {
			mailboxes = append(mailboxes, fmt.Sprintf("(Mesos) %v@%v: mailbox size = %v (> %v)",
				a.ID, host.IP, len(a.Events), maxEvents))
		}
	}
	if len(mailboxes) > 0 {
		return checks.Result{
			Host:   host,
			Status: checks.SProblem,
			Value:  mailboxes,
		}
	}
	return checks.Result{
		Status: checks.SOK,
	}
}

If your check needs to analyse the data collected on each node, you can implement an Aggregate function instead of using the the default one; please see an example in the dcos-version (checks/dcosversion/check.go) check.

How to release

  1. Install GoReleaser.

  2. Create Github personal access token with the repo scope and export it as an environment variable called GITHUB_TOKEN:

    $ export GITHUB_TOKEN=<your personal GitHub access token>

    Please find more information about this step here.

  3. Create a Git tag which adheres to semantic versioning and push it to GitHub:

    $ git tag -a v1.9.8 -m "Release v1.9.8"
    $ git push origin v1.9.8

    If you made a mistake on this step, you can delete the tag remotely and locally:

    $ git push origin :refs/tags/v1.9.8
    $ git tag --delete v1.9.8
  4. Test that the build works with the following command:

    $ goreleaser release --skip-publish --rm-dist
  5. If everything is fine publish the build with the following command:

    $ goreleaser release --rm-dist

bun's People

Contributors

abudnik avatar adyatlov avatar bbannier avatar cneth avatar edgarlanting avatar fabs avatar rukletsov avatar wavesoft avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bun's Issues

Rate Limit Detection

On the files

<bundle_dir>/<agent_ip>_agent/var/log/mesos/mesos-agent.log and/or
<bundle_dir>/<agent_ip>_agent/dcos-mesos-slave.service 

Rate limit messages can be detected which prevents the deployment of some services.

Search pattern:

You have reached your pull rate limit. You may increase the limit by authenticating and upgrading

Update README.md to resemble the current bun codebase

Since bun is now incorporated in the Mesosphere account I think its useful to change all references to Andreys account (adyatlov) including a refresh of the advertised code snippets in this readme (bun.CheckBuilder, etc...)

new pattern: "clock synchronization error" in cockroachdb service'log

Bun currently doesn't detect this obvious pattern in the cockroachdb service log:

"2020-01-22 17:09:59.889798 +0100 CET F200122 16:09:59.889709 29 server/server.go:219 [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)"

Time drift in the Cockroachdb log on DC/OS 1.13

The pattern
"Sleeping till wall time 1579124038327600109 to catches up to 1579124038779591927 to ensure monotonicity. Delta: 451.991818ms"
in the cockroachdb service's log indicates a tiny time drifts, that preventing the Cockroachdb instance from syncing with the other cockroachdb cluster's nodes, and, hence, preventing the corresponding mesos master from joining the DC/OS cluster.
The pattern is relevant for the DC/OS 1.13.
Details:
https://mesosphere.slack.com/archives/G09D33Q0Y/p1579192470358800
https://github.com/cockroachdb/cockroach/blob/master/pkg/server/server.go#L980

DCOS-19790

Unable to configure some of the overlays on this Agent:

Statsd message queue full for telegraf.services

2019-11-12 08:37:30.955952 +0100 CET 2019-11-12T07:37:30Z E! Error: statsd message queue full. We have dropped 243450000 messages so far. You may want to increase allowed_pending_messages in the config

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.