Coder Social home page Coder Social logo

matrix's Introduction

Matrix

Welcome to the real world.

This is a test engine designed to validate proper function of real-world software solutions under a variety of adverse conditions. While this system can run in a way very similar to bundletester this engine is designed to a different model. The idea here is to bring up a running deployment, set of a pattern of application level tests and ensure that the system functions after operations modelled with Juju are performed. In addition the system supports large scale failure injection such a removal of units or machines while tests are executing. It is because of this async nature that the engine is written on a fresh codebase.

Every effort should be made by the engine to correlate mutations to failure states and produce helpful logs.

Interactive Mode

Extending Matrix from within a Bundle

The default suite of tests is designed to provide a general assertion of functionality for a given bundle based on the common operations and promises that Juju provides. This ensures that, even without any work on the part of the charm or bundle author, we can make some assurances of the quality of the bundle and its charms. However, each bundle will have its own specific functionality that cannot be tested generically.

Bundles can extend the testing that Matrix does in several ways. The goal is to allow the bundle author to focus on the things that are unique to their bundle, while Matrix handles the things that are common to testing all bundles.

End-to-end load

The best way for a bundle to verify the functionality unique to that bundle is to provide an end-to-end load generator that verifies that the stack as a whole is functioning as expected. This can be done in two ways:

  • tests/end_to_end.py This file should contain an async function called end_to_end that will be called with two arguments: a Juju model instance, and a standard logger instance where it can emit messages.

  • tests/end_to_end This file should be executable, and will be invoked with the name of the model being tested. The output will be logged, with stderr being logged as errors.

In either case, the load generator will be called after the model has been deployed and has settled out. It should run indefinitely, continually generating a reasonable amount of load sufficient to assure that the system as a whole is functioning properly. If the function returns or the executable terminates, it will be considered a test failure. Otherwise, it will be terminated automatically once the rest of the built-in tests have finished.

Custom Suite

A bundle can also provide a custom Matrix suite in tests/matrix.yaml. This is a YAML file using the format described below. It can use any of the built-in Matrix tasks, and it can provide custom tasks as well. The bundle directory will be included on the Python path, so, for example, the bundle could provide a tests/matrix.py file with custom tasks and the YAML could refer to them via tests.matrix.task_name.

Running Matrix

Install and run Matrix as a juju plugin by doing the following:

git clone https://github.com/juju-solutions/matrix.git
cd matrix
sudo pip3 install . -f wheelhouse --no-index
juju matrix -p /path/to/bundle

This will run Matrix in interactive mode, with a terminal UI that shows the progress of each test, their results, the status of the Juju model, and the Juju debug log. If you prefer to use the non-interactive mode, invoke Matrix with the raw screen option:

juju matrix -p /path/to/bundle -s raw

By default, Matrix runs its built-in suite of tests, along with a matrix.yaml test case if found in the bundle. You can also pass in additional Matrix tests via the command line:

juju matrix -p /path/to/bundle /path/to/other/test.yaml

See juju matrix --help for more information and invocation options.

Running against bundles from the store

By itself, Matrix can only be run against local copies of bundles. To run against a bundle in the store, you can use bundletester:

sudo pip2 install bundletester
sudo pip3 install 'git+https://github.com/juju-solutions/matrix.git'
bundletester -t cs:bundle/wiki-simple

In addition to running the bundle and charm tests, bundletester will run Matrix on the bundle. Note that it will not run it in interactive mode, so you will only see the end result. The matrix.log and chaos_plan.yaml files will be available, however.

Running with the virtualenv

If you're developing on Matrix, or don't want to install it on your base system, you can use Tox to run Matrix's unit tests and build a virtualenv from which you can run Matrix:

git clone https://github.com/juju-solutions/matrix.git
cd matrix/
tox -r
. .tox/py35/bin/activate
juju matrix -p /path/to/bundle

Note that if any of the requirements change, you will need to rebuild the virtualenv:

deactivate
tox -r
. .tox/py35/bin/activate

Functional testing

Matrix also includes a full-stack test, which requires you to pass in a controller name to run:

tox -- -s --controller=lxd

Note that this takes some time, as it runs the default Matrix test suite against a trivial bundle.

If you want to do more thorough functional testing of the various quirks and corners of matrix, you can run the functional testing suite via tox like so:

tox -r -e functional -v

JaaS Support

You can also run juju-matrix against JaaS controllers. The cloud you intend to deploy to is ambiguous in JaaS if not specified, however, so you must specific a cloud with -C/--cloud.

To deploy to aws, for example, you would use the following invocation:

juju matrix -p /path/to/bundle -C aws

High level Design

Tests are run by an async engine driven by a simple rule engine. The reason to do things in this way is so we can express the high level test plan in terms of rules and states (similar to reactive and layer-cake).

tests:
- name: Traffic
  description: Traffic in the face of Chaos
  rules:
    - do:
        action: deploy
        version: current
    - do: test_traffic
      until: chaos.complete
      after: deploy
    - do:
        action: matrix.tasks.chaos
      while: test_traffic
    - do:
        action: matrix.tasks.health
        periodic: 5
      until: chaos.complete

Given this YAML test definition fragment the intention here is as follows. Define a test relative to a bundle. Deploy that bundle, this will set a state triggering the next rule and invoking a traffic generating test. The traffic generating test should be run "until" a state is set (chaos.done) and may be invoked more than once by the engine. While the engine is running the traffic suite a state (test_traffic based on test name) will be set. This allows triggering of the "while" rule which launches another task (chaos) on the current deployment. When that task has done what it deems sufficient it can exit, which will stop the execution of the traffic test.

Rules are evaluated continuously until the test completes and may run in parallel. Excessive used of parallelism can make failure analysis more complicated for the user however.

For a system like this to function we must continuously assert the health of the running bundle. This means there is a implicit task checking agent/workload health after every state change in the system. State in this case means states set by rules and transitions between rules. As Juju grows a real health system we'd naturally extend to depend on that.

Tasks

The system includes a number of built in tasks that are resolved from any do clause if no matching file is found in the tests directory. Currently these tasks are

matrix.tasks.deploy:
    version: *current* | prev

matrix.tasks.health

matrix.tasks.chaos:
    applications: *all* | [by_name]

Chaos internally might have a number of named components and mutation events that can be used to perturb the model. Configuration there of TBD.

Plugins

If there is no binary on the path of a give do:action: name then the action will attempt to load a Python object via a dotted import path. The last object should be callable and can expect handler(context, rule) as its signature. The context object is the rules Context object and rule is the current Rule instance. The object should return a boolean indicating if the rule is complete. If the task is designed to run via an 'until' condition it will be marked as complete after its task has been cancelled.

Test failure can be indicated immediately by raising matrix.model.TestFailure which will fail the test and cancel any pending Tasks running related to it. If you wish to signal test failure from an executable (non-plugin) you can use the exit with any non-zero value and a TestFailure exception will automatically be raised.

Interactions with other tools

Matrix can be used with existing testing tools. More work around integration is coming, but currently it is simple enough to have matrix run an existing testing tool and design your test plans around that. It is also possible to have an external runner call matrix and depend on its return value, such as running in bundletester mentioned above.

The advantages of a system like Matrix are not only in a reusable suite of tests but in helping to extract information from the failure cases that can be used to improve both the charms and their upstream software in cases where that makes sense. Because of the developing approach to tying failures to properties of the model and the runtime there is more to be gleaned than a simple pass/fail gate.

When Matrix is complete it should provide more information about the runtime of your deployments than you'd normally have access to and should be seen as part of the feedback loop DevOps depends on.

matrix's People

Contributors

arosales avatar bcsaller avatar johnsca avatar kwmonroe avatar lutostag avatar nskaggs avatar pengale avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

matrix's Issues

Matrix no longer works w/ bundles from the store

The following invocation of matrix no longer works:

matrix tests/test_2.matrix

It fails with the following error:

Task(command='matrix.tasks.deploy', args={'version': 'current', 'entity': 'cs:bundle/wiki-simple'})
Traceback (most recent call last):
  File "/home/petevg/Code/matrix/matrix/model.py", line 163, in execute
    result = await self.execute_plugin(context, cmd, rule)
  File "/home/petevg/Code/matrix/matrix/model.py", line 180, in execute_plugin
    result = await cmd(context, rule, self, event)
  File "/home/petevg/Code/matrix/matrix/tasks/deploy.py", line 6, in deploy
    new_apps = await context.juju_model.deploy(str(context.config.path))
  File "/home/petevg/Code/matrix/.tox/py35/lib/python3.5/site-packages/juju/model.py", line 897, in deploy
    await client_facade.AddCharm(channel, entity_id)
  File "/home/petevg/Code/matrix/.tox/py35/lib/python3.5/site-packages/juju/client/facade.py", line 317, in wrapper
    reply = await f(*args, **kwargs)
  File "/home/petevg/Code/matrix/.tox/py35/lib/python3.5/site-packages/juju/client/_client.py", line 9371, in AddCharm
    reply = await self.rpc(msg)
  File "/home/petevg/Code/matrix/.tox/py35/lib/python3.5/site-packages/juju/client/facade.py", line 436, in rpc
    result = await self.connection.rpc(msg, encoder=TypeEncoder)
  File "/home/petevg/Code/matrix/.tox/py35/lib/python3.5/site-packages/juju/client/connection.py", line 93, in rpc
    raise JujuAPIError(result)
juju.errors.JujuAPIError: charm or bundle URL has invalid form: "/home/petevg/Code/matrix"

This is related to the refactor that allows us to skip specifying a bundle in a .matrix test. If intentional, this is fine, but we need to revise that test, and possibly allow a user to pass in a bundle name from the store as an arg (close if I'm just missing the arg).

Add different gating rules for HA bundles

If a bundle is "High Availability", then we expect it to survive a chaos run. If a bundle is not -- e.g., it comprises a single db unit and a single service unit -- it may not survive a chaos run.

We'd like our automated testing tools to fail HA units if they fail a chaos run, but we don't need to be so harsh on simpler bundles.

I'm currently doing some work to allow a bundle author to flag a bundle as HA, triggering matrix to gate on failure differently. Will push a PR shortly.

Better/quicker validation of matrix plans

Use fake juju to validate a plan before we invest minutes (or tens of minutes) of time in executing it.

Current matrix validation is very basic, and doesn't catch a lot of issues.

Glitch can get us into a state where we cannot reset

Take the following glitch plan, and run it on wiki-simple to reproduce:

actions:
- action: destroy_machine
  selectors:
  - {selector: machines}
  - {selector: one}
- action: remove_unit
  selectors:
  - {application: mysql, selector: units}
  - {selector: leader, value: true}
  - {selector: one}
- action: kill_juju_agent
  selectors:
  - {application: wiki, selector: units}
  - {selector: leader, value: true}
  - {selector: one}
- action: kill_juju_agent
  selectors:
  - {application: wiki, selector: units}
  - {selector: leader, value: true}
  - {selector: one}
- action: destroy_machine
  selectors:
  - {selector: machines}
  - {selector: one}

You might get this Exception:

matrix:331:exception_handler: Traceback (most recent call last):

  File "/usr/lib/python3.5/asyncio/tasks.py", line 239, in _step
    result = coro.send(None)

  File "/home/petevg/Code/matrix/.tox/py35/lib/python3.5/site-packages/juju/application.py", line 164, in destroy
    return await app_facade.Destroy(self.name)

  File "/home/petevg/Code/matrix/.tox/py35/lib/python3.5/site-packages/juju/client/facade.py", line 317, in wrapper
    reply = await f(*args, **kwargs)

  File "/home/petevg/Code/matrix/.tox/py35/lib/python3.5/site-packages/juju/client/_client.py", line 7633, in Destroy
    reply = await self.rpc(msg)

  File "/home/petevg/Code/matrix/.tox/py35/lib/python3.5/site-packages/juju/client/facade.py", line 436, in rpc
    result = await self.connection.rpc(msg, encoder=TypeEncoder)

  File "/home/petevg/Code/matrix/.tox/py35/lib/python3.5/site-packages/juju/client/connection.py", line 93, in rpc
    raise JujuAPIError(result)

juju.errors.JujuAPIError: cannot destroy application "wiki": state changing too quickly; try again soon

The simplest thing to do might be to just catch that error, and retry the reset.

Glitch should only apply changes to apps deployed by the test

While we should generally run matrix tests on clean models, for dev / testing it's often helpful to (re)run against a pre-deployed model. However, if the current model includes applications that are not part of the matrix test, we might cause mutations to them which are not relevant to the test.

I think glitch should start from context.apps instead of context.juju_model.applications.

Matrix should create a model to run tests in

Matrix breaks things when it runs, which is bad if you accidentally run it against production, and reset is unreliable.

Both of these issues would be addressed if Matrix spun up a custom, uniquely id'ed model.

A complete solution would allow the config to override this behavior, so that pre-existing models could be reused if desired.

Feature request: Snap Matrix

It will almost certainly need to be unconfined, at least for now, but installation and usage would still be easier with a snap.

Add non glitched gating test in the default test bundle

Currently, we do not want to make glitch into a gating test, because it fails on known issues, but we don't have any tests that actually gate once you take away glitch.

Need to tweak the default test so that it gates on deploy, without glitch.

Add compatibility with Jaas

I just started up matrix while juju was pointed at the Jaas controller. It failed pretty spectacularly.

We need to handle the situation where an operator is using Jaas. The best way might be to pass in some Jaas arguments as command line arguments to matrix.

Quickstart in README doesn't work

The Quick Start currently states:

git clone https://github.com/juju-solutions/matrix.git
cd matrix/
tox
. .tox/py35/bin/activate
matrix tests/test_prog

but that fails hard:

usage: matrix [-h] [-c CONTROLLER] [-m MODEL] [-k] [-l LOG_LEVEL]
[-L [LOG_NAME [LOG_NAME ...]]]
[-f [LOG_FILTER [LOG_FILTER ...]]] [-s {tui,raw}] [-x FILENAME]
[-i INTERVAL] [-p PATH] [-D] [-B]
[-t [TEST_PATTERN [TEST_PATTERN ...]]] [-g GLITCH_PLAN]
[-n GLITCH_NUM] [-o GLITCH_OUTPUT]
[additional_suites [additional_suites ...]]
matrix: error: Invalid bundle directory: /home/acollard/Projects/matrix

It seems like it wants a bundle.yaml file in $PWD?

Create a selector for models

@abentley wants to be able to perform glitch actions against models.

The following code snippet probably does what he wants:

@selector
async def model(rule: Rule, model: Model):
    return [model]

Make glitch plans more repeatable by simplifying selection process

Instead of our current branching path through selectors, we should have a single select function, that includes something like the following args:

hosted charm(s) ("application")
ram
cpu
workload_status
leadership

To select a machine, we take a weighted sum of the args (leadership weighs less; hosted charms weighs more) of each deployed machine, and select the one with the best match, choosing randomly in the case of a tie.

actions now take a machine as the third argument (which, in the case of a selector that just acts on a model, is just a random machine), and derive stuff they need to act based on the properties of the machine.

Glitch is too aggressive on non HA bundles

Deploying a single instance of a web service, along with a single instance of a database is risky, but is not necessarily incorrect -- as long as you make regular backups, have good monitoring, and consider some hours of downtime to be acceptable for your blog/wiki/other small service, a deploy like this can be the correct, budget conscious choice.

Glitch will happily run destroy_unit when testing your bundle, however, creating a situation that needs human intervention, and that has been planned to need human intervention. Currently, glitch will mark this as a failure.

We should probably only run the destroy_unit action on bundles that promise high availability, either as a flag in glitch, or as a flag in the bundle yaml.

Cannot "juju ssh" to machines created in a matrix test

To reproduce:

  1. Run a matrix test.
  2. As the test is running, "juju switch" to the model that the test created.
  3. Try to "juju ssh" to one of the machines.

Note that you get a "permission denied (publickey)" error.

Either the path to the public key is not getting set in the terminal env (unlikely, because "juju ssh" to a model that I created works), or we're not copying the juju public key onto the machines that we create in a matrix test (weird, but it is possible that the websocket api is missing the command that creates those keys).

Crash while destroying a model on AWS

This happened in the cwr charm. Looks like we got a timeout trying to tear down the model, which means that a) our second test never ran, and b) we left a model shaped mess on the controller.

                                
                                                                            matrix:124:load_suites: Parsing /usr/local/lib/python3.5/dist-packages/matrix/matrix.yaml
matrix:422:add_model: Creating model matrix-fun-chimp
Start Test deployment Basic Deployment
==============================================================================
deploy:4:deploy: Deploying /tmp/cwr-tmp-ViPWhT/bundletester-HKYRI6/cs__kwmonroe_bundle_java_devenv
deploy:6:deploy: Deploy COMPLETE
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: busy
health:56:health: Health check: settling
health:56:health: Health check: settling
health:56:health: Health check: settling
health:56:health: Health check: settling
health:56:health: Health check: settling
health:56:health: Health check: settling
health:56:health: Health check: settling
health:56:health: Health check: healthy
------------------------------------------------------------------------------
matrix:228:crashdump: Running crash dump
matrix:230:crashdump: Crashdump specified, but skipping juju-crashdump due to gh issue #59. Will still save off matrix log.
matrix:263:crashdump: Crashdump COMPLETE
matrix:461:destroy_model: Destroying model matrix-fun-chimp
matrix:403:run: Error destroying model: 
matrix:490:exception_handler: Top Level Exception Handler
Traceback (most recent call last):
  File "/usr/lib/python3.5/asyncio/selector_events.py", line 662, in _read_ready
    data = self._sock.recv(self.max_size)
TimeoutError: [Errno 110] Connection timed out
asyncio:1148:default_exception_handler: Unhandled error in exception handler
context: {'message': 'Fatal read error on socket transport', 'exception': TimeoutError(110, 'Connection timed out'), 'transport': <_SelectorSocketTransport fd=11 read=polling write=<idle, bufsize=0>>, 'protocol': <asyncio.sslproto.SSLProtocol object at 0x7ff0e5122240>}
Traceback (most recent call last):
  File "/usr/lib/python3.5/asyncio/base_events.py", line 1180, in call_exception_handler
    self._exception_handler(self, context)
  File "/usr/local/lib/python3.5/dist-packages/matrix/rules.py", line 498, in exception_handler
    raise e
  File "/usr/lib/python3.5/asyncio/selector_events.py", line 662, in _read_ready
    data = self._sock.recv(self.max_size)
TimeoutError: [Errno 110] Connection timed out
matrix:490:exception_handler: Top Level Exception Handler
Traceback (most recent call last):
  File "/usr/lib/python3.5/asyncio/selector_events.py", line 662, in _read_ready
    data = self._sock.recv(self.max_size)
TimeoutError: [Errno 110] Connection timed out
asyncio:1148:default_exception_handler: Unhandled error in exception handler
context: {'message': 'Fatal read error on socket transport', 'exception': TimeoutError(110, 'Connection timed out'), 'transport': <_SelectorSocketTransport fd=10 read=polling write=<idle, bufsize=0>>, 'protocol': <asyncio.sslproto.SSLProtocol object at 0x7ff0e3291fd0>}
Traceback (most recent call last):
  File "/usr/lib/python3.5/asyncio/base_events.py", line 1180, in call_exception_handler
    self._exception_handler(self, context)
  File "/usr/local/lib/python3.5/dist-packages/matrix/rules.py", line 498, in exception_handler
    raise e
  File "/usr/lib/python3.5/asyncio/selector_events.py", line 662, in _read_ready
    data = self._sock.recv(self.max_size)
TimeoutError: [Errno 110] Connection timed out
matrix:378:run: Error adding model: WebSocket connection is closed: code = 1006, no reason.
------------------------------------------------------------------------------
matrix:228:crashdump: Running crash dump
matrix:230:crashdump: Crashdump specified, but skipping juju-crashdump due to gh issue #59. Will still save off matrix log.
matrix:263:crashdump: Crashdump COMPLETE
matrix:461:destroy_model: Destroying model matrix-fun-chimp
matrix:403:run: Error destroying model: WebSocket connection is closed: code = 1006, no reason.
Run Complete
deployment         ✔
end_to_end         ✘

Halting matrix w/ Ctrl-C breaks my terminal

To reproduce:

  1. Run matrix in a Terminal (I'm using the default terminal on Ubuntu Xenial).
  2. Ctrl-C to cancel the run.
  3. Note that text the the user inputs is no longer displayed (though it will have an effect -- try typing "clear", then hitting [Enter] for a reasonably safe example).

matrix doesn't use lxd cred in 2.1

Juju 2.1 now requires credentials for lxd. In cwr-ci, we support this with the set-credentials action. This works for cwr models, but fails when matrix tries to create a model:

matrix:124:load_suites: Parsing /usr/local/lib/python3.5/dist-packages/matrix/matrix.yaml
matrix:461:add_model: Creating model matrix-happy-pig
matrix:380:run: Error adding model: no credential specified
------------------------------------------------------------------------------
matrix:431:cleanup: Error while running crashdump.

action is potentially too overloaded a term

In the execution plan, using action may be too similar to the concept of actions in charms. A term not common in the existing juju language, like task might be better suited

Resources in testing

We would like to be able to include resources in our automated testing.

We want to be able to run tests against local binaries that would be attached to the controller at the time of testing. This way we will be able to verify the correctness of the binaries/resources before we actually upload them to the store.

The need for this has come up during development of the CI for Kubernetes charms.

Thank you

matrix never finishes (sometimes)

I ran cwr on a bundle last night on my local provider as well as a gce cloud. Local finished quickly, but today, there's still a model running on my gce controller. The matrix.log on the unit where matrix is running reports the following over and over again:

health:56:health: Health check: busy

All my units are settled with workload status active and agent status idle:

$ juju status -m matrix-saving-osprey
Model                 Controller  Cloud/Region        Version
matrix-saving-osprey  gce-c       google/us-central1  2.0.3

App      Version  Status  Scale  Charm          Store       Rev  OS      Notes
devenv            active      1  ubuntu-devenv  jujucharms    4  ubuntu
openjdk           active      1  openjdk        jujucharms    5  ubuntu

Unit          Workload  Agent  Machine  Public address   Ports  Message
devenv/0*     active    idle   0        104.197.176.159         devenv ready with: java
  openjdk/0*  active    idle            104.197.176.159         OpenJDK 8 (jre) installed

Machine  State    DNS              Inst id        Series  AZ
0        started  104.197.176.159  juju-cc93dd-0  xenial  us-central1-a

Relation  Provides  Consumes  Type
java      devenv    openjdk   subordinate

The since times are all well past 30s of the current time, which should cause the health check to acknowledge everything is healthy:

$ date
Thu Feb 16 16:53:23 UTC 2017

$ juju status -m matrix-saving-osprey --format=yaml | grep -i since
      since: 15 Feb 2017 23:44:10Z
      since: 15 Feb 2017 23:42:08Z
      since: 16 Feb 2017 16:50:37Z
          since: 16 Feb 2017 16:50:37Z
          since: 16 Feb 2017 16:50:37Z
              since: 15 Feb 2017 23:45:32Z
              since: 16 Feb 2017 16:49:59Z
      since: 15 Feb 2017 23:45:32Z

@johnsca thinks this might be caused by the connection being lost and matrix using stale data. Therefore the health check never sees the current workload/agent status as being active/idle.

Investigate using chaos-monkey with glitch for network chaos

The chaos-monkey repo contains some similar functionality to glitch, but also includes logic for creating network chaos. We should try to avoid duplication if possible and leverage that code. This may require some refactoring, though, since Matrix uses libjuju and chaos-monkey was written to be a subordinate.

Request: better define success test or failure

I ran into this while writing a Zookeeper test.

Here's what I want to have happen:

  1. We deploy a bundle, and wait until we are in a "healthy" state.
  2. We run glitch.
  3. We wait until we are in an "unhealthy" or "healthy" state.
  4. We fail if unhealthy, and pass if healthy

Right now, there isn't a great way to define this in a matrix.yaml -- after the health check gets set to healthy, and the health task gets marked as "completed", it's hard to get it to check for health again.

More generally, as @bcsaller brought up, we need a formal way of defining a task success or failure, preferably with a way of waiting until the consequences of the test have settled out.

Timeout doesn't cover initial deploy

We timeout if a matrix test takes too long, but this timeout doesn't cover the initial deploy -- if we time out during that, matrix will never exit.

Chaos Selectors don't handle subordinate units well

I saw this while testing the hadoop-processing bundle. A lot of the selectors were failing, on what looked like subordinate units. I'm not sure whether it's the "unit" or "leadership" selector that causes the issue.

Will do more troubleshooting ...

Glitch should not attempt to add_unit to subordinates

Traceback for <Task finished coro=<RuleEngine.rule_runner() done, defined at /home/johnsca/juju/matrix/matrix/rules.py:138> exception=JujuAPIError('cannot add unit 1/1 to application "topbeat": cannot add unit to application "topbeat": application is a subordinate',)> (most recent call last):
  File "/usr/lib/python3.5/asyncio/tasks.py", line 292, in _step
    self = None  # Needed to break cycles when an exception occurs.
  File "/home/johnsca/juju/matrix/matrix/rules.py", line 223, in rule_runner
    break
  File "/home/johnsca/juju/matrix/matrix/model.py", line 343, in execute
    result = await self.task.execute(context, self)
  File "/home/johnsca/juju/matrix/matrix/model.py", line 170, in execute
    raise
  File "/home/johnsca/juju/matrix/matrix/model.py", line 180, in execute_plugin
    result = await cmd(context, rule, self, event)
  File "/home/johnsca/juju/matrix/matrix/tasks/glitch/main.py", line 100, in glitch
    await actionf(rule, model, objects, **action)
  File "/home/johnsca/juju/matrix/matrix/tasks/glitch/actions.py", line 38, in wrapped
    rule, model, obj, **kwargs))
  File "/home/johnsca/juju/matrix/matrix/tasks/glitch/actions.py", line 91, in add_unit
    await application.add_unit(count=count, to=to)
  File "/home/johnsca/juju/matrix/.tox/py35/lib/python3.5/site-packages/juju/application.py", line 87, in add_unit
    num_units=count,
  File "/home/johnsca/juju/matrix/.tox/py35/lib/python3.5/site-packages/juju/client/facade.py", line 317, in wrapper
    reply = await f(*args, **kwargs)
  File "/home/johnsca/juju/matrix/.tox/py35/lib/python3.5/site-packages/juju/client/_client.py", line 7588, in AddUnits
    reply = await self.rpc(msg)
  File "/home/johnsca/juju/matrix/.tox/py35/lib/python3.5/site-packages/juju/client/facade.py", line 436, in rpc
    result = await self.connection.rpc(msg, encoder=TypeEncoder)
  File "/home/johnsca/juju/matrix/.tox/py35/lib/python3.5/site-packages/juju/client/connection.py", line 93, in rpc
    raise JujuAPIError(result)
juju.errors.JujuAPIError: cannot add unit 1/1 to application "topbeat": cannot add unit to application "topbeat": application is a subordinate

Matrix fails to deploy hadoop-processing

I duplicated test_2.matrix, but replaced the wiki-simple bundle with the hadoop-processing bundle.

Matrix fails on the deploy step due to the following error:

TypeError: addMachines() missing 3 required positional arguments: 'constraints', 'container_type', and 'parent_id'

Full logs here: http://pastebin.ubuntu.com/23426897/

This might actually be more of an issue w/ python-libjuju, not setting sensible defaults for arguments.

unit.run('sudo pkill juju') raises an exception

To reproduce:

Run a matrix plan that includes a kill_juju_agent action, or just write a script that runs unit.run('sudo pkill jujud' directly). You'll get a traceback indicating that the api is having a hard time because it gets "None" back, rather than a tag for an action.

I think that I'd actually rather handle the error handling mainly in matrix, as 'pkill jujud' is a very rude edge case, and I think that raising an exception is actually the correct thing for the api to do.

It might be worthwhile to add a more informative exception to python-libjuju at some point, though.

Odd timestamp issue

I ran into this trace when I stepped away from my computer after leaving a matrix test against hadoop-processing running (my computer did not suspend during the test):

Traceback for <Task finished coro=<RuleEngine.rule_runner() done, defined at /home/petevg/Code/matrix/matrix/rules.py:138> exception=ValueError('unconverted data remains: Z',)> (most recent call last):
  File "/usr/lib/python3.5/asyncio/tasks.py", line 292, in _step
    self = None  # Needed to break cycles when an exception occurs.
  File "/home/petevg/Code/matrix/matrix/rules.py", line 223, in rule_runner
    break
  File "/home/petevg/Code/matrix/matrix/model.py", line 343, in execute
    result = await self.task.execute(context, self)
  File "/home/petevg/Code/matrix/matrix/model.py", line 170, in execute
    raise
  File "/home/petevg/Code/matrix/matrix/model.py", line 180, in execute_plugin
    result = await cmd(context, rule, self, event)
  File "/home/petevg/Code/matrix/matrix/tasks/health.py", line 22, in health
    workload_status_duration = now - unit.workload_status_since
  File "/home/petevg/Code/matrix/.tox/py35/lib/python3.5/site-packages/juju/unit.py", line 50, in workload_status_since
    return datetime.strptime(since, "%Y-%m-%dT%H:%M:%S.%f")
  File "/usr/lib/python3.5/_strptime.py", line 510, in _strptime_datetime
    tt, fraction = _strptime(data_string, format)
  File "/usr/lib/python3.5/_strptime.py", line 346, in _strptime
    data_string[found.end():])
ValueError: unconverted data remains: Z

matrix not honoring -c <controller>

When matrix fails for whatever reason, I see crashdump fails as well. As an example, my jenkins log shows matrix being called like this:

2017-03-30 00:55:08 DEBUG call ['/usr/local/bin/matrix', '-s', 'raw', '-c', 'aws-w'] (cwd: /tmp/cwr-tmp-azQq_u/bundletester-DmLoVS/spark-processing-dev)

Note the -c aws-w there. Now see the failure by expanding AWS and clicking the matrix link:

http://bigtop.charm.qa/cwr_bundle__bigdata_dev_spark_processing/2/report.html

About 1/3 of the way down, see the crashdump failure:

matrix:228:crashdump: Running crash dump
matrix:216:execute_process: ERROR model gce-w:ci-70/job-2-matrix-pet-filly not found

This is AWS, yet crash dump is attempting to run on GCE. It seems matrix is not honoring the -c <controller>.

Add a "non terminal unit" selector to remove_unit

Came up in a meeting w/ @abentley

The simplest way to do so, would be to rename the third argument in remove_unit to non_terminal_unit, then have special logic in the switch statement in fetch to handle non_terminal_unit. (This means that the generator could generate plans that use the non_terminal_unit selector).

A more extensive refactor would be to rename the third argument to target, and then base the switch statement in fetch on the type annotation for the parameter passed to target, rather than the argument name.

This replaces the current check inside of remove_unit.

how to use a conjure-up spell?

Im sure im doing something stupid but running:

ubuntu@tupac:~/spells$ juju matrix -c conjure-up-localhost-d11 -m hai -p ./canonical-kubernetes
usage: juju-matrix [-h] [-c CONTROLLER] [-m MODEL] [-M MODEL_PREFIX] [-k]
                   [-l LOG_LEVEL] [-L [LOG_NAME [LOG_NAME ...]]]
                   [-f [LOG_FILTER [LOG_FILTER ...]]] [-d OUTPUT_DIR]
                   [-s {tui,raw}] [-x FILENAME] [-F] [-i INTERVAL] [-p PATH]
                   [-D] [-B] [-t [TEST_PATTERN [TEST_PATTERN ...]]]
                   [-g CHAOS_PLAN] [-n CHAOS_NUM] [-o CHAOS_OUTPUT]
                   [-z TIMEOUT] [-H]
                   [additional_suites [additional_suites ...]]
juju-matrix: error: Invalid bundle directory: canonical-kubernetes
ubuntu@tupac:~/spells$ ls canonical-kubernetes/
metadata.yaml  readme.md  steps  tests

This is with the snap version of juju-matrix

ubuntu@tupac:~/spells$ snap info juju-matrix
name:      juju-matrix
summary:   "Automatic testing of big software deployments under various failure conditions"
publisher: lutostag
contact:   https://github.com/juju-solutions/matrix/issues
description: |
  This is a test engine designed to validate proper function of real-world
  software solutions under a variety of adverse conditions. While this system
  can run in a way very similar to bundletester this engine is designed to a
  different model. The idea here is to bring up a running deployment, set of a
  pattern of application level tests and ensure that the system functions after
  operations modelled with Juju are performed. In addition the system supports
  large scale failure injection such a removal of units or machines while tests
  are executing.
  
commands:
  - juju-matrix
tracking:  edge
installed: 0.9.0 (8) 9MB classic
refreshed: 2017-03-30 19:48:27 +0000 UTC
channels:            
  edge:    0.9.0 (8) 9MB classic

Please let me know if im missing anything, thanks

Glitch plan can attempt to operate on removed units

glitch:98:glitch: GLITCHING remove_unit: [<Unit entity_id="wiki/0">]
glitch:98:glitch: GLITCHING remove_unit: [<Unit entity_id="mysql/0">]
glitch:98:glitch: GLITCHING add_unit: [<Application entity_id="mysql">]
glitch:98:glitch: GLITCHING reboot: [<Unit entity_id="mysql/0">]

It tried to reboot the mysql/0 unit that was removed by a previous action, and got stuck there (no error, no end).

"model" param is not strictly necessary in glitch actions

All glitch/chaos actions take a "model" parameter.

This isn't strictly necessary, as any object that we pass into the action as the value of the third parameter will have the .model attribute.

We could simplify each glitch action by getting rid of the model param.

@abentley

Return to user if they have not bootstrapped

If I run something like:
matrix tests/rules.1.yaml

and I have not bootstrapped then I just get blank output and no indicators on what I should be doing. Suggest to signal to users they need to bootstrap first then run matrix.

-thanks,
Antonio

matrix not honoring bundle constraints

Hi friends! I'm running the spark-processing bundle.yaml through cwr, which invokes bundletester/matrix. Here's my yaml:

https://api.jujucharms.com/charmstore/v5/spark-processing/archive/bundle.yaml

Note the constraints: "mem=7G root-disk=32G" on the spark application, for example. When matrix spins up my bundle for the first time (not chaotically), it seems to lose those constraints. I know this because a 7g machine on aws should be 2 cores, while a 7g machine on gce is 8 cores. Here's an example of the matrix models that were created on both aws and gce. Note the Cores column:

ubuntu@juju-b47b48-ci-shared-0:~$ for i in aws-w gce-c; do juju models -c $i; done
Controller: aws-w

Model                          Cloud/Region   Status     Machines  Cores  Access  Last connection
ci-70/job-22-matrix-set-tiger  aws/us-west-1  available         6      6          never connected
ci-70/job-22-steady-mutt       aws/us-west-1  available         7      9  admin   27 minutes ago

Controller: gce-c

Model                           Cloud/Region        Status     Machines  Cores  Access  Last connection
ci-70/job-22-matrix-sought-asp  google/us-central1  available         6      6          never connected
ci-70/job-22-steady-mutt        google/us-central1  available         7     21  admin   5 minutes ago

The ci-70/job-22-steady-mutt models are correct (verified by ssh'ing to the spark/0 unit and seeing 8 cores on gce, for example). The *-matrix-* models are incorrect (verified by ssh'ing to the spark/0 unit and seeing only 1 core on gce).

Why you lose my constraints?

Crash in urwid while deploying hadoop-processing

I get this exception while running the default tests against hadoop-processing, in a branch where I've made some changes to the way that we do glitch actions ... I think that the changes are clearing up some other crashes and making it easier to see this crash, rather than causing it, though.

This leaves urwid in a state where it no longer updates, meaning that we can't finish the deploy action.
matrix.log.zip

matrix:354:exception_handler: Exception in callback AsyncioEventLoop.enter_idle.<locals>.faux_idle_callback() at /home/petevg/Code/matrix/.tox/py35/lib/python3.5/site-packages/urwid/main_loop.py:1288
matrix:358:exception_handler: Traceback (most recent call last):

  File "/usr/lib/python3.5/asyncio/events.py", line 125, in _run
    self._callback(*self._args)

  File "/home/petevg/Code/matrix/.tox/py35/lib/python3.5/site-packages/urwid/main_loop.py", line 1289, in faux_idle_callback
    callback()

  File "/home/petevg/Code/matrix/.tox/py35/lib/python3.5/site-packages/urwid/main_loop.py", line 564, in entering_idle
    self.draw_screen()

  File "/home/petevg/Code/matrix/.tox/py35/lib/python3.5/site-packages/urwid/main_loop.py", line 579, in draw_screen
    self.screen.draw_screen(self.screen_size, canvas)

  File "/home/petevg/Code/matrix/.tox/py35/lib/python3.5/site-packages/urwid/raw_display.py", line 838, in draw_screen
    l = l.decode('utf-8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 0: unexpected end of data

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.