tendrilinc / marathon-autoscaler Goto Github PK

A simple autoscaler for Marathon applications

Home Page: https://hub.docker.com/r/tendril/marathon-autoscaler/

License: Apache License 2.0

Makefile 0.63% Python 98.09% Shell 0.48% Dockerfile 0.80%

marathon-autoscaler's Introduction

Marathon Autoscaler

Build Status	Docker Images
	Docker Hub

Description

The aim of this project is to allow Marathon applications to scale to meet load requirements, without user intervention. To accomplish this, it monitors Marathon's application metrics and scales applications based on user-defined thresholds.

Build and deploy the Autoscaler
Run the Autoscaler
- Deploying the Autoscaler to Marathon
- Deploying a Marathon application to use the Autoscaler
Testing the autoscaler with the Stress Tester app

Build and deploy the Autoscaler

The Makefile requires REGISTRY environment variable to be set to your Docker registry.

REGISTRY=fooreg.mydockerregistry.com make

To manually build the app, the following commands build and deploy the Autoscaler Docker container:

Build the python zipapp:

mkdir -p build/target
python -m zipapp lib/marathon-autoscaler -o build/target/marathon-autoscaler.pyz

Build the docker image:

docker build -t marathon_autoscaler .

Push the image to your registry:

docker push {{registry_url}}/marathon_autoscaler:latest

Run the Autoscaler

Deploying the Autoscaler to Marathon

In the scripts directory, deploy_autoscaler_to_marathon.py can be executed to deploy an instance of the Autoscaler to your Marathon system. The parameters needed are explained below:

cli switch	environment variable	description
--interval	INTERVAL	The time duration in seconds between polling events
--mesos-uri	MESOS_URI	The Mesos HTTP endpoint
--mesos-agent-port	AGENT_PORT	The port your Mesos Agent is listening on (defaults to 5051)
--marathon-uri	MARATHON_URI	The Marathon HTTP endpoint
--marathon-user	MARATHON_USER	The Marathon username for authentication on the `marathon-uri`
--marathon-pass	MARATHON_PASS	The Marathon password for authentication on the `marathon-uri`
--cpu-fan-out	CPU_FAN_OUT	Number of subprocesses to use for gathering and sending stats to Datadog
--dd-api-key	DATADOG_API_KEY	Datadog API key
--dd-app-key	DATADOG_APP_KEY	Datadog APP key
--dd-env	DATADOG_ENV	Datadog ENV variable to separate metrics by environment
--log-config	LOG_CONFIG	Path to logging configuration file. Defaults to logging_config.json
--enforce-version-match	ENFORCE_VERSION_MATCH	If set, version matching will be required of applications to participate
--rules-prefix	RULES_PREFIX	The prefix for rule names

Run the scripts/deploy_autoscaler_to_marathon.py script:

cd scripts && python deploy_autoscaler_to_marathon.py {PARAMETERS}

Deploying a Marathon application to use the Autoscaler

Participation

The autoscaler is a standalone application that monitors Marathon for applications that use specific labels. To make your application participate in the autoscaler the use_marathon_autoscaler label needs to be set to something truthful or a version number. To enable version matching, the autoscaler needs to be deployed with the --enforce-version-match commandline switch or ENFORCE_VERSION_MATCH environment variable.

The Autoscaler considers the following list of strings as true:

["true", "t", "yes", "y", "1"]

Minimum and Maximum Instances

Number of minimum and maximum number of application instances.

...
"labels": {
  "min_instances": 1,
  "max_instances": 10
}
...

Scaling Rules

Scaling rules are set in a Marathon application's labels in its application definition. To get you introduced to scaling rules, let's jump right into an example:

...
"labels": {
	"mas_rule_fastscaleup": "cpu | >90 | PT2M | 3 | PT1M30S"
},
...

Explanation: The above rule is called "fastscaleup" which states: if cpu is greater than* 90 percent for 2 minutes, then scale up by 3 instances and backoff for 1 minute and 30 seconds**. These values in the label value are the same as the original upper and lower thresholds, but you are no longer bound to stating both cpu and memory conditions. The idea of having exclusive conditions is now implied by having multiple rules with the same name. Here's an example of the above rule added to other conditions:

...
"labels": {
	"mas_rule_fastscaleup_1": "cpu | >90 | PT2M | 3 | PT1M30S",
	"mas_rule_fastscaleup_2": "mem | >85 | PT2M | 3 | PT1M30S"
},
...

Notice that the tolerance, scale factor and backoff values are repeated, this is for clarity, but when the autoscaler sees 2 or more rules with the same name, it will combine them into one rule and use the tolerance, scale factor, and backoff of the first rule it sees. In the example above, the suffix "_1" and "_2" are for Marathon's sake because Marathon does not support having repeat label names. If this suffix is numeric, the autoscaler will order them numerically and take the tolerance, scale factor and backoff from the mas_rule_fastscale_1 rule.

To complete the example above, so it contains scale down rules, here is example extended:

...
"labels": {
	"mas_rule_fastscaleup_1": "cpu | >90 | PT2M | 3 | PT1M30S",
	"mas_rule_fastscaleup_2": "mem | >85 | PT2M | 3 | PT1M30S",
	"mas_rule_slowscaledown_1": "cpu | <=90 | PT1M | -1 | PT30S",
	"mas_rule_slowscaledown_2": "mem | <=85 | PT1M | -1 | PT30S"
},
...

Let's explore some other ideas... Maybe your application is only interested in scaling based on CPU:

...
"labels": {
	"mas_rule_fastscaleup": "cpu | >90 | PT2M | 3 | PT1M30S",
	"mas_rule_slowscaledown": "cpu | <=90 | PT1M | -1 | PT30S",
},
...

Perhaps you want your application to scale up and down differently for different conditions:

...
"labels": {
	"mas_rule_slowscaleup": "cpu | >40 | PT2M | 1 | PT1M30S",
	"mas_rule_fastscaleup": "cpu | >60 | PT1M | 3 | PT30S",
	"mas_rule_hyperscaleup": "cpu | >90 | PT1M | 5 | PT15S",
	"mas_rule_slowscaledown": "cpu | <90 | PT1M30S | -1 | PT30S",
	"mas_rule_fastscaledown": "cpu | <10 | PT3M | -5 | PT30S",
},
...

When multiple rules focus on the same metric, the autoscaler should take the action of the rule that matches closest to the given tolerance and threshold. It is possible that your application may never trigger some rules depending on the application's behavior.

* Comparisons can use >, <, <=, >=, = or ==

** A Wikipedia Reference on ISO8601 time duration

Testing the autoscaler with the Stress Tester app

To see how the Autoscaler behaves with an application's scaling settings in a controlled environment, build and deploy the stress test application to an environment running the Autoscaler.

cd tests/stress_tester_app && docker build -t autoscale_test_app .

Push the image to the registry:

docker push autoscale_test_app:latest

Run the scripts/test_autoscaler.py script:

cd scripts && python test_autoscaler.py --marathon-uri MARATHON_HTTP --marathon-user MARATHON_USER --marathon-pass MARATHON_PASS

marathon-autoscaler's People

Contributors

Stargazers

Watchers

Forkers

davidxarnold meltwaterarchive infinityhacks alanniu99 irenextang jmparra ghulevishal amitbmas90 bcwilsondotcom kernelpanek zousheng hassaanp jangie evie hendrytb aprilsourz

marathon-autoscaler's Issues

Support disk and network I/O metrics

For Mesos configurations that enable the appropriate isolators, the marathon-autoscaler should honor the use of thresholds that pertain to those disk and network I/O metrics.

Not working with DCOS

Hi,

I am trying out the autoscaler on DCOS. The autoscaler is up and running and can fetch data from mesos and marathon as expected. However, it is not scaling the applications.

i tried to add logs around the autoscaler to see what might be wrong. Apparently the application-definition always turns out to be empty {} so the is_app_participating function is always passed an empty dict
app_def = ApplicationDefinition(metrics_summary.get("application_ returns null

My application's marathon JSON (condensed) is as follows:
{ "id": "/spark3", "backoffFactor": 1.15, "backoffSeconds": 1, "container": { "type": "DOCKER", "volumes": [], "docker": { "image": "mesosphere/spark:2.0.1-2.2.0-1-hadoop-2.6", "forcePullImage": true, "privileged": false, "parameters": [ { "key": "user", "value": "root" } ] } }, "cpus": 1, "disk": 0, "instances": 1, "labels": { "mas_rule_fastscaleup": "cpu | >50 | PT1M | 2 | PT1M30S", "min_instances": "1", "max_instances": "10", "use_marathon_autoscaler": "yes" } }

Is it something wrong here? Can you please point me in the right direction?

Thanks for the help.

Instructions to Deploy marathon_autoscaler dont seem to be working

Tried the command suggested:
cd scripts && python deploy_autoscaler_to_marathon.py {PARAMETERS}

The instructions miss the 's' in scripts.

and I get this error:

Traceback (most recent call last):
File "deploy_autoscaler_to_marathon.py", line 15, in
from marathon_autoscaler.marathon import Marathon
ImportError: No module named marathon_autoscaler.marathon

Better integration with TravisCI

There is so much more we can do with TravisCI. In addition to running the unit tests, the following is a list of aspirations for a better integration:

Install minimesos
Configure and start multiple minimesos environments
Build the marathon-autoscaler docker image
Deploy the marathon-autoscaler container to marathon
Deploy the stress testing image to marathon
Test the conditions of running the marathon-autoscaler with the stress test container/app

HistoryManager.tolerance_reached returns early results in missing log messages

On this line, the code should not return result.

Scaler not working

Hi,

Thanks for the project - and thanks for the many enhancements recently. I'm having trouble getting the autoscaler working. I just built the docker image and ran it with:

docker run -e MESOS_URI=http://x.x.x.x:5050 -e MARATHON_URI=http://x.x.x.x:8080 -e AGENT_PORT=5051 marathon-autoscaler

From the startup logs, it looks like everything is running ok:

2016-10-21 17:09:26,543 INFO supervisord started with pid 1
2016-10-21 17:09:27,547 INFO spawned: 'marathon_autoscaler' with pid 7
2016-10-21 17:09:28,142 | INFO | Namespace(agent_port=5051, cpu_fan_out=None, datadog_api_key=None, datadog_app_key=None, datadog_env=None, enforce_version_match=False, log_config='/app/logging_config.json', marathon_pass=None, marathon_uri='http://x.x.x.x:8080', marathon_user=None, mesos_uri='http://x.x.x.x:5050', rules_prefix='mas_rule', sleep_interval=5)  (<module>:92)
2016-10-21 17:09:28,144 | INFO | Mesos and Marathon Connections Established.  (start:157)
2016-10-21 17:09:29,147 INFO success: marathon_autoscaler entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-10-21 17:09:34,543 | INFO | Stats differentials collected.  (poll:86)
2016-10-21 17:09:34,544 | INFO | Decision process beginning.  (decide:56)
2016-10-21 17:09:34,545 | INFO | Decisions are completed.  (start:165)
2016-10-21 17:09:45,836 | INFO | Stats differentials collected.  (poll:86)
2016-10-21 17:09:45,837 | INFO | Decision process beginning.  (decide:56)
2016-10-21 17:09:45,837 | INFO | Decisions are completed.  (start:165)
2016-10-21 17:09:57,089 | INFO | Stats differentials collected.  (poll:86)

However, the app never scales under load with the following labels (using small durations for testing):

  "labels": {
  "use_marathon_autoscaler": "true",
  "min_instances": "1",
  "max_instances": "10",
  "mas_rule_fastscaleup_1": "cpu | >10 | PT10S | 3 | PT1M30S",
  "mas_rule_fastscaleup_2": "memory | >50 | PT10S | 3 | PT1M30S"
  }

Do you see any reason why this wouldn't work?

HTTP Administration Endpoint

The autoscaler should provide an HTTP/S endpoint for administering:

Log viewing
Changing logging configuration
Enable/Disable scaling (all or per app, for maintenance)

Additionally, the autoscaler should provide a RESTful API to allow applications to be preemptively scaled up or down or to allow changes to an application's scaling rules without the need for updating the application's Marathon definition.

Unnecessary datadog errors

When a mesos agent cannot be queried, the datadog methods do not have all data necessary to make a valid calls in the function send_datadog_metrics. This is just a minor issue, as it can just be misleading in what the actual issue is.

Perhaps this can be fixed by checking the data before trying to build the metric data structure?

2017-08-03 14:50:29,183 | ERROR | "'points' parameter is required" (send_datadog_metrics:81) 2017-08-03 14:50:26,180 | ERROR | HTTPConnectionPool(host='10.91.24.4', port=5051): Max retries exceeded with url: /monitor/statistics (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f9b5b1ad690>: Failed to establish a new connection: [Errno 111] Connection refused',)) (_do_request:69)

SSL support?

Hey guys,

Thanks for the great project. I'm trying to launch it in our environment, but our mesos_uri and marathon_uri are both https, so when I launch container with args I receive this:

2017-01-25 17:30:48,142 | INFO | Namespace(agent_port=5051, cpu_fan_out=None, datadog_api_key=None, datadog_app_key=None, datadog_env=None, enforce_version_match=False, log_config='/app/logging_config.json', marathon_pass=None, marathon_uri='http://marathon-dns/', marathon_user=None, mesos_uri='http://mesos_dns/', rules_prefix='mas_rule', sleep_interval=5)
2017-01-25 17:30:48,143 | INFO | Mesos and Marathon Connections Established.
2017-01-25 17:30:48,260 | ERROR | EOF occurred in violation of protocol (_ssl.c:590)
2017-01-25 17:30:48,260 | ERROR | 'NoneType' object has no attribute 'url'
2017-01-25 17:30:48,261 | CRITICAL | Marathon data could not be retrieved!
2017-01-25 17:30:48,261 | CRITICAL | Poller unable to reach Marathon/Mesos!

The same occurs if I change http to https in args, any help is much appreciated.

"memory" should be "mem" in the rules examples

Hi, I am a new user to marathon-autoscaler in the last few days. It seems like the rules were not working for me when I used "memory" for the metric but they did work when I switched them to "mem." Thanks!

Oversubscription Detection

The autoscaler needs to be aware that it is in danger of oversubscribing the mesos/marathon cluster and to halt all scale up actions if this state is detected. Ideally, this feature should emit events for when all scaling is halted and restored. It should also be configurable to allow for custom notifications or actions on these events.

Make datadog optional

Datadog metric reporting should be optional. Please update the autoscaler to run and not report metrics to datadog if datadog api key etc.. is not defined.

'Latest' and 'fix_24' atoscalers are not working (DCOS 1.9.0 and Marathon 1.4.2)

Hello,

So, the marathon-autoscaler containers version latest and fix_24 were not able to identify applications with the use_marathon_autoscaler label for a DCOS 1.9.0 / Marathon 1.4.2 installation.

Unfortunately I'm having trouble retrieving some of the original logs, but on INFO it would just print this:

| INFO | Stats differentials collected.
| INFO | Decision process beginning.
| INFO | Decisions are completed.

Activating DEBUG level showed that it could communicate with the nodes just fine:

2017-06-30 19:06:48,227 | DEBUG | (u'10.5.5.101', [{u'source': u'maintenance_lb_external.17b2ce46-4176-11e7-a377-92d74b0bec98', u'executor_id': u'maintenance_lb_external.17b2ce46-4176-11e7-a377-92d74b0bec98', u'statistics': {u'cpus_nr_throttled': 0, u'timestamp': 1498849608.21988, u'cpus_throttled_time_secs': 0.0, u'cpus_user_time_secs': 1609.07, u'mem_rss_bytes': 38350848, u'mem_limit_bytes': 1107296256, u'cpus_system_time_secs': 10820.05, u'cpus_nr_periods': 0, u'cpus_limit': 1.1}, u'framework_id': u'41da1a1e-5d43-4c01-9f60-6a2d9d9e9745-0000', u'executor_name': u'Command Executor (Task: maintenance_lb_external.17b2ce46-4176-11e7-a377-92d74b0bec98) (Command: NO EXECUTABLE)'}])

However, downgrading to fix_23 fixed the issue.

2017-07-03 13:53:46,832 | INFO | Decision process beginning.
2017-07-03 13:53:46,833 | INFO | thumbor/core: metrics: {'mem': 10.408289292279411, 'cpu': 0.3179427788440855}
2017-07-03 13:53:46,833 | INFO | thumbor/core: last_triggered_rule set to: [{'ruleInfo': {'rulePart': None, 'ruleName': u'slowscaledown'}, 'ruleValue': {'scale_factor': u'-1', 'weight': 1.0, 'threshold': {'val': u'20', 'op': u'<'}, 'metric': u'cpu', 'tolerance': u'PT1M', 'backoff': u'PT1M'}}]
2017-07-03 13:53:46,833 | INFO | thumbor/core: vote: -1 ; scale_factor requested: -1
2017-07-03 13:53:46,834 | INFO | thumbor/core: application ready: True
2017-07-03 13:53:46,834 | INFO | thumbor/core: instances: min:1, running:1, max:16
2017-07-03 13:53:46,834 | INFO | thumbor_core: tolerance window filled: True / 13:52:46.834477
2017-07-03 13:53:46,835 | INFO | thumbor_core: tolerance reached: True / 13:52:46.834477 - 13:53:46.834477
2017-07-03 13:53:46,835 | INFO | thumbor_core: within backoff window: True / 13:52:46.835363 - 13:53:46.835363
2017-07-03 13:53:46,836 | INFO | Decisions are completed.

Note that the only change I did was downgrading the container.

Any ideas what could be the issue? I'm fine with using an older version but I thought you may want to look at this.

If you need any more information please let me know!

Make marathon credentials optional

How do i run the autoscaler app where I have a marathon installation without credentials

Create example marathon applicaiton definition to launch docker container.

It would be great to have an example marathon application definition to show how use this project with docker and marathon.

Support AWS SQS queue depth as scaling criteria

As a user I would like to be able to scale based on SQS queue depth.

problem w/ app name that is a substring of another app name

Hi, I ran into a problem where I was testing marathon-autoscaler w/ two Marathon apps, one named "nginx" and the other "nginx-2." Only the "nginx" app was set with the marathon autoscaler tags and it looks like marathon-autoscaler was skipping over that b/c the label lookup was not working properly (I think due to using in instead of == in the app name lookups). I got around it by removing the "nginx-2" app from my Marathon instance. Thanks.

Remove contraints from marathon app definition

Since marathon_autoscaler_app.json is an example it should not contain any constraints.

Mesos slave port needs to be configurable.

Currently the autoscaler is hard-coded to contact mesos agents on tcp port 5050. This needs to be configurable as the default agent port is tcp 5051 but could be configured to use any port.

Configurable callback endpoints for scaling event notifications and actions

The marathon-autoscaler should provide a way for participating applications to set HTTP/HTTPS callbacks on the following events:

before scale up
after scale up
before scale down
after scale down

The configured endpoints should accept an HTTP POST action which will contain data about the event. Additionally, the before scale * callback configurations could be annotated with a flag to instruct the autoscaler to only perform the corresponding scaling action if the endpoint returns with an HTTP 2xx response.

Make log messaging easier to follow the decision-making process

In an attempt to keep the logging clean for readability, too much information was removed, which makes the log difficult to follow the autoscaler's decision-making process.

Connection refused when starting