Coder Social home page Coder Social logo

tendrilinc / marathon-autoscaler Goto Github PK

View Code? Open in Web Editor NEW
41.0 10.0 16.0 79 KB

A simple autoscaler for Marathon applications

Home Page: https://hub.docker.com/r/tendril/marathon-autoscaler/

License: Apache License 2.0

Makefile 0.63% Python 98.09% Shell 0.48% Dockerfile 0.80%

marathon-autoscaler's Introduction

Marathon Autoscaler

Build Status Docker Images
TravisCI Docker Hub

Description

The aim of this project is to allow Marathon applications to scale to meet load requirements, without user intervention. To accomplish this, it monitors Marathon's application metrics and scales applications based on user-defined thresholds.

Table Of Contents

Build and deploy the Autoscaler

The Makefile requires REGISTRY environment variable to be set to your Docker registry.

REGISTRY=fooreg.mydockerregistry.com make

To manually build the app, the following commands build and deploy the Autoscaler Docker container:

Build the python zipapp:

mkdir -p build/target
python -m zipapp lib/marathon-autoscaler -o build/target/marathon-autoscaler.pyz

Build the docker image:

docker build -t marathon_autoscaler .

Push the image to your registry:

docker push {{registry_url}}/marathon_autoscaler:latest

Run the Autoscaler

Deploying the Autoscaler to Marathon

In the scripts directory, deploy_autoscaler_to_marathon.py can be executed to deploy an instance of the Autoscaler to your Marathon system. The parameters needed are explained below:

cli switch environment variable description
--interval INTERVAL The time duration in seconds between polling events
--mesos-uri MESOS_URI The Mesos HTTP endpoint
--mesos-agent-port AGENT_PORT The port your Mesos Agent is listening on (defaults to 5051)
--marathon-uri MARATHON_URI The Marathon HTTP endpoint
--marathon-user MARATHON_USER The Marathon username for authentication on the marathon-uri
--marathon-pass MARATHON_PASS The Marathon password for authentication on the marathon-uri
--cpu-fan-out CPU_FAN_OUT Number of subprocesses to use for gathering and sending stats to Datadog
--dd-api-key DATADOG_API_KEY Datadog API key
--dd-app-key DATADOG_APP_KEY Datadog APP key
--dd-env DATADOG_ENV Datadog ENV variable to separate metrics by environment
--log-config LOG_CONFIG Path to logging configuration file. Defaults to logging_config.json
--enforce-version-match ENFORCE_VERSION_MATCH If set, version matching will be required of applications to participate
--rules-prefix RULES_PREFIX The prefix for rule names

Run the scripts/deploy_autoscaler_to_marathon.py script:

cd scripts && python deploy_autoscaler_to_marathon.py {PARAMETERS}

Deploying a Marathon application to use the Autoscaler

Participation

The autoscaler is a standalone application that monitors Marathon for applications that use specific labels. To make your application participate in the autoscaler the use_marathon_autoscaler label needs to be set to something truthful or a version number. To enable version matching, the autoscaler needs to be deployed with the --enforce-version-match commandline switch or ENFORCE_VERSION_MATCH environment variable.

The Autoscaler considers the following list of strings as true:

["true", "t", "yes", "y", "1"]

Minimum and Maximum Instances

Number of minimum and maximum number of application instances.

...
"labels": {
  "min_instances": 1,
  "max_instances": 10
}
...

Scaling Rules

Scaling rules are set in a Marathon application's labels in its application definition. To get you introduced to scaling rules, let's jump right into an example:

...
"labels": {
	"mas_rule_fastscaleup": "cpu | >90 | PT2M | 3 | PT1M30S"
},
...

Explanation: The above rule is called "fastscaleup" which states: if cpu is greater than* 90 percent for 2 minutes, then scale up by 3 instances and backoff for 1 minute and 30 seconds**. These values in the label value are the same as the original upper and lower thresholds, but you are no longer bound to stating both cpu and memory conditions. The idea of having exclusive conditions is now implied by having multiple rules with the same name. Here's an example of the above rule added to other conditions:

...
"labels": {
	"mas_rule_fastscaleup_1": "cpu | >90 | PT2M | 3 | PT1M30S",
	"mas_rule_fastscaleup_2": "mem | >85 | PT2M | 3 | PT1M30S"
},
...

Notice that the tolerance, scale factor and backoff values are repeated, this is for clarity, but when the autoscaler sees 2 or more rules with the same name, it will combine them into one rule and use the tolerance, scale factor, and backoff of the first rule it sees. In the example above, the suffix "_1" and "_2" are for Marathon's sake because Marathon does not support having repeat label names. If this suffix is numeric, the autoscaler will order them numerically and take the tolerance, scale factor and backoff from the mas_rule_fastscale_1 rule.

To complete the example above, so it contains scale down rules, here is example extended:

...
"labels": {
	"mas_rule_fastscaleup_1": "cpu | >90 | PT2M | 3 | PT1M30S",
	"mas_rule_fastscaleup_2": "mem | >85 | PT2M | 3 | PT1M30S",
	"mas_rule_slowscaledown_1": "cpu | <=90 | PT1M | -1 | PT30S",
	"mas_rule_slowscaledown_2": "mem | <=85 | PT1M | -1 | PT30S"
},
...

Let's explore some other ideas... Maybe your application is only interested in scaling based on CPU:

...
"labels": {
	"mas_rule_fastscaleup": "cpu | >90 | PT2M | 3 | PT1M30S",
	"mas_rule_slowscaledown": "cpu | <=90 | PT1M | -1 | PT30S",
},
...

Perhaps you want your application to scale up and down differently for different conditions:

...
"labels": {
	"mas_rule_slowscaleup": "cpu | >40 | PT2M | 1 | PT1M30S",
	"mas_rule_fastscaleup": "cpu | >60 | PT1M | 3 | PT30S",
	"mas_rule_hyperscaleup": "cpu | >90 | PT1M | 5 | PT15S",
	"mas_rule_slowscaledown": "cpu | <90 | PT1M30S | -1 | PT30S",
	"mas_rule_fastscaledown": "cpu | <10 | PT3M | -5 | PT30S",
},
...

When multiple rules focus on the same metric, the autoscaler should take the action of the rule that matches closest to the given tolerance and threshold. It is possible that your application may never trigger some rules depending on the application's behavior.

* Comparisons can use >, <, <=, >=, = or ==

** A Wikipedia Reference on ISO8601 time duration

Testing the autoscaler with the Stress Tester app

To see how the Autoscaler behaves with an application's scaling settings in a controlled environment, build and deploy the stress test application to an environment running the Autoscaler.

cd tests/stress_tester_app && docker build -t autoscale_test_app .

Push the image to the registry:

docker push autoscale_test_app:latest

Run the scripts/test_autoscaler.py script:

cd scripts && python test_autoscaler.py --marathon-uri MARATHON_HTTP --marathon-user MARATHON_USER --marathon-pass MARATHON_PASS

marathon-autoscaler's People

Contributors

alexandernilsson avatar davidxarnold avatar eadderley avatar jangie avatar jmparra avatar johnjeffers avatar kernelpanek-segfault avatar uplightsag avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

marathon-autoscaler's Issues

Support disk and network I/O metrics

For Mesos configurations that enable the appropriate isolators, the marathon-autoscaler should honor the use of thresholds that pertain to those disk and network I/O metrics.

Not working with DCOS

Hi,

I am trying out the autoscaler on DCOS. The autoscaler is up and running and can fetch data from mesos and marathon as expected. However, it is not scaling the applications.

i tried to add logs around the autoscaler to see what might be wrong. Apparently the application-definition always turns out to be empty {} so the is_app_participating function is always passed an empty dict
app_def = ApplicationDefinition(metrics_summary.get("application_ returns null

My application's marathon JSON (condensed) is as follows:
{ "id": "/spark3", "backoffFactor": 1.15, "backoffSeconds": 1, "container": { "type": "DOCKER", "volumes": [], "docker": { "image": "mesosphere/spark:2.0.1-2.2.0-1-hadoop-2.6", "forcePullImage": true, "privileged": false, "parameters": [ { "key": "user", "value": "root" } ] } }, "cpus": 1, "disk": 0, "instances": 1, "labels": { "mas_rule_fastscaleup": "cpu | >50 | PT1M | 2 | PT1M30S", "min_instances": "1", "max_instances": "10", "use_marathon_autoscaler": "yes" } }

Is it something wrong here? Can you please point me in the right direction?

Thanks for the help.

Instructions to Deploy marathon_autoscaler dont seem to be working

Tried the command suggested:
cd scripts && python deploy_autoscaler_to_marathon.py {PARAMETERS}

The instructions miss the 's' in scripts.

and I get this error:

Traceback (most recent call last):
File "deploy_autoscaler_to_marathon.py", line 15, in
from marathon_autoscaler.marathon import Marathon
ImportError: No module named marathon_autoscaler.marathon

Better integration with TravisCI

There is so much more we can do with TravisCI. In addition to running the unit tests, the following is a list of aspirations for a better integration:

  1. Install minimesos
  2. Configure and start multiple minimesos environments
  3. Build the marathon-autoscaler docker image
  4. Deploy the marathon-autoscaler container to marathon
  5. Deploy the stress testing image to marathon
  6. Test the conditions of running the marathon-autoscaler with the stress test container/app

Scaler not working

Hi,

Thanks for the project - and thanks for the many enhancements recently. I'm having trouble getting the autoscaler working. I just built the docker image and ran it with:

docker run -e MESOS_URI=http://x.x.x.x:5050 -e MARATHON_URI=http://x.x.x.x:8080 -e AGENT_PORT=5051 marathon-autoscaler

From the startup logs, it looks like everything is running ok:

2016-10-21 17:09:26,543 INFO supervisord started with pid 1
2016-10-21 17:09:27,547 INFO spawned: 'marathon_autoscaler' with pid 7
2016-10-21 17:09:28,142 | INFO | Namespace(agent_port=5051, cpu_fan_out=None, datadog_api_key=None, datadog_app_key=None, datadog_env=None, enforce_version_match=False, log_config='/app/logging_config.json', marathon_pass=None, marathon_uri='http://x.x.x.x:8080', marathon_user=None, mesos_uri='http://x.x.x.x:5050', rules_prefix='mas_rule', sleep_interval=5)  (<module>:92)
2016-10-21 17:09:28,144 | INFO | Mesos and Marathon Connections Established.  (start:157)
2016-10-21 17:09:29,147 INFO success: marathon_autoscaler entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-10-21 17:09:34,543 | INFO | Stats differentials collected.  (poll:86)
2016-10-21 17:09:34,544 | INFO | Decision process beginning.  (decide:56)
2016-10-21 17:09:34,545 | INFO | Decisions are completed.  (start:165)
2016-10-21 17:09:45,836 | INFO | Stats differentials collected.  (poll:86)
2016-10-21 17:09:45,837 | INFO | Decision process beginning.  (decide:56)
2016-10-21 17:09:45,837 | INFO | Decisions are completed.  (start:165)
2016-10-21 17:09:57,089 | INFO | Stats differentials collected.  (poll:86)

However, the app never scales under load with the following labels (using small durations for testing):

  "labels": {
  "use_marathon_autoscaler": "true",
  "min_instances": "1",
  "max_instances": "10",
  "mas_rule_fastscaleup_1": "cpu | >10 | PT10S | 3 | PT1M30S",
  "mas_rule_fastscaleup_2": "memory | >50 | PT10S | 3 | PT1M30S"
  }

Do you see any reason why this wouldn't work?

HTTP Administration Endpoint

The autoscaler should provide an HTTP/S endpoint for administering:

  • Log viewing
  • Changing logging configuration
  • Enable/Disable scaling (all or per app, for maintenance)

Additionally, the autoscaler should provide a RESTful API to allow applications to be preemptively scaled up or down or to allow changes to an application's scaling rules without the need for updating the application's Marathon definition.

Unnecessary datadog errors

When a mesos agent cannot be queried, the datadog methods do not have all data necessary to make a valid calls in the function send_datadog_metrics. This is just a minor issue, as it can just be misleading in what the actual issue is.

Perhaps this can be fixed by checking the data before trying to build the metric data structure?

2017-08-03 14:50:29,183 | ERROR | "'points' parameter is required" (send_datadog_metrics:81) 2017-08-03 14:50:26,180 | ERROR | HTTPConnectionPool(host='10.91.24.4', port=5051): Max retries exceeded with url: /monitor/statistics (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f9b5b1ad690>: Failed to establish a new connection: [Errno 111] Connection refused',)) (_do_request:69)

SSL support?

Hey guys,

Thanks for the great project. I'm trying to launch it in our environment, but our mesos_uri and marathon_uri are both https, so when I launch container with args I receive this:

2017-01-25 17:30:48,142 | INFO | Namespace(agent_port=5051, cpu_fan_out=None, datadog_api_key=None, datadog_app_key=None, datadog_env=None, enforce_version_match=False, log_config='/app/logging_config.json', marathon_pass=None, marathon_uri='http://marathon-dns/', marathon_user=None, mesos_uri='http://mesos_dns/', rules_prefix='mas_rule', sleep_interval=5)
2017-01-25 17:30:48,143 | INFO | Mesos and Marathon Connections Established.
2017-01-25 17:30:48,260 | ERROR | EOF occurred in violation of protocol (_ssl.c:590)
2017-01-25 17:30:48,260 | ERROR | 'NoneType' object has no attribute 'url'
2017-01-25 17:30:48,261 | CRITICAL | Marathon data could not be retrieved!
2017-01-25 17:30:48,261 | CRITICAL | Poller unable to reach Marathon/Mesos!

The same occurs if I change http to https in args, any help is much appreciated.

"memory" should be "mem" in the rules examples

Hi, I am a new user to marathon-autoscaler in the last few days. It seems like the rules were not working for me when I used "memory" for the metric but they did work when I switched them to "mem." Thanks!

Oversubscription Detection

The autoscaler needs to be aware that it is in danger of oversubscribing the mesos/marathon cluster and to halt all scale up actions if this state is detected. Ideally, this feature should emit events for when all scaling is halted and restored. It should also be configurable to allow for custom notifications or actions on these events.

Make datadog optional

Datadog metric reporting should be optional. Please update the autoscaler to run and not report metrics to datadog if datadog api key etc.. is not defined.

'Latest' and 'fix_24' atoscalers are not working (DCOS 1.9.0 and Marathon 1.4.2)

Hello,

So, the marathon-autoscaler containers version latest and fix_24 were not able to identify applications with the use_marathon_autoscaler label for a DCOS 1.9.0 / Marathon 1.4.2 installation.

Unfortunately I'm having trouble retrieving some of the original logs, but on INFO it would just print this:

| INFO | Stats differentials collected.
| INFO | Decision process beginning.
| INFO | Decisions are completed.

Activating DEBUG level showed that it could communicate with the nodes just fine:

2017-06-30 19:06:48,227 | DEBUG | (u'10.5.5.101', [{u'source': u'maintenance_lb_external.17b2ce46-4176-11e7-a377-92d74b0bec98', u'executor_id': u'maintenance_lb_external.17b2ce46-4176-11e7-a377-92d74b0bec98', u'statistics': {u'cpus_nr_throttled': 0, u'timestamp': 1498849608.21988, u'cpus_throttled_time_secs': 0.0, u'cpus_user_time_secs': 1609.07, u'mem_rss_bytes': 38350848, u'mem_limit_bytes': 1107296256, u'cpus_system_time_secs': 10820.05, u'cpus_nr_periods': 0, u'cpus_limit': 1.1}, u'framework_id': u'41da1a1e-5d43-4c01-9f60-6a2d9d9e9745-0000', u'executor_name': u'Command Executor (Task: maintenance_lb_external.17b2ce46-4176-11e7-a377-92d74b0bec98) (Command: NO EXECUTABLE)'}])

However, downgrading to fix_23 fixed the issue.

2017-07-03 13:53:46,832 | INFO | Decision process beginning.
2017-07-03 13:53:46,833 | INFO | thumbor/core: metrics: {'mem': 10.408289292279411, 'cpu': 0.3179427788440855}
2017-07-03 13:53:46,833 | INFO | thumbor/core: last_triggered_rule set to: [{'ruleInfo': {'rulePart': None, 'ruleName': u'slowscaledown'}, 'ruleValue': {'scale_factor': u'-1', 'weight': 1.0, 'threshold': {'val': u'20', 'op': u'<'}, 'metric': u'cpu', 'tolerance': u'PT1M', 'backoff': u'PT1M'}}]
2017-07-03 13:53:46,833 | INFO | thumbor/core: vote: -1 ; scale_factor requested: -1
2017-07-03 13:53:46,834 | INFO | thumbor/core: application ready: True
2017-07-03 13:53:46,834 | INFO | thumbor/core: instances: min:1, running:1, max:16
2017-07-03 13:53:46,834 | INFO | thumbor_core: tolerance window filled: True / 13:52:46.834477
2017-07-03 13:53:46,835 | INFO | thumbor_core: tolerance reached: True / 13:52:46.834477 - 13:53:46.834477
2017-07-03 13:53:46,835 | INFO | thumbor_core: within backoff window: True / 13:52:46.835363 - 13:53:46.835363
2017-07-03 13:53:46,836 | INFO | Decisions are completed.

Note that the only change I did was downgrading the container.

Any ideas what could be the issue? I'm fine with using an older version but I thought you may want to look at this.

If you need any more information please let me know!

problem w/ app name that is a substring of another app name

Hi, I ran into a problem where I was testing marathon-autoscaler w/ two Marathon apps, one named "nginx" and the other "nginx-2." Only the "nginx" app was set with the marathon autoscaler tags and it looks like marathon-autoscaler was skipping over that b/c the label lookup was not working properly (I think due to using in instead of == in the app name lookups). I got around it by removing the "nginx-2" app from my Marathon instance. Thanks.

Mesos slave port needs to be configurable.

Currently the autoscaler is hard-coded to contact mesos agents on tcp port 5050. This needs to be configurable as the default agent port is tcp 5051 but could be configured to use any port.

Configurable callback endpoints for scaling event notifications and actions

The marathon-autoscaler should provide a way for participating applications to set HTTP/HTTPS callbacks on the following events:

  • before scale up
  • after scale up
  • before scale down
  • after scale down

The configured endpoints should accept an HTTP POST action which will contain data about the event. Additionally, the before scale * callback configurations could be annotated with a flag to instruct the autoscaler to only perform the corresponding scaling action if the endpoint returns with an HTTP 2xx response.

Connection refused when starting

Hello,

Thank you for the project! I am having some trouble getting the container started correctly. First, it seems like parameters like DATADOG_API_KEY must be set, even though we won't be using Datadog.

Next, I'm also noticing that the script is trying to go out to my Mesos slaves on port 5050 for some reason. Here's how I started the container:

docker run -it --rm -v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/docker:/var/log/docker -e MESOS_URI=http://node1:5050 -e MARATHON_URI=http://node1:8080 -e MARATHON_USER=user -e MARATHON_PASS=pass -e DATADOG_API_KEY=a -e DATADOG_APP_KEY=a -e DATADOG_ENV=a -e INTERVAL=30 myregistry/marathon_autoscaler

However, when it starts, I see calls to node2 and node3 which are my 2 slaves:

/usr/lib/python2.7/site-packages/supervisor/options.py:296: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
  'Supervisord is running as root and it is searching '
2016-06-08 23:16:57,351 CRIT Supervisor running as root (no user in config file)
2016-06-08 23:16:57,351 WARN Included extra file "/etc/supervisor.d/marathon_autoscaler.ini" during parsing
2016-06-08 23:16:57,365 INFO RPC interface 'supervisor' initialized
2016-06-08 23:16:57,365 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2016-06-08 23:16:57,366 INFO supervisord started with pid 1
2016-06-08 23:16:58,370 INFO spawned: 'marathon_autoscaler' with pid 8
2016-06-08 23:16:58,622 | INFO | Namespace(cpu_fan_out=None, datadog_api_key='a', datadog_app_key='a', datadog_env='a', enforce_version_match=False, log_config='/app/logging_config.json', marathon_pass='pass', marathon_uri='http://node1:8080', marathon_user='user', mesos_uri='http://node1:5050', rules_prefix='mas_rule', sleep_interval=5)  (<module>:91)
2016-06-08 23:16:58,630 | INFO | Mesos and Marathon Connections Established.  (start:152)
2016-06-08 23:16:58,682 | ERROR | HTTPConnectionPool(host='node2', port=5050): Max retries exceeded with url: /monitor/statistics (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fdf63351750>: Failed to establish a new connection: [Errno 111] Connection refused',))  (_do_request:69)
2016-06-08 23:16:58,683 | ERROR | HTTPConnectionPool(host='node3', port=5050): Max retries exceeded with url: /monitor/statistics (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fdf63351690>: Failed to establish a new connection: [Errno 111] Connection refused',))  (_do_request:69)
2016-06-08 23:16:59,685 INFO success: marathon_autoscaler entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-06-08 23:17:01,710 | ERROR | HTTPConnectionPool(host='node2', port=5050): Max retries exceeded with url: /monitor/statistics (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fdf63350e10>: Failed to establish a new connection: [Errno 111] Connection refused',))  (_do_request:69)
2016-06-08 23:17:01,711 | ERROR | HTTPConnectionPool(host='node3', port=5050): Max retries exceeded with url: /monitor/statistics (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fdf63350d90>: Failed to establish a new connection: [Errno 111] Connection refused',))  (_do_request:69)
2016-06-08 23:17:04,715 | INFO | Stats differentials collected.  (poll:81)
2016-06-08 23:17:04,716 | ERROR | "'points' parameter is required"  (send_datadog_metrics:77)
2016-06-08 23:17:04,716 | INFO | Decision process beginning.  (decide:56)
2016-06-08 23:17:04,716 | INFO | Decisions are completed.  (start:160)

Any ideas?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.