zalando-zmon / zmon-aws-agent Goto Github PK
View Code? Open in Web Editor NEWAWS API crawler to auto discover running services in your account
License: Apache License 2.0
AWS API crawler to auto discover running services in your account
License: Apache License 2.0
From zmon_cli
As we are getting pull requests, we should add automated tests on Travis CI + flake8 checking.
Required to support zalando-stups/senza#310
Could be useful to return Spot instance price (may be as a separate entity), this would help in alerting if spot instances are being used.
Our team is using AWS SQS (Simple Queue Service) a lot. For monitoring we were relying on manually created entities, but this is getting tedious and error prone.
The ZMON AWS Agent should automatically pick up all SQS queues and create entities for each of it.
The idea would be to invoke list-queues and then for each entry in the array of queue urls, invoke get_queue_attributes to retrieve the settings.
id: "sqs-<name-of-the-queue>[aws:111122223333]"
type: "aws_sqs"
name: "<name-of-the-queue>"
infrastructure_account: "aws:111122223333"
region: "eu-central-1"
url: "https://sqs.eu-central-1.amazonaws.com/111122223333/<name-of-the-queue>"
arn: "arn:aws:sqs:eu-central-1:111122223333:<name-of-the-queue>"
message_retention_period_seconds: 1209600
maximum_message_size_bytes: 262144
receive_messages_wait_time_seconds: 10
delay_seconds: 0
visibility_timeout_seconds: 30
Some checks/alert could use these values to raise alerts/notifications once resources count is approaching set limits.
Not clear yet how each AWS resource/service expose those limits, but an example would be similar to https://github.com/jantman/awslimitchecker
From time to time we see the following exception occurring:
botocore.exceptions.ClientError: An error occurred (RequestLimitExceeded) when calling the DescribeInstances operation (reached max retries: 4): Request limit exceeded.
This can result in Auto Scaling groups added with no instances, for example.
We should handle this the same way we handle Throttling
error response from the AWS API.
Server certificate name is not unique to identify the certificates returned from AWS.
To easily monitor certificate expiry dates, we should collect all IAM and ACM certificates as entities.
It would be useful to know in some cases the creation time of the autoscaling group. Currently the entities have only last_modified
date, but that changes e.g. if scaling happens. An example use case for having the creation time for ASGs would be to detect old obsolete stacks.
When running aws autoscaling describe-auto-scaling-groups
the auto scaling group has a property like "CreatedTime": "2018-06-08T11:31:09.480Z"
which could be used for the purpose.
We should probably use the Name
tag instead of the instance ID as first part of the entity ID:
https://github.com/zalando-zmon/zmon-aws-agent/blob/master/zmon_aws_agent/aws.py#L247
We need to support Alias records for DNS traffic switching, i.e. we need to adapt the following code part to not only support CNAME records:
https://github.com/zalando-zmon/zmon-aws-agent/blob/master/zmon_aws_agent/aws.py#L53
Background: Senza switches from CNAME to Alias records (zalando-stups/senza#197).
local
entity has a field for error count, we can use this field (or similar one) to provide extra info for error reporting (recovered or not), this way we provide the ability for teams to investigate issues as they occur and without being un-noticed.
URL includes HTTPS for non HTTPS listeners.
The scalyr_ts_id
is not needed anymore and should be removed
The commit ba70e59 introduced fatal error, since this commit ELB discovery is not working and agent is broken for us completely:
Traceback (most recent call last):
File "/zmon-agent.py", line 522, in <module>
main()
File "/zmon-agent.py", line 405, in main
elbs = get_running_elbs(region, infrastructure_account)
File "/zmon-agent.py", line 178, in get_running_elbs
tag_desc = elb_client.describe_tags(LoadBalancerNames=elb_names)
File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 310, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 382, in _make_api_call
api_params, operation_model)
File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 423, in _convert_to_request_dict
api_params, operation_model)
File "/usr/local/lib/python2.7/dist-packages/botocore/validate.py", line 273, in serialize_to_request
raise ParamValidationError(report=report.generate_report())
botocore.exceptions.ParamValidationError: Parameter validation failed:
Invalid length for parameter LoadBalancerNames, value: 62, valid range: 1-20
As we no longer enforce new application versions, entity ids become hard to read.
I dont even know why this was ever done differently :D (blaming myself)
Events per instance does not yield anything interesting so far, strip it.
Dont query for active members of ELB/ASK. Wrap this call in check command, so it is only done if it matters.
Right now we skip adding weights if we have multiple DNS_ZONE
zmon-aws-agent/zmon_aws_agent/aws.py
Line 105 in fc59e75
As a ZMON grafana dashboard user, I would like to see the queryable stack name and version templating option. This would help users in parameterising their dashboards and monitor applications seamlessly for different deployment versions.
This would require the zmon aws agent to discover the stack names and versions as entities and store them in KairosDB.
I create a check using Trial Run and use this entity filter:
[
{
"type": "asg",
"infrastructure_account": "aws:1234567890123"
}
]
When I run the check, it returns 50 results, although there are 88 ASGs in the account. If I remove the infrastructure_account
constraint, I get more than a 1000 results (all ASGs from all the accounts).
I expect to get all matching ASGs when I use infrastructure_account
filter, not only first 50.
One entity was reported to have ec2-used-instances
> ec2-max-instances
.
This might be due to different types or spot instances.
We can accept extra set of fields that will be added to all discovered entities.
Example:
EXTRA_ENTITY_FIELDS="zmon_environment:staging"
Those fields will affect how we sync the entities.
json.dumps
with sort_keys
fails: TypeError: unorderable types: int() < str()
Elastigroups are alternatives to Auto Scaling Groups which are typically used to manage Spot fleets.
The agent should also discover Elastigroups and provide the same level of functionality as the native Auto Scaling Groups.
Any delay/update of the tags will be missed. Introduced by #95
In certain cases API calls could fail to AWS and we set a default value. It could make more sense to keep the old entity value instead of a default.
https://github.com/zalando-zmon/zmon-aws-agent/blob/master/zmon_aws_agent/aws.py#L744
https://github.com/zalando-zmon/zmon-aws-agent/blob/master/zmon_aws_agent/aws.py#L762
Zmon reported: zmon aws agent is not working.
Instance was terminated several times with no luck, mentioned instance ids do not exist.
Jan 24 13:41:17 docker/bb23870f7a52[837]: Traceback (most recent call last):
Jan 24 13:41:17 docker/bb23870f7a52[837]: File "/usr/local/bin/zmon-aws-agent", line 9, in <module>
Jan 24 13:41:17 docker/bb23870f7a52[837]: load_entry_point('zmon-aws-agent==0.2', 'console_scripts', 'zmon-aws-agent')()
Jan 24 13:41:17 docker/bb23870f7a52[837]: File "/usr/local/lib/python3.5/dist-packages/zmon_aws_agent-0.2-py3.5.egg/zmon_aws_agent/main.py", line 133, in main
Jan 24 13:41:17 docker/bb23870f7a52[837]: File "/usr/local/lib/python3.5/dist-packages/zmon_aws_agent-0.2-py3.5.egg/zmon_aws_agent/aws.py", line 476, in get_auto_scaling_groups
Jan 24 13:41:17 docker/bb23870f7a52[837]: File "/usr/local/lib/python3.5/dist-packages/zmon_aws_agent-0.2-py3.5.egg/zmon_aws_agent/common.py", line 29, in call_and_retry
Jan 24 13:41:17 docker/bb23870f7a52[837]: File "/usr/local/lib/python3.5/dist-packages/zmon_aws_agent-0.2-py3.5.egg/zmon_aws_agent/aws.py", line 476, in <lambda>
Jan 24 13:41:17 docker/bb23870f7a52[837]: File "/usr/local/lib/python3.5/dist-packages/botocore-1.4.93-py3.5.egg/botocore/paginate.py", line 291, in build_full_result
Jan 24 13:41:17 docker/bb23870f7a52[837]: for response in self:
Jan 24 13:41:17 docker/bb23870f7a52[837]: File "/usr/local/lib/python3.5/dist-packages/botocore-1.4.93-py3.5.egg/botocore/paginate.py", line 102, in __iter__
Jan 24 13:41:17 docker/bb23870f7a52[837]: response = self._make_request(current_kwargs)
Jan 24 13:41:17 docker/bb23870f7a52[837]: File "/usr/local/lib/python3.5/dist-packages/botocore-1.4.93-py3.5.egg/botocore/paginate.py", line 174, in _make_request
Jan 24 13:41:17 docker/bb23870f7a52[837]: return self._method(**current_kwargs)
Jan 24 13:41:17 docker/bb23870f7a52[837]: File "/usr/local/lib/python3.5/dist-packages/botocore-1.4.93-py3.5.egg/botocore/client.py", line 251, in _api_call
Jan 24 13:41:17 docker/bb23870f7a52[837]: return self._make_api_call(operation_name, kwargs)
Jan 24 13:41:17 docker/bb23870f7a52[837]: File "/usr/local/lib/python3.5/dist-packages/botocore-1.4.93-py3.5.egg/botocore/client.py", line 537, in _make_api_call
Jan 24 13:41:17 docker/bb23870f7a52[837]: raise ClientError(parsed_response, operation_name)
Jan 24 13:41:17 docker/bb23870f7a52[837]: botocore.exceptions.ClientError: An error occurred (InvalidInstanceID.NotFound) when calling the DescribeInstances operation: The instance IDs 'i-0160f4e89e5397511, i-0f612c91029abfaa9' do not exist
There is some hacky code to handle stups-cassandra nodes which should be removed
If we use the Health API we can show up open and upcoming issues and maintenance.
benefits:
$ aws --region us-east-1 health describe-events --filter "eventStatusCodes=open,upcoming"
{
"events": [
{
"lastUpdatedTime": 1486951398.0,
"eventTypeCategory": "scheduledChange",
"arn": "arn:aws:health:eu-west-1::event/a6540dbb-dbad-4c71-9940-9c8bc409f764",
"eventTypeCode": "AWS_EC2_PERSISTENT_INSTANCE_RETIREMENT_SCHEDULED",
"startTime": 1488160800.0,
"region": "eu-west-1",
"statusCode": "upcoming",
"service": "EC2",
"endTime": 1488160800.0
}
]
}
$ aws --region us-east-1 health describe-event-details --event-arns \
arn:aws:health:eu-west-1::event/a6540dbb-dbad-4c71-9940-9c8bc409f764
{
"failedSet": [],
"successfulSet": [
{
"event": {
"eventTypeCode": "AWS_EC2_PERSISTENT_INSTANCE_RETIREMENT_SCHEDULED",
"startTime": 1488160800.0,
"service": "EC2",
"arn": "arn:aws:health:eu-west-1::event/a6540dbb-dbad-4c71-9940-9c8bc409f764",
"statusCode": "upcoming",
"eventTypeCategory": "scheduledChange",
"lastUpdatedTime": 1486951398.0,
"region": "eu-west-1",
"endTime": 1488160800.0
},
"eventDescription": {
"latestDescription": "EC2 has detected degradation of the underlying hardware hosting your Amazon EC2 instance associated with this event in the eu-west-1 region. Due to this degradation, your instance could already be unreachable. After 2017-02-27 02:00 UTC your instance, which has an EBS volume as the root device, will be stopped.\n\nYou can see more information on your instances that are scheduled for retirement in the AWS Management Console (https://console.aws.amazon.com/ec2/v2/home?region=eu-west-1#Events)\n\n* How does this affect you?\nYour instance's root device is an EBS volume and the instance will be stopped after the specified retirement date. You can start it again at any time. Note that if you have EC2 instance store volumes attached to the instance, any data on these volumes will be lost when the instance is stopped or terminated as these volumes are physically attached to the host computer\n\n* What do you need to do?\nYou may still be able to access the instance. We recommend that you replace the instance by creating an AMI of your instance and launch a new instance from the AMI. For more information please see Amazon Machine Images (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html) in the EC2 User Guide. In case of difficulties stopping your EBS-backed instance, please see the Instance FAQ (http://aws.amazon.com/instance-help/#ebs-stuck-stopping).\n\n* Why retirement?\nAWS may schedule instances for retirement in cases where there is an unrecoverable issue with the underlying hardware. For more information about scheduled retirement events please see the EC2 user guide (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-retirement.html). To avoid single points of failure within critical applications, please refer to our architecture center for more information on implementing fault-tolerant architectures: http://aws.amazon.com/architecture\n\nIf you have any questions or concerns, you can contact the AWS Support Team on the community forums and via AWS Premium Support at: http://aws.amazon.com/support"
}
}
]
}
$ aws --region us-east-1 health describe-affected-entities --filter \
"eventArns=arn:aws:health:eu-west-1::event/a6540dbb-dbad-4c71-9940-9c8bc409f764"
{
"entities": [
{
"eventArn": "arn:aws:health:eu-west-1::event/a6540dbb-dbad-4c71-9940-9c8bc409f764",
"lastUpdatedTime": 1486951398.0,
"entityValue": "i-13e701d3",
"entityArn": "arn:aws:health:eu-west-1:786011980701:entity/AVo1Nb-EZUIH_-T6pbaj",
"awsAccountId": "786011980701"
}
]
}
When any operation fails (specially adding/removing entities), some further inspection of entities properties might be needed, logging should provide us with such info.
The generated entity IDs are not sanitized, i.e. invalid characters in application_id
or application_version
might break the entity IDs of type "instance".
This is a serious bug as the agent essentially stops working (e.g. throws error on every entity DELETE).
[
{
"dns_traffic": "true",
"infrastructure_account": "aws:ACCOUNT_ID",
"type": "elb"
}
]
dns_traffic
, it we get the application elb info."{{StackName}}-{{Region}}.{{TeamID}}
Example log with stack trace:
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: INFO:zmon-aws-agent:Adding new instance entity with ID: xxx-test-cd95-xyz[aws:NNN:eu-central-1]
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: ERROR:zmon_cli.client:ZMON client failed in: add_entity
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: ERROR:zmon-aws-agent:Failed to add entity: {'instance_type': 'm4.large', 'events': [{'Code': 'instance-stop', 'Description': 'The instance is running on degraded hardware', 'NotBefore': datetime.datetime(2017, 3, 17, 13, 0, tzinfo=tzlocal())}], 'state_reason': '', 'name': 'xxx-test', 'application_version': 'cd95', 'id': 'xxx-test-cd95-xyz[aws:NNN:eu-central-1]', 'runtime': 'Docker', 'ports': {'9042': '9042', '7001': '7001'}, 'aws_id': 'i-12345', 'created_by': 'agent', 'region': 'eu-central-1', 'block_devices': {'/dev/xvdf': {'attach_time': '2017-02-22 10:33:53+00:00', 'volume_type': 'ebs', 'volume_id': 'vol-54321'}, '/dev/sda1': {'attach_time': '2017-02-22 10:33:17+00:00', 'volume_type': 'ebs', 'volume_id': 'vol-09876'}}, 'host': '172.31.xx.yy', 'type': 'instance', 'spot_instance': False, 'application_id': 'xxx-test', 'ip': '172.31.xx.yy', 'source_base': 'registry.opensource.zalan.do/stups/planb-cassandra-2', 'source': 'registry.opensource.zalan.do/stups/planb-cassandra-2:cd-67', 'infrastructure_account': 'aws:NNN'}
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: Traceback (most recent call last):
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: File "/usr/local/lib/python3.5/dist-packages/zmon_aws_agent-0.2-py3.5.egg/zmon_aws_agent/main.py", line 74, in add_new_entities
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: zmon_client.add_entity(entity)
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: File "/usr/local/lib/python3.5/dist-packages/zmon_cli-1.1.46-py3.5.egg/zmon_cli/client.py", line 68, in wrapper
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: return f(*args, **kwargs)
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: File "/usr/local/lib/python3.5/dist-packages/zmon_cli-1.1.46-py3.5.egg/zmon_cli/client.py", line 330, in add_entity
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: resp = self.session.put(self.endpoint(ENTITIES, trailing_slash=False), json=entity)
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 533, in put
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: return self.request('PUT', url, data=data, **kwargs)
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 461, in request
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: prep = self.prepare_request(req)
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 394, in prepare_request
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: hooks=merge_hooks(request.hooks, self.hooks),
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: File "/usr/local/lib/python3.5/dist-packages/requests/models.py", line 297, in prepare
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: self.prepare_body(data, files, json)
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: File "/usr/local/lib/python3.5/dist-packages/requests/models.py", line 428, in prepare_body
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: body = complexjson.dumps(json)
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: File "/usr/lib/python3.5/json/__init__.py", line 230, in dumps
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: return _default_encoder.encode(obj)
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: File "/usr/lib/python3.5/json/encoder.py", line 198, in encode
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: chunks = self.iterencode(o, _one_shot=True)
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: File "/usr/lib/python3.5/json/encoder.py", line 256, in iterencode
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: return _iterencode(o, 0)
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: File "/usr/lib/python3.5/json/encoder.py", line 179, in default
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: raise TypeError(repr(o) + " is not JSON serializable")
Mar 6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: TypeError: datetime.datetime(2017, 3, 17, 13, 0, tzinfo=tzlocal()) is not JSON serializable
There are some calls that page results but currently only the first page is effectively used. this should reviewed and fixed.
https://github.com/zalando/spilo
IndexError: list index out of range
In accounts / regions where not a single SQS is running, an exception is thrown and spamming the logs:
ERROR:zmon_aws_agent.aws:Failed to list SQS queues.
File "/usr/local/lib/python3.5/dist-packages/zmon_aws_agent-0.2-py3.5.egg/zmon_aws_agent/aws.py", line 798, in get_sqs_queues
for queue_url in list_queues_response['QueueUrls']:
KeyError: 'QueueUrls'
Empty responses should be handled more gracefully.
When API throttling is applied, the agent fails completely.
Might be related to #22
zmon-agent version: 0.46 running on AWS
Error inside the application.log
Traceback (most recent call last):
File "/zmon-agent.py", line 543, in
main()
File "/zmon-agent.py", line 423, in main
apps = get_running_apps(region)
File "/zmon-agent.py", line 162, in get_running_apps
ins['resource_id'] = tags['aws:cloudformation:logical-id']
KeyError: 'aws:cloudformation:logical-id'
sleeping...
[abc]-[aws:123]
should be abc]-[aws:123]
as only [a-z]
are valid first chars.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.