Coder Social home page Coder Social logo

zalando-zmon / zmon-aws-agent Goto Github PK

View Code? Open in Web Editor NEW
11.0 23.0 11.0 292 KB

AWS API crawler to auto discover running services in your account

License: Apache License 2.0

Python 99.52% Shell 0.28% Dockerfile 0.19%
zmon aws ec2 discovery monitoring

zmon-aws-agent's People

Contributors

a1exsh avatar aermakov-zalando avatar alexkorotkikh avatar arjunrn avatar avaczi avatar hjacobs avatar jan-m avatar lmineiro avatar malpi avatar marek-obuchowicz avatar mohabusama avatar oporkka avatar vetinari avatar vibhory2j avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zmon-aws-agent's Issues

Sync spot instance price

Could be useful to return Spot instance price (may be as a separate entity), this would help in alerting if spot instances are being used.

  • Should be optional

Integrate AWS SQS

Our team is using AWS SQS (Simple Queue Service) a lot. For monitoring we were relying on manually created entities, but this is getting tedious and error prone.

The ZMON AWS Agent should automatically pick up all SQS queues and create entities for each of it.

Implementation Strategy

The idea would be to invoke list-queues and then for each entry in the array of queue urls, invoke get_queue_attributes to retrieve the settings.

Entity Structure Proposal

id: "sqs-<name-of-the-queue>[aws:111122223333]"
type: "aws_sqs"
name: "<name-of-the-queue>"
infrastructure_account: "aws:111122223333"
region: "eu-central-1"
url: "https://sqs.eu-central-1.amazonaws.com/111122223333/<name-of-the-queue>"
arn: "arn:aws:sqs:eu-central-1:111122223333:<name-of-the-queue>"
message_retention_period_seconds: 1209600
maximum_message_size_bytes: 262144
receive_messages_wait_time_seconds: 10
delay_seconds: 0
visibility_timeout_seconds: 30

Sporadic RequestLimitExceeded are unhandled

From time to time we see the following exception occurring:

botocore.exceptions.ClientError: An error occurred (RequestLimitExceeded) when calling the DescribeInstances operation (reached max retries: 4): Request limit exceeded.

This can result in Auto Scaling groups added with no instances, for example.

We should handle this the same way we handle Throttling error response from the AWS API.

Add creation time for auto scaling groups

It would be useful to know in some cases the creation time of the autoscaling group. Currently the entities have only last_modified date, but that changes e.g. if scaling happens. An example use case for having the creation time for ASGs would be to detect old obsolete stacks.

When running aws autoscaling describe-auto-scaling-groups the auto scaling group has a property like "CreatedTime": "2018-06-08T11:31:09.480Z" which could be used for the purpose.

Implement better error reporting

local entity has a field for error count, we can use this field (or similar one) to provide extra info for error reporting (recovered or not), this way we provide the ability for teams to investigate issues as they occur and without being un-noticed.

Fatal error on elb_client.describe_tags

The commit ba70e59 introduced fatal error, since this commit ELB discovery is not working and agent is broken for us completely:

Traceback (most recent call last):
  File "/zmon-agent.py", line 522, in <module>
    main()
  File "/zmon-agent.py", line 405, in main
    elbs = get_running_elbs(region, infrastructure_account)
  File "/zmon-agent.py", line 178, in get_running_elbs
    tag_desc = elb_client.describe_tags(LoadBalancerNames=elb_names)
  File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 310, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 382, in _make_api_call
    api_params, operation_model)
  File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 423, in _convert_to_request_dict
    api_params, operation_model)
  File "/usr/local/lib/python2.7/dist-packages/botocore/validate.py", line 273, in serialize_to_request
    raise ParamValidationError(report=report.generate_report())
botocore.exceptions.ParamValidationError: Parameter validation failed:
Invalid length for parameter LoadBalancerNames, value: 62, valid range: 1-20

Reduce number of AWS API invocations

Events per instance does not yield anything interesting so far, strip it.

Dont query for active members of ELB/ASK. Wrap this call in check command, so it is only done if it matters.

Discover stack name and versions

As a ZMON grafana dashboard user, I would like to see the queryable stack name and version templating option. This would help users in parameterising their dashboards and monitor applications seamlessly for different deployment versions.

This would require the zmon aws agent to discover the stack names and versions as entities and store them in KairosDB.

Number of ASG entities is limited to 50

I create a check using Trial Run and use this entity filter:

[
    {
        "type": "asg",
        "infrastructure_account": "aws:1234567890123"
    }
]

When I run the check, it returns 50 results, although there are 88 ASGs in the account. If I remove the infrastructure_account constraint, I get more than a 1000 results (all ASGs from all the accounts).

I expect to get all matching ASGs when I use infrastructure_account filter, not only first 50.

Add discovery of Elastigroups

Elastigroups are alternatives to Auto Scaling Groups which are typically used to manage Spot fleets.

The agent should also discover Elastigroups and provide the same level of functionality as the native Auto Scaling Groups.

zmon agent fails

Zmon reported: zmon aws agent is not working.

Instance was terminated several times with no luck, mentioned instance ids do not exist.

Jan 24 13:41:17  docker/bb23870f7a52[837]: Traceback (most recent call last):
Jan 24 13:41:17  docker/bb23870f7a52[837]:   File "/usr/local/bin/zmon-aws-agent", line 9, in <module>
Jan 24 13:41:17  docker/bb23870f7a52[837]:     load_entry_point('zmon-aws-agent==0.2', 'console_scripts', 'zmon-aws-agent')()
Jan 24 13:41:17  docker/bb23870f7a52[837]:   File "/usr/local/lib/python3.5/dist-packages/zmon_aws_agent-0.2-py3.5.egg/zmon_aws_agent/main.py", line 133, in main
Jan 24 13:41:17  docker/bb23870f7a52[837]:   File "/usr/local/lib/python3.5/dist-packages/zmon_aws_agent-0.2-py3.5.egg/zmon_aws_agent/aws.py", line 476, in get_auto_scaling_groups
Jan 24 13:41:17  docker/bb23870f7a52[837]:   File "/usr/local/lib/python3.5/dist-packages/zmon_aws_agent-0.2-py3.5.egg/zmon_aws_agent/common.py", line 29, in call_and_retry
Jan 24 13:41:17  docker/bb23870f7a52[837]:   File "/usr/local/lib/python3.5/dist-packages/zmon_aws_agent-0.2-py3.5.egg/zmon_aws_agent/aws.py", line 476, in <lambda>
Jan 24 13:41:17  docker/bb23870f7a52[837]:   File "/usr/local/lib/python3.5/dist-packages/botocore-1.4.93-py3.5.egg/botocore/paginate.py", line 291, in build_full_result
Jan 24 13:41:17  docker/bb23870f7a52[837]:     for response in self:
Jan 24 13:41:17  docker/bb23870f7a52[837]:   File "/usr/local/lib/python3.5/dist-packages/botocore-1.4.93-py3.5.egg/botocore/paginate.py", line 102, in __iter__
Jan 24 13:41:17  docker/bb23870f7a52[837]:     response = self._make_request(current_kwargs)
Jan 24 13:41:17  docker/bb23870f7a52[837]:   File "/usr/local/lib/python3.5/dist-packages/botocore-1.4.93-py3.5.egg/botocore/paginate.py", line 174, in _make_request
Jan 24 13:41:17  docker/bb23870f7a52[837]:     return self._method(**current_kwargs)
Jan 24 13:41:17  docker/bb23870f7a52[837]:   File "/usr/local/lib/python3.5/dist-packages/botocore-1.4.93-py3.5.egg/botocore/client.py", line 251, in _api_call
Jan 24 13:41:17  docker/bb23870f7a52[837]:     return self._make_api_call(operation_name, kwargs)
Jan 24 13:41:17  docker/bb23870f7a52[837]:   File "/usr/local/lib/python3.5/dist-packages/botocore-1.4.93-py3.5.egg/botocore/client.py", line 537, in _make_api_call
Jan 24 13:41:17  docker/bb23870f7a52[837]:     raise ClientError(parsed_response, operation_name)
Jan 24 13:41:17  docker/bb23870f7a52[837]: botocore.exceptions.ClientError: An error occurred (InvalidInstanceID.NotFound) when calling the DescribeInstances operation: The instance IDs 'i-0160f4e89e5397511, i-0f612c91029abfaa9' do not exist

Get AWS Health Events as Entity

If we use the Health API we can show up open and upcoming issues and maintenance.

benefits:

  • Not only for Instances like #86 and #87
  • saves requests
$ aws --region us-east-1 health describe-events --filter "eventStatusCodes=open,upcoming"
{
    "events": [
        {
            "lastUpdatedTime": 1486951398.0,
            "eventTypeCategory": "scheduledChange",
            "arn": "arn:aws:health:eu-west-1::event/a6540dbb-dbad-4c71-9940-9c8bc409f764",
            "eventTypeCode": "AWS_EC2_PERSISTENT_INSTANCE_RETIREMENT_SCHEDULED",
            "startTime": 1488160800.0,
            "region": "eu-west-1",
            "statusCode": "upcoming",
            "service": "EC2",
            "endTime": 1488160800.0
        }
    ]
}
$ aws --region us-east-1 health describe-event-details --event-arns \
       arn:aws:health:eu-west-1::event/a6540dbb-dbad-4c71-9940-9c8bc409f764
{
    "failedSet": [],
    "successfulSet": [
        {
            "event": {
                "eventTypeCode": "AWS_EC2_PERSISTENT_INSTANCE_RETIREMENT_SCHEDULED",
                "startTime": 1488160800.0,
                "service": "EC2",
                "arn": "arn:aws:health:eu-west-1::event/a6540dbb-dbad-4c71-9940-9c8bc409f764",
                "statusCode": "upcoming",
                "eventTypeCategory": "scheduledChange",
                "lastUpdatedTime": 1486951398.0,
                "region": "eu-west-1",
                "endTime": 1488160800.0
            },
            "eventDescription": {
                "latestDescription": "EC2 has detected degradation of the underlying hardware hosting your Amazon EC2 instance associated with this event in the eu-west-1 region. Due to this degradation, your instance could already be unreachable. After 2017-02-27 02:00 UTC your instance, which has an EBS volume as the root device, will be stopped.\n\nYou can see more information on your instances that are scheduled for retirement in the AWS Management Console (https://console.aws.amazon.com/ec2/v2/home?region=eu-west-1#Events)\n\n* How does this affect you?\nYour instance's root device is an EBS volume and the instance will be stopped after the specified retirement date. You can start it again at any time. Note that if you have EC2 instance store volumes attached to the instance, any data on these volumes will be lost when the instance is stopped or terminated as these volumes are physically attached to the host computer\n\n* What do you need to do?\nYou may still be able to access the instance. We recommend that you replace the instance by creating an AMI of your instance and launch a new instance from the AMI. For more information please see Amazon Machine Images (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html) in the EC2 User Guide. In case of difficulties stopping your EBS-backed instance, please see the Instance FAQ (http://aws.amazon.com/instance-help/#ebs-stuck-stopping).\n\n* Why retirement?\nAWS may schedule instances for retirement in cases where there is an unrecoverable issue with the underlying hardware. For more information about scheduled retirement events please see the EC2 user guide (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-retirement.html). To avoid single points of failure within critical applications, please refer to our architecture center for more information on implementing fault-tolerant architectures: http://aws.amazon.com/architecture\n\nIf you have any questions or concerns, you can contact the AWS Support Team on the community forums and via AWS Premium Support at: http://aws.amazon.com/support"
            }
        }
    ]
}
$ aws --region us-east-1 health describe-affected-entities --filter \
     "eventArns=arn:aws:health:eu-west-1::event/a6540dbb-dbad-4c71-9940-9c8bc409f764" 
{
    "entities": [
        {
            "eventArn": "arn:aws:health:eu-west-1::event/a6540dbb-dbad-4c71-9940-9c8bc409f764",
            "lastUpdatedTime": 1486951398.0,
            "entityValue": "i-13e701d3",
            "entityArn": "arn:aws:health:eu-west-1:786011980701:entity/AVo1Nb-EZUIH_-T6pbaj",
            "awsAccountId": "786011980701"
        }
    ]
}

More verbose logging in case of failure

When any operation fails (specially adding/removing entities), some further inspection of entities properties might be needed, logging should provide us with such info.

Entity ID might contain invalid characters

The generated entity IDs are not sanitized, i.e. invalid characters in application_id or application_version might break the entity IDs of type "instance".

This is a serious bug as the agent essentially stops working (e.g. throws error on every entity DELETE).

dns_traffic matches only application name and version

  • We're trying to have extract the ELB info. and we have the following rules but we didn't get informaiton for specific applicaiton
[
    {
        "dns_traffic": "true",
        "infrastructure_account": "aws:ACCOUNT_ID",
        "type": "elb"
    }
]
  • After removing the dns_traffic, it we get the application elb info.
  • after checking with @mohabusama, I found out that dns_traffic related to specific pattern which is application name and application pattern and we are using different pattern for this application as we have it in different aws region
    here's the application pattern "{{StackName}}-{{Region}}.{{TeamID}}

Adding EC2 instance entity with upcoming events fails due to JSON+datetime serialization

Example log with stack trace:

Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: INFO:zmon-aws-agent:Adding new instance entity with ID: xxx-test-cd95-xyz[aws:NNN:eu-central-1]  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: ERROR:zmon_cli.client:ZMON client failed in: add_entity  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: ERROR:zmon-aws-agent:Failed to add entity: {'instance_type': 'm4.large', 'events': [{'Code': 'instance-stop', 'Description': 'The instance is running on degraded hardware', 'NotBefore': datetime.datetime(2017, 3, 17, 13, 0, tzinfo=tzlocal())}], 'state_reason': '', 'name': 'xxx-test', 'application_version': 'cd95', 'id': 'xxx-test-cd95-xyz[aws:NNN:eu-central-1]', 'runtime': 'Docker', 'ports': {'9042': '9042', '7001': '7001'}, 'aws_id': 'i-12345', 'created_by': 'agent', 'region': 'eu-central-1', 'block_devices': {'/dev/xvdf': {'attach_time': '2017-02-22 10:33:53+00:00', 'volume_type': 'ebs', 'volume_id': 'vol-54321'}, '/dev/sda1': {'attach_time': '2017-02-22 10:33:17+00:00', 'volume_type': 'ebs', 'volume_id': 'vol-09876'}}, 'host': '172.31.xx.yy', 'type': 'instance', 'spot_instance': False, 'application_id': 'xxx-test', 'ip': '172.31.xx.yy', 'source_base': 'registry.opensource.zalan.do/stups/planb-cassandra-2', 'source': 'registry.opensource.zalan.do/stups/planb-cassandra-2:cd-67', 'infrastructure_account': 'aws:NNN'}  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: Traceback (most recent call last):  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:   File "/usr/local/lib/python3.5/dist-packages/zmon_aws_agent-0.2-py3.5.egg/zmon_aws_agent/main.py", line 74, in add_new_entities  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:     zmon_client.add_entity(entity)  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:   File "/usr/local/lib/python3.5/dist-packages/zmon_cli-1.1.46-py3.5.egg/zmon_cli/client.py", line 68, in wrapper  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:     return f(*args, **kwargs)  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:   File "/usr/local/lib/python3.5/dist-packages/zmon_cli-1.1.46-py3.5.egg/zmon_cli/client.py", line 330, in add_entity  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:     resp = self.session.put(self.endpoint(ENTITIES, trailing_slash=False), json=entity)  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:   File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 533, in put  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:     return self.request('PUT', url, data=data, **kwargs)  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:   File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 461, in request  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:     prep = self.prepare_request(req)  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:   File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 394, in prepare_request  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:     hooks=merge_hooks(request.hooks, self.hooks),  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:   File "/usr/local/lib/python3.5/dist-packages/requests/models.py", line 297, in prepare  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:     self.prepare_body(data, files, json)  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:   File "/usr/local/lib/python3.5/dist-packages/requests/models.py", line 428, in prepare_body  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:     body = complexjson.dumps(json)  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:   File "/usr/lib/python3.5/json/__init__.py", line 230, in dumps  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:     return _default_encoder.encode(obj)  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:   File "/usr/lib/python3.5/json/encoder.py", line 198, in encode  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:     chunks = self.iterencode(o, _one_shot=True)  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:   File "/usr/lib/python3.5/json/encoder.py", line 256, in iterencode  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:     return _iterencode(o, 0)  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:   File "/usr/lib/python3.5/json/encoder.py", line 179, in default  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]:     raise TypeError(repr(o) + " is not JSON serializable")  
Mar  6 15:24:39 ip-aa-bb-cc-dd docker/5ec84e28e5b3[810]: TypeError: datetime.datetime(2017, 3, 17, 13, 0, tzinfo=tzlocal()) is not JSON serializable  

SQS discovery: Gracefully handle empty responses.

In accounts / regions where not a single SQS is running, an exception is thrown and spamming the logs:

ERROR:zmon_aws_agent.aws:Failed to list SQS queues.
File "/usr/local/lib/python3.5/dist-packages/zmon_aws_agent-0.2-py3.5.egg/zmon_aws_agent/aws.py", line 798, in get_sqs_queues 
  for queue_url in list_queues_response['QueueUrls']:
KeyError: 'QueueUrls'

Empty responses should be handled more gracefully.

zmon-agent KeyError

zmon-agent version: 0.46 running on AWS
Error inside the application.log

Traceback (most recent call last):
File "/zmon-agent.py", line 543, in
main()
File "/zmon-agent.py", line 423, in main
apps = get_running_apps(region)
File "/zmon-agent.py", line 162, in get_running_apps
ins['resource_id'] = tags['aws:cloudformation:logical-id']
KeyError: 'aws:cloudformation:logical-id'
sleeping...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.