aws-samples / aws-batch-runtime-monitoring Goto Github PK

Serverless application to monitor an AWS Batch architecture through dashboards.

License: MIT No Attribution

Python 100.00%

aws serverless dashboards aws-batch hpc

aws-batch-runtime-monitoring's Introduction

AWS Batch Runtime Monitoring Dashboards Solution

This SAM application deploys a serverless architecture to capture events from Amazon ECS, AWS Batch and Amazon EC2 to visualize the behavior of your workloads running on AWS Batch and provide insights on your jobs and the instances used to run them.

This application is designed to be scalable by collecting data from events and API calls using Amazon EventBrige and does not make API calls to describe your resources. Data collected through events and API are partially aggregated to DynamoDB to recoup information and generate Amazon CloudWatch metrics with the Embedded Metric Format. The application also deploys a several of dashboards displaying the job states, Amazon EC2 instances belonging your Amazon ECS Clusters (AWS Batch Compute Environments), ASGs across Availability Zones. It also collects API calls to the RunTask to visualize job placement across instances.

Dashboard

A series of dashboards are deployed as part of the SAM application. They provide information on your ASGs scaling, the capacity in vCPUs and instances. You will also find dashboards providing statistics on Batch job states transitions, RunTask API calls and jobs placement per Availability Zone, Instance Type, Job Queue and ECS Cluster.

Architecture Diagram

Overview

This solution captures events from AWS Batch and Amazon ECS using Amazon Event Bridge. Each event triggers an Amazon Lambda function that will add a metrics in Amazon CloudWatch and interact with an Amazon DynamoDB table to tie a job to the instance it runs on.

ECS Instance Registration: captures when instance are added or removed from an ECS cluster (Compute Environments in Batch). It helps to link together the EC2 Instance ID and the ECS Container Instance ID.
ECS RunTask: is the API called by AWS Batch to run the jobs. Calls can be successful (job placed on the cluster) or not in which case the job is not placed due to a lack of free resources.
Batch Job Transitions: capture Batch jobs transitions between states and their placement across Job Queues and Compute Environments.

DynamoDB is used to retain state of the instances joining the ECS Clusters (which sit underneath our Batch Compute Environments) so we can identify the instances joining a cluster and in which availability zone they are created. This also allows us to identify on which instance jobs are placed, succeed or fail.

When RunTask is called or Batch jobs transition between state, we can associate them to the Amazon EC2 instance on which they are running. RunTask API calls and Batch jobs are not associated directly with Amazon EC2, we use the ContainerInstanceID generated when an instance registers with a cluster to identify which Amazon EC2 instance is used to run a job (or place a task in the case of ECS). This architecture does not make any explicit API call such as DescribeJobs or DescribeEC2Instance which makes it it relatively scalable and not subject to potential throttling on these APIs.

CloudWatch Embedded Metrics Format (EMF) is used to collect the metrics displayed on the dashboards. Step Function are used to build the logic around the Lamba functions handing the events and data stored in DynamoDB and pushed to CloudWatch EMF.

How to run the SAM application

To build and deploy your application for the first time, run the following in your shell:

sam build
sam deploy --guided

You will be asked to provide a few parameters.

Parameters

When creating the stack for the first time you will need to provide the following parameters:

Stack Name: name of the CloudFormation stack that will be deployed.
AWS Region: region in which you will deploy the stack, the default region will be used if nothing is provided.

After the first launch, you can modify the function and deploy a new version with the same parameters through the following commands:

sam build
sam deploy # use sam deploy --no-confirm-changeset to force the deployment without validation

Cleanup

To remove the SAM application, go to the CloudFormation page of the AWS Console, select your stack and click Delete.

Adding Monitoring to Existing Auto-Scaling Groups

The Clusters Usage dashboards needs monitoring activated for each Auto-Scaling groups (ASGs), it is not done by default. The ASGMonitoring Lambda function in the serverless application automatically adds monitoring for new ASGs created by AWS Batch but to add it for existing ones you can run the following command in your terminal (install jq) or AWS CloudShell:

aws autoscaling describe-auto-scaling-groups | \
  jq -c '.AutoScalingGroups[] | select(.MixedInstancesPolicy.LaunchTemplate.LaunchTemplateSpecification.LaunchTemplateName | contains("Batch-lt")) | .AutoScalingGroupName' | \
  xargs -t -I {} aws autoscaling enable-metrics-collection  \
    --metrics GroupInServiceCapacity GroupDesiredCapacity GroupInServiceInstances \
    --granularity "1Minute" \
    --auto-scaling-group-name {}

Requirements

To run the serverless application you need to install the SAM CLI and have Python 3.8 installed on your host. Python 3.7 can be used as well by modifying the template.yaml file and replace the Lambda functions runtime from python3.8 to python3.7. You can quickly do this with the following sed command in the repository directory:

sed -i 's/3\.8/3\.7/g' template.yaml

If you plan to use AWS CloudShell to deploy the SAM template, please modify the Lambda runtime to 3.7 as suggested above (unless 3.8 is available) and make your Python 3 command the default for python: alias python=/usr/bin/python3.7. You can check the Python version available in CloudShell with python3 --version.

References

aws-batch-runtime-monitoring's People

Contributors

Stargazers

Watchers

Forkers

psantora-amazon gmgtamz datanir perifaws sasuolandersito ariellasasson cerrix ylanger-intel adamantike yannmjl devendra-d-chavan a-kumar5 yangxuserene flick-tech tccaws

aws-batch-runtime-monitoring's Issues

Add job definition tracking in the dashboard job states

We currently track the jobs states across Compute Environments and the Job Queues but not Job Definitions.

Adding this features requires pulling the data from the Jobs States events and adding a new dimension when writing the metrics. A series of widgets needs to be added to the Jobs States widgets.

Json Path search expression syntax is not correct

The filter expression in two of the state machines is not correct due to missing simple quotes around the searched strings.

The original code

"$$.Execution.Input.detail.responseElements.containerInstance.attributes[?(@.name==ecs.availability-zone)].value"

What is should look like:

"$$.Execution.Input.detail.responseElements.containerInstance.attributes[?(@.name=='ecs.availability-zone')].value"

Failure in Adding Monitoring to Existing Auto-Scaling Groups

Hi,
if you have Null values in some "LaunchTemplateName" fields then you get the following error when trying to add Existing Auto-Scaling Groups to the monitoring:
jq: error (at <stdin>:6896): null (null) and string ("Batch-lt") cannot have their containment checked

A possible workaround could be:
aws autoscaling describe-auto-scaling-groups | jq -c '.AutoScalingGroups[] | select( .MixedInstancesPolicy.LaunchTemplate.LaunchTemplateSpecification.LaunchTemplateName != null) |select(.MixedInstancesPolicy.LaunchTemplate.LaunchTemplateSpecification.LaunchTemplateName | contains("Batch-lt")) | .AutoScalingGroupName' | xargs -t -I {} aws autoscaling enable-metrics-collection --metrics GroupInServiceCapacity GroupDesiredCapacity GroupInServiceInstances --granularity "1Minute" --auto-scaling-group-name {}
(this command would replace the one in the readme.md)

Regards

Job transitions in state other than RUNNABLE are not collected

We have deployed this project to our AWS account, but after triggering AWS Batch jobs, we only get metrics for the RUNNABLE state.

This is an example of CloudWatch error log for when the State Machine tries to run for a state transition.

{"id":"10","type":"ChoiceStateEntered","details":{"input":"{"LastEventType":"SUCCEEDED","JobQueue":"arn:aws:batch:us-east-1:123456789012:job-queue/prod-data-retrieval","JobName":"RetrieveData","Region":"us-east-1","LastEventTime":"2022-06-10T17:42:54Z","JobId":"096ad8ae-83c4-434e-b541-2114ecd19fa3","JobDefinition":"arn:aws:batch:us-east-1:123456789012:job-definition/prod-retrieve-data:78"}","inputDetails":{"truncated":false},"name":"Is Attempted?"},"previous_event_id":"9","event_timestamp":"1654882975216","execution_arn":"arn:aws:states:us-east-1:123456789012:express:JobStatesStateMachineServerless-DBT9CAIdCeAT:c9becabb-d991-4af7-a1ec-85586f8e1ea0_b47f626b-423e-4c9f-9380-14fd92de9c71:57741e9b-a01f-45ea-95ef-ad7e1f78eb72"}

{"id":"11","type":"ExecutionFailed","details":{"cause":"An error occurred while executing the state 'Is Attempted?' (entered at the event id #10). Unable to apply Path transformation to null or empty input.","error":"States.Runtime"},"previous_event_id":"0","event_timestamp":"1654882975216","execution_arn":"arn:aws:states:us-east-1:123456789012:express:JobStatesStateMachineServerless-DLU9BBIkCzEL:c9becabb-d991-4af7-a1ec-85586f8e1ea0_b47f626b-423e-4c9f-9380-14fd92de9c71:57741e9b-a01f-45ea-95ef-ad7e1f78eb72"}

It's worth to mention that we only use Fargate clusters, so we haven't tested if state transitions work correctly for EC2-based clusters.

Cloudwatch Log Policy Limit 5120 characters Error

Hi Team

For customers with a large log policy limit they will get this error when deploying the SAM App:

"Invalid Logging Configuration: The CloudWatch Logs Resource Policy size was
exceeded. We suggest prefixing your CloudWatch log group name with
/aws/vendedlogs/states/. (Service: AWSStepFunctions; Status Code: 400; Error
Code: InvalidLoggingConfiguration;"

The recommendation given by the error log itself and using a log group starting with /aws/vendedlogs/states/ as destination. Is this a possible change that can be made to the solution?

Change use of index in state machines to retrieve fields from CW events generated events

The state machines use indices to retrieve fields of interest such as the AZ, Instance ID and Instance Type instead. The indices of the fields can vary in the nested arrays which cause errors in the step functions and impacts the ability to add data the dashboards as well as the instance Amazon DynamoDB table.

A solution is to use JSON Path with transient Pass states to retrieve the data instead of using indices that can var

AWS Batch custom dashboards do not display any data

The following custom dashboards return no data in a client's PROD environment even when an AWS Batch calculation falls within the selected time horizon:

Batch-ECS-InstancesRegistration
Batch-Jobs-Placement
Batch-Jobs-states

Batch-ASG-Utilization and Batch-EC2-Capacity dashboards display the correct data and activity.

AWS Batch custom dashboards do not display any data

The following custom dashboards return no data in a client's PROD environment even when an AWS Batch calculation falls within the selected time horizon:

Batch-ECS-InstancesRegistration
Batch-Jobs-Placement
Batch-Jobs-states
Batch-ASG-Utilization and Batch-EC2-Capacity dashboards display the correct data and activity.

test

Job placement dashboard no longer shows data for executed jobs

Problem

A recent change in ECS RunTask Cloudtrail breaks the ability to parse AWS Batch job metadata (like job id, compute environment name, etc.) from environment variables in the ECS RunTask Cloudtrail event as the environment variables are now redacted.

Specifically, the ECS RunTask state machine relies on parsing the following meta data from ECS RunTask Cloudtrail event

Metadata	ContainerOverride environment variable
JobId	AWS_BATCH_JOB_ID
CEName	AWS_BATCH_CE_NAME
JQName	AWS_BATCH_JQ_NAME
JobAttempt	AWS_BATCH_JOB_ATTEMPT

Redaction of the environment variables breaks this state machine result in an error when the incoming Cloudtrail event is processed by the step function

Sample ECS RunTask event shows the redacted environment variables

{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "AssumedRole",
        ...
        "invokedBy": "batch.amazonaws.com"
    },
    "eventTime": "2023-06-20T04:34:38Z",
    "eventSource": "ecs.amazonaws.com",
    "eventName": "RunTask",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "batch.amazonaws.com",
    "userAgent": "batch.amazonaws.com",
    "requestParameters": {
        ...
        "overrides": {
            "containerOverrides": [
                {
                    "name": "default",
                    "environment": "HIDDEN_DUE_TO_SECURITY_REASONS",
                    "cpu": 1024,
                    "memory": 128,
                    "resourceRequirements": []
                }
            ]
        },
        "count": 1,
        "launchType": "EC2",
        "tags": [
            {
                "key": "aws:batch:compute-environment",
                "value": "..."
            },
            {
                "key": "aws:batch:job-definition",
                "value": "..."
            },
            {
                "key": "aws:batch:job-queue",
                "value": "..."
            }
        ],
        "cluster": "...",
        "enableExecuteCommand": false,
        "taskDefinition": "...",
        ...
    },
    "responseElements": {
        "failures": [],
        "tasks": [
            {
                "attachments": [],
                "attributes": [
                    {
                        "name": "ecs.cpu-architecture",
                        "value": "x86_64"
                    }
                ],
                "availabilityZone": "us-east-1f",
                "clusterArn": "...",
                "containerInstanceArn": "...",
                "containers": [
                    {
                        "containerArn": "...",
                        "taskArn": "...",
                        "name": "default",
                        "image": "...",
                        "lastStatus": "PENDING",
                        "networkInterfaces": [],
                        "cpu": "0",
                        "memory": "1"
                    }
                ],
                "cpu": "1024",
                "createdAt": "Jun 20, 2023, 4:34:38 AM",
                "desiredStatus": "RUNNING",
                "enableExecuteCommand": false,
                "group": "family:...",
                "lastStatus": "PENDING",
                "launchType": "EC2",
                "memory": "128",
                "overrides": {
                    "containerOverrides": [
                        {
                            "name": "default",
                            "environment": "HIDDEN_DUE_TO_SECURITY_REASONS",
                            "cpu": 1024,
                            "memory": 128,
                            "resourceRequirements": []
                        }
                    ],
                    "inferenceAcceleratorOverrides": []
                },
                "tags": [
                    {
                        "key": "aws:batch:job-queue",
                        "value": "..."
                    },
                    {
                        "key": "aws:batch:compute-environment",
                        "value": "..."
                    },
                    {
                        "key": "aws:batch:job-definition",
                        "value": "..."
                    }
                ],
                "taskArn": "...",
                "taskDefinitionArn": "...",
                "version": 1
            }
        ]
    },
    "requestID": "...",
    "eventID": "...",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "...",
    "eventCategory": "Management"
}

This results in the following failure in the Select Common fields step in the ECS RunTask state machine impairing the ability to post the metrics used in the job placement dashboard

{
  "cause": "An error occurred while executing the state 'Select common fields' (entered at the event id #4). The JSONPath '$.detail.requestParameters.overrides.containerOverrides[0].environment[?(@.name=='AWS_BATCH_JOB_ID')].value' specified for the field 'JobId.$' could not be found in the input '...'",
  "error": "States.Runtime"
}

Proposed solution

Use ECS task state events as the source of truth for successfully placed Batch jobs (defined as ECS RunTask calls that have succeeded). Add a new state machine to process these events. Existing Lambda to publish the EMF metrics is reused.
Use Batch system created tags in the RunTask request to identify the compute environment and job queue associated with a RunTask request for which placement has failed (for example, container instances have not yet been scaled up). Batch system created tags use aws:batch prefix in the ECS RunTask request (see Cloudtrail event example above).

AutoScaling Groups In-Service Capacity (Batch-EC2-Capacity) should not depend on period.

The following query for the "AutoScaling Groups In-Service Capacity" displays different values depending on different Period chosen. When choosing one day interval from console, the period becomes 5 minutes, which displays a much higher vCPU capacity than actual values.
SEARCH('{AWS/AutoScaling,AutoScalingGroupName} MetricName="GroupInServiceCapacity"', 'Sum', 300)
Removing the period information from the query should fix the issue.
SEARCH('{AWS/AutoScaling,AutoScalingGroupName} MetricName="GroupInServiceCapacity"', 'Maximum')
The goal is to display concurrency at certain time instead of accumulated result over time. Other capacity related widgets havea similar issue in the same dashboard.
Attached are two graphs with different periods (1min and 5 min) during the same time frame of 1 day.

1 minute period

5 minute period