aws-observability / observability-best-practices Goto Github PK

Observability best practices on AWS

Home Page: https://aws-observability.github.io/observability-best-practices/

License: MIT No Attribution

HTML 1.38% Makefile 3.14% JavaScript 1.56% TypeScript 33.59% Go 35.99% Shell 8.16% Dockerfile 1.50% Python 7.28% CSS 0.06% Rust 7.34%

observability-best-practices's People

Contributors

Stargazers

Watchers

Forkers

saurabhdes bitkill lewinkedrs philmcdo awsdabra ankan-devops github-cyberhat normalfaults aitime davdunc zeroc0d3 rapphil pelgrim ldomb kiranprakash-aws achandap awsvikram bunny254 arrowroot john1q84 chit786 hibira bonclay7 jderrett deepakjha1office jinwoongk franklinaws poprahul awsdoyita s50600822 lcpsluiz y6ec813 prashanth0108 kukrev carlosrodgut faustino3723 glenacota dthvt andrewglass-bjss bryan-aguilar bfreiberg captain-arvoy maxjiang153 vamshiaruru32 jeromeinsf varamishitha platformengineerid vtex henriquepiccolo ahmadt312 lliu8080 dannybarrientos emreoztoprak git-jayj kash-yap nataliagranato brandon-kimberly tedmax100 sguruvar yves-vogl azarboon

observability-best-practices's Issues

is this a valid metric - apiserver_request_latencies_sum

Hi Team,

You have indicated that apiserver_request_latencies_sum & apiserver_registered_watchers are a valid control plane metric. I am not able to see this in my cluster, is this only available is new versions?

Any help will be really useful.

Process to report/fix typos

Is there a process to report and/or fix typos in the docs?
For example:

Do you prefer contributors to report typos or create PRs to fix them?
In the case you accept PRs for fixing typos, do you prefer a separate PR for each doc page, or a single PR spanning multiple doc pages is also accepted?

BTW, thanks for this amazing piece of work.

Essential Metrics - Truncated Table

Section "Sources" on this page https://aws-observability.github.io/observability-best-practices/guides/containers/oss/eks/best-practices-metrics-collection/ has a truncated table

ex:

##Request for change
Kindy get the table to the right size / wrap up so that the full contents are visible to the user.

Incorrect values in adot-irsa.sh

. The SERVICE_ACCOUNT_NAME is incorrect. The script sets it as adot-collector when it should be aws-otel-collector.

Request To reference The "Easily Monitor Containerized Applications with Amazon CloudWatch Container Insights" Tutorial in the EKS Best Practices Guide

Hi team!

I recently published a tutorial (keeping best practices in mind) on monitoring an application's performance using CloudWatch Container Insights.

I would like to suggest that the EKS Observability best practices guide should include a reference to the above tutorial, adding a lot of value as it explores more than just the setup process, providing detailed guidance on utilizing CloudWatch Log Insights for effective search and querying, as well as instructions for leveraging the CloudWatch dashboard for efficient monitoring.

I read through the best EKS practices guide, and found that the section (highlighted in the screenshot below) would be a great place to reference it.

PR created.

Thanks in advance,

Ahmad (Containers Specialist SA, AWS)

CloudWatch Logs Insights Example Query - VPC Flow Logs

VPC Flow Logs

Last 20 Messages from the VPC flow logs	sorted in descending timestamp order. The $limit value in all the queries below can be adjusted to return a different amount of messages as necessary.

fields @timestamp, @message, @logStream, @log
| sort @timestamp desc
| limit 20

Filtering for messages that have REJECT action and aggregating them for every 1 hour. 
fields @timestamp, @message, @logStream, @log | filter @message like /REJECT/ 
    | stats count(*) as connCount by bin(1h)
    | sort connCount desc

Filtering for messages that have REJECT action and aggregating them for every 10 minutes. 
fields @timestamp, @message, @logStream, @log | filter @message like /REJECT/ 
    | stats count(*) as rejCount by bin(10m)
    | sort rejCount desc
	
Filtering for messages with a date range. The timestamp to be converted into UNIX Epoch time and used with > and < range. 
fields @timestamp, @message, @logStream, @log 
	| fields tomillis(@timestamp) as millisecs
	| filter millisecs > 1675641600000 and millisecs < 1675718300000
	| sort @timestamp desc
	| limit 20

Filtering logs that have REJECT action in the flow logs. 
fields @timestamp, @message, @logStream, @log | filter @message like /REJECT/
| sort @timestamp desc
| limit 20

Filtering logs that have ACCEPT action in the flow logs. 
fields @timestamp, @message, @logStream, @log | filter @message like /ACCEPT/
| sort @timestamp desc
| limit 20

Filtering logs for selected source IP addresses with action as REJECT. 
fields @timestamp, @message, @logStream, @log  | filter srcAddr like '192.168.20.' and action = 'REJECT'
| sort @timestamp desc
| limit 20

Filtering logs for selected source IP addresses with action as ACCEPT. 
fields @timestamp, @message, @logStream, @log  | filter srcAddr like '192.168.20.' and action = 'ACCEPT'
| sort @timestamp desc
| limit 20

Filtering logs for specific source IP address with action as REJECT. 
fields @timestamp, @message, @logStream, @log  | filter srcAddr = '192.168.20.42' and action = 'REJECT'
| sort @timestamp desc
| limit 20

Filtering logs for specific source IP address, source port with action as ACCEPT. 
fields @timestamp, @message, @logStream, @log  | filter srcAddr = '192.168.20.42' and srcPort = '1111' and action = 'ACCEPT'
| sort @timestamp desc
| limit 20

Filtering logs for specific source IP address, source port with action as REJECT. 
fields @timestamp, @message, @logStream, @log  | filter srcAddr = '192.168.20.42' and srcPort = '1111' and action = 'REJECT'
| sort @timestamp desc
| limit 20

Filtering logs for specific destination IP address with action as ACCEPT. 
fields @timestamp, @message, @logStream, @log  | filter dstAddr = '192.168.20.176' and action = 'ACCEPT'
| sort @timestamp desc
| limit 20

Filtering logs for specific destination IP address with action as REJECT. 
fields @timestamp, @message, @logStream, @log  | filter dstAddr = '192.168.20.176' and action = 'REJECT'
| sort @timestamp desc
| limit 20

Filtering logs for specific destination IP address, destination port with action as ACCEPT. 
fields @timestamp, @message, @logStream, @log  | filter dstAddr = '192.168.20.176' and dstPort = '1111' and action = 'ACCEPT'
| sort @timestamp desc
| limit 20

Filtering logs for specific destination IP address, destination port with action as REJECT. 
fields @timestamp, @message, @logStream, @log  | filter dstAddr = '192.168.20.176' and dstPort = '1111' and action = 'REJECT'
| sort @timestamp desc
| limit 20

Filtering logs for specific elastic interfaceId
fields @timestamp, @message, @logStream, @log  | filter interfaceId = 'eni-05c479afeb755895e'
| sort @timestamp desc
| limit 20

Filtering logs for specific elastic interfaceId with action as REJECT. 
fields @timestamp, @message, @logStream, @log  | filter interfaceId = 'eni-05c479afeb755895e' and action = 'REJECT'
| sort @timestamp desc
| limit 20

Filtering logs for specific elastic interfaceId with action as ACCEPT. 
fields @timestamp, @message, @logStream, @log  | filter interfaceId = 'eni-05c479afeb755895e' and action = 'ACCEPT'
| sort @timestamp desc
| limit 20

Filtering logs for specific aws accountId with action as ACCEPT. 
fields @timestamp, @message, @logStream, @log  | filter accountId = '133233069280' and action = 'ACCEPT'
| sort @timestamp desc
| limit 20

Filtering logs for specific aws accountId with action as REJECT. 
fields @timestamp, @message, @logStream, @log  | filter accountId = '133233069280' and action = 'REJECT'
| sort @timestamp desc
| limit 20

Filtering logs with logStatus as OK, confirming Data is logging normally to the chosen destinations. 
fields @timestamp, @message, @logStream, @log  | filter logStatus = 'OK'
| sort @timestamp desc
| limit 20

Filtering logs with logStatus as NODATA, to check whether there was no network traffic to or from the network interface.
fields @timestamp, @message, @logStream, @log  | filter logStatus = 'NODATA'
| sort @timestamp desc
| limit 20

Filtering logs with logStatus as NODATA, to check whether Some flow log records were skipped during the aggregation interval.
fields @timestamp, @message, @logStream, @log  | filter logStatus = 'SKIPDATA'
| sort @timestamp desc
| limit 20 

Filtering logs with protocol as 1 for ICMP requests
fields @timestamp, @message, @logStream, @log  | filter protocol = '1'
| sort @timestamp desc
| limit 20

Filtering logs with protocol as 6 for TCP requests
fields @timestamp, @message, @logStream, @log  | filter protocol = '6'
| sort @timestamp desc
| limit 20

Filtering logs with protocol as 17 for UDP requests
fields @timestamp, @message, @logStream, @log  | filter protocol = '17'
| sort @timestamp desc
| limit 20

Cost savings advice for Container Insights maybe outdated

While testing AWS CloudWatch Container Insights with Enhanced Observability I felt the urge to provide a configuration with a better fit to our needs. Reading through your guides, the section about "Cost savings with Container Insights on Amazon EKS" were very interesting. In particular the subsection called: "Customize Metrics and Dimensions".

observability-best-practices/docs/en/guides/containers/aws-native/eks/amazon-cloudwatch-container-insights.md

Line 158 in 97774e6

### Cost savings with Container Insights on Amazon EKS

The solution presented there seems easy to understand and very straight forward.
However, nowadays, after the release of CloudWatch Container Insights with Enhanced Observability, it seems to me the presented guides here are out of date.

Why? Because I believe if one follows the suggested installation method (EKS add-on), the documented otel-agent-conf is not used. But, looking at the log of the deployed cloudwatch-agent, it still uses the awsemfexporter. Hence, I believe it should be possible to use the described metric_declaraiont:

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/5ccdbe08c6a2a43b7c6c7f9c0031a4b0348394a9/exporter/awsemfexporter/README.md#metric_declaration

I would be very thankful if someone could help me understand how we could move forward here.

Incorrect Link

There is a Incorrect link which is described here

CoreDNS Metrics¶
CoreDNS is a flexible, extensible DNS server that can serve as the Kubernetes cluster DNS. The CoreDNS pods provide name resolution for all pods in the cluster. Running DNS intensive workloads can sometimes experience intermittent CoreDNS failures due to DNS throttling, and this can impact applications.

Checkout the latest best practices for tracking key CoreDNS performance metrics here.
<< Refer above for the broken link >>

CURE Dashboard is not a JSON

Hi, I am trying to import the dashboard linked bellow but the file contains an HTML source code instead of JSON.
sandbox/cure-grafana-dashboard/AmazonManagedGrafanaCUREDashboard.json

Thank you

CloudWatch Logs Insights Example Query - Amazon RDS/Aurora

Aurora PostgresQL cluster log

Filtering logs that have ERROR messages in them

fields @timestamp, @message, @logStream, @log
| filter @message like "ERROR"
| sort @timestamp desc

Filtering logs that have ERROR messages and aggregating them for a required time window like 1 hour

fields @timestamp, @message, @logStream, @log
| filter @message like "ERROR"
| stats count(*) as errCount by bin(1h)

Filtering logs that have specific texts in the message field

fields @timestamp, @message, @logStream, @log
| filter @message like 'could not receive data from client: Connection reset by peer'
| sort @timestamp desc

Amazon RDS Error log - MySQL

Filtering logs that have Warning messages in them

fields @timestamp, @message, @logStream, @log
| filter @message like "Warning"
| sort @timestamp desc

MySQL Error Codes - https://dev.mysql.com/doc/mysql-errors/8.0/en/server-error-reference.html
Getting the logs based on the specific mySQL error codes

fields @timestamp, @message, @logStream, @log
| filter @message like 'MY-010055'
| sort @timestamp desc

InvalidArgumentException in monitor-aurora-with-grafana lamda

I implemented the monitor-aurora-with-grafana script as described without issue (all 4 steps in the CF stack were successful), however each lamda invocation includes this error:

[ERROR] InvalidArgumentException: An error occurred (InvalidArgumentException) when calling the GetResourceMetrics operation: This group is not a known group: db.application is not valid for current resourceTraceback (most recent call last): File "/var/task/lambda_function.py", line 33, in lambda_handler pi_response = get_db_resource_metrics(instance) File "/var/task/lambda_function.py", line 78, in get_db_resource_metrics response = pi_client.get_resource_metrics( File "/var/runtime/botocore/client.py", line 530, in _api_call return self._make_api_call(operation_name, kwargs) File "/var/runtime/botocore/client.py", line 960, in _make_api_call raise error_class(parsed_response, operation_name) | [ERROR] InvalidArgumentException: An error occurred (InvalidArgumentException) when calling the GetResourceMetrics operation: This group is not a known group: db.application is not valid for current resource Traceback (most recent call last): File "/var/task/lambda_function.py", line 33, in lambda_handler pi_response = get_db_resource_metrics(instance) File "/var/task/lambda_function.py", line 78, in get_db_resource_metrics response = pi_client.get_resource_metrics( File "/var/runtime/botocore/client.py", line 530, in _api_call return self._make_api_call(operation_name, kwargs) File "/var/runtime/botocore/client.py", line 960, in _make_api_call raise error_class(parsed_response, operation_name)

and no metrics are published to CloudWatch.

I have confirmed the region is correct, and we have multiple databases in this region with Performance Insights enabled.

Grafana unable to show RDS Postgres metrics

I am running grafana 10.2.0, and trying to collect metrics from cloudwatch. I am running RDS Postgres 12.14, and using the lambda from the monitoring-aurora-with-grafana folder.

I ran into a first issue, where this line of code would fail due to sending over 1000 data items:

    if metric_data:
        logger.info('## sending data to cloduwatch...')
        try:
            cw_client.put_metric_data(
            Namespace= targetMetricNamespace,
            MetricData= metric_data)
        except ClientError as error:
            raise ValueError('The parameters you provided are incorrect: {}'.format(error))

I then changed the code to split the put call in batches of 500, and it stopped happening:

result = []
    max_elements = 500
    for i in range(0, len(metric_data), max_elements):
        result.append(metric_data[i:i + max_elements])
    
    if metric_data:
        for entry in result:
            logger.info('## sending data to cloduwatch...')
            try:
                cw_client.put_metric_data(
                Namespace= targetMetricNamespace,
                MetricData= entry)
            except ClientError as error:
                raise ValueError('The parameters you provided are incorrect: {}'.format(error))

After this, it started erroring out from the db_slices LoC:

dbSliceGroup = { "db.sql_tokenized", "db.application", "db.wait_event", "db.user", "db.session_type", "db.host", "db", "db.application" }

I got some exceptions from the lambda, where initially db.session_type was considered invalid. Then a couple others which I did not take note also failed. I then commented everything out, to remain only with db.sql_tokenized:

dbSliceGroup = { 
    "db.sql_tokenized", 
    #"db.application", 
    #"db.wait_event", 
    #"db.user", 
    #"db.session_type", 
    #"db.host", 
    #"db", 
    #"db.application" 
}

Things then worked out and I managed to see the metrics in the AuroraMonitoringGrafana/PerformanceInsightMetrics namespace. Unfortunately though, the metrics do not show up in Grafana, despite me adding the custom namespaces in configs. I am using the dashboard included in this repo, for the aurora use case.

Is there any subtlety involved, to make these metrics work? I do not know where to look further, some assistance would be appreciated.

The lambda function failed

I follow the guide https://docs.aws.amazon.com/prometheus/latest/userguide/integrating-cw-firehose.html to setup the solution with CDK. All resources can created successfully. When I create a Metric Stream for the firehose created by CDK.

I got the following error message when calling the lambda from firehose.

2024/03/11 07:09:34
{
"errorMessage": "invalid character 'Þ' looking for beginning of value",
"errorType": "SyntaxError",
"stackTrace": [
{
"path": "github.com/aws/[email protected]/lambda/errors.go",
"line": 39,
"label": "lambdaPanicResponse"
},
{
"path": "github.com/aws/[email protected]/lambda/invoke_loop.go",
"line": 116,
"label": "callBytesHandlerFunc.func1"
},
{
"path": "runtime/panic.go",
"line": 884,
"label": "gopanic"
},
{
"path": "lambda/main.go",
"line": 84,
"label": "HandleRequest"
},
{
"path": "reflect/value.go",
"line": 586,
"label": "Value.call"
},
{
"path": "reflect/value.go",
"line": 370,
"label": "Value.Call"
},
{
"path": "github.com/aws/[email protected]/lambda/handler.go",
"line": 293,
"label": "reflectHandler.func2"
},
{
"path": "github.com/aws/[email protected]/lambda/invoke_loop.go",
"line": 119,
"label": "callBytesHandlerFunc"
},
{
"path": "github.com/aws/[email protected]/lambda/invoke_loop.go",
"line": 75,
"label": "handleInvoke"
},
{
"path": "github.com/aws/[email protected]/lambda/invoke_loop.go",
"line": 39,
"label": "startRuntimeAPILoop"
},
{
"path": "github.com/aws/[email protected]/lambda/entry.go",
"line": 106,
"label": "start"
},
{
"path": "github.com/aws/[email protected]/lambda/entry.go",
"line": 69,
"label": "StartWithOptions"
},
{
"path": "github.com/aws/[email protected]/lambda/entry.go",
"line": 45,
"label": "Start"
},
{
"path": "lambda/main.go",
"line": 119,
"label": "main"
},
{
"path": "runtime/proc.go",
"line": 250,
"label": "main"
},
{
"path": "runtime/asm_amd64.s",
"line": 1598,
"label": "goexit"
}
]
}

Sampling Best Practices Guide

I believe it would be a benefit to have clear documentation for trace sampling best practices. Users who are new to trace instrumentation will usually start by sampling 100% of spans. That is also the default behavior of the ADOT Collector. This can become an issue when trying to migrate their observability to solution to a large scale solution. Users could hit rate limits when exporting to x-ray back end and also face large resource constraints if sampling is not implemented properly.

To better onboard end users to sampling at scale, the best practices guide could have clear documentation or point to existing resources to answer questions like the ones listed below.

What is sampling?
What are the different types of sampling?
When should I sample?
What should I sample?
How do I sample in the OpenTelemetry ecosystem?

Provide Dashboard for Kubecost and Cost Usage Report Data

The idea is to combine data from Kubecost and CUR in a single Grafana dashboard. Kubecost data would be stored in AMP and CUR data will be retrieved through Athena. The queries in Grafana should allow the filtering on various k8s semantics, like labels, jobs... the CUR data will be used to provide cost for other AWS services based on Cost Allocation Tags, these tags can be defined by the end user through a list they provide.

404 - Not found error - https://aws-observability.github.io/observability-best-practices/recipes/amp/recipes/eks-observability-accelerator.md

https://aws-observability.github.io/observability-best-practices/recipes/amp/recipes/eks-observability-accelerator.md reports 404 error. Please add the page/blog behind this link.

Essential Metrics - Incorrect metric definition

API server request latency sum

ex:
Value in the table: The response latency distribution in microseconds for each verb, resource and subresource
Actual Definition: Cumulative Counter which tracks total time taken by the K8 API server to process requests.

Logs Insights Example Query - Route 53 Resolver Query Logs

NOTE: this is for [R53 Resolver Query Logs](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/resolver-query-logs.html), not for [Public DNS Query Logging](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/query-logs.html)

Top 10 DNS names queried

stats count(*) as numRequests by query_name
| sort numRequests desc
| limit 10

What is the purpose of this query?

Pulls the amount of DNS queries per domain on Resolver Query Logging configuration and lists the top 10 in descendant order

How do I use this query?

Use this query to know what are the most resolved domains on the selected Route 53 Resolver Query Logging Log Group. Each Query Logging configuration could cover a single or multiple VPCs in a region.

Top 10 talkers

stats count(*) as numRequests by srcaddr
| sort numRequests desc
| limit 10

What is the purpose of this query?

Pulls the top generators of DNS queries on Route 53 Resolver and lists the top 10 in descendant order.

How do I use this query?

Use this query to know what are the top talkers (the clients doing the most queries) on Route 53 Resolver. Each Query Logging configuration could cover a single or multiple VPCs in a region.

Top queried DNS names per source IP

stats count(*) as numRequests by query_name, srcaddr
| sort numRequests desc
| limit 10

What is the purpose of this query?

Pulls the top queried DNS names and groups them by source IP, listing the top 10 in descendant order.

How do I use this query?

Use this query to know what are the top talkers (the clients doing the most queries) for the top queried domains on Route 53 Resolver. Can be useful to get an idea of which hosts are generating the most queries for the top-queried domains. Each Query Logging configuration could cover a single or multiple VPCs in a region.

Top queried DNS names per source IP

stats count(*) as numRequests by query_name, srcaddr
| sort numRequests desc
| filter firewall_rule_action = "ALERT"
| limit 10

What is the purpose of this query?

Pulls the top queried DNS names and groups them by source IP, but only for those domains being flagged as ALERT by the Route 53 DNS Firewall, listing the top 10 in descendant order.

How do I use this query?

Use this query to know what are the top talkers (the clients doing the most queries) for the top queried ALERT-flagged domains on Route 53 Resolver. Can be useful to get an idea of which hosts are generating the most queries for those domains being flagged as ALERT by the DNS Firewall. Each Query Logging configuration could cover a single or multiple VPCs in a region.

AMG Terraform article should mention terraform API key resource

Hey there!

This recipe: https://aws-observability.github.io/observability-best-practices/recipes/recipes/amg-automation-tf/
has the user manually creating an API key to use in their terraform deployment (and then hardcoding it later)

It would be great if it could be updated to mention using the API key terraform resource: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/grafana_workspace_api_key to automatically create an API key instead of having the user do so by hand

CW metric exporter lambda failed with a 400 status code from AMP

I've tried to setup the prometheus and firehose metric exporter but the problem is we are getting 400s from AMP as a response in lambda. Since the payload is proto, it's hard to debug this using the requests. If anyone had faced the same issue, please help with the debugging.

Logs Insights Example Query

CloudWatch Query Statement

stats count(errorCode) as eventCount by eventSource, eventName, awsRegion, userAgent, errorCode
| sort eventCount desc

What is the purpose of this query?

The query provides a quick way to understand highest API errors grouped by error category in a particular time window.

How do I use this query?

Use this query against the CloudTrail log groups (when CloudTrail is configured to log to CloudWatch Logs). Select the desired time frame for the query (ex. past 3 hours).

Supporting RDS MS SQL Performance Insights metrics

Hi Team,
I would like to know if the Performance Insights metrics can support AWS RDS Ms SQL on:
https://github.com/aws-observability/observability-best-practices/tree/main/sandbox/monitor-aurora-with-grafana

I deployed your solution and its amazing ( thanks for the hard work) but I can see that it's not working on MS SQL RDS, Any road map to support it?

Container Insights pricing of metrics

While reading the best practices guides I came across this line stating:

Metrics collected by Container Insights are charged as custom metrics.

Permalinks:

observability-best-practices/docs/en/guides/containers/aws-native/ecs/best-practices-metrics-collection-1.md

Line 29 in 97774e6

    
           Metrics collected by Container Insights are charged as custom metrics. For more information about CloudWatch pricing, see [Amazon CloudWatch Pricing](https://aws.amazon.com/cloudwatch/pricing/)

observability-best-practices/docs/en/guides/containers/aws-native/eks/amazon-cloudwatch-container-insights.md

Line 15 in 97774e6

    
           From a cost optimization standpoint and to help you manage your Container Insights cost, CloudWatch does not automatically create all possible metrics from the log data. However, you can view additional metrics and additional levels of granularity by using CloudWatch Logs Insights to analyze the raw performance log events. Metrics collected by Container Insights are charged as custom metrics. For more information about CloudWatch pricing, see [Amazon CloudWatch Pricing](https://aws.amazon.com/cloudwatch/pricing/).

However, after the release of Container Insights with Enhanced Observability last year, I believe AWS changed the billing of Container Insights. On there AWS CloudWatch Pricing Website they state the following:

Container Insights with enhanced observability uses a tiered pricing model based on observations.

https://aws.amazon.com/cloudwatch/pricing/

Now, I am not sure if I understood all the documentation regarding pricing but since I saw that the specific line I am mentioning was pushed months before the new release of enhanced observability, I thought this information might be outdated. 🙂

Running into an error "Template format error: Unrecognized resource types: [AWS::Scheduler::Schedule]"

observability-best-practices/sandbox/monitor-aurora-with-grafana/template.yml

Line 28 in 1a3b465

Schedule:

When deploying the monitor-aurora-with-grafana, I running into the following error

Failed to create the changeset: Waiter ChangeSetCreateComplete failed: Waiter encountered a terminal failure state: For expression "Status" we matched expected path: "FAILED" Status: FAILED. Reason: Template format error: Unrecognized resource types: [AWS::Scheduler::Schedule]

Logs Insights Example Query

CloudWatch Query Statement

stats count(errorCode) as eventCount by eventSource, eventName, awsRegion, userAgent, errorCode
| filter errorCode = 'ThrottlingException' 
| sort eventCount desc

What is the purpose of this query?

The query provides a quick way to understand highest API Throttling errors grouped by error category in a particular time window. This can narrow down the API call and the AWS service in the region where the API requests are throttled.

How do I use this query?

Use this query against the CloudTrail log groups (when CloudTrail is configured to log to CloudWatch Logs). Select the desired time frame for the query (ex. past 3 hours).

aws-observability / observability-best-practices Goto Github PK

observability-best-practices's People

Contributors

Stargazers

Watchers

Forkers

observability-best-practices's Issues

Section "Sources" on this page https://aws-observability.github.io/observability-best-practices/guides/containers/oss/eks/best-practices-metrics-collection/ has a truncated table

API server request latency sum

Top 10 DNS names queried

What is the purpose of this query?

How do I use this query?

Top 10 talkers

What is the purpose of this query?

How do I use this query?

Top queried DNS names per source IP

What is the purpose of this query?

How do I use this query?

Top queried DNS names per source IP

What is the purpose of this query?

How do I use this query?

CloudWatch Query Statement

What is the purpose of this query?

How do I use this query?

Permalinks:

CloudWatch Query Statement

What is the purpose of this query?

How do I use this query?

Recommend Projects

Recommend Topics

Recommend Org