aws-observability / observability-best-practices Goto Github PK
View Code? Open in Web Editor NEWObservability best practices on AWS
Home Page: https://aws-observability.github.io/observability-best-practices/
License: MIT No Attribution
Observability best practices on AWS
Home Page: https://aws-observability.github.io/observability-best-practices/
License: MIT No Attribution
Hi Team,
You have indicated that apiserver_request_latencies_sum & apiserver_registered_watchers are a valid control plane metric. I am not able to see this in my cluster, is this only available is new versions?
Any help will be really useful.
Is there a process to report and/or fix typos in the docs?
For example:
BTW, thanks for this amazing piece of work.
##Request for change
Kindy get the table to the right size / wrap up so that the full contents are visible to the user.
Hi team!
I recently published a tutorial (keeping best practices in mind) on monitoring an application's performance using CloudWatch Container Insights.
I would like to suggest that the EKS Observability best practices guide should include a reference to the above tutorial, adding a lot of value as it explores more than just the setup process, providing detailed guidance on utilizing CloudWatch Log Insights for effective search and querying, as well as instructions for leveraging the CloudWatch dashboard for efficient monitoring.
I read through the best EKS practices guide, and found that the section (highlighted in the screenshot below) would be a great place to reference it.
PR created.
Thanks in advance,
Ahmad (Containers Specialist SA, AWS)
VPC Flow Logs
Last 20 Messages from the VPC flow logs sorted in descending timestamp order. The $limit value in all the queries below can be adjusted to return a different amount of messages as necessary.
fields @timestamp, @message, @logStream, @log
| sort @timestamp desc
| limit 20
Filtering for messages that have REJECT action and aggregating them for every 1 hour.
fields @timestamp, @message, @logStream, @log | filter @message like /REJECT/
| stats count(*) as connCount by bin(1h)
| sort connCount desc
Filtering for messages that have REJECT action and aggregating them for every 10 minutes.
fields @timestamp, @message, @logStream, @log | filter @message like /REJECT/
| stats count(*) as rejCount by bin(10m)
| sort rejCount desc
Filtering for messages with a date range. The timestamp to be converted into UNIX Epoch time and used with > and < range.
fields @timestamp, @message, @logStream, @log
| fields tomillis(@timestamp) as millisecs
| filter millisecs > 1675641600000 and millisecs < 1675718300000
| sort @timestamp desc
| limit 20
Filtering logs that have REJECT action in the flow logs.
fields @timestamp, @message, @logStream, @log | filter @message like /REJECT/
| sort @timestamp desc
| limit 20
Filtering logs that have ACCEPT action in the flow logs.
fields @timestamp, @message, @logStream, @log | filter @message like /ACCEPT/
| sort @timestamp desc
| limit 20
Filtering logs for selected source IP addresses with action as REJECT.
fields @timestamp, @message, @logStream, @log | filter srcAddr like '192.168.20.' and action = 'REJECT'
| sort @timestamp desc
| limit 20
Filtering logs for selected source IP addresses with action as ACCEPT.
fields @timestamp, @message, @logStream, @log | filter srcAddr like '192.168.20.' and action = 'ACCEPT'
| sort @timestamp desc
| limit 20
Filtering logs for specific source IP address with action as REJECT.
fields @timestamp, @message, @logStream, @log | filter srcAddr = '192.168.20.42' and action = 'REJECT'
| sort @timestamp desc
| limit 20
Filtering logs for specific source IP address, source port with action as ACCEPT.
fields @timestamp, @message, @logStream, @log | filter srcAddr = '192.168.20.42' and srcPort = '1111' and action = 'ACCEPT'
| sort @timestamp desc
| limit 20
Filtering logs for specific source IP address, source port with action as REJECT.
fields @timestamp, @message, @logStream, @log | filter srcAddr = '192.168.20.42' and srcPort = '1111' and action = 'REJECT'
| sort @timestamp desc
| limit 20
Filtering logs for specific destination IP address with action as ACCEPT.
fields @timestamp, @message, @logStream, @log | filter dstAddr = '192.168.20.176' and action = 'ACCEPT'
| sort @timestamp desc
| limit 20
Filtering logs for specific destination IP address with action as REJECT.
fields @timestamp, @message, @logStream, @log | filter dstAddr = '192.168.20.176' and action = 'REJECT'
| sort @timestamp desc
| limit 20
Filtering logs for specific destination IP address, destination port with action as ACCEPT.
fields @timestamp, @message, @logStream, @log | filter dstAddr = '192.168.20.176' and dstPort = '1111' and action = 'ACCEPT'
| sort @timestamp desc
| limit 20
Filtering logs for specific destination IP address, destination port with action as REJECT.
fields @timestamp, @message, @logStream, @log | filter dstAddr = '192.168.20.176' and dstPort = '1111' and action = 'REJECT'
| sort @timestamp desc
| limit 20
Filtering logs for specific elastic interfaceId
fields @timestamp, @message, @logStream, @log | filter interfaceId = 'eni-05c479afeb755895e'
| sort @timestamp desc
| limit 20
Filtering logs for specific elastic interfaceId with action as REJECT.
fields @timestamp, @message, @logStream, @log | filter interfaceId = 'eni-05c479afeb755895e' and action = 'REJECT'
| sort @timestamp desc
| limit 20
Filtering logs for specific elastic interfaceId with action as ACCEPT.
fields @timestamp, @message, @logStream, @log | filter interfaceId = 'eni-05c479afeb755895e' and action = 'ACCEPT'
| sort @timestamp desc
| limit 20
Filtering logs for specific aws accountId with action as ACCEPT.
fields @timestamp, @message, @logStream, @log | filter accountId = '133233069280' and action = 'ACCEPT'
| sort @timestamp desc
| limit 20
Filtering logs for specific aws accountId with action as REJECT.
fields @timestamp, @message, @logStream, @log | filter accountId = '133233069280' and action = 'REJECT'
| sort @timestamp desc
| limit 20
Filtering logs with logStatus as OK, confirming Data is logging normally to the chosen destinations.
fields @timestamp, @message, @logStream, @log | filter logStatus = 'OK'
| sort @timestamp desc
| limit 20
Filtering logs with logStatus as NODATA, to check whether there was no network traffic to or from the network interface.
fields @timestamp, @message, @logStream, @log | filter logStatus = 'NODATA'
| sort @timestamp desc
| limit 20
Filtering logs with logStatus as NODATA, to check whether Some flow log records were skipped during the aggregation interval.
fields @timestamp, @message, @logStream, @log | filter logStatus = 'SKIPDATA'
| sort @timestamp desc
| limit 20
Filtering logs with protocol as 1 for ICMP requests
fields @timestamp, @message, @logStream, @log | filter protocol = '1'
| sort @timestamp desc
| limit 20
Filtering logs with protocol as 6 for TCP requests
fields @timestamp, @message, @logStream, @log | filter protocol = '6'
| sort @timestamp desc
| limit 20
Filtering logs with protocol as 17 for UDP requests
fields @timestamp, @message, @logStream, @log | filter protocol = '17'
| sort @timestamp desc
| limit 20
While testing AWS CloudWatch Container Insights with Enhanced Observability I felt the urge to provide a configuration with a better fit to our needs. Reading through your guides, the section about "Cost savings with Container Insights on Amazon EKS" were very interesting. In particular the subsection called: "Customize Metrics and Dimensions".
The solution presented there seems easy to understand and very straight forward.
However, nowadays, after the release of CloudWatch Container Insights with Enhanced Observability, it seems to me the presented guides here are out of date.
Why? Because I believe if one follows the suggested installation method (EKS add-on), the documented otel-agent-conf
is not used. But, looking at the log of the deployed cloudwatch-agent
, it still uses the awsemfexporter
. Hence, I believe it should be possible to use the described metric_declaraiont
:
I would be very thankful if someone could help me understand how we could move forward here.
There is a Incorrect link which is described here
CoreDNS Metrics¶
CoreDNS is a flexible, extensible DNS server that can serve as the Kubernetes cluster DNS. The CoreDNS pods provide name resolution for all pods in the cluster. Running DNS intensive workloads can sometimes experience intermittent CoreDNS failures due to DNS throttling, and this can impact applications.
Checkout the latest best practices for tracking key CoreDNS performance metrics here.
<< Refer above for the broken link >>
Hi, I am trying to import the dashboard linked bellow but the file contains an HTML source code instead of JSON.
sandbox/cure-grafana-dashboard/AmazonManagedGrafanaCUREDashboard.json
Thank you
Aurora PostgresQL cluster log
Filtering logs that have ERROR messages in them
fields @timestamp, @message, @logStream, @log
| filter @message like "ERROR"
| sort @timestamp desc
Filtering logs that have ERROR messages and aggregating them for a required time window like 1 hour
fields @timestamp, @message, @logStream, @log
| filter @message like "ERROR"
| stats count(*) as errCount by bin(1h)
Filtering logs that have specific texts in the message field
fields @timestamp, @message, @logStream, @log
| filter @message like 'could not receive data from client: Connection reset by peer'
| sort @timestamp desc
Amazon RDS Error log - MySQL
Filtering logs that have Warning messages in them
fields @timestamp, @message, @logStream, @log
| filter @message like "Warning"
| sort @timestamp desc
MySQL Error Codes - https://dev.mysql.com/doc/mysql-errors/8.0/en/server-error-reference.html
Getting the logs based on the specific mySQL error codes
fields @timestamp, @message, @logStream, @log
| filter @message like 'MY-010055'
| sort @timestamp desc
I implemented the monitor-aurora-with-grafana
script as described without issue (all 4 steps in the CF stack were successful), however each lamda invocation includes this error:
[ERROR] InvalidArgumentException: An error occurred (InvalidArgumentException) when calling the GetResourceMetrics operation: This group is not a known group: db.application is not valid for current resourceTraceback (most recent call last): File "/var/task/lambda_function.py", line 33, in lambda_handler pi_response = get_db_resource_metrics(instance) File "/var/task/lambda_function.py", line 78, in get_db_resource_metrics response = pi_client.get_resource_metrics( File "/var/runtime/botocore/client.py", line 530, in _api_call return self._make_api_call(operation_name, kwargs) File "/var/runtime/botocore/client.py", line 960, in _make_api_call raise error_class(parsed_response, operation_name) | [ERROR] InvalidArgumentException: An error occurred (InvalidArgumentException) when calling the GetResourceMetrics operation: This group is not a known group: db.application is not valid for current resource Traceback (most recent call last): File "/var/task/lambda_function.py", line 33, in lambda_handler pi_response = get_db_resource_metrics(instance) File "/var/task/lambda_function.py", line 78, in get_db_resource_metrics response = pi_client.get_resource_metrics( File "/var/runtime/botocore/client.py", line 530, in _api_call return self._make_api_call(operation_name, kwargs) File "/var/runtime/botocore/client.py", line 960, in _make_api_call raise error_class(parsed_response, operation_name)
and no metrics are published to CloudWatch.
I have confirmed the region is correct, and we have multiple databases in this region with Performance Insights enabled.
I am running grafana 10.2.0, and trying to collect metrics from cloudwatch. I am running RDS Postgres 12.14, and using the lambda from the monitoring-aurora-with-grafana
folder.
I ran into a first issue, where this line of code would fail due to sending over 1000 data items:
if metric_data:
logger.info('## sending data to cloduwatch...')
try:
cw_client.put_metric_data(
Namespace= targetMetricNamespace,
MetricData= metric_data)
except ClientError as error:
raise ValueError('The parameters you provided are incorrect: {}'.format(error))
I then changed the code to split the put call in batches of 500, and it stopped happening:
result = []
max_elements = 500
for i in range(0, len(metric_data), max_elements):
result.append(metric_data[i:i + max_elements])
if metric_data:
for entry in result:
logger.info('## sending data to cloduwatch...')
try:
cw_client.put_metric_data(
Namespace= targetMetricNamespace,
MetricData= entry)
except ClientError as error:
raise ValueError('The parameters you provided are incorrect: {}'.format(error))
After this, it started erroring out from the db_slices LoC:
dbSliceGroup = { "db.sql_tokenized", "db.application", "db.wait_event", "db.user", "db.session_type", "db.host", "db", "db.application" }
I got some exceptions from the lambda, where initially db.session_type
was considered invalid. Then a couple others which I did not take note also failed. I then commented everything out, to remain only with db.sql_tokenized
:
dbSliceGroup = {
"db.sql_tokenized",
#"db.application",
#"db.wait_event",
#"db.user",
#"db.session_type",
#"db.host",
#"db",
#"db.application"
}
Things then worked out and I managed to see the metrics in the AuroraMonitoringGrafana/PerformanceInsightMetrics
namespace. Unfortunately though, the metrics do not show up in Grafana, despite me adding the custom namespaces in configs. I am using the dashboard included in this repo, for the aurora use case.
Is there any subtlety involved, to make these metrics work? I do not know where to look further, some assistance would be appreciated.
I follow the guide https://docs.aws.amazon.com/prometheus/latest/userguide/integrating-cw-firehose.html to setup the solution with CDK. All resources can created successfully. When I create a Metric Stream for the firehose created by CDK.
I got the following error message when calling the lambda from firehose.
2024/03/11 07:09:34
{
"errorMessage": "invalid character 'Þ' looking for beginning of value",
"errorType": "SyntaxError",
"stackTrace": [
{
"path": "github.com/aws/[email protected]/lambda/errors.go",
"line": 39,
"label": "lambdaPanicResponse"
},
{
"path": "github.com/aws/[email protected]/lambda/invoke_loop.go",
"line": 116,
"label": "callBytesHandlerFunc.func1"
},
{
"path": "runtime/panic.go",
"line": 884,
"label": "gopanic"
},
{
"path": "lambda/main.go",
"line": 84,
"label": "HandleRequest"
},
{
"path": "reflect/value.go",
"line": 586,
"label": "Value.call"
},
{
"path": "reflect/value.go",
"line": 370,
"label": "Value.Call"
},
{
"path": "github.com/aws/[email protected]/lambda/handler.go",
"line": 293,
"label": "reflectHandler.func2"
},
{
"path": "github.com/aws/[email protected]/lambda/invoke_loop.go",
"line": 119,
"label": "callBytesHandlerFunc"
},
{
"path": "github.com/aws/[email protected]/lambda/invoke_loop.go",
"line": 75,
"label": "handleInvoke"
},
{
"path": "github.com/aws/[email protected]/lambda/invoke_loop.go",
"line": 39,
"label": "startRuntimeAPILoop"
},
{
"path": "github.com/aws/[email protected]/lambda/entry.go",
"line": 106,
"label": "start"
},
{
"path": "github.com/aws/[email protected]/lambda/entry.go",
"line": 69,
"label": "StartWithOptions"
},
{
"path": "github.com/aws/[email protected]/lambda/entry.go",
"line": 45,
"label": "Start"
},
{
"path": "lambda/main.go",
"line": 119,
"label": "main"
},
{
"path": "runtime/proc.go",
"line": 250,
"label": "main"
},
{
"path": "runtime/asm_amd64.s",
"line": 1598,
"label": "goexit"
}
]
}
I believe it would be a benefit to have clear documentation for trace sampling best practices. Users who are new to trace instrumentation will usually start by sampling 100% of spans. That is also the default behavior of the ADOT Collector. This can become an issue when trying to migrate their observability to solution to a large scale solution. Users could hit rate limits when exporting to x-ray back end and also face large resource constraints if sampling is not implemented properly.
To better onboard end users to sampling at scale, the best practices guide could have clear documentation or point to existing resources to answer questions like the ones listed below.
What is sampling?
What are the different types of sampling?
When should I sample?
What should I sample?
How do I sample in the OpenTelemetry ecosystem?
The idea is to combine data from Kubecost and CUR in a single Grafana dashboard. Kubecost data would be stored in AMP and CUR data will be retrieved through Athena. The queries in Grafana should allow the filtering on various k8s semantics, like labels, jobs... the CUR data will be used to provide cost for other AWS services based on Cost Allocation Tags, these tags can be defined by the end user through a list they provide.
https://aws-observability.github.io/observability-best-practices/recipes/amp/recipes/eks-observability-accelerator.md reports 404 error. Please add the page/blog behind this link.
ex:
Value in the table: The response latency distribution in microseconds for each verb, resource and subresource
Actual Definition: Cumulative Counter which tracks total time taken by the K8 API server to process requests.
NOTE: this is for [R53 Resolver Query Logs](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/resolver-query-logs.html), not for [Public DNS Query Logging](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/query-logs.html)
stats count(*) as numRequests by query_name
| sort numRequests desc
| limit 10
Pulls the amount of DNS queries per domain on Resolver Query Logging configuration and lists the top 10 in descendant order
Use this query to know what are the most resolved domains on the selected Route 53 Resolver Query Logging Log Group. Each Query Logging configuration could cover a single or multiple VPCs in a region.
stats count(*) as numRequests by srcaddr
| sort numRequests desc
| limit 10
Pulls the top generators of DNS queries on Route 53 Resolver and lists the top 10 in descendant order.
Use this query to know what are the top talkers (the clients doing the most queries) on Route 53 Resolver. Each Query Logging configuration could cover a single or multiple VPCs in a region.
stats count(*) as numRequests by query_name, srcaddr
| sort numRequests desc
| limit 10
Pulls the top queried DNS names and groups them by source IP, listing the top 10 in descendant order.
Use this query to know what are the top talkers (the clients doing the most queries) for the top queried domains on Route 53 Resolver. Can be useful to get an idea of which hosts are generating the most queries for the top-queried domains. Each Query Logging configuration could cover a single or multiple VPCs in a region.
stats count(*) as numRequests by query_name, srcaddr
| sort numRequests desc
| filter firewall_rule_action = "ALERT"
| limit 10
Pulls the top queried DNS names and groups them by source IP, but only for those domains being flagged as ALERT
by the Route 53 DNS Firewall, listing the top 10 in descendant order.
Use this query to know what are the top talkers (the clients doing the most queries) for the top queried ALERT-flagged domains on Route 53 Resolver. Can be useful to get an idea of which hosts are generating the most queries for those domains being flagged as ALERT by the DNS Firewall. Each Query Logging configuration could cover a single or multiple VPCs in a region.
Hey there!
This recipe: https://aws-observability.github.io/observability-best-practices/recipes/recipes/amg-automation-tf/
has the user manually creating an API key to use in their terraform deployment (and then hardcoding it later)
It would be great if it could be updated to mention using the API key terraform resource: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/grafana_workspace_api_key to automatically create an API key instead of having the user do so by hand
I've tried to setup the prometheus and firehose metric exporter but the problem is we are getting 400s from AMP as a response in lambda. Since the payload is proto, it's hard to debug this using the requests. If anyone had faced the same issue, please help with the debugging.
stats count(errorCode) as eventCount by eventSource, eventName, awsRegion, userAgent, errorCode
| sort eventCount desc
The query provides a quick way to understand highest API errors grouped by error category in a particular time window.
Use this query against the CloudTrail log groups (when CloudTrail is configured to log to CloudWatch Logs). Select the desired time frame for the query (ex. past 3 hours).
Hi Team,
I would like to know if the Performance Insights metrics can support AWS RDS Ms SQL on:
https://github.com/aws-observability/observability-best-practices/tree/main/sandbox/monitor-aurora-with-grafana
I deployed your solution and its amazing ( thanks for the hard work) but I can see that it's not working on MS SQL RDS, Any road map to support it?
While reading the best practices guides I came across this line stating:
Metrics collected by Container Insights are charged as custom metrics.
However, after the release of Container Insights with Enhanced Observability last year, I believe AWS changed the billing of Container Insights. On there AWS CloudWatch Pricing Website they state the following:
Container Insights with enhanced observability uses a tiered pricing model based on observations.
https://aws.amazon.com/cloudwatch/pricing/
Now, I am not sure if I understood all the documentation regarding pricing but since I saw that the specific line I am mentioning was pushed months before the new release of enhanced observability, I thought this information might be outdated. 🙂
When deploying the monitor-aurora-with-grafana, I running into the following error
Failed to create the changeset: Waiter ChangeSetCreateComplete failed: Waiter encountered a terminal failure state: For expression "Status" we matched expected path: "FAILED" Status: FAILED. Reason: Template format error: Unrecognized resource types: [AWS::Scheduler::Schedule]
stats count(errorCode) as eventCount by eventSource, eventName, awsRegion, userAgent, errorCode
| filter errorCode = 'ThrottlingException'
| sort eventCount desc
The query provides a quick way to understand highest API Throttling errors grouped by error category in a particular time window. This can narrow down the API call and the AWS service in the region where the API requests are throttled.
Use this query against the CloudTrail log groups (when CloudTrail is configured to log to CloudWatch Logs). Select the desired time frame for the query (ex. past 3 hours).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.