aws / amazon-codeguru-profiler-python-agent Goto Github PK

Home Page: https://docs.aws.amazon.com/codeguru/latest/profiler-ug/what-is-codeguru-profiler.html

License: Apache License 2.0

Python 99.81% Shell 0.19%

aws codeguru profiler python

amazon-codeguru-profiler-python-agent's Introduction

Amazon CodeGuru Profiler Python Agent

For more details, check the documentation: https://docs.aws.amazon.com/codeguru/latest/profiler-ug/what-is-codeguru-profiler.html

How to use it

This package is being released to PyPI as codeguru-profiler-agent, so use it as any Python package from PyPI.

For a demo application that uses this agent, check our demo application.

How to contribute

Check the GitHub repository at aws/amazon-codeguru-profiler-python-agent.

See CONTRIBUTING for details.

How to release

See DEVELOPMENT for more information.

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License. See LICENSE for details.

amazon-codeguru-profiler-python-agent's People

Contributors

Stargazers

Watchers

Forkers

gimki pandpara mukileshpira qpc-database aavinav usuyaslo yvanzara-eightfold test-mass-forker-org-1 arpitjain799 seanpm2001 mcteo jpalevic-amazon

amazon-codeguru-profiler-python-agent's Issues

Report profile before sample to avoid incorrect profile end time

profile_reporter attempts to sample before reporting profile. This behavior may not be ideal for lambda application.

If we sample before reporting profile, it is possible for the following scenario happens:

Profile contains 99% of the data for certain time period (e.g. 11:00-11:05). If the lambda container freeze just before the reporter should report the profile and it resumes after 20 minutes, the profile would contain 1 sample for 11:25 and the profile would have start and end time be 11:00 - 11:30 which gives a misleading information of the profile.

Also we should set the end time in profile as the last sample time instead of the actual reporting time to avoid the confusion stated above.

TypeError on Python 3.11

When running CPython 3.11 workloads on Fargate, CodeGuru Profiler sometimes crashes:

2023-11-14 20:27:16,412 - codeguru_profiler_agent.profiler_runner - INFO - An unexpected issue caused the profiling command to terminate.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/codeguru_profiler_agent/profiler_runner.py", line 74, in _profiling_command
    sample_result = self._run_profiler()
                    ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/codeguru_profiler_agent/metrics/with_timer.py", line 26, in timed
    result = fn(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/codeguru_profiler_agent/profiler_runner.py", line 98, in _run_profiler
    self._sample_and_aggregate()
  File "/usr/local/lib/python3.11/site-packages/codeguru_profiler_agent/metrics/with_timer.py", line 26, in timed
    result = fn(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/codeguru_profiler_agent/profiler_runner.py", line 103, in _sample_and_aggregate
    sample = self.sampler.sample()
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/codeguru_profiler_agent/metrics/with_timer.py", line 26, in timed
    result = fn(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/codeguru_profiler_agent/sampler.py", line 46, in sample
    stacks = self._get_stacks(
             ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/codeguru_profiler_agent/sampling_utils.py", line 34, in get_stacks
    stacks.append(_extract_frames(end_frame, max_depth))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/codeguru_profiler_agent/sampling_utils.py", line 114, in _extract_frames
    stack_entries = _extract_stack(stack, max_depth)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/codeguru_profiler_agent/sampling_utils.py", line 76, in _extract_stack
    _maybe_append_synthetic_frame(result, last_frame, last_frame_line_no)
  File "/usr/local/lib/python3.11/site-packages/codeguru_profiler_agent/sampling_utils.py", line 100, in _maybe_append_synthetic_frame
    line = linecache.getline(frame.f_code.co_filename, line_no).strip()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/linecache.py", line 31, in getline
    if 1 <= lineno <= len(lines):
       ^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '<=' not supported between instances of 'int' and 'NoneType'

It appears as if the issue occurs in sampling_utils.py: last_frame_line_no may become None and that case is unhandled.

Could this be due to changes to the traceback module in Python 3.11?

requirements.txt missing in PyPI source release

Hi,

the setup.py file currently gets the necessary requirements from requirements.txt as specified here:

amazon-codeguru-profiler-python-agent/setup.py

Line 7 in ea6a2f1

REQUIREMENTS = [i.strip() for i in open("requirements.txt").readlines()]

This is not included with the source release in PyPI which makes building from source impossible.

Would it be possible to include that again to be able to build without using the wheel?

I noticed this because I maintain the conda-forge feedstock for this and it was caught in our CI, as can be seen here: https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=278441&view=logs&j=656edd35-690f-5c53-9ba3-09c10d0bea97&t=e5c8ab1d-8ff9-5cae-b332-e15ae582ed2d

Edit: This has was caught in the CI but I was able to verify this locally as well.

Missing data for processes started with ProcessPoolExecutor

We have a Flask application with Gunicorn running on AWS Fargate. Our application has part of the business logic executed in parallel, using ProcessPoolExecutor. The executor, including the worker processes are reused and long running.
We have noticed that we are missing profiling data for the logic executed in worker processes. We have attempted to start new profiler for worker processes using ProcessPoolExecutor initializer, but this attempt has failed with message that we can't start the profiler twice within the same process.

Failing to get EC2 instance metadata from the IMDSv2 API when running on EC2 instances

The amazon-codeguru-profiler-python-agent fails to get EC2 instance metadata from the IMDSv2 api when running on EC2 instances due to using a GET request to fetch the token when a PUT request is required. I don't really know what the impact is because as the logs note: Unable to get Ec2 instance metadata, this is normal when running in a different environment (e.g. Fargate), profiler will still work. So the profiler seems to still work on EC2 instances but just doesn't have the EC2 instance metadata.

Simplified program to reproduce issue:

import logging
import time

from codeguru_profiler_agent import Profiler

if __name__ == '__main__':
    logging.basicConfig(level=logging.DEBUG)
    Profiler(profiling_group_name='MyProfilingGroup').start()
    time.sleep(30)

Relevant logs from running the program on an EC2 instance with IMDSv2 with python 3.7 and codeguru-profiler-agent 1.2.4:

...
DEBUG:codeguru_profiler_agent.agent_metadata.fleet_info:Making a request to http://169.254.169.254/latest/api/token with headers set for these keys: dict_keys(['X-aws-ec2-metadata-token-ttl-seconds'])
INFO:codeguru_profiler_agent.agent_metadata.aws_ec2_instance:Unable to get Ec2 instance metadata, this is normal when running in a different environment (e.g. Fargate), profiler will still work
DEBUG:codeguru_profiler_agent.agent_metadata.aws_ec2_instance:Caught exception: 
Traceback (most recent call last):
  File "/opt/env/lib/python3.7/site-packages/codeguru_profiler_agent/agent_metadata/aws_ec2_instance.py", line 68, in look_up_metadata
    token = cls.__look_up_ec2_api_token()
  File "/opt/env/lib/python3.7/site-packages/codeguru_profiler_agent/agent_metadata/aws_ec2_instance.py", line 62, in __look_up_ec2_api_token
    headers={EC2_METADATA_TOKEN_TTL_HEADER_KEY: EC2_METADATA_TOKEN_TTL_HEADER_VALUE}) \
  File "/opt/env/lib/python3.7/site-packages/codeguru_profiler_agent/agent_metadata/fleet_info.py", line 26, in http_get
    return request.urlopen(req, timeout=METADATA_URI_TIMEOUT_SECONDS)  # nosec
  File "/usr/local/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/local/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/local/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/local/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/local/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 405: Not Allowed
...

Looks like the issue was introduced in this PR here: #24 and the test didn't catch it because the mock doesn't doesn't match the IMDSv2 API responses.

For the instance metadata docs see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html#instance-metadata-v2-how-it-works and https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html where it notes only a PUT request is allowed for fetching the token while the agent uses a GET request: https://github.com/aws/amazon-codeguru-profiler-python-agent/blob/main/codeguru_profiler_agent/agent_metadata/aws_ec2_instance.py#L61 which is why the request fails and returns HTTP Error 405: Not Allowed.

So the fix looks to be change the agent to use a PUT request to get the token and fix the corresponding test.

Congrats 🎉

Just wanted to say hi! It's really awesome to see this codebase out in the public, big congrats!
I still see some of my mistakes in it -- I won't tell you where, to make it interesting xD

Hope everything's going well with the team. Hoping to see the Java agent out in public as well!

Let the best agent and product win ;)

Conda

Hi,

this is more of an FYI than an issue but I added the profiler to conda-forge: https://github.com/conda-forge/codeguru_profiler_agent-feedstock

Feel free to ping me if anyone here would like to be added as a maintainer there.

Improvement in CPU usage checker

ProfilerDisabler's CPU usage checker calculate the CPU usage by profiler by the average runProfiler time divided by sampling interval.

runProfiler includes time to refreshConfig, submitProfile, sampling, aggregate sample.

RefreshConfig and SubmitProfile get called every report interval (e.g. 5 minutes) while sampling and aggregate sample get called every sampling interval (e.g. 1 second).

To have a better estimation of the cpu overhead; we should consider sampling + aggregate sample separately from refreshConfig + submitProfile.

Sampling + aggregate sample overhead can stick with what CPUUsageChecker is doing now. A separate mechanism should be used for estimating the CPU overhead by submitProfile and refreshConfig

Profiling decorator overrides decorated function's name.

Hi all,

Found a small issue that has been bothering me for a while, so finally took the time to create an issue + PR.

When you use the with_lambda_profiler decorator, it overrides the name of the function with profiler_decorate. Normally this isn't very visible, but if you are using XRay as well, it becomes painfully obvious, as per this screenshot:

The fix is minor, so I'll just create a PR for it here now.

Thanks