o3de / sig-testing Goto Github PK

Documentation and materials for the Open 3D Engine Test Special Interest Group

sig-testing's Introduction

O3DE Testing Special Interest Group

SIG Testing Charter

This group exists to make testing O3DE software easy and automated. For more information on what we do and how to join us, read our Charter

Meetings

Meeting notes are available from our Meetings which take place on Discord. Anyone is welcome to join these meetings and politely listen or contribute.

The full table of previous meetings is available here.

Documentation

This repository contains our own documentation on how to operate this SIG, separate from the user-facing documentation at O3DE.org

General Resources

See O3DE Resources for additional information related to any SIG in O3DE.

Licensing

The O3DE/foundation repository and all contributions under this repository herein are licensed under Creative Commons Attribution 4.0 International License. Please note that, by contributing to this repository, whether via commit, pull request, issue, comment, or in any other fashion, you are explicitly agreeing that all of your contributions will fall under the same permissive license.

sig-testing's People

Contributors

Stargazers

Watchers

Forkers

evanchia-ly-sdets vincent6767 smurly lb-jaroslawgaweda lb-tomaszserwanski lb-michalrodzos

sig-testing's Issues

Define Metrics for Debugability

Summary

Currently, most design decisions and work for debugability are done through either grass-roots movements where small groups of developers make a change, or a contributor pushes for a specific improvement. It is consequently unclear what metrics have been important in past decisions. We want to design and implement metrics so that such decisions can be supported by appropriate data and to measure the effectiveness of changes made because of these decisions.

Selecting and measuring appropriate metrics for debugability is challenging. Ultimately, debugability has a significant human element that is difficult to quantify. However, it is still valuable in measuring what can be measured, as well as acknowledging the incomplete picture that these metrics provide.

There is also a wide range of improvements that can be done for debugability timelines that cross over into other areas. For example, faster build times also improves debugability timelines. However, to limit the scope of this RFC, we are going to limit this document to discussing elements that are unique to debugability.

We are suggesting to measure a few key debugging times and polling customers on their debugging experience.

What is the motivation for this suggestion?

Primarily, the motivation for this is to enable metrics-driven decisions around improvements to debugability. Diving a little deeper, these metrics will allow us to more intelligently focus our efforts in improving debugability, specifically in improving the timelines for discovering and fixing bugs. It will also allow us to prioritize which problems are most severe.

We also want to be able to track the impact of decisions we make to improve debugability. Measuring metrics and their subsequent improvement is one way to do this. This will allow us to provide visibility to the larger organization by a metrics-driven approach and provide support for claims of debugability improvements.

Design Description

Debugging Steps

Here are some common steps of finding and fixing a bug:

The error is reported. This is done by Jenkins sending out an e-mail, although developers can also view the Jenkins build through its web UI.
The log is investigated. This can be found through the web UI for Jenkins. (or in a GitHub issue)
1. This may include artifact investigation, which is also found through the web UI for Jenkins
The bug is reproduced locally. This step is not always required, but often is. Depending on the developer's workstation state, the nature of the bug, and the developer's preferences, the following sub-steps may be required:
1. Sync the commit where the bug was found.
2. Build the commit in the appropriate flavor.
3. Add breakpoints. This is a less common step.
4. Add new output. This is a less common step.
5. Run the appropriate executable(s).
6. Run the appropriate test(s).
Make changes to actually fix the bug.
1. If no test existed to catch the bug, add an appropriate test(s) to prevent the bug from regressing.

Measuring Debugging Time

Each of the steps above can be measured in time, although the value of those data points will vary wildly. Many of them are too dependent on the bug, can be done in parallel, or are out of scope for debugability. In order to ensure consistent measuring, we will perform a UX study where research subjects are monitored by a researcher. The researcher will measure the time each step takes with a stopwatch, record the results, and collect other feedback.

Local reproduction is a borderline case - things like build time are out of scope, but making sure that the bug can be reproduced locally reliably are in scope. Further, there's a ton of variables involved that will make any metrics we do measure exceptionally noisy. Did they have a partial sync already? Did they have a build already? Do they prefer to debug via logs or debug via the debugger? All of these will cause the time to vary wildly while not actually indicating the ease of the debugging as defined as in-scope for this RFC. Given that local reproduction is more of a binary question (Can we reproduce the bug locally?) given the scope of this RFC, we will not be including it as a suggested timed metric.

Debugging Time Metrics

Thus, this leaves us with 3 steps to measure for time:

Error reporting time: This step would have three parts. Currently, Jenkins only outputs an error e-mail when all builds in a run are finished, but changes could be made to report an issue as soon as a build in a run has failed.
- Time when failure is first signaled inside Jenkins job
- Time when Jenkins sends an e-mail reporting the failure to the user (or a bot posts a failure message)
- Time when user takes action on the notification and starts working on their machine (modifying code, or starting a build)
Log-finding time: This step would have two parts. We expect this to be a small portion of the total time consumed, but it is still valuable to reduce it where possible.
- Time it takes from notification to finding the correct log
- Time it takes (after finding the correct log) to parse enough information in the log to take action locally
Artifact-finding time: We expect this to be a small portion of the total time consumed, but it is still valuable to reduce it where possible.

The first one is a binary option - we can either send the e-mail at the end of the build, or upon the first failure. But measuring that difference could allow us to decide if the earlier messaging time is worthwhile. The second two options are going to have to be user-reported metrics.

Metric	Description	Measured By
Failure first signaled	Time to see first failure in Jenkins	Automated script
Failure e-mail sent	How long it takes for an e-mail/message to go out after a failure is found	Automated script
Log-finding	How long it takes & how many clicks to find the appropriate log from a failed Jenkins run main page	Polling, customer reports
Log-parsing	How long it takes to find the actual error message in the appropriate log	Polling, customer reports
Artifact-finding	How long it takes & how many clicks to find artifacts from a failed Jenkins run main page	Polling, customer reports

Steps to Achieve Goal

Even if a given objective is swift to complete, having too many steps (often in tedious clicks), can be a frustrating developer experience. Thus, we will measure the amount of steps a given goal took prior to changes and after changes. This metric has the advantage of not relying on polling, either, as we can independently measure this.

Polling

While potentially less reliable than absolute metrics such as measuring time, polling customers for their preferences and pain points is still valuable. We would reach out to the sig-ux in order to develop questions. Measuring user preferences is not a primary skill set of our team and we would want to lean on experts to ensure an effective solution for our customers.

In addition to broad polls, we will include deeper individual customer reports where we work with a single developer to examine their experience directly. For this step, we would work with a small sample of specific developers to get a detailed view of their debugging experience before and after changes. While this will take more time, it will provide more reliable results and allow us to guarantee accuracy in our polling.

Checks

As a part of polling, we want to ensure that certain key checks are hit. These are:

Were you able to reproduce the bug locally, the first time?
Were you able to reproduce the bug locally?
Were you able to reproduce the bug?

These are, in order, valuable benchmarks to hit. Reproduction of a bug is key in fixing it. Being able to reproduce it locally allows a developer to glean important information in fixing it. Finally, being able to reproduce it the first time is helpful in simply saving developer time by not forcing them to repeat steps.

Implementation

Write script that scrapes Jenkins build logs to measure Failure first signaled and Failure e-mail sent metrics
Perform external developer polling (this would be where most questions would be) (this would include log-finding, log-parsing, artifact-finding metrics)

Possible Outcomes / Scenarios

Between metrics and polling, we will be able to generate a better picture of our pain sources and problems than we have now. While we won't be able to predict every possible outcome of these metrics and polling, we will attempt to identify several likely ones and comment briefly on our plans for those outcomes.

Outcome/Scenario	Planned Steps
Problems are not being logged due to unclear issue-filing process	Provide/improve documentation on issue-filing, run trainings on issue-filing
Problems are not being logged due to effort required	Provide quick-links and other shortcuts, create issue templates
Problems are being logged, but are not being actioned on	Use metrics to provide data to prioritize issues higher, reach out to appropriate owners

What are the advantages of the suggestion?

Provides absolute metrics in debugability (see table above)
Polls customers directly to get their experiences
Focuses on the key aspects of debugability and provides a limited scope
Low investment cost

What are the disadvantages of the suggestion?

Not all aspects are measured with metrics, meaning that some pain points may be under-examined
Metrics are only a partial picture of the problem, potentially supporting some solutions more than they merit (this can be mitigated by seeking customer feedback)
Customer opinions may change over time, requiring multiple polls (may require setting a particular cadence in order to maximize responses)
Some issues found by metrics/polling may not be relevant to the scope

Are there any alternatives to this suggestion?

This suggestion has multiple components that are not coupled, so each component can be considered individually. This runs the risk of giving a partial picture, but perhaps the lesser amount of investment is worth the lesser result.

We could implement metrics on every step of the debugging process. This has a high risk of noise, as certain steps running long are way less impactful than others (builds can be run overnight, for example, while finding the logs & artifacts is an active step). Further, many of these metrics can be addressed by tasks out of scope for this RFC.

We could skip the detailed customer experience poll. Broad polling often has low engagement, however, so a more direct interaction is expected to provide better results, even if it is at a higher cost.

What is the strategy for adoption?

First, this would require reaching out to coordinate with the sig-ux to develop polling questions. Metrics and polling both require customer input, so this would require a customer outreach campaign. We would also have a small number of direct customer interactions. We would have to collate these results, then make our changes, then poll again to see if our changes have had the desired effects.

Define proactive outreach to other SIGs

While SIG-Testing is a group organized to simplify test-writing for other SIGs, our charter and RFCs are fairly passive. SIG-Testing should proactively reach out to other SIGs to help clarify how we can assist and/or collaborate with them.

Recommend the ways we can reach out, critical talking points, and a cadence for regularly performing outreach. This should result in creating new tasks to define the outreach content, and to perform the initial outreach.

Initially suggested points:

SIG-Testing can assist other SIGs by:
- Help provide, improve, or take ownership of shared test tools
- Consulting to help determine their level of risk / risk-acceptance
- Reviewing any new test automation in pull requests
Outreach approach could include:
- Attend meetings of other SIGs, to help clarify - meetings are listed in O3DE calendar
- Could clarify what we do in Discord text channel
- Designate ambassador to other SIGs (may personalize which is good, may pigeonhole which is less good)

RFC: EditorEntityComponentBehaviors Libraries

Summary:

A set of libraries that allow for Editor Entity Components to be interacted with through common user interactions that come packed with behavior validators (need to break validators out into their own RFC).

For a prototype please see Gem/PythonTests/EditorPythonTestTools/editor_python_test_tools/editor_component

What is the motivation for this suggestion?

Why is this important?

As we continue to stabilize the automation, looking into ways to provide standard behaviors for interactions with the AzLmbr API will allow us to write cleaner, more maintainable, and stable tests that we trust.
Abstracting the complexity of AzLmbr interactions provides an easy to use interface.
- Lowers the barrier of entry for writing automation.
- Prevents interacting with abstracted behaviors in a way that's been proven to be error prone or flaky.

What are the use cases for this suggestion?

Easily adds a component to an entity.
Easily allows a user to interact with a component entity.

    phsyx_collider_entity = EditorEntity.create_editor_entity("TestEntity")
    physx_collider = PhysxCollider(phsyx_collider_entity)
    physx_collider.shape.set(physics.ShapeType_Box)
    physx_collider.shape.box.height.set(value)
    var = physx_collider.box_shape.height.get()

What should the outcome be if this suggestion is implemented?

A library of editor entity component property types.
A Library of Entity Component behavior classes.

Suggestion design description:

On-Build Auto-generation of the following:

A library of editor entity component property types.
- A base class for a property and it's default expected behaviors to be overridden.
- A series of classes that define the behavior of those property types.
  - EG: A float property would inherit the base property class and would define the common behaviors for a float property.
A Library of Entity Component behavior classes.
- A base class that defines the standard behaviors for an entity component.
- A series of classes for each entity component that inherits the base entity component class that then registers a property class for each property supported by the entity component.

What are the advantages of the suggestion?

Libraries are always up to date
Property Classes could later be extended for other Entity/Component models throughout the engine (such as in the Character Editor's Inspector).

What are the disadvantages of the suggestion?

Effort required for writing the auto-gen system.
Effort required to define one-off entity component/property behaviors.
- Built-In Bus Calls such as Reflection Probe in Atom.
- Special handling for Properties that use built-in Enums.
- Special handling properties that are containers.

How will this be work within the O3DE project?

It could be generated every build and could be placed in a parallel folder to user\python_symbols\azlmbr called user\python_symbols\editor_tools\

Are there any alternatives to this suggestion?

Same implementation as above but not using auto-generation.

What is the strategy for adoption?

Explain how new and existing users will adopt this suggestion.
- TBD
Point out any efforts needed if it requires coordination with multiple SIGs or other projects.
- TBD
Explain how it would be taught to new and existing O3DE users.
- TBD

Proposed SIG-Testing meeting agenda for July-01-21

Meeting Details

Date/Time: July 1, 2021 @ 18:00 UTC / 2:00pm EDT / 11:00am PDT
Location: Discord SIG-Testing Voice Room
Moderator: Sean Sweeney (Kadino)
Note Taker Sean Sweeney (Kadino)

The SIG-Testing Meetings repo contains the history past calls, including a link to the agenda, recording, notes, and resources.

SIG Updates

There are no updates since the previous meeting, as this is the first meeting! We are currently trying to bootstrap this SIG, to prepare it for success when it is handed off to the Linux Foundation. The current meeting participants are expected to be the majority of the members of this public SIG through launch.

Meeting Agenda

Attendee introductions
Review and ratify SIG Charter at #1

Outcomes from Discussion topics

Charter review had significant discussion, pointed to additional changes

Action Items

Update charter based on feedback (Kadino)
Upload meeting minutes to repo (Kadino)
Re-review and ratify charter at next SIG-Testing meeting (sig-testing)

Open Discussion Items

List any additional items below!

Create a SIG Triage workflow

Describe the Issue
There should be a meeting-script for organizers follow, to help consistently cover the three locations where sig-testing owns issues:
https://github.com/o3de/o3de
https://github.com/o3de/o3de.org
https://github.com/o3de/sig-testing

This can be documented inside https://github.com/o3de/sig-testing/tree/main/meetings

Additional context
Links should include query filters into the issues

Create Roadmap View

Hi! SIG release is working to help the O3DE community to get visibility of the O3DE roadmap. Full context can be read here: o3de/sig-release#79.

In order to achieve that goal, we need your help to create the roadmap for your SIG by February 6th, 2023, and start giving brief updates about the roadmap items at the Joint TSC meeting on February 28th, 2023. Instruction to create the roadmap view can be read in the RFC section "Roadmap Review in TSC monthly meeting".

Let me know if you have any concerns with the dates or any questions about the ask!

Determine if additional Ubuntu tests are required and potentially put them in place.

During the 22.05.0 release we discovered this issue: o3de/o3de#9502

This issue was randomly discovered by someone testing the night before the release.
I am raising this here so that SIG-Platform can consider when/where this sort of testing (opening files from the file explorer) should be tested and potentially put the necessary processes in place.

Since it's platform-related testing it seems like it should fall under the responsibility of this sig to decide what should happen with this.

Investigate making results from Nightly AR runs to be publicly available

In order to hand off the ownership of RCA and reporting of Nightly AR failures to the SIGs we need to make the reporting of the nightly results available publicly.

This issue is to look into the requirements and dependencies to make this possibly. A couple of options we are considering:

Use tooling to auto-cut issues for failed tests into GHI under needs-sig and needs-triage label, then reporting the list of issues daily
Make the overall results of nightly runs publicly available, then hand ownership and responsibility of reporting issues from failures to the SIGs

There may be other options to enable this process, this issue is to perform the investigation into these options.

Proposed RFC Feature: Collect EditorTest tests across O3DE

Summary:

Currently, EditorTest tests can only be batched or parallelized within the module that declares them. This leads to a moderate amount of inefficiency, where relatively few tests can be launched in batches and/or in parallel. It also provides some convenience, where there are guarantees that other modules' tests won't interfere with one another. There could be significant gains by enabling all (or more) tests to be collected and executed together.

There may be issues such as temporary in-memory prefabs.

May be appropriate to allow tests to opt in to such an all-test parallelization and/or batching.

TODO: Fill out the rest of this RFC with a proposal.

What is the relevance of this feature?

Why is this important? What are the use cases? What will it do once completed?

Feature design description:

Explain the design of the feature with enough detail that someone familiar with the environment and framework can understand the concept and explain it to others.
It should include at least one end-to-end example of how a developer will use it along with specific details, including outlying use cases.
If there is any new terminology, it should be defined here.

Technical design description:

Explain the technical portion of the work in enough detail that members can implement the feature.
Explain any API or process changes required to implement this feature
This section should relate to the feature design description by reference and explain in greater detail how it makes the feature design examples work.
This should also provide detailed information on compatibility with different hardware platforms.

What are the advantages of the feature?

Explain the advantages for someone to use this feature

What are the disadvantages of the feature?

Explain any disadvantages for someone to use this feature

How will this be implemented or integrated into the O3DE environment?

Explain how a developer will integrate this into the codebase of O3DE and provide any specific library or technical stack requirements.

Are there any alternatives to this feature?

Provide any other designs that have been considered. Explain what the impact might be of not doing this.
If there is any prior art or approaches with other frameworks in the same domain, explain how they may have solved this problem or implemented this feature.

How will users learn this feature?

Detail how it can be best presented and how it is used as an extension or a standalone tool used with O3DE.
Explain if and how it may change how individuals would use the platform and if any documentation must be changed or reorganized.
Explain how it would be taught to new and existing O3DE users.

Are there any open questions?

What are some of the open questions and potential scenarios that should be considered?

Roadmap: Intermittent Failure Notifications

Summary:

Use metrics on historical test failure rates to send automated notifications before intermittent failures reach a critical mass of instability. A dashboard should also be available.

What is the relevance of this feature?

Provides SIGs awareness of any infrequent but repeated failures in tests which they maintain. When delivered, SIGs should be better prompted to take early action before intermittent failure rates rise critically high.

Tasks

Beta Give feedback

RFC: Intermittent Failure Intervention #43

priority/critical rfc-feature status/blocked triage/accepted
#62 (blocking prerequisite)
(RFC followup work TBD)
Options

Proposed RFC Feature: Automatically Discover Intermittently Failing Tests

Summary:

It can sometimes be difficult to know whether the test failure is related to their proposed change, or the failure is due to an inherently flaky test or unstable feature. To limit the pain from this ambiguity, the test infrastructure can automatically help identify the source of intermittent behavior.

What is the relevance of this feature?

Whenever a contributor encounters a test failure in AR, time is usually lost from one of two ways:

The contributor assumes the test is flaky, triggers a rerun with no changes, then after the multiple failures realizes the changelist contains a bug. This costs both time of an AR run and AR resources of one run.
The contributor spends time asking test owners or general chat if there is a known issue with the test because they are unsure if the failure is related to their changelist. This causes minor disruptions as people take time to assist answering the question and may also lead to people losing confidence in the testing pipeline.

Implementing this feature will allow contributors to determine whether a test failure is intermittent or deterministic. There may also be an intermittent bug with the product and the test is correctly catching the issue. This feature will help generate data for addressing intermittently failing tests whether the fault be the product or the test itself.

Feature design description:

Test failures triggers another step in Jenkins that reruns all failed tests M-number of times (proposing 10) and outputs a summary of the results. The results will provide information on whether the test is failing deterministically or if the failure is due to a flaky test. The rerun will not change the success output of the AR run and is meant to only provide additional information. There will also be an agreed upon timeout set (proposing 5 minutes) to ensure that we limit the amount of excess time spent in the pipeline.

For example, let's say that a user runs the Automated Review (AR) for their pull request that results in a single test failure. The pipeline will trigger an addition rerun step and isolate the failed test. The failed test will be rerun 10 times which can result in one of two scenarios:
-10/10 failures shows that the failure is deterministic and related to the pull request
-0-9/10 failures shows that the test is intermittently failing and further investigation is required. Data is collected.
The user is then notified of this result.

Technical design description:

A new step will be added to the Jenkinsfile for test reruns. This step can be further separated into two functions: Pytest reruns and googletest reruns. The Pytest reruns are trivial because a feature already exists. Run the command below which utilizes the pytest cache for the last test failures. The pytest cache may need to be preserved before starting the reruns because successive pytest reruns will overwrite the cache and may change the list of "previously failed tests".
The GoogleTests will need to mimic this cache property and store the failed test commands in another location. AzTestRunner will need to be modified to write to a "cache" file that contains all of the parameters required to run the each failed test. The new Jenkins step will iterate through each failed test inside of this "cache" and call AzTestRunner with the correct parameters. Once these tests are finished, Jenkins will determine whether the test failures are intermittent or not and add this information to the notifications it already sends out.

Pytest

Rerunning failed tests for pytest is simple, as it contains a rerun functionality already. The function utilizes contents stored in ~/o3de/.pytest_cache to track which tests failed:

# Pytest rerun command
~/o3de/python/python.cmd -m pytest --last-failed

GoogleTest

When a test fails, it will write a copy/paste rerun command for the individual Googletest. The rerun step will execute all of the saved test commands. AzTestRunner already compiles a rerun command to output, so this can probably be leveraged:

# Example output of test rerun command
/data/workspace/o3de/build/linux/bin/profile/AzTestRunner /data/workspace/o3de/build/linux/bin/profile/libAzCore.Tests.so AzRunUnitTests

What are the advantages of the feature?

Implementing this feature will improve the user experience for contributors when running into test failures. Knowing whether a test is intermittent or deterministic is useful information. This will also enable future features for intermittently failing test metric tracking. This can be useful for fixing intermittently failing tests by relating test failure rates against changelists.

What are the disadvantages of the feature?

Increased Cost from AR execution times and extra EC2 resources on test failures.
- This is mitigated by setting a timeout on the reruns which stops triggering reruns when the timeout is exceeded, but still finishes the current rerun. This will result in a worst case scenario extra (T + N) time, where T is the timeout and N is the time to take to run the entire suite (if all tests fail).
Only tests utilizing LyTestTools (Pytest) and AzTestRunner (Googletest) frameworks will have the rerun commands. If a test is executed in any other way, the framework will have to be extended for its rerun command.
A frequent flaky-failure case is when multiple tests are run at the same time, however this will run the test in isolation. This problem requires a considerably larger effort that is considered out of scope for this proposal.

How will this be implemented or integrated into the O3DE environment?

Covered in technical description.

Are there any alternatives to this feature?

Utilize CTest which has existing failed test retry functionality. This solution was not preferred because CTest cannot distinguish individual test cases within its module. It would rerun the entire module, which would increase the time to rerun.

We track the test history of each test and upon test failures, query the history to check the last N runs of the same test. This would add a small amount of time to the pipeline compared to rerunning the tests. However, this solution misses the value of focusing on the test history on the current change, and preventing new intermittent behavior.

How will users learn this feature?

Since this feature will be automatically added into Automated Review, a simple message/email notifying O3DE contributors about the changes should suffice. A section can be added in the existing Automated Review document for the reruns. This feature will be added to the documentation and will be automatically used during Automated Review.

Are there any open questions?

The largest value that can be obtained from this feature is only if contributors utilize the information as a tool to fix intermittently failing tests. This feature only provides additional information and will require further features and processes as a solution to intermittently failing tests.

Feedback Date

Discussion and decision should be made on 1/28/22

Proposed SIG-Testing meeting agenda for 2021-08-27

Meeting Details

Date/Time: August 27, 2021 @ 11:30AM PT / 2:30pm ET
Location: Discord SIG-Testing Voice Room
Moderator: David Kerwin (AMZN-dk)
Note Taker David Kerwin (AMZN-dk)

The SIG-Testing Meetings repo contains the history past calls, including a link to the agenda, recording, notes, and resources.

SIG Updates

What happened since the last meeting?
Last meeting we discussed the metrics we want to track from the Automated Review system, this week we will focus on metrics from the GHI repo related to bug tickets.

Meeting Agenda

Discuss agenda from proposed topics

Discuss new and changed reporting needs for bugs in the O3DE GHI repo

What should be reported (which bug clusters to report, how often etc.)
Define audience and ownership of reported data

Open floor for testing SIG related discussion and to propose items to add to the Testing SIG roadmap

Outcomes from Discussion topics

Discuss outcomes from agenda

Action Items

Create actionable items from proposed topics

Open Discussion Items

List any additional items below!

Generate a list of needed testing documentation, guides, tutorials, and execution instructions

This issue will be the starting point for the Testing SIG to identify test ware artifacts which are needed to be hosted in a public facing location such as this repository or the O3DE.org documentation guides.

Examples of test ware that are needed include:

How to run automated tests locally
How to write automated tests for each component or test suite
How to execute user workflow testing manually prior to submitting work

The output of this will be to turn each needed item into its own issue to be assigned and prioritized for publication.

SIG Reviewer Nomination

Nomination Guidelines

Reviewer Nomination Requirements

6+ contributions successfully submitted to O3DE
100+ lines of code changed across all contributions submitted to O3DE
2+ O3DE Reviewers or Maintainers that support promotion from Contributor to Reviewer
Requirements to retain the Reviewer role: 4+ Pull Requests reviewed per month

ReviewerNomination

I would like to nominate: SWMasterson, to become a Reviewer on behalf of sig-testing. I verify that they have fulfilled the prerequisites for this role.

Nominee's GitHub Profile https://github.com/swmasterson

Reviewers & Maintainers that support this nomination should comment in this issue.

SIG Maintainer Nomination

Nomination Guidelines

Reviewer Nomination Requirements

6+ contributions successfully submitted to O3DE
100+ lines of code changed across all contributions submitted to O3DE
2+ O3DE Reviewers or Maintainers that support promotion from Contributor to Reviewer
Requirements to retain the Reviewer role: 4+ Pull Requests reviewed per month

Maintainer Nomination Requirements

Has been a Reviewer for 2+ months
8+ reviewed Pull Requests in the previous 2 months
200+ lines of code changed across all reviewed Pull Request
2+ O3DE Maintainers that support the promotion from Reviewer to Maintainer
Requirements to retain the Reviewer role: 4+ Pull Requests reviewed per month

Maintainer Nomination

Fill out the template below including nominee GitHub user name, desired role and personal GitHub profile

I would like to nominate: michleza, to become a Maintainer on behalf of sig-testing. I verify that they have fulfilled the prerequisites for this role.

Nominee's GitHub Profile https://github.com/michleza
Note: michleza is currently a reviewer of sig-simulation

Reviewers & Maintainers that support this nomination should comment in this issue.

Proposed RFC Feature: In-Game Test Harness

Summary:

The ability to author and execute tests which target logic in the games produced by O3DE.

What is the relevance of this feature?

Customers in the games industry increasingly want to automate tests of their game logic, for example]. O3DE also has extremely few tests of the default game projects we build, due to both the overall difficulty of in-game automation and discarding outdated tests. Delivering a system that simplifies in-game automation will not only help O3DE prevent its own examples from breaking, but also help customers efficiently improve their own products. While it is always possible for customers to use game programming techniques to create their own in-game test automation, failing to serve this area leaves an unstandardized negative space.

Proposed RFC Feature: Test Impact Analysis Framework

Summary

This feature proposes a change based testing system called the Test Impact Analysis Framework (TIAF) that utilizes test impact analysis (TIA) for C++ and Python tests to replace the existing CTest system for test running and reporting in automated review (AR). Rather than running all of the tests in the suite for each AR run, this system uses the source changes in the pull request (PR) to select only the tests deemed relevant to those changes. By running only the relevant subset of tests for a given set of source changes, this system will reduce the amount of time spent in AR running tests and reduce the frequency at which flaky tests unrelated to the source changes can block PRs from being merged.

What is the relevance of this feature?

As of the time of writing, it takes on average three attempts at AR for a given PR to successfully pass prior to merging. During each run, it takes on average 15 to 20 minutes for all of the C++ and Python tests to be run. When a given AR run fails, sometimes this will be due to pipeline and build issues (beyond the scope of this RFC) and other times due to flaky failing tests that are not relevant to source changes. This system will reduce the amount of time spent in AR by reducing the number of tests being run for a given set of source changes as well as reducing the likelihood of irrelevant, flaky tests from causing false negative AR run failures.

Feature design description

TIA works on the principle that for a given set of incoming source changes to be merged into the repository, the tests that do not cover the impacted source changes are wasted as they are incapable of detecting new bugs and regression in those source changes. Likewise, the opposite is true: the tests that do cover those changes are the relevant subset of tests we should select to run in order to detect new bugs and regression in those incoming source changes.

In order to determine which tests are pertinent to a given set of incoming source changes we need to run our tests with instrumentation to determine the coverage of those tests so we can build up a map of tests to the production sources they cover (and vice versa). Each time we select a subset of tests we deem relevant to our incoming source changes we need to also run those tests with said instrumentation so we can update our coverage map with the latest coverage data. This continuous cycle of select tests → run tests with instrumentation → update coverage map for each set of source changes coming into the repository is the beating heart of TIA.

Test selection

The circle below represents the total set of tests in our repository:

The red circle below represents the relevant subset of tests selected for a given set of incoming source changes:

In this example, the red circle is notably smaller than the white circle so in theory we can speed up the test run time for our incoming source changes by only running the tests in the red circle that we have deemed relevant to those changes.

In practice

Unfortunately, OpenCppCoverage, the open source, toolchain independent coverage tool available on the Windows platform is far too slow for use in the standard implementation of TIA. Furthermore, there is no tooling available to perform TIA for our Python tests. As such, two new algorithms have been developed to speed up TIA enough for real world use in O3DE for both our C++ and Python tests.

Technical design description

C++ tests

Fast Test Impact Analysis (FTIA) is an algorithm developed as part of the TIAF that extends the concept of TIA to use cases where the time cost of instrumenting tests to gather their coverage data is prohibitively expensive.

FTIA works on the principle that if we were to generate an approximation of test coverage that is a superset of the actual coverage then we can play as fast and loose with our approximation as necessary to speed up the instrumentation so long as that approximation is a superset of the actual coverage.

Test selection

The yellow circle below represents the approximation of relevant subset of C++ tests in a debug build for a given set of incoming source changes:

Notice that the yellow circle is notably larger than the red circle. However, notice that it is also notably smaller than the white circle.

Test instrumentation

We achieve this by skipping the breakpoint placing stage of OpenCppCoverage. In effect, this acts as a means to enumerate all of the sources used to build the test target binary and any compile time and runtime build targets it depends on. The end result is that it runs anywhere from 10 to 100 times faster than vanilla OpenCppCoverage, depending on the source file count. Not only that but with some further optimizations (unrelated to instrumentation) we can bring the speed of running instrumented test targets to be at parity with those without instrumentation.

Is this not the same as static analysis of which sources comprise each build target?

Static analysis of the build system by walking build target dependency graphs to determine which test targets cover which production targets is another technique for test selection in the same family as TIA but with one important detail: it is incapable of determining runtime dependencies (e.g. child processes launched and runtime-linked shared libraries loaded by the test target, which some build targets in O3DE do). Not only that but with FTIA we get another optimization priced in for free: for profile builds, the number of sources used to build the targets in O3DE is typically considerably less than those of a debug build (the latter being somewhat a reflection of the sources specified for the build targets in CMake). This is due to the compiler being able to optimize away dead code and irrelevant dependencies etc., something that would be functionally impossible to do with static analysis alone.

The blue circle below represents the approximation of relevant subset of C++ tests in a profile build for a given set of incoming source changes:

Notice that the blue circle is notably smaller than the yellow circle and approaching a more accurate approximation of the red circle. In effect, we had to do nothing other than change our build configuration to leverage this performance optimization.

Python tests

Indirect Source Coverage (ISC) is the novel technique developed to integrate our Python end-to-end tests into TIAF. The goal of ISC is to infer which dependencies a given test has in order to facilitate the ability to map changes to these dependencies to their dependent tests. Such coverage is used to associate the likes of assets and Gems to Python tests, shaders and materials to Atom tests, and so on. Whilst the technique described in this RFC is general, the specific focus is on coverage for Python tests such that they can be integrated into TIAF.

Basic approach

The techniques for inferring source changes to C++ tests are not applicable to Python tests as FTIA would produce far too much noise to allow for pinpointed test selection and prioritization. Instead, Python test coverage enumerates at edit time components on any activated entities as their non-source coverage to determine which Python tests can be selected when the source files of these components are created, modified or deleted.

An additional optimization step is performed: for each known component, the build system is queried for any other build targets (and their dependencies) that any of the known component's parent build targets are an exclusive depender for. These exclusive dependers are thus also deemed to be known dependencies for any Python tests that depend on said parent build target.

Differences Between C++ Coverage and Test Selection

Python test selection differs from C++ test selection on a fundamental level: C++ test selection is additive, whereupon the selector optimistically assumes NO C++ tests should be unless a given test can conclusively be determined through analysis of coverage data whereas Python test selection is subtractive, whereupon the selector pessimistically assumes ALL Python tests should be run unless a given test can be conclusively determined through analysis of non-source coverage. This subtractive process allows for test selection to still occur with only tentative, less rigorous coverage data that can be expanded upon as more rules and non-source coverage options are explored without the risk of erroneously eliminating Python tests due to incomplete assumptions and limited data.

Limitations of ISC

The limitations of this proposed approach are as follows:

Only Python tests that have component dependencies are eligible for test selection.
a. All other Python tests must be run.
Only change lists exclusively containing CRUD operations for files belonging to the build targets for components (or their exclusive dependencies) that one or more Python test depends on are eligible for test selection.
b. Otherwise, all Python tests must be run.

Technical Design

The PythonCoverage Gem is a system Gem enabled for the AutomatedTesting project that enumerates at run-time all components on all activated entities and writes the parent module of each component to a coverage file for each Python test case. Using this coverage data, the build target for each parent module binary is inferred and thus the sources that compose that build target. Should a change list exclusively contain source files belonging to any of the enumerated components it is eligible for Python test selection. Otherwise, it is not an all Python tests must be run.

Coverage Generation

The Editor is launched with the Python test being run supplied as a command line argument.
A OnEntityActivated notification is received for an entity that has been activated.
The component descriptors for the recently activated entity are enumerated.
For each enumerated component descriptor, an attempt is made to discover the parent module that the descriptor belongs to.
a. As components belonging to static libraries do not have a parent module, only those belonging to shared libraries (i.e. Gems) can be used for coverage.
The parent modules discovered for the Python test are written out to a coverage file for that test.

Dynamic Dependency Map Integration

For each Python test coverage file, the build system is queried to find the build target for parent modules in that coverage file.
a. Determine the modules that this module is an exclusive depender for.
The source files for each build target are added to the Dynamic Dependency Map as being covered by the Python test(s) in question.

Test Selection

The change list is analysed to see if it exclusively contains source files for known Python test dependencies:
a. If yes, the Python tests dependent on those dependencies are selected for the test run (Python tests not dependent are skipped).
b. If no, all Python tests are run.

What are the advantages of the feature?

For the most part, the benefits will come without any need for input from team members as it will be running as part of AR to reduce the time your PRs spend in AR, as well as reducing the likelihood of you being blocked by flaky tests. However, the framework itself is a complete model of the build system so there is a lot of scope for using this information to implement even more AR optimizations. For example, as the framework knows exactly which sources belong to which build targets, and which of those build targets are test targets and production targets, it knows that, say, a text file is not part of the build system (or covered by any tests) so changes to that file can be safely ignored by AR. Although it’s hard to speculate what the long term benefits of this framework will be, we are confident that other teams and customers will make good use of this framework to implement all manner of novel optimizations and useful repository insights.

C++ tests

On average, we are currently rejecting 85% of tests to run for incoming source changes which has resulted in speedups of approximately 5x compared to that of running all of our tests (and there’s still a lot of room for improvement, an end goal of >95% test rejection and 15-25x overall speedup of test run times is realistic once all the other pieces are in place).

Above is the plot of test run duration for 130 builds in O3DE's AR servers using FTIA. The orange line represents the average duration of all of thees builds (which currently stands at just over 30 seconds). Notice that many builds take significantly less than this, and it is only the long pole test runs (those where the source changes touch root dependencies, e.g. AzCore) that are pushing the average higher.

Above is the selection efficiency (amount of tests rejected as being irrelevant for the source changes) for the same 130 builds. The average efficiency is about 85% (85% of tests being rejected) with many more being at or near 100%.

Python tests

We ran the Python test selector for the past 50 pull requests and measured the test selection efficiency, that is, the percentage of the 26 Python test targets in the Main suite that were rejected as being not relevant to the source changes in the PR. Our baseline efficiency to beat, as offered by our current CTest approach, is 0% (no test targets rejected), so anything above 0% would be considered a win.

In the graph above, the orange line represents the average test selection efficiency over all 50 PRs (currently sitting at 30%). That is to say that, on average, 30% of test targets are rejected for a given PR (even if the data itself is very polarized). Although this falls far short of the 85%+ efficiency of the C++ test selection under Fast Test Impact Analysis, even if no further improvements were to made we would still be cutting down the number of Python tests our customers need to run in AR by nearly a third. Over time, these numbers add up to significant, measurable savings in terms of server costs and velocity that PRs can be merged into the repository. Of course, we do have plans to improve the efficiency, but that’s a challenge for another day.

What are the disadvantages of the feature?

This new system undoubtedly introduces far more complexity than using the existing CTest system. This complexity in turn means responsibility to further develop and maintain the system, as well as more points of failure that could hinder AR should any breaking changes be introduced into the system.

How will this be implemented or integrated into the O3DE environment?

The average developer need do nothing but sit back and reap the benefits of time savings in AR when they wish to merge their PRs. For the teams responsible for maintaining the build system and AR pipeline, this system will require the CTest pipeline stages to be replaced with the TIAF pipeline stages. We have been testing this system with our own shadow pipeline to weed out any teething issues so will be able to provide a strategy for seamless replacement into O3DE's AR pipeline.

Currently, only Windows is supported. Should this system be rolled out with success, other platforms will be supported.

Are there any alternatives to this feature?

There are no free, open source and toolchain independent tools for providing TIA on the Windows platform.

How will users learn this feature?

Documentation will be provided but, for the most part, this feature will be transparent to the user. There is nothing they need to do in order to use this system other than open up a PR.

Are there any open questions?

How will other platforms be supported?
How can this system be integrated with the least amount of friction and disruption?
What are the unforeseen blind spots in coverage for both C++ and Python tests?

Proposed RFC Feature editor_test.py reconciles included Report.results against actual Report summary

Summary:

Currently when a test runs using editor_test.py (batch/parallel), the Editor Python Binding (EPB, hydra) script contains some number of Report.result and Report.critical_result lines. If the script ends early without having called those lines there is no accounting that tests were skipped or missed. Only if a log is successfully captured can we see which Report lines actually got logged.

What is the relevance of this feature?

In the event that the application under test (AUT) exits early but does not indicate error or is non-responsive and fails to log information before being terminated for timeout, we have no way of assessing test lines that were skipped or missed. We can only see that the overall test completed with status timeout or ended with exit code (expected or otherwise). If the test exits without indicating an explicit failing result, the overall assessment would then be passing even if some number of test lines are skipped.

Feature design description:

before executing the AUT we should parse the EPB script for expected results so that on completion we can reconcile expected to the actual Report summary. This would prevent us from failing to notice skipped test functionality.

Technical design description:

Parse the EPB script for a list of Report.result lines to create an accounting
- code branching will add complexity for e.g. if-statements
- asserts are also important to verify whether they were evaluated (may require spying, or profiling, or using a Report.Assert?)
After execution of the AUT with the EPB script, compare the result summary to the expected and reconcile the difference
Indicate discrepancies in tests reporting with a mechanism that alerts users to missed testing

What are the advantages of the feature?

Having this may prevent false perception about the completeness of testing; we will know if tests lines are not running.
Provides a consistency check that the full set of assertions in a script ran, and did not exit early.

What are the disadvantages of the feature?

Some tests may intentionally be within branching code that either intentionally doesn't run for various conditions or is meant to run multiple times for a list of values. This would make reconciliation difficult to achieve. (implementation complexity)
It may be more expensive in resources to build this than to use mechanisms to manually ensure testing is completed as expected. (auditing test health, unlikely to be done except when the test or feature fails for other reasons)

How will this be implemented or integrated into the O3DE environment?

Work would be done within the editor_test.py batch/parallel framework and should be transparent to test implementers.

Are there any alternatives to this feature?

manual audit of test results
A structured expectations library may simplify this: o3de/o3de#8874

How will users learn this feature?

possibly docstrings in editor_test.py or report summary output differences

Are there any open questions?

Proposed SIG-Testing meeting timeslot

In a recent SIG-Testing meeting, a change of meeting time was suggested. To determine a new time, please vote on the following poll!

All current and prospective members of SIG-Testing are welcome to reply. Proposed times will follow Daylight/Summer timezone changes, unless later voted against.

The current options were constrained to:

Tuesday-Thursday to avoid weekends and common holidays
Common daytime hours between North America and Europe, where the majority of current members reside

This poll will close on March 1, 2022. If "None of these times work for me" represents over 30% of the poll, the next SIG-Testing meeting in Discord will determine a better set of times.

Proposed SIG-testing meeting agenda for 10-8-21

Meeting Agenda

Proposed Topic:
To fail or not to fail an automated integration test on encountering any assert/error rather than failing only on those relevant to the test

2023 SIG-Testing Chair/Co-Chair Elections

SIG chair / co-chair elections for 2023

SIG-Testing should organize an election for the new year. Per the Elections Guide, at least one Election Official should be identified before proceeding. This official should be assigned this issue, and drive the process forward.

The chair / co-chair roles

The chair and co-chair serve equivalent roles in the governance of the SIG and are only differentiated by title in that the highest vote-getter is the chair and the second-highest is the co-chair. The chair and co-chair are expected to govern together in an effective way and split their responsibilities to make sure that the SIG operates smoothly and has the availability of a chairperson at any time.

Unless distinctly required, the term "chairperson" refers to either/both of the chair and co-chair. If a chair or co-chair is required to perform a specific responsibility for the SIG they will always be addressed by their official role title.

Chairs are not required to provide "on-call" support of any kind as part of their chairship. The O3D Foundation is responsible for addressing any high-severity issues related to documentation or community which would impact the immediate operation of the O3DE project or may result in legal liability of the O3D Foundation.

Responsibilities

Schedule and proctor regular SIG meetings on a monthly cadence, or as determined by the SIG.
Planning and strategy for test automation for O3DE.
Serve as a source of authority (and ideally wisdom) with regards to testing O3DE. Chairpersons are the ultimate arbiters of many documentation standards, processes, and practices.
Act as representatives of the community, and provide advice to other SIGs, the Technical Steering Committee (TSC), and the Foundation on community building and management.
Coordinate with partners and the Linux Foundation and Marketing Committee regarding official community events.
Regularly participate in O3DE discussion channels such as our mailing lists, GitHub discussions, and Discord.
Maintain a release roadmap for the O3DE SIG area of discipline.

Nominations and election

Nomination requirements

Nominees must be an active contributor to Open 3D Engine in some fashion.
Nominees are not required to be reviewers or maintainers, but reviewer/maintainer status is something nominees should highlight.
Nominees should have at least some familiarity with open source, and one of either technical writing or community management / building.

If you are nominated by somebody other than yourself, you're encouraged to provide your own statement (however brief). If you would like to decline a nomination, please do so on this issue before the nomination deadline.

How to nominate

Nominations will be accepted beginning with the date of posting of this issue, and will be open until 2023-03-31 12:00 PM Pacific Time.

Nominate somebody (including yourself) by responding to this issue with:

A statement that the nominee should be nominated for a chair position in the Testing SIG.
Include any information on their work in O3DE (whether that is actively participating in Discord/forums/email, producing code submissions, producing docs submissions, or providing external tutorials or resources). If you are nominating yourself, provide a brief statement on your reasons for running for chairperson..
If you are nominating somebody else, make sure that they are @-ed by their GitHub alias so they receive a notification and can respond.
The name under which the nominee should be addressed. Nominees are allowed to contact the election proctor to have this name changed.
The GitHub username of the nominee. (Self-nominations need not include this; it's on your post.)
Nominee's Discord username. Chairpersons must be active in the O3DE Discord if elected.

By accepting a nomination you acknowledge that you believe you can faithfully execute the role required of a chairperson for sig-testing for the next 6-12 months. Chairpersons may relinquish their chairship at any time and for any reason.

If only one nomination is received, nominations will remain open for an additional week.

Election Process

The election will be conducted for one week from 25/4/2023 through 2/5/2023 and held through an online poll. Votes will be anonymous and anyone invested in the direction of O3DE and the SIG holding the election may vote. If you choose to vote, we ask that you be familiar with the nominees.

If there is a current interim chair, they will announce the results in the Discord sig channel as well as the SIG O3DE mailing list. If there is no interim chair, the Election Official will announce the results utilizing the same communication channels. At that time if there is a dispute over the result or concern over vote tampering, voting information will be made public to the extent that it can be exported from the polling system and the SIG will conduct an independent audit under the guidance of a higher governing body in the foundation.

Proposed RFC Suggestion: Create a policy for SIGs to resolve AR test failures

O3DE Suggestion RFC

Summary:

We are currently looking into tooling which will make it easier for SIGs to more easily and quickly identify both real and intermittent test failures, however the work to create this tooling will be wasted if SIGs do not use the information they provide to resolve the identified issues.

In order to continuously improve the AR system and reduce contributor frustration with intermittent test failures, we need a policy around how SIGs address failures in AR so they do not end up in an inactive backlog. It is important to get other SIG involvement in drafting and rolling out this policy in order to increase the likelihood of contributors adopting and putting this policy into practice, as it will likely not have as big of an impact if it is created by the testing SIG in isolation.

Proposed RFC Feature editor_test.py supports pytest parameterization passed through to collected execution.

Summary:

The current framework for test execution utilizes collection concepts implemented in .\o3de\Tools\LyTestTools\ly_test_tools\o3de\editor_test.py. Tests and suites implement classes from editor_test.py which can then be dynamically collected at runtime for execution. Because these suites and test classes subclass parent classes which are used in the collection representation, dynamic pytest paramerterization is not currently supported. It is desirable that we add customizable parameterization to editor_test.py TestSuite and SingleTest classes so that more complex test concepts can be created. Render Hardware Layer (RHI) should replace or augment use_null_renderer as an option since -rhi=null is simply a specific rhi requested.

What is the relevance of this feature?

Tests may need to specify a reusable parameter (i.e. level name) or execute their common code with different command line parameters (i.e. specifying an -rhi=). Currently a duplicate test script must be created for each variation.

Feature design description:

editor_test.py test classes and suite classes would handle standard pytest parameterization decoration.
mechanisms would be available to define custom fixtures as allowable by either StandaloneTest or batch/parallel test classes.
The default use_null_renderer behavior should be a parameterized rhi call that can be replaced with specified list of desired rhi such as ["dx12", "vulkan"]. Backwards compatability param use_null_renderer could remain.
Any number of other command line argument parameters should be possible to pass through to the application under test.

Technical design description:

user defined classes which implement any of EditorTestSuite, EditorSingleTest, EdiditorSharedTests, or other parent classes used by collection will allow standard pytest decoration.
- There will need to be a way to only batch tests requesting the same editor arguments together. Possibly only supporting parameterization with SingleTest (and perhaps ParallelTest).
collection routines in editor_test will pass through parameterized decoration so that test execution will include the options
multiple execution of tests through parameterization will occur and be reflected in test suite summary results.

What are the advantages of the feature?

pytest typically is used with parameterization to avoid duplicate test code or provide for execution time options for tests.

What are the disadvantages of the feature?

Collection causes a difficult situation for passing parameterization.
Some test scenarios like batching may not permit parameterization since many tests share one editor session. Batched tests may need to be grouped by parameters or excluded from parameterization.

How will this be implemented or integrated into the O3DE environment?

this should be implemented in a way that standard pytest parameterization techniques would be familiar to a user of the test framework to some extent.

Are there any alternatives to this feature?

No. only a static assortment of parameters exist for some of the test classes such as workspace (a class containing methods and information about the execution environment)
Providing a mechanism to parameterize within the Editor may mitigate the need for this, but does not solve the same test writing issues.

How will users learn this feature?

Example tests will be written showing how this can be used. Most likely I envision some Atom tests using rhi parameterization to replace existing hard-coded rhi option calls.
Not having this feature increases the difficulty to write tests.

Are there any open questions?

Proposed RFC Feature: Batched tests with no results are recollected and run in a new batch rather than failed

Summary:

Currently in editor_test.py when a batched list of tests encounters a timeout or a test exits the editor prematurely the tests which did not execute have no result and currently report as failed. This results in confusion for many who see a list of failed tests and assume they ran and failed.

What is the relevance of this feature?

Actually running tests rather than failing them will provide more useful information to debug the actual timeout or unexpected exit.

Feature design description:

When a Report summary includes tests without results, those tests will be collected in a new batch
The new sub-batch of tests will be executed again to generate results
If a timeout was the cause we will not rerun the test which was running at the timeout (the one that caused the issue)
Tests which already had no results and were rerun will not be collected for a 3rd run.

Technical design description:

Recollection and execution occurs for tests which report no results
Recollected tests should not be run a 3rd time if they fail to report results
Results will clearly call out no result rather than failure. "Test did not run or had no results!"

What are the advantages of the feature?

Removes reports of failures which are really just did not run. This reduces initial confusion among developers troubleshooting the AR results.

What are the disadvantages of the feature?

Increased run time for AR
Need to be careful that we don't collect the test which caused the unexpected exit or timeout and just repeat the cycle

How will this be implemented or integrated into the O3DE environment?

This will be part of editor_test.py and the pytest framework around it. It should just work without extra effort.

Are there any alternatives to this feature?

Educate everyone who reads a log as to what the output means more fully and expect them to deeply understand pytest and editor_test output.

How will users learn this feature?

Documentation would need to be created
In practical use they should not see this or need to know much about it just the result of less "failures" on top of root failures to distract them.

Are there any open questions?

What are some of the open questions and potential scenarios that should be considered?

RFC: Extend EditorTest tools to MaterialEditor

Summary:

There are a set of tools inside C:\git\o3de\Tools\LyTestTools\ly_test_tools\o3de\editor_test.py that allow for parallel test runs that run in batched sets. The problem is, these tools only work for tests that utilize Editor but it could also support MaterialEditor. It currently cannot support other executables (such as Launcher executables).

This RFC proposal will discuss the viability of this and also a rough outline of how we would go about implementing it for O3DE. Once completed, the results of this RFC will be used to cut tasks and assign work to get it implemented.

What is the relevance of this feature?

Since the inception of the parallel & batch test run tools, there has been an increase in test speed without much sacrifice to test efficiency. This is because these tools utilize return codes to validate test results instead of log lines, which are often plagued by race conditions, file permission issues, and other technical problems whereas return codes are not.

This change will expand this option beyond just tests that utilize Editor and apply it to the MaterialEditor also.

Feature design description:

The feature will be the expansion of our existing parallel & batched test tools. Once implemented, it would allow the existing parallel & batched test runs for Editor.exe to be extended to MaterialEditor test runs as well, utilizing this return code approach to verifying tests. We will save time on test runs while also gaining more test efficiency, with no real sacrifice to our existing approach to automated tests.

Technical design description:

It will work the same as Editor tests that utilize this functionality. There is a "test launcher" file that contains all of the tests to batch, run in parallel, or run as a single non-batched non-parallel test. An example launcher file would look something like this:

# Full example can be found at C:\git\o3de\AutomatedTesting\Gem\PythonTests\Atom\TestSuite_Main.py

from ly_test_tools.o3de.editor_test import EditorSharedTest, EditorTestSuite

logger = logging.getLogger(__name__)
TEST_DIRECTORY = os.path.join(os.path.dirname(__file__), "tests")


@pytest.mark.parametrize("project", ["AutomatedTesting"])
@pytest.mark.parametrize("launcher_platform", ['windows_editor'])
class TestAutomation(EditorTestSuite):

    enable_prefab_system = True

    @pytest.mark.test_case_id("C36529679")
    class AtomLevelLoadTest_Editor(EditorSharedTest):
        from Atom.tests import hydra_Atom_LevelLoadTest as test_module

    @pytest.mark.test_case_id("C36525657")
    class AtomEditorComponents_BloomAdded(EditorSharedTest):
        from Atom.tests import hydra_AtomEditorComponents_BloomAdded as test_module

    @pytest.mark.test_case_id("C32078118")
    class AtomEditorComponents_DecalAdded(EditorSharedTest):
        from Atom.tests import hydra_AtomEditorComponents_DecalAdded as test_module

The above example would run 3 tests in parallel, and as 1 batch (batch limits are 8 tests at a time, which can be changed inside the C:\git\o3de\Tools\LyTestTools\ly_test_tools\o3de\editor_test.py file but AR will assume this value of 8):

    # Function to calculate number of editors to run in parallel, this can be overriden by the user
    @staticmethod
    def get_number_parallel_editors():
        return 8

Since it batches the tests up using Editor as its base, all of the classes are "EditorX" classes where X is the additional naming for the class (i.e. class EditorTestSuite). These classes also define values that are closely tied to the Editor meaning that we should be able to implement this for MaterialEditor very easily by simply using the same objects and then making sure it targets MaterialEditor instead of Editor when launching the tests:

Find the entry point where "Editor" is launched in the code then provide the user an option to utilize "MaterialEditor" instead. This appears to be primarily in the class EditorTestClass.collect().make_test_func() method or _run_parallel_batched_tests function, specifically the editor parameter would need to reference MaterialEditor instead. This editor parameter feeds into all of the other function calls within the script, so will work seamlessly with MaterialEditor once the switch is made.
Update all Editor hard coded references such as editor.log to change based on the executable used by the test.
Add some feature optimizations to match the options available to us, such as changing use_null_renderer to something like use_rhi and then pass in the options null, dx12, or vulkan for instance. Basically updating the code to be more agnostic so that it supports both Editor and MaterialEditor options rather than fit a narrowly defined run for non-GPU null renderer tests.

Also, not every test has to be run in parallel and batched. In fact, GPU tests have to be run as a single test using the class EditorSingleTest object so this option will exist for any tests that can't be run in parallel for whatever reason. The added benefit of converting is primarily for the return codes which are the most solid test approach (previously we used log lines).

Initial support of this feature would be only for Windows, as that is our current standard. This could change in the future, but this RFC will not cover any of that design.

What are the advantages of the feature?

Faster test runs, more test stability (return codes instead of log lines), and easier than before to add new tests to an existing python test launcher script. All existing tests are also ready today to be converted in this way (and many have already).

What are the disadvantages of the feature?

The disadvantage will be that some unknowns remain as to how easy this "switch flip" would be from Editor to MaterialEditor - it looks easy to do when reviewing the code, but will it be easy when we change it and try to run MaterialEditor tests in this manner? That will be the large unknown this task work would unveil for us - though I predict it being workable.

How will this be implemented or integrated into the O3DE environment?

Very easily by simply adding to the existing code inside C:\git\o3de\Tools\LyTestTools\ly_test_tools\o3de\editor_test.py or the C:\git\o3de\Tools\LyTestTools\ly_test_tools\o3de\ directory. It uses files that already exist and functionality that has already been tested with Editor.exe extensively.

The hard part will be testing, hardening, and getting other users to adopt and update their code to match this new functionality. This would ideally be done using an e-mail linking to a wiki that details for users how to update their tests for this framework change so that nothing breaks unexpectedly (especially when it comes to renames, they're trivial but they can easily break a lot of existing code).

Are there any alternatives to this feature?

The alternative we had were log line tests which are not as reliable as return code tests.

How will users learn this feature?

There is already an existing internal wiki, and we would probably have need of a public one since this will be publicly available. However, since the demand for test tools has appeared low and most of our sig-testing meetings are usually only internal participants, I don't think a public wiki would be required initial - though it certainly wouldn't hurt in the long run.

Are there any open questions?

An open question is whether GPU tests will ever be able to run as parallel & batched tests. So far it looks like the answer is no, but these types of tests are often cumbersome in automated testing so are usually avoided or greatly reduced in volume compared to other types of tests since they are considered "front end tests". "Front end tests" are usually the slowest and most prone to race conditions, so even though this is a problem I believe it will always be one regardless of which automated test tool we choose.

Proposed RFC Feature: Static Analysis via GitHub Actions

Summary:

Static Analysis tool(s) could execute during Automated Review OR periodically run and auto-cut issues based on the findings. Static analysis tools exist in GitHub actions, which O3DE has "free" credits for executing as part of being an open source project: https://github.com/marketplace/category/code-quality . Determine which are appropriate to run, and propose the cadence they should run.

TODO: This RFC is a stub, and needs to be further defined before it is ready for comment and further revision. Fill out the sections below, and bring this document to review with SIG-Testing

What is the relevance of this feature?

Why is this important? What are the use cases? What will it do once completed?

Feature design description:

Explain the design of the feature with enough detail that someone familiar with the environment and framework can understand the concept and explain it to others.
It should include at least one end-to-end example of how a developer will use it along with specific details, including outlying use cases.
If there is any new terminology, it should be defined here.

Technical design description:

Explain the technical portion of the work in enough detail that members can implement the feature.
Explain any API or process changes required to implement this feature
This section should relate to the feature design description by reference and explain in greater detail how it makes the feature design examples work.
This should also provide detailed information on compatibility with different hardware platforms.

What are the advantages of the feature?

Explain the advantages for someone to use this feature

What are the disadvantages of the feature?

Explain any disadvantages for someone to use this feature

How will this be implemented or integrated into the O3DE environment?

Explain how a developer will integrate this into the codebase of O3DE and provide any specific library or technical stack requirements.

Are there any alternatives to this feature?

Provide any other designs that have been considered. Explain what the impact might be of not doing this.
If there is any prior art or approaches with other frameworks in the same domain, explain how they may have solved this problem or implemented this feature.

How will users learn this feature?

Detail how it can be best presented and how it is used as an extension or a standalone tool used with O3DE.
Explain if and how it may change how individuals would use the platform and if any documentation must be changed or reorganized.
Explain how it would be taught to new and existing O3DE users.

Are there any open questions?

What are some of the open questions and potential scenarios that should be considered?

SIG-Testing 11/30 release notes

Please fill any info related to the below for the 11/30 release notes: Note this has a due date of Friday Nov 12th, 2021

Features
Bug fixes (GHI list if possible)
Deprecations
Known issues

SIG Reviewer/Maintainer Nomination: FuzzyCarterAWS

Nomination Guidelines

Reviewer Nomination Requirements

6+ contributions successfully submitted to O3DE
100+ lines of code changed across all contributions submitted to O3DE
2+ O3DE Reviewers or Maintainers that support promotion from Contributor to Reviewer
Requirements to retain the Reviewer role: 4+ Pull Requests reviewed per month

Maintainer Nomination Requirements

Has been a Reviewer for 2+ months
8+ reviewed Pull Requests in the previous 2 months
200+ lines of code changed across all reviewed Pull Request
2+ O3DE Maintainers that support the promotion from Reviewer to Maintainer
Requirements to retain the Reviewer role: 4+ Pull Requests reviewed per month

Reviewer/Maintainer Nomination

Fill out the template below including nominee GitHub user name, desired role and personal GitHub profile

I would like to nominate: FuzzyCarterAWS, to become a Reviewer on behalf of sig-testing. I verify that they have fulfilled the prerequisites for this role.

Nominee's GitHub Profile < https://github.com/FuzzyCarterAWS >

Reviewers & Maintainers that support this nomination should comment in this issue.

Roadmap: Public Pipeline Metrics

Summary:

Expose a metrics summary accessible by all public contributors. This should be a simple HTML dashboard, which is generated by scripts in the O3DE repo.

What is the relevance of this feature?

Helps contributors understand the health of automated build-and-test across O3DE's many CI pipelines, and where to contribute at a glance.

Tasks

Beta Give feedback

Clarify LFX Insights blockers for test metrics #34

priority/critical status/blocked triage/accepted
RFC: Public Pipeline Health Metrics #69
(RFC followup work TBD)
Options

RFC: Intermittent Failure Intervention

Summary

The Testing Special Interest Group (SIG-Testing) primarily serves a support and advisory role to other O3DE SIGs, to help them maintain their own tests. While this ownership model tends to function well, there are cases where instability in features owned by one SIG can interfere with the tests of all SIGs. This RFC proposes a runbook for SIG-Testing to follow during emergent cases where the intermittent failure rate approaches a critical level. It also proposes metrics with automated alarms that trigger proactively following this runbook on behalf of other SIGs, as well as improved automated failure warnings to all SIGs to reduce the need to manually follow the runbook:

Improved automated notifications should serve as the earliest warning, notifying a SIG of failures in shared branches such as Development and Stabilization
- Automated notifications are added based on failure rate metrics, to help inform the SIG when instability is growing in code they own
A runbook for SIG-Testing is intended as a fallback, to help intervene when instability in one SIG's area of ownership threatens the ability for all O3DE contributors to ship code
- Guide to stabilize or back-out code with critical instability that impacts testing (ideally before all engineering work becomes blocked)
  - SIG-Testing will temporarily take ownership to modify, disable, or remove misbehaving code and tests, then return ownership of feature functionality to the primary SIG
- Secondary automated notification for SIG-Testing to use the runbook exist via a second, less-sensitive threshold than the initial instability notifications sent to SIGs

Note: Investigation and intervention on behalf of other SIGs is currently outside of the stated responsibilities of SIG-Testing, and accepting this RFC would amend the charter.

What is the motivation for this suggestion?

Intermittent failures are a frustrating reality of complex software. O3DE SIGs already strive to deliver quality features, and do not intentionally merge new code that intermittently fails. And in many cases intermittently unsafe code is caught and fixed before it ships: during development, during code reviews, or by tests executed during Automated Review of pull requests. This RFC does not seek to change how code is developed, reviewed, or submitted. Regardless, some percentage of instability evades early detection and creates nondeterminism.

When a failure appears to be nondeterministic, it can initially pass the Automated Review pipeline only to later fail during verification of a future change. Since these failures can "disappear on rerun" and are easy to ignore, they tend to accumulate without being fixed. This debt of accumulated nondeterminism wastes time and hardware resources, and also frustrates contributors who investigate failures they cannot reproduce. While a policy exists to help contributors handle intermittent failures, its guidance has proven insufficient to prevent subtle issues from accumulating into a crisis. For example, documents such as this RFC are produced every 3-6 months when a pipeline stability crisis occurs. If this RFC is accepted, whenever the rate of intermittent failure rises above a threshold, an automated notification will prompt a SIG-Testing member to follow the runbook. Such interventions have regularly been necessary in the past, but had insufficient metrics, no automation, and no runbook.

To limit the frequency a human must manually follow this runbook, SIGs should also get automatically notified of failures well before a critical failure rate is reached. Existing autocut issues contain little information specific to the failure, and are not deduplicated based on this information. This can be improved to cut separate issues for different failure-causes. This can additionally combine with information from GitHub Codeowners to automatically find an appropriate SIG label to assign new issues for investigation.

The intended outcome is:

Improved detection of individual intermittent failures
Clearer automated notification of accumulated failures which are similar
A backup plan for maintaining functionality when the outcomes above are insufficient to maintain pipeline stability

Suggestion design description

Definitions

Automated Review (AR): the portion of the Continuous Integration Pipeline which gates merging code from pull requests into a shared branch such as Development or Stabilization. Test failures here include intentional rejections, where the system is functioning normally and rejecting bad code, as well as unintended intermittent failures. Due to this, AR metrics are not used in the automation proposed below, though they may still be a useful health metric.

Branch Update Run: These builds are post-submission health checks, executed against the current state of the shared branch. All failures here are unintentional intermittent failures, or are a sign of a merge error. If Branch Update runs are failing then Automated Review runs (of merging in a new change) should similarly be failing. This is the primary source of health metrics proposed for this RFC.

Periodic (Nightly) Builds: These are periodic health checks, which execute a broader and slower range of tests. Health metrics should also be reported from here.

Tolerable Failure Rates

Metrics on test failure rates are an inherently imprecise and fuzzy measurement, which try to demonstrate statistical confidence. A single piece of code may have one extremely rare intermittent failure, or it may have multiple simultaneous patterns of intermittent failure, or it may eventually become consistently failing due to complex environmental factors. The following confidence bands intend to simplify interpreting fuzzy data, starting from the most severe:

Consistent: A failure is occurring so often that maintainers are effectively unable to merge code. Failures at this threshold and higher become difficult to distinguish from a 100% failure rate.
Severe: A failure occurs fairly consistently, creating significant increases to the time and cost to merge code. At this rate of failure, intervention becomes necessary.
Warning: A failure occurs inconsistently, though not frequently enough to cause severe throughput issues. At this rate of failure it becomes more important to isolate and fix an issue.
Detected: A failure has occurred at least once, but may be relatively rare. In this case it may be important to fix, or may be considered a non-issue. (current notifications only trigger on each detection of failure)
Undetected: A failure occurs in zero of the recently sampled cases. This does not mean an issue does not exist, but that it is effectively undetectable by existing automation. If a bug does exist, it could theoretically be detected by expanding the sample set.

Within these categories, some already have obvious steps to follow. Consistently failing issues will continue to follow the GitHub issues workflow. Issues undetected by automated tests either prompt new automation to detect them, or may be safe to ignore. And for the purposes of this proposal, the "Detected" category is a threshold to not require additional action beyond continuing to auto-cut an issue to notify about the failure. The boundaries between the remaining categories are proposed as:

Severe: a failure rate between 1/20 and 1/2 (above this would be Consistent)
Warning: a failure rate between 1/100 and 1/20 (below this would be Detected)

These thresholds are subjective and are sensitive to the scope of the product, its tests, and its pipeline environment. Due to subjectivity, the values may need to change as O3DE changes in scope. To better handle the broad scope, metrics are proposed at three aggregated levels. Each level acts as a filter with different sensitivity, catching what the previous one misses. The intent of these categories is to accurately identify a problem area when possible, but still detect when small problems accumulate into a widespread issue. To keep the definitions simple, the same confidence bands are proposed for the metrics categories. This is described below in the section "Metrics for Failure Rates".

Autocut Issue Improvements

Existing autocut issues rarely result in action and pile up from failed Branch Update Runs. This proves that these issues are not effectively tracked beyond the push-notifications sent as instant messages. The runbook above calls for advance notifications sent to SIGs, and suggests the existing autocut issues are the appropriate medium. To effectively use them, the following improvements are recommended:

Autocut deduplication, where a preexisting issue is updated instead of cutting a new issue, needs to be based off additional failure information. Instead of targeting a single issue title, failure root cause and origin information should be included. Modifying deduplication to include failure analysis will result in dozens more autocut bugs. This analysis needs to be carefully made less-generic, as certain bugs have multiple failure patterns. Becoming overly sensitive (for example to logging statements with timestamps) could result in never deduplicating, and cutting multiple unrelated issues tracking the same bug.
A form of failure root cause analysis is already used within Jenkins. The output of this plugin can be used for deduplication. If there are additional patterns we want to use to dedupe issues, they can be added to the patterns used by this tool. If the output of this tool is too noisy, we can add a filter between it and the deduplication mechanism.
When a failure occurs that references a file of origin (compilation failure, test failure, etc.) the system can look up that file in GitHub Codeowners, and find most likely SIG to initially assign the new issue.
For convenience, a link will be included in the Jenkins results summary whenever any issue gets autocut/deduped from a build.

Failure Runbook

When any pipeline failure first occurs, it will result in an autocut issue. This issue should be auto-assigned to the SIG designated by the Codeowners file for investigation. As an issue continues to reproduce, existing issues should be auto-commented on. And if failure rates rise above a threshold, a second issue gets cut to SIG-Testing to intervene by following a runbook. The full runbook is not defined here, only an outline of how it is used.

This runbook will document both Automated and Manual processes to reduce intermittent failure. The automated processes are documented to clarify the steps that have already been taken. When the initial automated portions are insufficient, the automation prompts SIG-Testing to take action in the manual portion of the runbook by auto-cutting an extra issue in GitHub. The following steps apply to Branch Update Runs and Periodic Builds, with the intent to keep Automated Review runs only seeing newly-introduced failures. (After RFC, this runbook should exist as its own document in the SIG-Testing repo)

Pipeline Automation:
When any failure is detected in a Branch Update or Periodic Build, an issue is updated or auto-cut to track it (this is already implemented today). If tests were executed, then test metrics will also be uploaded. It is expected that autocut issues which are due to intermittent behavior may be claimed and then closed due to no reproduction, or ignored due to low priority.

Issues Automation with Metrics:
When new test metrics are uploaded, any new failure should prompt querying recent failure metrics. Based on the query results, take the following actions:

An individual test failure is above the Warning threshold:
- Update autocut issue with name of failing test, failure rate, and a link to the intermittent failure policy
- If also above Critical, add label priority/critical
- If also above the Consistent threshold, Autocut (or update) a second issue to SIG-Testing and relate it to the original with label priority/critical to investigate rising failure rates
A failing test-module is above the Warning threshold:
- Update autocut issue with name of failing module, failure rate, and a link to the intermittent failure policy
- If also above Critical, add label priority/critical
- If also above the Consistent threshold, Autocut (or update) a second issue to SIG-Testing and relate it to the original with label priority/critical to investigate rising failure rates
All pipeline runs are above the Critical threshold:
- Autocut (or update) a second issue to SIG-Testing and relate it to the original with label priority/critical, to investigate rising failure rates with unclear origin
- Include a link to this SIG-Testing runbook
- If also also above the Consistent threshold, update label priority/blocker on the investigation ticket

Note: Warning threshold is currently undefined at the pipeline level, as it is more sensitive to failures.
Note: Creating a pipeline critical will always occur before module or individual test critical. However it may not get investigated before other more-specific critical issues are logged. (This may result in nearly always having an investigation open)
Note: Can result in a SIG-Testing investigation being prompted on the first new failure shortly after a prolonged failure.

Metrics for Failure Rates

Three levels of test metrics are proposed, and each require alarms that trigger automated actions in GitHub Issues.

Test failures per pipeline run

Contributors are most directly impacted by the aggregate test failure rate of an entire run of the Automated Review Pipeline. This involves running tests across all modules for multiple variant builds in parallel, and O3DE has grown to nearly 200 test modules. Certain modules run in parallel with one another, and certain failures may only occur when all modules execute. There are around 100,000 tests which currently execute on each pipeline run, and a single intermittent failure across any of these tests results in a failed run.

Failures per test module

Test modules often contain hundreds of individual tests. In a test module with a hundred tests, if every test had only a 1% independent failure rate then the module would be statistically expected to almost always fail. Additionally, certain tests may only fail when run with the rest of their module. Due to the finer granularity and scale, these metrics are less sensitive than those for the full pipeline. For instance, if each of the current ~200 modules contain only a single 1/100 error rate, barely triggering a warning-level response for modules, then across the two variant test executions the pipeline would expect a 100% failure rate with around 4 failed modules in every run. ( 2 build variants * 200 modules * 1/100 fail rate = 4 failures per pipeline run). While an unhealthy state is possible with a low per-module failure rate, it should still be detected by the pipeline-wide metric above.

Failures per individual test

Individual tests must be highly stable. They are also the smallest, quickest data point to iterate on. With nearly 50,000 (and growing) tests across the Main and Smoke suites, even a one-in-a-million failure baked into each test could accumulate into severe pipeline-level failure rates ( 2 build variants * 50,000 tests * 1/1,000,000 fail rate = 1/10 runs fail). While this makes subtle issues in individual tests the least sensitive to accumulated failure, identifying a single problematic test is also the best case scenario for debugging. And while these per-test metrics detect only specific issues, other complex systemic issues should be caught by the aggregate metrics. An unhealthy state is again possible with a low per-test failure rate, but should l be detected by investigations into either the module-wide or pipeline-wide metrics above.

Metrics Requirements

A metrics backend system needs to track historical test failure data in Branch Update Runs and Periodic Builds, which will be queried to alarm on the recent failure trends.

The following metrics need to be collected from all tests executed in every run:

Test Name
Pass/Fail
Build Job ID
Pipeline run ID

The following needs to be collected from every Test Module run by CTest in a pipeline:

Test Module Name
Pass/Fail
Build Job ID
Pipeline run ID

The following needs to be collected from every test Build Job execution:

Build Job ID
Pipeline run ID
Pass/Fail (all tests passed / any failed)

The following needs to be collected from every Pipeline execution:

Pipeline run ID
Pass/Fail (combined across all Build Jobs in the run)

To ensure statistical accuracy, the metrics analysis should be conducted across the most recent 100 runs from within 1 week. This should provide a balance between recency and accuracy.

The heaviest of these metrics will be for individual tests in Branch Update Runs. Test name identifiers are often in excess of one hundred characters, and there are currently nearly 50,000 tests across the Smoke and Main suites. With around 12 branch update runs per day triggering two test-runs each, this can result in a sizeable amount of data. Periodic Builds currently execute a few hundred longer-running tests as often as twice per day, and would constitute less than 1% of the total data. Periodic builds may also change in frequency depending on the needs and scale of the O3DE project.

To reduce the volume of data, we can store only individual test failures and calculate passes based on total runs of a build job. This may result in builds that fail early (during machine setup, build, get aborted, etc.) artificially inflating the test pass rate, since they would not create test metrics. Newly added tests would similarly start with a inflated pass rate. Further analysis on this exists below in the Appendix on Metrics Estimates.

Example intermittent failure scenario

An individual test "A" has already failed a few times within the previous week during Branch Update Runs, and encounters two failures during some of the Branch Update Runs today. The automation would take the following steps on as new failures occur today:

Run 1
- Test A fails, uploading metrics
- Collect failure information from Jenkins "Identified Problems" root cause summary and "Test Result" summary
- Check GitHub Issues for an open autocut failure, with a description containing this root cause summary and test name, which is found already created
- Query the failure rate metrics for the test, which is now 4/100 and below the severe failure rate
- Query the failure rate metrics for the test-module which is now 5/100 (due to another intermittently failing test)
  - Update the existing issue with label priority/critical and add a comment with the name of the failing test module and its failure rate
Run 2
- Passes! Metrics are uploaded, and no further action is taken
Run 3
- Test B fails, uploading metrics
- Collect failure information from Jenkins "Identified Problems" root cause summary and "Test Result" summary
- Check GitHub Issues for an open autocut failure, with a description containing this root cause summary and test name, which is NOT found
- Create a new issue for the failure in Test B (which may not be intermittent)
- Query the failure rate for failing test, which is now 1/100
- Query the failure rate for the failing test module which is now 1/100
- Query the overall failure rate for all Pipeline runs, which is now 5/100
  - Create a new issue with label priority/critical assigned to SIG-Testing to investigate increasing overall failure rate
Run 4
- Passes! Metrics are uploaded, and no further action is taken
Run 5
- Test A fails, uploading metrics
- Collect failure information from Jenkins "Identified Problems" root cause summary and "Test Result" summary
- Check GitHub Issues for an open autocut failure, with a description containing this root cause summary and test name, which is found (same as Run 1)
- Query the failure rate for failing test, which is now 5/100
  - Update the existing issue with label priority/critical (re-added in case a user removed it) and add a comment with the name of the failing test and its failure rate

What are the advantages of the suggestion?

Stop more nondeterministic pipeline crises before they start.
- Expose metrics on failure rates which other SIGs can passively monitor.
- Actively warn other SIGs when their code is increasingly misbehaving, and they have not taken action.
- Take action when accumulated error cannot be addressed by SIGs.
Improved issue deduplication to help all SIGs track new issues.

What are the disadvantages of the suggestion?

Taking action on behalf of SIGs may act as an unintended pressure valve, reducing existing habits to take proactive action.
- Mitigated by: improving autocut tickets to accurately notify the owning SIG.
Runbook expands stated responsibilities of the relatively small SIG-Testing, competes with delivering and improving test automation tools.
Expanding issue automation increases its maintenance complexity.
- Current autocut issues are low-value, but also low-noise and low-frustration.
- Automated creation and deduplication can never be perfectly concise and accurate, this will only shape the patterns of noise.
- Increasing the volume of autocut issues (even with accurate deduplication) has diminishing returns on resolving more bugs.
Does not prevent new intermittent issues from passing through Automated Review, detects failures after submission.

Are there any alternatives to this suggestion?

AR Test failures can be automatically retried, bypassing current user pain points by ignoring intermittent failures.
- Succeeding after automated retry is effectively ignoring certain negative results, and accepting increased infrastructural cost they incur.
  - Increases cost and time to run tests.
- Does not prevent shipping intermittently failing code.
  - Increases the rate of false-pass failures propagating across branches, and then into projects which depend on O3DE.
  - A "shift-right" solution that forwards pain of intermittent product bugs toward release to the end user, where issues become more expensive to fix.
- Metrics on actual failure rate become even more important to monitor, since many failures become automatically ignored.
- Not a mutually exclusive solution. Could still be delivered alongside this RFC.
AR Test failures can be automatically retried, still failing if the initial test failed, but collecting additional testing metrics to display in AR.
- Already is proposed in #22
  - Not initially pursued, primarily due to Infrastructural cost constraints of rerunning tests.
- Does not prevent shipping intermittently failing code.
  - Helps clarify AR false-failures as intermittent.
  - Does not detect intermittent false-passes.
- Not a mutually exclusive solution. Could still be delivered alongside this RFC.
Tests in Automated Review could unconditionally run multiple times, to find intermittent behavior within a single change.
- Significantly increases the time and infrastructure cost of running Automated Review
  - Costs are multiplied when growing the test suite
- Increases the rate that existing intermittent failures block submission in AR
  - Increased blocking pain may increase priority on resolving intermittent issues, but is not guaranteed to. May instead lead to attempts to bypass automation.
Periodically run tests dozens to hundreds of times, collecting failure metrics separately from AR and Branch Updates.
- Would collect metrics identical to those proposed above for Branch Update Runs.
  - More metrics would improve statistical fidelity.
- Creates significant Infrastructural cost.
Pipeline failure metrics alone could be recorded and published for other SIGs to act on how they see fit, without a failsafe process for SIG-Testing to follow.
- Channels existing maintainer frustration toward "Automated Review" as a whole being unreliable, risks forfeiting trust in the tool.
  - Investing in this may not resolve any issues.
  - Metrics not tied to a specific policy are easy to ignore.
  - Other SIGs are not currently requesting these metrics.
Use this proposal, but set different stability standards for different test types. This could be between C++ and Python tests, between the Smoke and Main test suites, or another partition.
- Does not address the overall stability issues in AR.
  - Only divides where the budget of "tolerable" instability gets spent.
- Complex rules about "what can fail how often" obscure the actual problem: a feature or its test is too unstable to verify in Automated Review.

What is the strategy for adoption?

O3DE Maintainers begin receiving modified failure notifications in autocut issues.
SIG-Testing gets cut investigation issues, scheduled as maintainers find time.
- The runbook will contain documentation for SIG-Testing to understand the metrics which trigger using the runbook.
- Stability investigations should be completed within one month of being cut.
  *Approval of this RFC should involve input from all SIGs. Delivery of these changes may need coordination with SIG-Build.

Are there any open questions?

Which metrics solution do we onboard to?
- Public O3DE metrics are currently blocked by #34 and may need an interim solution. However this RFC intends to clarify the metrics requirements, and have a plan ready for when they become unblocked. All data handling considerations for test-data should be handled in that issue.

Appendix

Metrics Estimates

Below is an rough estimate of the volume of test metrics and their cost. Other SIGs may have additional metrics needs, which are not calculated here.

A. Jenkins Pipeline Metrics

There are currently a total of 29 stages across the seven parallel jobs in each of the Automated Review (Pull Requests) and Branch Indexing (Merge Consistency Checks) runs. There are approximately 40 pull request runs and 20 branch updates per day (across two active branches).

Periodic Builds have many more jobs, currently around 180 stages across 46 parallel jobs. It is difficult to estimate how this set of stages will change over time.

The daily Jenkins-metrics load factor of "Pipeline Run" and "Job-Stage Run" should be around 60x29 + 180x3 = 2280. If we remove Periodic Mac builds, this would be around 2200 Jenkins-level metrics per day. It is difficult to estimate how this set of pipelines and stages will change over time. However the top level metrics should stay a comparatively low volume.

B. Test Result Metrics

There are currently around 43,000 tests that run in each Automated Review and Branch Update test-job, and this is expected to slowly grow over time. One path to reducing the scope of test metrics is to bundle these metrics into only reporting on the module that contains sets of tests, for which there are currently 135 modules in Automated Review. This is a major tradeoff of data quality for a ~99.9% reduction in size, which should at least be paired with saving the raw number of pass, fail, error, and skipped tests.

Another way to reduce metrics is to not explicitly store test-pass data, and to add entries for only non-pass results. This has the negative effect of conflating (reconstructed) data on "pass" and "not run" and makes it unclear when a test becomes renamed or disabled, but otherwise stores explicit failure data with significantly reduced load. This would result in a variable load of metrics which increases as more tests fail per run. Currently around 1/10 of test runs encounter a test failure. When such failures occur, the current average number of failures is around 2.5.

The daily load factor for all test-metrics would be 43,000x2x60 + 2,000x4x3 = 5,184,000 test-level metrics per day.

If only modules are reported, this would be 135x2x60 + 34x4x3 = 16,608 module-level metrics per day.

If modules and test-failures are reported, this would be approximately 0.1x2.5x60 + 135x2x60 + 34x4x3 ~= 16,625 module-plus-failure metrics per day. Since this load is variable, it is rounded up to 20,000. This suggests that saving only failure data would be a ~99.6% reduction in storage.

C. Profiling Metrics

There are currently 1795 Micro-Benchmark metrics (across 10 modules), and 10 planned end-to-end benchmarks of workflows. Providing a metrics pipeline will encourage the current number of performance metrics to grow, as such profiling data otherwise has little utility. A wild estimate is that this will expand by 10x within a year.

These execute only in the three Periodic Builds, in each on Windows and Linux. This makes the daily profiling metrics load 1805x3x2 = 10,830, likely growing to ~110,000 daily within a year.

Estimated Total Metrics Load

Metrics systems commonly store metrics with dimensional values, grouping a "single metric" as multiple related values and not only as individual KVPs. Under this model, daily metrics would be around 5,200,000 which is heavily dominated by test-metrics. This reduces to around 35,000 (a greater than 99% reduction) if only test modules and failures are logged, and not all individual tests.

If test metrics are allowed to naturally grow, this could reach 6,000,000 daily metrics within a year, or perhaps 150,000 if only test modules and failures are recorded. This would be around 42,000,000 vs 1,050,000 metrics per week, 180,000,000 vs 4,500,000 per month, 2,200,000,000 vs 55,000,000 per year. These metrics would exist across four or five metrics types (Pipeline Run, Job-Stage Run, Test Result, Profiling Result) with Test-Module Result being important if the reduced load is selected.

Estimated Metrics Cost

While a metrics solution has not yet been selected, here is one off-the-shelf estimate:

AWS CloudWatch primarily charges per type of custom metric, as well as a small amount per API call that uploads metrics data. Each post request of custom metrics is limited to 20 gzip-packed metrics, for which we should see a nearly 1:1 ability to batch data from test-XMLs. Monthly CloudWatch cost estimates for metrics plus a dashboard and 100 alarms are $112 for full metrics (4 types with 10MM API calls) vs $14 for failure-only (5 types with 250k API calls). This is estimated across four or five metrics types (Pipeline Run, Job-Stage Run, Test Result, Profiling Result) with Test Module Result being added if the reduced load is selected.

However it is important to limit the types of custom metrics in CloudWatch (and likely any other backend as well). While it could make dashboard partitioning and alarm-writing easier, monthly costs would be extremely high if individual metric-types were all stored separately. For instance if every unique metric were accidentally stored with a unique key (10MM types), the monthly cost could be over $240,000!

While out of scope for this RFC, this is a critical dependency which must have its usage and access limited: #34

wiki sidebar in o3de and other related repositories needs a cohesive sidebar section for test tools documents

Test tool documentation that covers testing tools and practices are being added to various repository wiki pages. a sidebar linking group needs to be created to offer navigation to these documents
a new document was added for multi-test framework, but other than direct linking there is no sidebar discoverable link for it.

https://github.com/o3de/o3de/wiki/Multitest-Framework-(Running-multiple-tests-in-one-executable)

we need to create a section in the wiki sidebar to house links to this and future documents
https://github.com/o3de/o3de/wiki/_Sidebar/_edit

not certain what to name this section. possibly Test Practices and Tooling

Create Priority labels for the SIG-Testing repository

As a means to help prioritize and assign out issues from the SIG-Testing repo we are proposing the following labels be added which mirror the o3de repo's labels:

priority/blocker Highest priority. Must be actively worked on right now as it is blocking other work.
priority/critical Critical priority. Must be actively worked on as someone's top priority right now.
priority/major Major priority. Work that should be handled after all blocking and critical work is done.
priority/minor Lowest priority. Work that may be scheduled
needs-priority Indicates issue lacks a priority/foo label and requires one.

Proposed SIG-Testing meeting agenda for Jan-28-22

Meeting Details

Date/Time: January 28, 2022 @ 19:30 UTC / 14:30 ET
Location: Discord SIG-Testing Voice Room
Moderator: amzn-dk
Note Taker amzn-dk

The SIG-Testing Meetings repo contains the history past calls, including a link to the agenda, recording, notes, and resources.

Meeting Agenda

RFC Doc Read and comment: @evanchia-ly-sdets Read over and discuss Proposed RFC Feature: Automatically Discover Intermittently Failing Tests #22

Update on editortest.py extension work: @jromnoa Give update and further discussion on work to extend editortest.py to support more exes.

Outcomes from Discussion topics

TBD

Action Items

TBD

Open Discussion Items

TBD

RFC: Re-factor the pytest.Class objects in the editor_test framework code.

Summary:

A single pytest.Class object that shared/batched/parallel/single test classes can all inherit from and share.
This RFC was spawned as a byproduct of the work in #27 as the pytest.Class object proved far too difficult to convert over, requiring an RFC of its own.

What is the relevance of this feature?

Why is this important? This is important to help limit tech debt and ease the burden of long term maintenance for the test tools. As it stands now, most of the code that both the Editor and upcoming (o3de/o3de#9609) that is utilized for the pytest.Class portions are copypasta with the editor fixtures swapped out for material_editor fixtures instead on the ly_test_tools side.
What are the use cases? The use cases are for any new O3DE executables that we want to test using these batched/parallel test tools. For instance, if a team (let's say Atom) adds customizations to their code in the new MaterialEditor files that have the potential to benefit all shared tests but that code is on the pytest.Class object, then other test classes with their unique pytest.Class objects will not inherit these benefits (such as the Editor). We'd have to manually add the code to each object (MaterialEditorTestClass & EditorTestClass). Additionally, when we want to add a new feature it will be much easier than it is now since parsing through the nested for loops and callback functions within those for loops gets quite tedious (it mixes some fixtures in there as well to get data/values from further complicating things as far as re-factoring or adding new functionality).
What will it do once completed? Have a singular pytest.Class class that new classes inherit from will ease the burden of adoption as well as ease fixing future bugs or functionality within the code.

Feature design description:

The Design: I haven't had the chance yet to fully lay out the architectural discoveries I made with the editor_test framework when working on the other RFC, however this new class would ideally go into the newly created multi_test_framework.py file, similar to the other objects & functions that were moved and created. You can see a full list of changes in the PR for that work here: o3de/o3de#9609
After the new class is added to that file, the editor_test.py and material_editor_test.py modules will inherit from it for their specific EditorTestClass and MaterialEditorTestclass objects respectively. Not everything will be able to be flattened and inherited by both, but anything that can be should be moved over. Ideally some __init__() attribute setting, pytest fixture parametrization, or similar implementation can be utilized in the editor_test.py and material_editor_test.py modules for setting specific values for Editor or MaterialEditor.
Some pitfalls to watch out for discovered during the previous RFC work on MaterialEditor: Fixtures primarily. I tried to get the fixture calls in the correct order when converting the pytest.Class object but it become non-trivial when swapping between MaterialEditor and Editor tests. Make sure, at all phases of the re-factor, that you are actively running and checking the test code as you iterate. I failed to do this on my first attempt at the MaterialEditor RFC and it buried me in a messy callstack. Update the code, then run the code (mainly because of the fixtures) one step at a time.
For additional use cases, simply run the existing tests we have in https://github.com/o3de/o3de/tree/development/AutomatedTesting/Gem/PythonTests and make sure they work the same as they do now.

Technical design description:

Write out a new architectural chart for how the existing objects function and where we want our new pytest.Class object to go to modify this existing architecture. This will help keep the work focused and guided before diving deep into the coding aspects (perhaps a future RFC review can contain this effort, for all of @o3de/sig-testing to review together).
Create a new class in the new multi_test_framework.py module. Name it something similar to the other class objects in there (i.e. AbstractTestClass which we'll go with for the rest of this design description to make it easier to follow).
Have the AbstractTestClass inherit from pytest.Class similar to how EditorTestClass inside editor_test.py does: AbstractTestClass(pytest.Class).
Add all shareable functions and code into the new AbstractTestClass object.
MaterialEditorTestClass and EditorTestClass should inherit from AbstractTestClass: MaterialEditorTestClass(AbstractTestClass) & EditorTestClass(AbstractTestClass).
You will probably need to do a super().__init__() call for these classes after inheriting from AbstractTestClass.
Do not break any existing interfaces and do not rename anything that is used by the tests inside https://github.com/o3de/o3de/tree/development/AutomatedTesting/Gem/PythonTests since we want to make adoption of this for test writers as least disruptive as possible (emergent work kills sprint productivity).

What are the advantages of the feature?

Ease of creation of new features as well as maintenance or bug fixing for the editor_test framework code.
Ability to easily add any executable as a new potential test target with our test tools.
Implemented as part of ly_test_tools so we get to reap all of the benefits of the existing test functions and code by simply expanding upon it with these new objects.

What are the disadvantages of the feature?

Complication of the re-factor work required here means that the architectural planning for this needs to be accurate before the work is taken on. Make sure to chart out the objects in the existing framework code and ensure they are understood then share them with the team and how this new re-factor will play into that going forward.

How will this be implemented or integrated into the O3DE environment?

It will be a simple PR to update the existing code, then it will be automatically adopted behind the scenes (backend tech debt change).

Are there any alternatives to this feature?

An alternative in the meantime is to simply do a copypasta + tweak of the existing pytest.Class objects for new executables that need to utilize these framework tools. While it's not ideal and won't help with long term tech debt woes, it can solve the problem of having no functionality at all by taking this approach - now is better than never: "Now is better than never. Although never is often better than right now".

How will users learn this feature?

They can review the new framework design documentation that will go out and add new features or fix issues they see with the existing objects and functionality we have in place.
This will hopefully deepen existing users' understanding on the batched/parallel test framework code which is a net benefit for all users of these tools.

Are there any open questions?

Is this important to do right now or since it's tech debt do we wait before adding it?
How do users feel about existing documentation?
Should this integrate more seamlessly into the existing ly_test_tools or should it remain kind of inside its own section like it is now in ly_test_tools?

Propose change/disable of autocut to SIG-Build

Follow up on #43 Autocut Issue Improvements, to clarify tasks + ownership.

Proposed RFC Suggestion: Moving `editor_entity_utils` to the `EditorPythonBindings` gem

Summary:

I am proposing moving the editor_entity_utils python module out of the EditorPythonTestTools package, so that it can be used outside of its current project (AutomatedTesting). The proposed new location is the EditorPythonBindings gem.

What is the motivation for this suggestion?

This is important because currently, the editor_entity_utils module can only be used in the AutomatedTesting project. This is preventing users from leveraging this module in their own projects that are generally useful beyond the scope of just automated tests (e.g. helper methods for creating entities, managing components on those entities, modifying properties on components, etc...).

This is actually a continuation of a previous change made (o3de/sig-content#61) where we moved the pyside_utils module from the AutomatedTesting project to the QtForPython gem for the same motivation of using the module in any user project. There are other modules in EditorPythonTestTools that would be generally useful to move as well, but it will be easier to move specific modules at a time.

Making these modules more generally accessible has been raised several times in the past, most recently as a result of the following PR for the URDF Exporter Gem (o3de/o3de-technicalart#5). In this case, the POC had to replicate a lot of the functionality that is provided by the editor_entity_utils because it couldn't be used outside of the AutomatedTesting project.

Suggestion design description:

Move the editor_entity_utils.py file from EditorPythonTestTools to <O3DE>\Gems\EditorPythonBindings\Editor\Scripts
Update any automated testing scripts to import from the new location (should just require changing editor_python_test_tools.editor_entity_utils to editor_entity_utils)

What are the advantages of the suggestion?

Now the editor_entity_utils would be usable from any project, not limited to only the AutomatedTesting project
Lower the barrier to entry for users wanting to create custom tools that interact with entities/components/properties by providing convenient helper classes/methods
The EditorPythonBindings gem is a good landing spot since it provides the azlmbr python modules that these helpers are built upon

What are the disadvantages of the suggestion?

If there are any users working out of the AutomatedTesting project and using editor_entity_utils helper methods their imports would stop working. This should be a very small number of people since most users create their own projects, but nevertheless we should probably announce this as an impactful change.

Future considerations

As mentioned earlier, there are additional modules in EditorPythonTestTools that would be strong candidates to be moved outside of the AutomatedTesting project as well so that they could be used in any project. It might be beneficial to identify those as part of this work, even if we choose to only move the editor_entity_utils module right now.

Also, once the editor_entity_utils module has been moved, we should work with @shawstar to update the URDF Exporter gem (o3de/o3de-technicalart#5) as an initial customer/use-case to replace its current helper methods for entity/component/property actions with corresponding editor_entity_utils APIs.

Roadmap: In-Game Test Automation

Summary:

Tools to simplify creating automated tests which execute inside the game engine, such as the deployable Game and Server executables.

What is the relevance of this feature?

Helps O3DE contributors write tests for in-game (in-engine) functionality, separate from the Editor. Helps teams using O3DE write tests for their own production code.

Tasks

Beta Give feedback

Proposed RFC Feature: In-Game Test Harness #20

priority/critical rfc-feature triage/accepted
RFC: In-Engine Testing #66
(RFC followup work TBD)
Options

Clarify LFX Insights blockers for test metrics

The LFX Insights tool has been recommended for use with O3DE by the Linux Foundation. There are supposedly issues blocking its use for recording test metrics from O3DE. Clarify the gaps between current functionality and desired functionality, whether adoption is blocked, and what specific features are blocked.

RFC: AzLmbr Bus Event libraries for Python

Summary:

An extensive library of Enums for the AzLmbr Bus Events to be used in python.

This would be a library of a raw bus event call that will pass the string for that event along instead of using string literals.

EG:

azlmbr.editor.EditorComponentAPIBus(bus.Broadcast, bus.EditorComponenetBus.AddComponentOfType, object)

azlmbr.editor.EditorComponentAPIBus(bus.Broadcast, "AddComponentOfType", object)

What is the motivation for this suggestion?

Why is this important?

Prevents the use of using string literals that are prone to syntax errors.
Helps enforce SOLID and DRY software development by giving a single place for these libraries to live and usage of encapsulated data.
Makes updating a bus event structure easier by only having to change the value of the Enum vs finding each usage of the string in a bus event call.

What are the use cases for this suggestion?

Every azlmbr bus event call in the Python Automation scripts.

What should the outcome be if this suggestion is implemented?

An extensive library of enums to be used for calling bus events.

Suggestion design description:

AutoGen: Leverage the autogen to create the mappings for bus events automatically on every build
- Advantages:
  - Mappings are updated every build allowing for up-to-date bus event mappings whenever the bus events are exposed/added/updated.
  - Eliminates automation maintenance for bus event updates due to the bus event mappings being automatically updated and remain, allowing the automation to remain stable.
  - Disadvantages:
    - Development time to plan and implement the basic autogen system for generating these.
    - This solution will require creating a pattern for changes in bus event names that will remap/point to the new event name.

Are there any alternatives to this suggestion?

Manual Generation: A small group of people dive into all the existing bus events and manually create an enum library of all exposed bus events.
- Advantages:
  - Software design for this solution is minimal.
  - Updating bus events that had their name changed is easy and doesn't require a development pattern.
- Disadvantages:
  - This will require manual maintenance when new busses are exposed, new bus events are created, or bus events change names/patterns.
  - Manual creation of enums could result in syntax errors (but are easy to fix once discovered)

What is the strategy for adoption?

Add the usage of libraries to the Automation Best Practices doc.
Add docs about the libraries.
Give a discord talk about the libraries.
Use PRs to help enforce the usage and spread of the knowledge of the libraries.

SIG Maintainer Nomination for smurly

Nomination Guidelines

Maintainer Nomination Requirements

Has been a Reviewer for 2+ months
8+ reviewed Pull Requests in the previous 2 months
200+ lines of code changed across all reviewed Pull Request
2+ O3DE Maintainers that support the promotion from Reviewer to Maintainer
Requirements to retain the Reviewer role: 4+ Pull Requests reviewed per month

Maintainer Nomination

Fill out the template below including nominee GitHub user name, desired role and personal GitHub profile

I would like to nominate: smurly, to become a Maintainer on behalf of sig-testing. I verify that they have fulfilled the prerequisites for this role.

Nominee's GitHub Profile https://github.com/smurly

Reviewers & Maintainers that support this nomination should comment in this issue.

SIG-Testing Chair/Co-Chair Nominations 12/1 - 12/8 -- Elections 12/8 - 12/15

SIG chair / co-chair elections for 2022

Since the inception of O3DE, each SIG chair has been staffed as an interim position. It's time to hold some official elections, following some of the proposed guidance but with our own process due to the holiday season and in order to expedite the elections into next year.

The chair / co-chair roles

Unless distinctly required, the term "chairperson" refers to either/both of the chair and co-chair. If a chair or co-chair is required to perform a specific responsibility for the SIG they will always be addressed by their official role title.

In particular, if both chairpersons would be unavailable during a period of time, the chair is considered to be an on-call position during this period. As the higher vote-getter they theoretically represent more of the community and should perform in that capacity under extenuating circumstances. This means that if there is an emergency requiring immediate action from the SIG, the chair will be called to perform a responsibility.

Responsibilities

Schedule and proctor regular SIG meetings on a cadence to be determined by the SIG.
Serve as a source of authority (and ideally wisdom) with regards to O3DE SIG area of discipline. Chairpersons are the ultimate arbiters of many standards, processes, and practices.
Participate in the SIG Discord channel and on the GitHub Discussion forums.
Serve as a representative of the broader O3DE community to all other SIGs, partners, the governing board, and the Linux Foundation.
Represent the SIG to O3DE partners, the governing board, and the Linux Foundation.
Coordinate with partners and the Linux Foundation regarding official community events.
Represent (or select/elect representatives) to maintain relationships with all other SIGs as well as the marketing committee.
Serve as an arbiter in SIG-related disputes.
Coordinate releases with SIG Release.
Assist contributors in finding resources and setting up official project or task infrastructure monitored/conducted by the SIG.
Long-term planning and strategy for the course of the SIG area of discipline for O3DE.
Maintain a release roadmap for the O3DE SIG area of discipline.

Additionally, at this stage of the project, the SIG chairpersons are expected to act in the Maintainer role for review and merge purposes only, due to the lack of infrastructure and available reviewer/maintainer pool.

... And potentially more. Again, this is an early stage of the project and chair responsibilities have been determined more or less ad-hoc as new requirements and situations arise. In particular the community half of this SIG has been very lacking due to no infrastructural support, and a chairperson will ideally bring some of these skills.

Nomination

Nomination may either be by a community member or self-nomination. A nominee may withdraw from the election at any time for any reason until the election starts on 12/3.

Nomination requirements

For this election, nominees are required to have at minimum two merged submissions to http://github.com/o3de/o3de (must be accepted by 2022-01-31). This is to justify any temporary promotion to Maintainer as required by this term as chairperson. Submissions may be in-flight as of the nomination deadline (2021-12-08 12PM PT), but the nominee must meet the 2-merge requirement by the end of the election or they will be removed from the results.

Any elected chairperson who does not currently meet the Maintainer status will be required to work with contributors from the SIG to produce an appropriate number of accepted submissions by January 31, 2022 or they will be removed and another election will be held.

The only other nomination requirement is that the nominee agrees to be able to perform their required duties and has the availability to do so, taking into account the fact that another chairperson will always be available as a point of contact.

How to nominate

Nominations will be accepted for 1 week from 2021-12-01 12:00PM PT to 2021-12-08 12:00PM PT.
Nominate somebody (including yourself) by responding to this issue with:

A statement that the nominee should be nominated for a chair position in the specific SIG holding its election. Nominees are required to provide a statement that they understand the responsibilities and requirements of the role, and promise to faithfully fulfill them and follow all contributor requirements for O3DE.
The name under which the nominee should be addressed. Nominees are allowed to contact the election proctor to have this name changed.
The GitHub username of the nominee (self-nominations need not include this; it's on your post.)
Nominee's Discord username (sorry, but you must be an active Discord user if you are a chairperson.)

Election process

The election will be conducted for one week from 2021-12-08 12:00PM PT and 2021-12-15 12:00PM PT and held through an online poll. Votes will be anonymous and anyone invested in the direction of O3DE and the SIG holding the election may vote. If you choose to vote, we ask that you be familiar with the nominees.

If there is a current interim chair, they will announce the results in the Discord sig channel as well as the SIG O3DE mailing list no later than 2021-12-17 1:00PM PT. If there is no interim chair, the executive director will announce the results utilizing the same communication channels. At that time if there is a dispute over the result or concern over vote tampering, voting information will be made public to the extent that it can be exported from the polling system and the SIG will conduct an independent audit under the guidance of a higher governing body in the foundation.

The elected chairpersons will begin serving their term on 2022-01-01 at 12AM PT. Tentatively SIG chairs will be elected on a yearly basis. If you have concerns about wanting to replace chairs earlier, please discuss in the request for feedback on Governance.

Proposed SIG-Testing meeting agenda for Mar-15-22

Meeting Details

Date/Time: March 15, 2022 @ 18:00 (6pm) UTC / 11:00 PDT
Location: Discord SIG-Testing Voice Room
Moderator: Kadino
Note Taker TBA

The SIG-Testing Meetings repo contains the history past calls, including a link to the agenda, recording, notes, and resources.

SIG Updates

New meeting time, which follows North America daylight savings
Discord service outage delayed meeting time for a week

Meeting Agenda

Review #27
Open forum to discuss ongoing challenges

Outcomes from Discussion topics

Discuss outcomes from agenda

Action Items

Create actionable items from proposed topics

Open Discussion Items

List any additional items below!

RFC: Adding Github Codeowners File

Summary

O3DE currently does not have code ownership automation. This means that processes such as creating pull requests and GitHub issues requires manually adding the correct Special Interest Group (SIG). For a contributor who is not familiar with the code base this can be an intimidating task. The user will have to either scour documentation or ask and wait for responses in the O3DE discord chats. Not only is this process is tedious, but can potentially lead towards incorrect code owner reviews on pull requests.

This RFC will discuss utilizing GitHub Codeowners as a solution, along with answering implementation details.

What is the relevance of this feature?

Codeowners is a well known public tool that adds automation capabilities for code ownership in a GitHub repository. This will label directories and files to their corresponding SIG. The Codeowners file will also label unowned code, which can be used ensure that each file in O3DE has an owner. It's important that the tool is well known to the open source community since O3DE is also open source.

Feature design description:

A Codeowners file will be created in the O3DE repository. The file will contain manually added mappings between directories and GitHub users, which can include a SIG team alias. The file should be separated by SIG's, which will organize their own directory ownerships. Having a GitHub Codeowners file in the O3DE repository will automatically add all corresponding SIG's to for each file affected by a Pull Request. Viewing a file in the O3DE repository on the GitHub website will also indicate who the owning SIG is.

There are multiple CLI tools that contributors can take advantage of. These tools can quickly tell a user the owner of a specified file without having to open a Pull Request or look on GitHub. These tools can also scan the repository for all unowned files which can be utilized to ensure every line of code has an owner.

Technical design description:

Setting up the GitHub Codeowners file should follow the official documentation.

A file will be created at ~/o3de/.github/CODEOWNERS as per the documentation above. The file's content will be grouped by SIG's who will be responsible for specifying their own areas of ownership. Once the file is added to the repository, its effects will automatically take place.

# SIG-Core
Code\Editor @sig-core
Code\Foo @sig-core @sig-testing  # Example of Co-Owned code

# SIG-Testing
Tools\LyTestTools @sig-testing

The Codeowners file will cease to function if the size exceeds 3MB. The Early Warning System should add a new validator to ensure the size of the file is below the threshold.

Un-owned Code Policy

Initially, we should not enforce that newly added files to the O3DE repository have an owner because it will require some effort and coordination to reach a point where all files/directories have an owner. Eventually, the Early Warning System should block Pull Requests that adds new files with no owner added in the Codeowners file. This feature can be implemented as part of this initiative, but should only be enabled once the codebase already has an owner for all files.

GitHub Codeowners Ownership

The ownership of the GitHub Codeowners file itself should be owned or determined by the Technical Steering Committee as it's an important organizational key to the repository. It has been suggested that SIG-Ops could be a potential owner as the group is responsible for automation.

What are the advantages of the feature?

GitHub Codeowners automatically adds the designated owners of each file in a pull request, ensuring that the correct SIG reviews all changes on files they own. There are CLI tools that users can use to determine which SIG owns a file. This will reduce the amount of triaging for O3DE GitHub Issues with no SIG label. Most importantly, the Codeowners file can enable further types of automation such as targeted SIG notifications on Automated Review failures.

What are the disadvantages of the feature?

Automation can produce an overwhelming amount of noise and notifications. While this shouldn't be an issue during the initial implementation of the Codeowners file, it's important that any future automation keeps this in mind.

Are there any alternatives to this feature?

O3DE can implement its own custom ownership system by utilizing CMake. This would allow for full customization of code ownership automation for O3DE. This path is not suggested due to the implementation effort difference between a custom system and the GitHub Codeowners file. It is also important to use a well known public tool as previously mentioned above for public O3DE contributors.

How will users learn this feature?

Since the process is automatic, there is not a large amount of teaching required. An email and discord notification announcing the implementation and linking the official documentation should suffice.

Are there any open questions?

After the feature rollout, there should be an effort to establish ownership for all unowned code. Once all unowned is accounted for, then the new file code ownership validator should be enabled.

Feedback due 2/11