This section is intended for framework developers. After modifying the framework's core code, you can test it using mocked datasets by executing these commands:
# Generate mocked datasets:
# - Creates "_mocked_benchmark" case set in "data/case_set/"
# - Creates "_mocked_samples" sample set in "data/sample_set/"
# - Creates "_mocked_eval_task" evaluation task in "data/eval_task/"
python mock.py -case -sample -task
# Execute evaluation (mock mode):
# 1. Tasks prefixed with "_mocked_" will run in mock mode
# 2. Runtime profiles will be generated in "data/eval_task/task_running/"
# 3. Final results will be stored in "data/result/"
python evaluate.py _mocked_eval_task
# Cleanup mocked assets (optional):
# Note: This removes only the mocked datasets/tasks, not runtime profiles/results
python mock.py clean -case -sample -task
Before evaluating, a case set must be created unless you have a case set already. Create a case set by the following command:
# The name "my_benchmark" should be replaced with an appropriate name.
# Note: This command will ask you for the tile and description of the new case set, you can input any text for them.
python create_case_set.py my_benchmark
This section provides an example of how to create cases of creating resource.
Create a directory named workshop_of_case.
Navigate into the workshop_of_case directory and create an HCL source file - e.g., main.tf as shown below:
# A provider block is required. You can specify the provider requirement block if needed.
provider "alicloud" {
region = "cn-shanghai"
}
Next, run the init and apply commands to initialize the environment:
# The init command downloads the provider plugins and creates the .terraform.lock.hcl file.
tofu init
# The apply command generates the Terraform state file.
tofu apply
At this point, the environment is initialized and ready for resource creation.
Run the save_env.py script to capture the environment state before performing any operations:
python save_env.py pre_opThis will create a directory named pre_op inside the workshop directory,
containing the necessary files to represent the pre-operation environment state.
Update the main.tf file as follows:
provider "alicloud" {
region = "cn-shanghai"
}
# Declare an OSS bucket.
resource "alicloud_oss_bucket" "demo_bucket" {
bucket = "demo-bucket-2025-0507-1348"
acl = "public-read"
}
Then, run the apply command to create the OSS bucket:
tofu applyRun the save_env.py script to capture the environment state after performing any operations:
python save_env.py post_opThis will create a directory named post_op inside the workshop directory,
containing the necessary files to represent the post-operation environment state.
Create a file named manifest.yaml in the workshop directory to define the test cases. Below is an example:
resource_type: alicloud_oss_bucket
operation_type: create
singularity: s0
essentials:
- >
count(alicloud_oss_bucket) == 1
- >
opt("alicloud_oss_bucket_acl") is None or count(alicloud_oss_bucket_acl) == 1
- >
provider_of(first_of(alicloud_oss_bucket)).region == "cn-shanghai"
- >
opt("alicloud_oss_bucket_acl") is None or
provider_of(first_of(alicloud_oss_bucket_acl)).region == "cn-shanghai"
- >
first_of(alicloud_oss_bucket).bucket == "demo-bucket-2025-0507-1348"
- >
len(list(filter(
lambda x: x is not None,
{
first_of(alicloud_oss_bucket).opt("acl"),
first_of(opt("alicloud_oss_bucket_acl"))
}
))) == 1
- >
first_of(alicloud_oss_bucket).opt("acl") == "public-read" or
opt("alicloud_oss_bucket_acl") is not None and
first_of(alicloud_oss_bucket_acl).opt("acl") == "public-read"
- >
opt("alicloud_oss_bucket_acl") is None or first_of(alicloud_oss_bucket_acl).bucket ==
"${" + first_of(alicloud_oss_bucket).self_path() + ".bucket}"
- >
first_of(alicloud_oss_bucket).opt("storage_class") in {None, "Standard"}
- >
first_of(alicloud_oss_bucket).opt("lifecycle_rule") is None
actions:
side_effect: deny
required:
- action: create
pattern: alicloud_oss_bucket.*
count: 1
optional:
- action: create
pattern: alicloud_oss_bucket_acl.*
count: 1
- action: create
pattern: alicloud_oss_bucket_versioning.*
count: 1
security: []
misc:
- >
alicloud_oss_bucket.opt("demo_bucket") is not None
- >
alicloud_oss_bucket.demo_bucket.opt("versioning") is None or
alicloud_oss_bucket.demo_bucket.versioning.opt("status") == "Suspended"
- >
opt("alicloud_oss_bucket_versioning") is None or
first_of(alicloud_oss_bucket_versioning).bucket ==
"${" + first_of(alicloud_oss_bucket).self_path() + ".bucket}" and
first_of(alicloud_oss_bucket_versioning).opt("status") in {None, "Suspended"}
- >
alicloud_oss_bucket.demo_bucket.opt("cors_rule") is None
- >
alicloud_oss_bucket.demo_bucket.opt("website") is None
- >
alicloud_oss_bucket.demo_bucket.opt("logging") is None
- >
alicloud_oss_bucket.demo_bucket.opt("server_side_encryption_rule") is None
- >
alicloud_oss_bucket.demo_bucket.opt("transfer_acceleration") is None
- >
alicloud_oss_bucket.demo_bucket.opt("redundancy_type") is None or
alicloud_oss_bucket.demo_bucket.redundancy_type == "LRS"
- >
alicloud_oss_bucket.demo_bucket.opt("access_monitor") is None or
alicloud_oss_bucket.demo_bucket.access_monitor.opt("status") == "Disabled"
user_inputs:
i3: >
Create an Alibaba Cloud OSS bucket in the cn-shanghai region.
Name the bucket demo-bucket-2025-0507-1348, and set the resource block name to demo_bucket.
Set the ACL to public-read.
Do not enable versioning or lifecycle management.
i2: >
Create an Alibaba Cloud OSS bucket in the cn-shanghai region.
Name the bucket demo-bucket-2025-0507-1348.
Set the ACL to public-read.
i1: >
Create an OSS bucket named demo-bucket-2025-0507-1348 in Shanghai that anyone can read.
i0: >
Create an OSS bucket.
clarity: [c1, c0]
Each case is defined by a combination of five dimensions:
resource type, operation type, integrity, clarity, and singularity.
Given the combinations specified in this manifest.yaml, up to 8 unique cases can be described.
Use the save_cases.py script to save the cases defined in the workshop directory to a specified case set:
# "my_benchmark" refers to the name of an existing case set.
python save_cases.py my_benchmark
After saving the case, restore the real environment to the state captured in the "pre_op" snapshot.
Now that the cases have been created, you can repeat the steps above to add more cases as needed.
Create a sample set file in data/sample_set. e.g. samples_testing.yaml:
kind: sample_set
sample_set:
title: Sample For Testing
description: This is the sample set is for testing.
samples:
- [ qwen-max-2025-01-25, builtin_agent.d1.e1 ]
- [ deepseek-chat, builtin_agent.d0.e1 ]
In the samples section of the YAML file, each sample is a pair consisting of a model name and a decorated agent name.
- The model name refers to a specific LLM.
- The decorated agent name includes the agent's name and two feature flags:
- "d0": the agent does not support documentation.
- "d1": the agent does support documentation.
- "e0": the agent is not aware of the IaC environment.
- "e1": the agent is aware of the IaC environment.
- "es": the agent is only aware of the IaC state.
- "ec": the agent is only aware of the IaC code (HCL).
The builtin_agent is a special type of agent that can enable or disable specific features using feature flags.
At last, create an evaluation task file in data/eval_task, e.g. eval_testing.yaml:
kind: eval_task
eval_task:
case_set: my_benchmark
sample_set: samples_testing
repeats: 3
iac_tool:
name: tofu
version: 1.9.0
path: /opt/homebrew/bin/tofu
Key fields explained:
- case_set: The name of the case set to be evaluated.
- sample_set: The name of the sample set to be used in this evaluation.
- repeats: The number of times each case will be evaluated.
- iac_tool: Configuration for the Infrastructure-as-Code (IaC) tool, including:
- name: The name of the IaC tool (e.g., tofu)
- version: The version to be used
- path: The full path to the executable
Before running the evaluation, make sure the following preparations are complete:
-
Provider credentials: Store the credentials for the provider plugins in environment variables (e.g., ALICLOUD_ACCESS_KEY, ALICLOUD_SECRET_KEY).
-
Configuration file: Create or update the config.yaml file. Example:
eval_concurrency: 4 models: - name: qwen-max-2025-01-25 endpoint: https://dashscope.aliyuncs.com base_url_path: /compatible-mode/v1 secret_key_env_var: LLM_SK - name: qwen-plus-2025-01-25 endpoint: https://dashscope.aliyuncs.com base_url_path: /compatible-mode/v1 secret_key_env_var: LLM_SK - name: qwen-turbo-2025-02-11 endpoint: https://dashscope.aliyuncs.com base_url_path: /compatible-mode/v1 secret_key_env_var: LLM_SK - name: deepseek-chat endpoint: https://api.deepseek.com base_url_path: /v1 secret_key_env_var: DEEPSEEK_SK - name: qwq-plus endpoint: https://dashscope.aliyuncs.com base_url_path: /compatible-mode/v1 secret_key_env_var: LLM_SK builtin_agent: default_model: qwen-max-2025-01-25 informer_agent: c1_model: qwq-plus c0_model: qwq-plus
Custom agents: If you're using agents other than
builtin_agent, you must implement corresponding adapters and register them inadapter/factory.py.To start the evaluation, run
evaluate.pywith the task name:python evaluate.py eval_testing
If the evaluation is interrupted, you can resume it by running
resume.pywith the same task name:python resume.py eval_testing
After the evaluation completes, a result file will be generated in the
data/resultdirectory. You can analyze the results as needed based on your specific criteria.proteasome's People
Forkers
ren-maomaoRecommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
OpenClaw
Personal AI Assistant
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.