Coder Social home page Coder Social logo

proteasome's Introduction

1. Testing The Framework

This section is intended for framework developers. After modifying the framework's core code, you can test it using mocked datasets by executing these commands:

# Generate mocked datasets:
# - Creates "_mocked_benchmark" case set in "data/case_set/"
# - Creates "_mocked_samples" sample set in "data/sample_set/"
# - Creates "_mocked_eval_task" evaluation task in "data/eval_task/"
python mock.py -case -sample -task

# Execute evaluation (mock mode):
# 1. Tasks prefixed with "_mocked_" will run in mock mode
# 2. Runtime profiles will be generated in "data/eval_task/task_running/"
# 3. Final results will be stored in "data/result/"
python evaluate.py _mocked_eval_task

# Cleanup mocked assets (optional):
# Note: This removes only the mocked datasets/tasks, not runtime profiles/results
python mock.py clean -case -sample -task

2. Creating The Case Set

Before evaluating, a case set must be created unless you have a case set already. Create a case set by the following command:

# The name "my_benchmark" should be replaced with an appropriate name.
# Note: This command will ask you for the tile and description of the new case set, you can input any text for them.
python create_case_set.py my_benchmark

3. Making A Case

This section provides an example of how to create cases of creating resource.

3.1. Creating Workshop Directory

Create a directory named workshop_of_case.

3.2. Initializing Environment

Navigate into the workshop_of_case directory and create an HCL source file - e.g., main.tf as shown below:

# A provider block is required. You can specify the provider requirement block if needed.
provider "alicloud" {
  region = "cn-shanghai"
}

Next, run the init and apply commands to initialize the environment:

# The init command downloads the provider plugins and creates the .terraform.lock.hcl file.
tofu init

# The apply command generates the Terraform state file.
tofu apply

At this point, the environment is initialized and ready for resource creation.

3.3. Saving The Pre-Operation Environment

Run the save_env.py script to capture the environment state before performing any operations:

python save_env.py pre_op

This will create a directory named pre_op inside the workshop directory, containing the necessary files to represent the pre-operation environment state.

3.4. Creating An OSS Bucket

Update the main.tf file as follows:

provider "alicloud" {
  region = "cn-shanghai"
}

# Declare an OSS bucket.
resource "alicloud_oss_bucket" "demo_bucket" {
  bucket = "demo-bucket-2025-0507-1348"
  acl = "public-read"
}

Then, run the apply command to create the OSS bucket:

tofu apply

3.5. Saving The Post-Operation Environment

Run the save_env.py script to capture the environment state after performing any operations:

python save_env.py post_op

This will create a directory named post_op inside the workshop directory, containing the necessary files to represent the post-operation environment state.

3.6. Preparing A Manifest File

Create a file named manifest.yaml in the workshop directory to define the test cases. Below is an example:

resource_type: alicloud_oss_bucket
operation_type: create
singularity: s0
essentials:
  - >
    count(alicloud_oss_bucket) == 1
  - >
    opt("alicloud_oss_bucket_acl") is None or count(alicloud_oss_bucket_acl) == 1
  - >
    provider_of(first_of(alicloud_oss_bucket)).region == "cn-shanghai"
  - >
    opt("alicloud_oss_bucket_acl") is None or
    provider_of(first_of(alicloud_oss_bucket_acl)).region == "cn-shanghai"
  - >
    first_of(alicloud_oss_bucket).bucket == "demo-bucket-2025-0507-1348"
  - >
    len(list(filter(
        lambda x: x is not None,
        {
            first_of(alicloud_oss_bucket).opt("acl"),
            first_of(opt("alicloud_oss_bucket_acl"))
        }
    ))) == 1
  - >
    first_of(alicloud_oss_bucket).opt("acl") == "public-read" or
    opt("alicloud_oss_bucket_acl") is not None and
    first_of(alicloud_oss_bucket_acl).opt("acl") == "public-read"
  - >
    opt("alicloud_oss_bucket_acl") is None or first_of(alicloud_oss_bucket_acl).bucket ==
    "${" + first_of(alicloud_oss_bucket).self_path() + ".bucket}"
  - >
    first_of(alicloud_oss_bucket).opt("storage_class") in {None, "Standard"}
  - >
    first_of(alicloud_oss_bucket).opt("lifecycle_rule") is None
actions:
  side_effect: deny
  required:
    - action: create
      pattern: alicloud_oss_bucket.*
      count: 1
  optional:
      - action: create
        pattern: alicloud_oss_bucket_acl.*
        count: 1
      - action: create
        pattern: alicloud_oss_bucket_versioning.*
        count: 1
security: []
misc:
  - >
    alicloud_oss_bucket.opt("demo_bucket") is not None
  - >
    alicloud_oss_bucket.demo_bucket.opt("versioning") is None or
    alicloud_oss_bucket.demo_bucket.versioning.opt("status") == "Suspended"
  - >
    opt("alicloud_oss_bucket_versioning") is None or
    first_of(alicloud_oss_bucket_versioning).bucket ==
    "${" + first_of(alicloud_oss_bucket).self_path() + ".bucket}" and
    first_of(alicloud_oss_bucket_versioning).opt("status") in {None, "Suspended"}
  - >
    alicloud_oss_bucket.demo_bucket.opt("cors_rule") is None
  - >
    alicloud_oss_bucket.demo_bucket.opt("website") is None
  - >
    alicloud_oss_bucket.demo_bucket.opt("logging") is None
  - >
    alicloud_oss_bucket.demo_bucket.opt("server_side_encryption_rule") is None
  - >
    alicloud_oss_bucket.demo_bucket.opt("transfer_acceleration") is None
  - >
    alicloud_oss_bucket.demo_bucket.opt("redundancy_type") is None or
    alicloud_oss_bucket.demo_bucket.redundancy_type == "LRS"
  - >
    alicloud_oss_bucket.demo_bucket.opt("access_monitor") is None or
    alicloud_oss_bucket.demo_bucket.access_monitor.opt("status") == "Disabled"
user_inputs:
  i3: >
    Create an Alibaba Cloud OSS bucket in the cn-shanghai region.
    Name the bucket demo-bucket-2025-0507-1348, and set the resource block name to demo_bucket.
    Set the ACL to public-read.
    Do not enable versioning or lifecycle management.
  i2: >
    Create an Alibaba Cloud OSS bucket in the cn-shanghai region.
    Name the bucket demo-bucket-2025-0507-1348.
    Set the ACL to public-read.
  i1: >
    Create an OSS bucket named demo-bucket-2025-0507-1348 in Shanghai that anyone can read.
  i0: >
    Create an OSS bucket.
clarity: [c1, c0]

Each case is defined by a combination of five dimensions: resource type, operation type, integrity, clarity, and singularity. Given the combinations specified in this manifest.yaml, up to 8 unique cases can be described.

3.7. Saving The Cases To The Case Set

Use the save_cases.py script to save the cases defined in the workshop directory to a specified case set:

# "my_benchmark" refers to the name of an existing case set.
python save_cases.py my_benchmark

After saving the case, restore the real environment to the state captured in the "pre_op" snapshot.

Now that the cases have been created, you can repeat the steps above to add more cases as needed.

4. Setting Samples and Evaluation Task

Create a sample set file in data/sample_set. e.g. samples_testing.yaml:

kind: sample_set
sample_set:
  title: Sample For Testing
  description: This is the sample set is for testing.
  samples:
    - [ qwen-max-2025-01-25, builtin_agent.d1.e1 ]
    - [ deepseek-chat, builtin_agent.d0.e1 ]

In the samples section of the YAML file, each sample is a pair consisting of a model name and a decorated agent name.

  • The model name refers to a specific LLM.
  • The decorated agent name includes the agent's name and two feature flags:
    • "d0": the agent does not support documentation.
    • "d1": the agent does support documentation.
    • "e0": the agent is not aware of the IaC environment.
    • "e1": the agent is aware of the IaC environment.
    • "es": the agent is only aware of the IaC state.
    • "ec": the agent is only aware of the IaC code (HCL).

The builtin_agent is a special type of agent that can enable or disable specific features using feature flags.

At last, create an evaluation task file in data/eval_task, e.g. eval_testing.yaml:

kind: eval_task
eval_task:
  case_set: my_benchmark
  sample_set: samples_testing
  repeats: 3
  iac_tool:
    name: tofu
    version: 1.9.0
    path: /opt/homebrew/bin/tofu

Key fields explained:

  • case_set: The name of the case set to be evaluated.
  • sample_set: The name of the sample set to be used in this evaluation.
  • repeats: The number of times each case will be evaluated.
  • iac_tool: Configuration for the Infrastructure-as-Code (IaC) tool, including:
    • name: The name of the IaC tool (e.g., tofu)
    • version: The version to be used
    • path: The full path to the executable

5. Before Evaluating

Before running the evaluation, make sure the following preparations are complete:

  • Provider credentials: Store the credentials for the provider plugins in environment variables (e.g., ALICLOUD_ACCESS_KEY, ALICLOUD_SECRET_KEY).

  • Configuration file: Create or update the config.yaml file. Example:

    eval_concurrency: 4
    models:
      - name: qwen-max-2025-01-25
        endpoint: https://dashscope.aliyuncs.com
        base_url_path: /compatible-mode/v1
        secret_key_env_var: LLM_SK
      - name: qwen-plus-2025-01-25
        endpoint: https://dashscope.aliyuncs.com
        base_url_path: /compatible-mode/v1
        secret_key_env_var: LLM_SK
      - name: qwen-turbo-2025-02-11
        endpoint: https://dashscope.aliyuncs.com
        base_url_path: /compatible-mode/v1
        secret_key_env_var: LLM_SK
      - name: deepseek-chat
        endpoint: https://api.deepseek.com
        base_url_path: /v1
        secret_key_env_var: DEEPSEEK_SK
      - name: qwq-plus
        endpoint: https://dashscope.aliyuncs.com
        base_url_path: /compatible-mode/v1
        secret_key_env_var: LLM_SK
    builtin_agent:
      default_model: qwen-max-2025-01-25
    informer_agent:
      c1_model: qwq-plus
      c0_model: qwq-plus
  • Custom agents: If you're using agents other than builtin_agent, you must implement corresponding adapters and register them in adapter/factory.py.

6. Evaluating

To start the evaluation, run evaluate.py with the task name:

python evaluate.py eval_testing

If the evaluation is interrupted, you can resume it by running resume.py with the same task name:

python resume.py eval_testing

7. Analyzing The Result

After the evaluation completes, a result file will be generated in the data/result directory. You can analyze the results as needed based on your specific criteria.

proteasome's People

Stargazers

99% avatar  avatar

Forkers

ren-maomao

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.