Coder Social home page Coder Social logo

opencsgs / swe-bench-docker Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aorwall/swe-bench-docker

0.0 0.0 0.0 6.88 MB

A Docker based solution of the SWE-bench evaluation framework

License: MIT License

Shell 4.17% Python 45.84% Dockerfile 49.99%

swe-bench-docker's Introduction

This is a Dockerfile based solution of the SWE-Bench evaluation framework.

The solution is designed so that each "testbed" for testing a version of a repository is built in a separate Docker image. Each test is then run in its own Docker container. This approach ensures more stable test results because the environment is completely isolated and is reset for each test. Since the Docker container can be recreated each time, there's no need for reinstallation, speeding up the benchmark process.

Validation

SWE-Bench_Lite

Docker images for testbeds used in the SWE-Bench_Lite dataset has been built and tested on gold predictions. 2 benchmark instances are currently failing. See results in the evaluations/SWE-bench_Lite_golden folder.

SWE-Bench

Docker images for testbeds used in the SWE-Bench dataset has been built and tested on the check-harness predictions published by SWE-bench. 10 benchmark instances are currently failing. See results in the evaluations/SWE-bench_check_harness folder.

Comparing results from other agents

I have tested running Docker benchmarks on the SWE-Agents GPT-4 benchmark and Auto Code Rover's first benchmark run.

The SWE-Agent GPT-4 predictions yield exactly the same results of 18% (54) resolved issues as SWE-Agent's own results, which seems to show that the Docker image approach works with the same accuracy.

However, the Docker benchmark provides better results for AutoCodeRover. In AutoCodeRover's own benchmarks, they achieve 16.00% (48), 15.67% (47), and 16.67% (50) resolved issues. In swe-bench-docker, the same predictions result in 18.00% (54), 19% (57) and 19% (57) resolved issues. This adds up to a pass@3 of 26% (78) compared to 22.33% (67) reported in the AutoCodeRover paper. This suggests that other agents' benchmarks may show lower results than they actually achieve because it's challenging to conduct evaluations with completely accurate results.

Docker images types

There are currently three different Docker images for running benchmarks.

Conda

Testbeds are set up in a Conda environment similar to the original SWE-bench environment.

Pyenv

Since each benchmark is tested in its own container, using Conda may be overkill. Testbeds are set up with only the correct Python version installed via Pyenv. This approach has been shown to result in fewer erroneous benchmark instances in repositories where it has been tested, and the image becomes smaller. Currently, django, psf/requests and scikit-learn use this type of Docker image. Hopefully, more repositories can be run this way.

Instance image

In scikit-learn, some benchmarks seem to fail because Cython code isn't compiled. To avoid building the project before each test, an image is built for each benchmark instance.

Run evaluation

Run run_evaluation.py to evaluate a predictions file. A log for each test is written to log_dir in the same format as in the SWE-bench evaluation tools, and the same tooling can then be used to generate a report.

python run_evaluation.py 
    --predictions_path [Required]  Path to the predictions file 
    --log_dir          [Required]  Path to directory to save evaluation log files 
    --swe_bench_tasks  [Required]  Path to SWE-bench task instances file or dataset 
    --namespace        [Optional]  Namespace of the Docker repository 
    --log_suffix       [Optional]  Suffix to append to log file names
    --skip_existing    [Optional]  Skip evaluating task instances with logs that already exist
    --timeout          [Optional]  Timeout for installation + test script execution
    --num_processes    [Optional]  Number of processes to run in parallel (-1 for unlimited)

Pull Docker images

It might be worth pulling all Images before running the script to achieve more consistent timing in the evaluation.

scripts/pull_docker_images.sh [Dockerfiles directory] [Namespace]

Build Docker images

Generate Dockerfiles

Generates Dockerfiles for all test beds in a SWE-Bench benchmark dataset. These can then be used to build Docker images.

python run_dockerfile_generator.py 
    --swe_bench_tasks  [Required]  Path to SWE-bench task instances file or dataset 
    --namespace        [Required]  Namespace of the Docker repository 
    --docker_dir       [Required]  Path to the directory where the Dockerfiles will be saved

Build Docker images

This script builds Docker images from all Dockerfiles.

scripts/build_docker_images.sh [Dockerfiles directory] [Namespace]

Push Docker images

This script builds Docker images from all Dockerfiles.

scripts/push_docker_images.sh [Dockerfiles directory] [Namespace]

Troubleshooting

Run single instance

Run a single instance and print logs to stdout.

python run_single_instance.py 
    --instance_id      [Required]  Instance ID of the task to run
    --swe_bench_tasks  [Optional]  Path to SWE-bench task instances file or dataset (default is princeton-nlp/SWE-bench_Lite)
    --namespace        [Optional]  Namespace of the Docker repository
    --predictions_path [Optional]  Path to the predictions file, if not set the golden patch will be used

Build single Docker image

scripts/build_docker_images.sh [Namespace] [Testbed directory]

swe-bench-docker's People

Contributors

aorwall avatar elliottlawrence avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.