A framework for execution-based evaluation of any dataset in any language from the paper Measuring The Impact Of Programming Language Distribution.
- Docker (ideally installed in sudoless mode.)
- Python 3.9+
- Install the requirements with
pip install -r requirements.txt
- Choose a dataset from the supported raw datasets.. For this example we will use HumanEval.
- You must first convert the dataset to the BabelCode format. To create BC-HumanEval, run
python convert_dataset.py --dataset_name="human_eval" --input_path="data/raw_datasets/human_eval_questions.jsonl"
This will create the parsed dataset located in data/parsed_datasets/human_eval.jsonl
. It will additionally create the files data/golden_predictions/human_eval.jsonl
and data/convert_failures/human_eval.txt
. The former is the gold solutions from the dataset that allows validation of BabelCode. The convert failures is the description of all failures that occured when trying to convert the dataset.
- To generate the testing code and prompt data for the dataset run
python generate_test_code.py --gin_file="configs/generate_code.gin" \
--input="data/parsed_datasets/human_eval.jsonl" \
--output="data/problem_code/human_eval"
This will create the data/problem_code/human_eval
directory. In it will be the following files:
testing_code.jsonl
: The jinja testing scripts and problem information for each problem in the dataset and each language supported by BabelCode.prompt_info.jsonl
: The translated signatures, both with and without docstrings, for each problem in the dataset and each language supported by BabelCode.failures
: A directory of JSONL files where each file contains the failures for each langauge.
- Finally run
bash scripts/docker_eval.sh configs/tutorial_eval.gin tutorial data/golden_predictions/human_eval.jsonl data/problem_code/human_eval
The outputs will be written to eval_outputs/tutorial
. It contains:
metrics.json
: The overall results for each languagepred_results.jsonl
: the results for each prediction{LANGUAGE}_execution_results.jsonl
: The raw execution result for each prediction in LANGUAGE
To make your predictions work with this framework, they must be in a jsonlines file where each line is its own prediction. Each prediction must have the follwing keys:
qid
: A question ID the prediction is for.language
: The language the prediction is in.code
: The code prediction to run.id
: The id of the prediction. For each question, no two predictions may have the same id.
Optional Keys are:
entry_fn_name
: The name of the function to call when evaluating.entry_cls_name
: If the language requires classes, this is the name of the prediction class to call when evaluating.
-
For each language, generate a json lines file from question specifications where each line contains both the question information and the raw code needed to check the correctness of a solution
- The code generated works by passing the test cases as arguments
- The actual execution is wrapped by
try-except
blocks so that we can return the exception name. - Then the resulting outcome string is printed in the format
TEST-{TEST_ID}...{RESULT_STR}
-
Using a provided directory where each file in it is a set of json lines predictions for a given language, create the code files in a temporary directory
-
Execute the predictions by simulating the shell and passing in language specific commands to compile and run a given file of code.
-
Parse out the
TEST-{TEST_ID}...{RESULT_STR}
lines from the stdout captured and compare this with the expected test ids in the question data. If theRESULT_STR
for each is notPASSED
, then the prediction is considered failing. -
Report the aggregated statistics from all the questions and save both the raw results (result for each test case, the stdout/stderr, commands used, etc.) and the metrics to a specified output directory.
Additional documentation can be found in the docs
directory.
This is not an officially supported Google product.