zorazrw / odex Goto Github PK

View Code? Open in Web Editor NEW

36.0 36.0 6.0 605 KB

[EMNLP'23] Execution-Based Evaluation for Open Domain Code Generation

Home Page: https://code-eval.github.io

License: Creative Commons Attribution Share Alike 4.0 International

Python 89.80% Jupyter Notebook 10.02% Shell 0.18%

code-generation evaluation execution open-domain

odex's People

Contributors

Stargazers

Watchers

Forkers

veerumehta lwaekfjlk loubnabnl nashid liujuncn chanhee-luke

odex's Issues

CodeBLEU file missing

The init.py in the metric file mentions the import of compute_codebleu. However, codebleu is not included in the repo? Could you provide the corresponded code?

Hugging face call is deprecated

I get this when using a codebase based on ODEX:

/Users/gneubig/work/gemini-benchmark/benchmarking/Code/verify.py:13: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate
  code_eval_metric = load_metric("code_eval")

Unexpected Keyword Argument 'replace_function_name'

Hi, When I run the codegen code, I am getting the following error

Command:

python nl2code_codegen.py --language en --model_size 350M --model_data mono --output_dir codegen_350M

Error:

Traceback (most recent call last):
  File "/home/rudra/odex/nl2code_codegen.py", line 203, in <module>
    main()
  File "/home/rudra/odex/nl2code_codegen.py", line 175, in main
    scores_dict = evaluate(model, eval_dataloader, tokenizer, args)
  File "/home/rudra/odex/nl2code_codegen.py", line 77, in evaluate
    for i, batch_inputs in enumerate(dataloader): 
  File "/home/rudra/.cache/CGLLM/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/rudra/.cache/CGLLM/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
    return self._process_data(data)
  File "/home/rudra/.cache/CGLLM/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
    data.reraise()
  File "/home/rudra/.cache/CGLLM/lib/python3.9/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/rudra/.cache/CGLLM/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/rudra/.cache/CGLLM/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/rudra/.cache/CGLLM/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/rudra/odex/src/data.py", line 65, in __getitem__
    prompt = create_fewshot_prompt_nl2code(
TypeError: create_fewshot_prompt_nl2code() got an unexpected keyword argument 'replace_function_name'

Welcome to join opencomass

Hi,
great works~
Welcome to join the OpenCompass for more users.
https://github.com/open-compass/opencompass

OpenCompass Team

Stripping the prompt can improve model performance

It seems that the Odex prompts fed to the model have a trailing whitespace, and this degrades the performance of models (CodeGen here) on the benchmark. Adding a strip to the prompt here would increase the performance. here are some numbers:

python nl2code_codegen.py --language en --model_size 2B --model_data mono \
 --num_tests_input 0 --num_tests_eval 100 --num_examples 0 --temperature 0.8 \
 --top_p 0.95 --num_return_sequences 50

gives:

Overall Pass@K Scores:
[pass@1] 0.4137 (439)
[pass@2] 0.4662 (439)
[pass@3] 0.4920 (439)
[pass@4] 0.5078 (439)
[pass@5] 0.5188 (439)
[pass@6] 0.5270 (439)
[pass@7] 0.5335 (439)
[pass@8] 0.5387 (439)
[pass@9] 0.5431 (439)
[pass@10] 0.5467 (439)

as opposed to

  "pass@1": 14.28,
  "pass@2": 15.69,
  "pass@5": 16.99,
  "pass@10": 17.54

without stripping (also the numbers reported in the paper).

(thanks @murthyrudra for running the code)

bug dataset quality

I have found several description error and answer error in your English data:

prompt: reverse the list that contains 1 to 10
- answer: list(reversed(list(range(10))))
- true answer: should be range(1, 11) instead of range(10)
prompt: print a list l and move first 3 elements to the end of the list
- answer: l[3:] + l[:3]
- true answer: print(l); return l[3:] + l[:3]

There are still many problems with bugs (containing semantic ambiguity or the incorrect answer.)
I hope you can revise your dataset carefully, as your dataset contains several diverse libraries, which can make huge impact on the whole code generation progress.

Why the results are the same?

This is kind of questionable...

zorazrw / odex Goto Github PK

odex's People

Contributors

Stargazers

Watchers

Forkers

odex's Issues

CodeBLEU file missing

Hugging face call is deprecated

Unexpected Keyword Argument 'replace_function_name'

Welcome to join opencomass

Stripping the prompt can improve model performance

bug dataset quality

Why the results are the same?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent