Coder Social home page Coder Social logo

lalitnayyar / codeultrafeedback Goto Github PK

View Code? Open in Web Editor NEW

This project forked from martin-wey/codeultrafeedback

0.0 0.0 0.0 15.89 MB

CodeUltraFeedback for aligning large language models to coding preferences https://arxiv.org/abs/2403.09032

License: MIT License

Python 100.00%

codeultrafeedback's Introduction

CodeUltraFeedback

Aligning Large Language Models to Coding Preferences

๐Ÿค” About โ€ข ๐Ÿš€ Getting Started โ€ข ๐Ÿง  Models โ€ข ๐Ÿค— Datasets โ€ข ๐Ÿ“ Citation

Note

[03-13-2024] ๐Ÿ† We are preparing a leaderboard for CODAL-Bench, stay tuned!

[03-13-2024] ๐Ÿ”ฅ We release the first version of CodeUltraFeedback and CODAL-Bench.

Contact: If you have any inquiries or want to raise an issue, please feel free to contact:

About

Overview of CodeUltraFeedback

Overview of CodeUltraFeedback dataset construction (see Section II of our paper for more details).

Given the increasing coding capabilities of large language models (LLMs), the following question emerges:

How well do these capabilities align with the expectations of developers, particularly concerning non-functional requirements such as code readability, efficiency, and adherence to best practices?

We believe existing benchmarks relying on automated metrics and static analysis tools are insufficient and too rigid for evaluating the broader capabilities of LLMs. Instead, we believe LLM-as-a-judge offers a more nuanced strategy (or proxy to human evaluation) to evaluate LLMs while effectively considering the intricacies of natural and programming languages.

Our work features two main contributions: CodeUltraFeedback and CODAL-Bench, a dataset and benchmark for aligning LLMs to coding preferences and evaluating their alignment using LLM-as-a-judge.

CodeUltraFeedback is a preference dataset of complex coding instructions to align LLMs to coding preferences. It has an analogous construction procedure to UltraFeedback, featuring:

  • โœจ Complex instructions: CodeUltraFeedback is based on a 10k subset of MagiCoder Evol-Instruct comprising open domain complex coding instructions.
  • โœจ Coding preferences: CodeUltraFeedback includes 5 coding preferences, which are crucial to evaluate the broader capabilities of LLMs: instruction-following, code explanation, code complexity and efficiency, code readability, and coding style.
  • โœจ Large pool of LLMs: We use a large pool of 14 LLMs from 8 model families to generate responses to the 10k instructions to consider diverse writing and coding styles.
  • โœจ LLM-as-a-judge and AI feedback: We use GPT-3.5 as a judge for evaluating LLM responses, which annotates each response with both numerical and textual feedback. The AI feedback data can be leveraged for various applications, including model alignment through RLAIF, tuning a critic LLM, and more.

CODAL-Bench is a benchmark of 500 coding problems (100 per coding preference). We use LLM-as-a-judge with reference-guided single-answer grading using GPT-3.5 or GPT-4 to evaluate LLM alignment. The approach enables the judge LLM to provide consistent ratings and evaluate each LLM individually (similar to MT-Bench).

๐Ÿš€ Getting Started

We provide all the source code implemented to build CodeUltraFeedback and evaluate LLMs on CODAL-Bench.

Important

We are currently working on instructions to:

  1. Build CodeUltraFeedback or extend the dataset
  2. Tune your own SFT and DPO LLMs
  3. Evaluate LLMs on CODAL-Bench

Models

Model Checkpoint Size CODAL-Bench GPT-3.5
(G-3.5, G-4)
CODAL-Bench GPT-4
(G-4)
HumanEval+
(k=1, k=10)
License
CodeLlama-7B-Instruct ๐Ÿค— HF Link 7B 6.00 / 5.46 4.72 37.9 / 60.4 Llama2
CodeLlama-7B-Instruct-SFT ๐Ÿค— HF Link 7B 6.51 / 5.83 5.84 51.2 / 82.9 Llama2
CodeLlama-7B-Instruct-DPO ๐Ÿค— HF Link 7B 7.15 / 6.79 5.08 42.3 / 80.5 Llama2
CodeLlama-7B-Instruct-SFT+DPO ๐Ÿค— HF Link 7B 7.36 / 7.08 5.85 43.1 / 75.6 Llama2

Datasets and Benchmark

๐Ÿ“ Citation

@misc{weyssow2024codeultrafeedback,
  title={CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences}, 
  author={Martin Weyssow and Aton Kamanda and Houari Sahraoui},
  year={2024},
  eprint={2403.09032},
  archivePrefix={arXiv},
  primaryClass={cs.SE}
}

codeultrafeedback's People

Contributors

martin-wey avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.