Large Language Models(LLMs) for Code

A collection of LLMs of Code.

Continuous update.:running:

If there are any errors or some missing models, please contact us new issue or by e-mail:[email protected].

Code LLMs
Timeline of Code LLMs
Parameters of Code LLMs
Models
Improve Code LLMs
Dataset
Benchmark
Future

Code LLMs

Large Language Models(LLMs) for Code.

The Code LLMs referred to here is a large language model specially trained for code-related tasks. Its training corpus may not only contain code, but also natural language.

The Code LLMs here does not include the general large language model, although the general large language model also has the ability to complete code-related tasks.

Timeline of Code LLMs

Parameters of Code LLMs

Models

Codex [OpenAI] [2021.07] [Close]

📃Evaluating Large Language Models Trained on Code

🏴introduction

Model Architecture: Decoder Only,GPT Family
Params: 12B
Training Data: Collected[Code:159GB]
Training Time: -
Languages: Python[Multilingual]
Evaluation: HumanEval, APPS
Supported Tasks: Code Generation，Docstring Generation

Tabnine [Close]

🔗AI assistant for software developers

🏴introduction

Model Architecture: LLM
Params: -
Training Data: -
Training Time: -
Languages: -
Evaluation: -
Supported Tasks: Whole line completions,Full-function completions,Natural language to code completions

AlphaCode [DeepMind] [2022.03] [Close]

📃Competition-Level Code Generation with AlphaCode

🏴introduction

Model Architecture: Encoder-Decoder
Params: 41B
Training Data: Collected[Code:715.1GB]
Training Time: -
Languages: 12langs
Evaluation: HumanEval,APPS,CodeContest
Supported Tasks: Competition-Level Code Generation

PaLM-Coder [Google] [2022.04] [Close]

📃PaLM: Scaling Language Modeling with Pathways

🏴introduction

Model Architecture: Decoder Only
Params: 8B, 62B, 540B
Training Data: Collected[Text: 741B tokens Code: 39GB(780B tokens trained)]
Training Time: 6144 TPU
Languages: Multiple
Evaluation: HumanEval,MBPP,TransCoder,DeepFix
Supported Tasks: Code Generation,Code Translation,Code Repa

PolyCoder [CMU] [2022.02] [Open]

📃A Systematic Evaluation of Large Language Models of Code

🏴introduction

Model Architecture: Decoder Only，GPT Family
Params: 2.7B
Training Data: Collected[Code: 253.6GB]
Training Time: - 
Languages: 12 langs
Evaluation: HumanEval
Supported Tasks: Code Generation

GPT-Neo [EleutherAI] [2021.03] [Open]

📃GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow

🏴introduction

Model Architecture: Decoder Only,GPT Family
Params: 1.3B, 2.7B
Training Data: The Pile[Text: 730GB Code: 96GB(400B tokens trained)]
Training Time: -
Languages: Multiple
Evaluation: HumanEval
Supported Tasks: Code Generation

GPT-NeoX [EleutherAI] [2022.04] [Open]

📃GPT-NeoX-20B: An Open-Source Autoregressive Language Model

🏴introduction

Model Architecture: Decoder Only,GPT Family
Params: 20B
Training Data: The Pile[Text: 730GB Code: 95GB(473B tokens trained)]
Training Time: -
Languages: Multiple
Evaluation: HumanEval
Supported Tasks: Code Generation

GPT-J [EleutherAI] [2021.06] [Open]

🔗GPT-J-6B: 6B JAX-Based Transformer

🏴introduction

Model Architecture: Decoder Only,GPT Family
Params: 6B
Training Data: The Pile[Text: 730GB Code: 96GB(473B tokens trained)]
Training Time: -
Languages: Multiple
Evaluation: HumanEval
Supported Tasks: Code Generation

Incoder [Meta] [2022.04] [Open]

📃InCoder: A Generative Model for Code Infilling and Synthesis

🏴introduction

Model Architecture: Decoder Only
Params: 1.3B,6.7B
Training Data: Collected[Code: 159GB StackOverflow: 57GB(60B tokens trained)]
Training Time: -
Languages: 28 langs
Evaluation: HumanEval,MBPP,CodeXGLUE
Supported Tasks: Infilling Lines Of Code (HumanEval),Docstring Generation (CodeXGLUE), Return Type Prediction,Varible Name Predic

CodeGen [Salesforce] [2022.03] [Open] 🌟popular

📃CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

🏴introduction

Model Architecture: Decoder Only
Params: 6.1B, 16.1B
Training Data: The Pile,BigQuery,BigPython[Code:150B tokens Text:355B tokens]
Training Time: -
Languages: CodeGen-Multi(6 langs),CodeGen-Mono(python)
Evaluation: HumanEval, MTPB
Supported Tasks: Single-Turn Code Generation,Multi-Turn Code Generation

CodeGeeX [THU] [2022.09] [Open]

📃CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X

🏴introduction

Model Architecture: Decoder Only,GPT Family
Params: 13B
Training Data: The Pile,CodeParrot,Collected[Code: 158B tokens(850B tokens trained)]
Training Time: 1536 Ascend 910 AI processors (32GB) with Mindspore (v1.7.0),two months
Languages: 23 langs
Evaluation: HumanEval-X,HumanEval,MBPP,CodeXGLUE,XLCoST
Supported Tasks: Multilingual Code Generation,Code Translation

AiXcoder [PKU] [Close]

🔗AixCoder

🏴introduction

Model Architecture: -
Params: 13B?
Training Data: -
Training Time: -
Languages: Multiple
Evaluation: -
Supported Tasks: Code Generation,Code Completion,Code Search

Pangu-Coder [Huawei Noah’s Ark Lab] [2022.07] [Close]

📃PanGu-Coder: Program Synthesis with Function-Level Language Modeling

🏴introduction

Model Architecture: PANGU-α architecture,Decoder Only
Params: 2.6 B
Training Data: Collected(147GB)
Training Time: - 
Languages: python
Evaluation: HumanEval,MBPP
Supported Tasks: Code Generation

ERNIE-Code [Baidu] [2022.12] [Close]

⚠️Don't think ERNIE-Code is a Code LLMs

📃ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

🏴introduction

Model Architecture: Encoder-Decoder,T5-base
Params: 560M
Training Data: CodeSearchNet,NL Corpus
Training Time: - 
Languages: Multiple
Evaluation: mCoNaLa,Bugs2Fix,Microsoft Docs
Supported Tasks: Multilingual Code-to-Text, Text-to-Code, Code-to-Code, and Text-to-Text Generation.

Improve Code LLMs

AlphaCode:Competition-Level Code Generation with AlphaCode
CodeT:CodeT: Code Generation with Generated Tests

Dataset

The Pile

🔗repo🏴introduction
BigQuery(BIGQUERY）

🔗repo🏴introduction
CodeParrot

🔗repo🏴introduction
The Stack

🔗repo🏴introduction
PolyCoder

🔗repo🏴introduction
CodeSearchNet

🔗repo🏴introduction
ProjectCodeNet

🔗repo🏴introduction
BigPython

🔐close🏴introduction
Collected

🕷️crawled data🏴introduction

Benchmark

HumanEval

🔗 repo📃paper🏴introduction
APPS

🔗 repo📃paper🏴introduction
MBPP

🔗 repo📃paper🏴introduction
CodeXGLUE

🔗 repo📃paper🏴introduction
CodeContest

🔗 repo📃paper🏴introduction
TransCoder

🔗 repo📃paper🏴introduction
DeepFix

🔗 repo📃paper🏴introduction
MTPB

🔗 repo📃paper🏴introduction
HumanEval-X

🔗 repo📃paper🏴introduction
XLCoST

🔗 repo📃paper🏴introduction
DS-1000

🔗 repo📃paper🏴introduction
ODEX

🔗 repo📃paper🏴introduction

Future

Future development

stophobia / large-language-models-for-code Goto Github PK

large-language-models-for-code's Introduction

Large Language Models(LLMs) for Code

Contents

Code LLMs

Timeline of Code LLMs

Parameters of Code LLMs

Models

Improve Code LLMs

Dataset

Benchmark

Future

large-language-models-for-code's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent