Comments (8)
I would answer your second question first.
another question, is the CodeGPT pretrained and finetuned with the same code corpus, but different format? (raw code for pretrain, token format for finetune?)
The pre-trained dataset and fine-tuned dataset are different. CodeGPT is pre-trained on Python and Java corpus from CodeSearchNet in raw code format, and fine-tuned on PY150 and Github Java Corpus in token format. We'll update more details about CodeGPT to our repo.
For the first question.
But the performance of finetuned model != pretrain model, right?
If I want to use finetuned model in the production environment, I have to preprocess the input code before sending them into model, and postprocess the output before presenting in IDE...
Yes. The fine-tuned model is more likely to predict code sequences in token level format. And if you would like to use it in real code completion scenario, you could use our released pre-trained CodeGPT model which is suitable for predicting raw codes. Alternately, you can try to fine-tuning CodeGPT on more data corpus (e.g. you could download PY150 and Github Java Corpus to obtain the raw code of each code file without running our preprocessing script). According to the time-cost analysis, it won't take too much time or computing resources.
from codexglue.
Yes, indeed. We do the preprocessing to ignore all the indentation and line breaks. CodeGPT is a pre-train model for CodeXGLUE participants to play with. It's train on a not large corpus and it can be viewed as a baseline for CodeXGLUE benchmark.
from codexglue.
Thanks for your suggestion.
As in most of related works about code completion, many researchers focus on token-level code completion and they tokenize raw codes into token sequences. We keep this setting in CodeXGLUE. To ensure consistency, we also use this input format for line-level code completion.
If a model works on the initial raw code, it is expected to perform well on token format after fine-tuned. In fact, our CodeGPT is pre-trained on the initial format of raw code and fine-tuned on token format.
from codexglue.
I see...thx for the reply. I can understand such token level format is easier for the evaluations.
But the performance of finetuned model != pretrain model, right?
If I want to use finetuned model in the production environment, I have to preprocess the input code before sending them into model, and postprocess the output before presenting in IDE...
from codexglue.
another question, is the CodeGPT pretrained and finetuned with the same code corpus, but different format? (raw code for pretrain, token format for finetune?)
Thx in advance.
from codexglue.
@celbree Thx for these details~
I have tried the pre-trained CodeGPT model and found that all indentation and line break are missing. Do I made any mistake when deploying the model?
Or did you preprocess the raw code by removing indentation and line break before pratraining the model? If so, any reason to do this preprocess?
from codexglue.
@celbree Thanks~
to my point of view, it is better to keep indentation and line breaks in the input, as indentation is import to the semantics of code.
a simple example:
a = 1
b = 2
if a > b:
print ('I don't agree')
print('Neither do I')
a = 1
b = 2
if a > b:
print ('I don't agree')
print('Neither do I')
are different
from codexglue.
Totally agree. Thanks for your suggestion.
from codexglue.
Related Issues (20)
- Is there a python library to use code BLEU? HOT 1
- 403 Forbidden error for Code-To-Text data files HOT 6
- Question related to fine tuning pretrained models for Defect-Detection task
- clone detection reproduction,CodeBert pipeline MAP@R only 76.64
- The CodeBlue evaluation script about code-to-code translation
- About CodeBLEU
- not a mach-o file when run code bleu
- javascript keys for CodeBLEU HOT 1
- [Code Completion - Token level] About eval_acc function
- [codecompletion-token] split function in code/dataset.py
- Missing data in ConCode HOT 2
- this line is wrong HOT 2
- The mlm and mlm_probability arguments in the run.py are not effective.
- Save_total_limit argument not used in run.py
- When resuming from a saved checkpoint, the train_dataloader doesn't resume from the same saved step.
- idx_file.txt is not effectively updated with the current epoch.
- Convert gradient accumulation with Accelerate
- no test file of webquery_predictions
- CloseTesting answer dataset is empty
- Question about text-code evaluation
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from codexglue.