Implementation of the research paper "Evaluating Commit Message Generation: To BLEU Or Not To BLEU?"
Please install the neccessary libraries before running our codes:
- python==3.6.9
- nltk==3.4.5
- numpy==1.16.5
- scikit-learn==0.22.1
Our data is extracted from the MSR dataset. The data used for our experiments can be found in the "Dataset" folder.
- The "human_annotations.csv" file contains the human annotated scores for 100 pairs of reference and predicted commit messages.
- A subset of the original MCMD dataset has been used for our experiments. The .csv files for pairs of reference and predicted sentences is of the general form "model_MCMD(Number).csv", where "model" could be any of the CMG models listed below and "(Number)" takes one the values 1-5 according to the choice of the programming language (PL).
Number | PL |
---|---|
1 | C++ (C plus plus) |
2 | C# (C sharp) |
3 | Java (Java) |
4 | JS (Javascript) |
5 | Py (Python) |
The Commit Message generation (CMG) models considered in our experiments are:
The Machine Translation (MT) metrics considered in our experimentations are: BLEU4, BLEUNorm, BLEUCC, METEOR, METEOR-NEXT, ROUGE-1, ROUGE-2, ROUGE-L, TER.
- RQ1: Which factors affect commit message quality?
- RQ2: Which metric is best suited to evaluate commit messages?
- RQ3: How do the CMG tools perform on the new metric?
- The potential factors included in our study are Length, Word Alignment, Semantic Scoring, Case Folding, Punctuation Removal and Smoothing.
- For replication of RQ1 section in the paper, simply run the
Effect of {Factor_name}.py
file under the "Experimental Results" folder to observe the effect of {Factor_name} factor on the metrics. - For implementation of the code on your own human annotated dataset, replace the
human_annotations.csv
part with your own human annotated .csv file in the following code snippet of theEffect of {Factor_name}.py
file under the "Experimental Results" folder, followed by minor alterations in case of changes in the number of human annotators used (here, #annotators = 3).
with open('human_annotations.csv') as csvfile:
ader = csv.reader(csvfile)
- For replication of RQ2 section in the paper, simply run the
The Log-MNEXT metric.py
file to obtain its correlation with human evaluation scores. - For implementation of the metric and getting the score for any given reference and predicted sentence pair, run the
The Log-MNEXT metric.py
file and then call the functionlog_mnext_score([reference],predicted)
. However, bothreference
andpredicted
are of type 'string'.
-
For replication of RQ3 section in the paper, i.e, observing the performance of Log-MNEXT metric on a specific model for a particular PL of the MCMD dataset, simply update the
model_MCMD(Number).csv
part by putting the model name in place ofmodel
, number 1,2,3,4 or 5 in place of(Number)
in the code snippet of theLog-MNEXT performance on the models.py
file under the "Experimental Results" folder. -
For observing the performance of the metric on any other CMG model, replace the
model_MCMD(Number).csv
part of the code snippet below with the required .csv file containing the reference senetences and the predicted sentences generated by your specific model.
refs=[]
preds=[]
with open('model_MCMD(Number).csv') as csvfile:
ader = csv.reader(csvfile)
for row in ader:
refs.append(row[0])
preds.append(row[1])