guxd / deep-code-search Goto Github PK

View Code? Open in Web Editor NEW

278.0 278.0 86.0 51.92 MB

DeepCS: Deep Code Search

License: MIT License

Python 100.00%

deep-code-search's People

Contributors

Stargazers

Watchers

Forkers

annamalai-nr xing-hu terminiter ml-lab flyboss alabarga cclauss nayname dineshresearch awesome-archive stevenlol hyzcn afcarl papachristoumarios jayavardhanr zzbslayer linzeqipku kidofjealous sycomix azizilyosov kakayuw chnsh abuhamad und3va swarnadeep8597 onlyforecho mf1832146 zhaoyicc hozart seujay johnlnguyen mslichao gulian-github xueqiyang chubbymaggie leorpoirot eamonliang mloncode sohilladhani rsbaher1 u-scope shayanzamani kevinrodrigues05 wanzhiwen2016 avineshwar othadem arukaminado luzhenjie1999 forest520 d3v3l0 maximbrg embergungnir xrosliang yuhanzhang zhangfanlonggroup daxmc99 cemkoc xiez22 luzhujiu geekyshane newdaylqt lgzwang zhuang-liu-maker yayuanzi8 shubhamugare cb778 songyf myteam888 zhangyue-se heyutong781 monalisha31 leesureman samridhivaid ishuangxin harlanlu aopro7 mustphd taihuimei gloria0610 nali001 linuer leyauu icsawyer soft4cn kingsjr26 harishgovardhandamodar

deep-code-search's Issues

How come the contents of "phrases" in use.methname.h5 file

Dear Author,

As I'm new in hdf, I opened the use.methname.h5 file with HDFView, I can check the hdf dataset has two attributes, "indices" and "phrases".

For indices, I can open the contents, for phrases there is error when opening it. I can't see the contents of phrases.
But by debugging into the code, I find phrases is a list of numbers repenting the method names, e.g. [82, 31, 262, 1212, 176], which means ['contains', "key", "store", "registry", "context"] respectively.

I want to ask how the strings are transformed into the according numbers? what's their relationship.

Thanks a lot for your continuous help.

Best Regards,
ttbyffey

tran rawtxt data to h5

[20, [8, [14, [73]], [14, [36]], [4, [28]]], [4, [1516], [660]], [19, [15, [11, [8, [4, [169], [66], [4]]], [4, [4]]]], [15, [11, [8, [4, [4, [6599]], [9, [7, [4]]]]], [4, [160]]]], [15, [11, [8, [4, [1534], [74], [1216]]], [4, [1216], [74]]]], [15, [11, [8, [4, [6057], [8]]], [4, [8], [1534]]]], [15, [11, [8, [4, [6057], [8]]], [4, [8], [74]]]], [15, [11, [8, [4, [1516], [196], [909]]], [4, [59]]]]], [12, [13]]]
我的每一条数据是多层嵌套的list，我需要转成h5格式，以至于可以直接在您的程序上进行。但是np.array做不了这个操作。
def save_hdf5(vecs, filename): '''save the processed data into a hdf5 file''' f = tables.open_file(filename, 'w') filters = tables.Filters(complib='blosc', complevel=5) earrays = f.create_earray(f.root, 'phrases', tables.Int16Atom(),shape=(0,),filters=filters) indices = f.create_table("/", 'indices', Index, "a table of indices and lengths") pos = 0 line=1 for x in vecs: print(line) earrays.append(numpy.array(x)) ind = indices.row ind['pos'] = pos ind['length'] = len(x) ind.append() pos += len(x) line=line+1 f.close()
我应该如何修改这段代码，thx。

question regarding training data and testing data

Dear Xiaodong,

I'd like to confirm if the data in test.desc.h5,test.apiseq.h5,test.methname.h5,test.tokens.h5 are included in the 18,000,000 training data?

Create requirements.txt file for both Keras and PyTorch

Without a requirements file, users are forced to install each dependency manually

https://pip.pypa.io/en/stable/user_guide/

use.rawcode.txt file encoding & Model running

Hello, first of all you did impressive work!
I am using the pytorch version. I took all of the data in the google drive folder, unzipped all files and replaced the files in \data\github with them.
I have several questions:

Is the model already trained ? (can i simply run python codesearcher.py --mode search or should i train the model first)
Does the file use.codevecs.normalized.h5 already existing in the \data\github matches the full data files from google drive?
While running python codesearcher.py --mode search i receive the following error (when using the use.rawcode.txt from google drive):

Is the use.rawcode.txt file from the google drive folder not encoded in utf-8?

the reasult not well

dcs->epcohStep=260000
top 10 ACC=0.767, MRR=0.32433587301587297, MAP=0.32433587301587297, nDCG=0.42961201846689157
top 5 ACC=0.6995, MRR=0.44343166666666667, MAP=0.44343166666666667, nDCG=0.5078651066901124
top 1 ACC=0.4761, MRR=0.4761, MAP=0.4761, nDCG=0.4761
数据集是codesearchnet中提供的Java数据，这是我训练过程达到的最优结果，poolsize设置的1000，达不到您之前说的结果要在0.9以上。
我将数据集按您所说划分为train和valid部分。感觉valid起的作用和test部分一样。
执行search的结果非常糟糕。我应该如何解决这个问题，使得search结果明显一些？
我之前用了您提供的epoch500来在大的codebase运行的时候，结果也是相关的比较少。我当时没找到原因，现在到我自己处理的时候，结果也这样，非常期待回复。

Dataset

Hi guys, awesome project. Would you mind releasing the original training and testing dataset, without any pickled or preprocessed files?

One question about pre-processing for input query string

Hi, Xiaodong
Do you do some pre-processing for the input query string?
I use the input string from benchmark queries from your paper Table 1. I found if I do some suitable process for the input string, the search step can return more related result, but it is not easy to do this kind pre-processing.

For example:
Input Query: "how do I invoke a java method when given the method name as a string",
-------this input query return no much related result in 5 return search results.

but as input queries as following:
Input Query: "invoke a java method when given the method name as a string"
-------this input query return 1 related result in 5 return search results.
Input Query: "invoke a java method when given the method name"
-------this input query return 2 related result in 5 return search results.
------- maybe 2 best from my view.

Input Query: "invoke method when given the method name"
-------this input query return 2 related result in 5 return search results.
Input Query: "invoke method given method name"
-------this input query return 2 related result in 5 return search results.

From my understanding for your paper and code, after the training and repo steps, the search step is just comparing the input query vector distance with vectors from repo. because the vectors from repo are fixed, to get the better search result, the input query become a key element, if input query include more words such as "how" "do" "in" "as" "a" "string", these function words deeply effect the result, and hidden more related result for notional words such as "invoke". I don't know if you have a good way to balance the pre-processing for the input query input.

load pretrained model issue

Hi, in the codesearcher.py source file, line 274, when modify the config file to load the specific epoch, there is an error, due to miss the model parameter when invoking the load_model function, would you please help fix this issue? Thanks in advance.

InvalidArgumentError: Shape [-1,30] has negative dimensions [[Node: desc = Placeholder[dtype=DT_INT32, shape=[?,30], _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

训练的时候报这个错是怎么回事。为什么会显示没fed那个desc中数据啊。
具体报错如下：
2019-09-09 09:18:56.138512: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\35\tensorflow\core\framework\op_kernel.cc:1148] Invalid argument: Shape [-1,30] has negative dimensions
2019-09-09 09:18:56.138757: E c:\tf_jenkins\home\workspace\release-win\m\windows\py\35\tensorflow\core\common_runtime\executor.cc:644] Executor failed to create kernel. Invalid argument: Shape [-1,30] has negative dimensions
[[Node: desc = Placeholderdtype=DT_INT32, shape=[?,30], _device="/job:localhost/replica:0/task:0/cpu:0"]]
Traceback (most recent call last):
File "D:\anaconda\installs\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1139, in _do_call
return fn(*args)
File "D:\anaconda\installs\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1121, in _run_fn
status, run_metadata)
File "D:\anaconda\installs\envs\tensorflow\lib\contextlib.py", line 66, in exit
next(self.gen)
File "D:\anaconda\installs\envs\tensorflow\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape [-1,30] has negative dimensions
[[Node: desc = Placeholderdtype=DT_INT32, shape=[?,30], _device="/job:localhost/replica:0/task:0/cpu:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:/pycharm/workplaces/prac/deep-code-search-master/keras/main.py", line 457, in
codesearcher.train(model)
File "D:/pycharm/workplaces/prac/deep-code-search-master/keras/main.py", line 228, in train
hist = model.fit([chunk_padded_methnames,chunk_padded_apiseqs,chunk_padded_tokens,chunk_padded_good_descs,chunk_padded_bad_descs], epochs=10, batch_size=batch_size, validation_split=split)
File "D:\pycharm\workplaces\prac\deep-code-search-master\keras\models.py", line 227, in fit
return self._training_model.fit(x, y, **kwargs)
File "D:\anaconda\installs\envs\tensorflow\lib\site-packages\keras\engine\training.py", line 1485, in fit
initial_epoch=initial_epoch)
File "D:\anaconda\installs\envs\tensorflow\lib\site-packages\keras\engine\training.py", line 1140, in _fit_loop
outs = f(ins_batch)
File "D:\anaconda\installs\envs\tensorflow\lib\site-packages\keras\backend\tensorflow_backend.py", line 2075, in call
feed_dict=feed_dict)
File "D:\anaconda\installs\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 789, in run
run_metadata_ptr)
File "D:\anaconda\installs\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 997, in _run
feed_dict_string, options, run_metadata)
File "D:\anaconda\installs\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1132, in _do_run
target_list, options, run_metadata)
File "D:\anaconda\installs\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape [-1,30] has negative dimensions
[[Node: desc = Placeholderdtype=DT_INT32, shape=[?,30], _device="/job:localhost/replica:0/task:0/cpu:0"]]

Caused by op 'desc', defined at:
File "D:/pycharm/workplaces/prac/deep-code-search-master/keras/main.py", line 452, in
model.build()
File "D:\pycharm\workplaces\prac\deep-code-search-master\keras\models.py", line 144, in build
desc = Input(shape=(self.data_params['desc_len'],), dtype='int32', name='desc')
File "D:\anaconda\installs\envs\tensorflow\lib\site-packages\keras\engine\topology.py", line 1375, in Input
input_tensor=tensor)
File "D:\anaconda\installs\envs\tensorflow\lib\site-packages\keras\engine\topology.py", line 1286, in init
name=self.name)
File "D:\anaconda\installs\envs\tensorflow\lib\site-packages\keras\backend\tensorflow_backend.py", line 349, in placeholder
x = tf.placeholder(dtype, shape=shape, name=name)
File "D:\anaconda\installs\envs\tensorflow\lib\site-packages\tensorflow\python\ops\array_ops.py", line 1530, in placeholder
return gen_array_ops._placeholder(dtype=dtype, shape=shape, name=name)
File "D:\anaconda\installs\envs\tensorflow\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 1954, in _placeholder
name=name)
File "D:\anaconda\installs\envs\tensorflow\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "D:\anaconda\installs\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "D:\anaconda\installs\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 1269, in init
self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Shape [-1,30] has negative dimensions
[[Node: desc = Placeholderdtype=DT_INT32, shape=[?,30], _device="/job:localhost/replica:0/task:0/cpu:0"]]

Process finished with exit code 1

validation results while training the pytorch model do not match paper's results

Hi, I have trained the pytorch model with the given Google Drive preprocessed dataset. These are the following validation logs for every 30 epochs while training.

batch size is 128

epoch: 30
ACC=0.3093, MRR=0.3093, MAP=0.3093, nDCG=0.3093
epoch: 60
ACC=0.3685, MRR=0.3685, MAP=0.3685, nDCG=0.3685
epoch: 90
ACC=0.3808, MRR=0.3808, MAP=0.3808, nDCG=0.3808
epoch: 120
ACC=0.415, MRR=0.415, MAP=0.415, nDCG=0.415
epoch: 150
ACC=0.4223, MRR=0.4223, MAP=0.4223, nDCG=0.4223
epoch: 180
ACC=0.4306, MRR=0.4306, MAP=0.4306, nDCG=0.4306
epoch: 210
ACC=0.4516, MRR=0.4516, MAP=0.4516, nDCG=0.4516
epoch: 240
ACC=0.4597, MRR=0.4597, MAP=0.4597, nDCG=0.4597
epoch: 270
ACC=0.4647, MRR=0.4647, MAP=0.4647, nDCG=0.4647
epoch: 300
ACC=0.4686, MRR=0.4686, MAP=0.4686, nDCG=0.4686
epoch: 330
ACC=0.4718, MRR=0.4718, MAP=0.4718, nDCG=0.4718
epoch: 360
ACC=0.4704, MRR=0.4704, MAP=0.4704, nDCG=0.4704
epoch: 390
ACC=0.4643, MRR=0.4643, MAP=0.4643, nDCG=0.4643
epoch: 420
ACC=0.4804, MRR=0.4804, MAP=0.4804, nDCG=0.4804
epoch: 450
ACC=0.4741, MRR=0.4741, MAP=0.4741, nDCG=0.4741

I have got only max Mean Reciprocal Rank of 0.48, But in Deep Code Search Paper the MRR is 0.6 with same dataset for 500 epochs (when trainied with Keras model). Please help me with what went wrong?

Search on pytorch AssertionError: inconsistent number of chunks, check whether the specified files for codebase and code vectors are correct!

In the repr_code.py:

parser.add_argument('--chunk_size', type=int, default=2000000, help='split code vector into chunks and store them individually. '\

In the search.py:

parser.add_argument('--chunk_size', type=int, default=2000000, help='codebase and code vector are stored in many chunks. '\

The values are the same.
When I ran "python search.py --model JointEmbeder --reload_from 1420000", it went wrong. (It's 1420000 because in the /output/JointEmbeder/github/models I have 142 files from "epo10000.h5" to "epo 1420000.h5")
What's wrong? Thank you!

about converts test to num

您的keras版本config中词汇表的大小设置的是10000，因为在您给的pkl文件中dict['<s>']=0，dict['</s>']=0，dict['UNK']=1
其实<s> </s>是相同的，所以len=10000+1
keras版本
convert函数中 return [vocab.get(w, 0) for w in words]在将本文转换为数字的时候，您将unk默认值设置为0，但是pkl中unk是1阿。
而且您的 return pad_sequences(data, maxlen=len, padding='post', truncating='post', value=0)中填充的是0，在pkl中是和是为0的，与pad的意义是否相符阿？，这里是不是存在问题阿。我刚接触nlp不久，不知道我的理解是不是对的。

在我的立即填充应该是pad标识符，就是您pytorch版本中的数据pad=0，<s>=1,</s>=2,<unk>=3,这里的pad就是0。与json数据相符。

关于keras版本的数据我在做将文本转换为数字的数据时（因为文本—>数字映射——>pkl）
dict['<s>']=0，dict['</s>']=0，dict['UNK']=1，我这么改下面的函数合理么？
(1)return [vocab.get(w, 1) for w in words]，将默认从vocab.get(w, 0)改成vocab.get(w, 1)
(2)return pad_sequences(data, maxlen=len, padding='post', truncating='post', value=0),这里认为pad填充的就是<s>,</s>?

how to pre-process a python code snippet instead of java??

I would like to build a search engine for python code snippet for that I want to prepare a data set u prepared in java snippet. According to me rest of the thing will remain the same. So anyone suggests me for python code snippet preprocessing?

NameError: name 'F' is not defined

Hi, @guxd

I just replicated the repo and ran it on the data available in the repo! I ran into this error!

larumuga@scspc743:~/workspace/deep-code-search/pytorch$ python codesearcher.py --mode train
Build Model
/home/larumuga/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py:38: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.2 and num_layers=1
  "num_layers={}".format(dropout, num_layers))
loading data...
10000 entries
/home/larumuga/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py:995: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
  warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
epo:[0/2000] itr:100 Loss=0.05644
epo:[0/2000] itr:200 Loss=0.05123
epo:[0/2000] itr:300 Loss=0.04307
epo:[0/2000] itr:400 Loss=0.03867
epo:[0/2000] itr:500 Loss=0.03428
epo:[0/2000] itr:600 Loss=0.03314
epo:[0/2000] itr:700 Loss=0.03253
epo:[0/2000] itr:800 Loss=0.03131
epo:[1/2000] itr:100 Loss=0.02819
epo:[1/2000] itr:200 Loss=0.02607
epo:[1/2000] itr:300 Loss=0.02635
epo:[1/2000] itr:400 Loss=0.02651
epo:[1/2000] itr:500 Loss=0.02542
epo:[1/2000] itr:600 Loss=0.02513
epo:[1/2000] itr:700 Loss=0.02473
epo:[1/2000] itr:800 Loss=0.02718
epo:[2/2000] itr:100 Loss=0.02327
epo:[2/2000] itr:200 Loss=0.01982
epo:[2/2000] itr:300 Loss=0.02143
epo:[2/2000] itr:400 Loss=0.02162
epo:[2/2000] itr:500 Loss=0.02315
epo:[2/2000] itr:600 Loss=0.02130
epo:[2/2000] itr:700 Loss=0.01996
epo:[2/2000] itr:800 Loss=0.02273
epo:[3/2000] itr:100 Loss=0.01937
epo:[3/2000] itr:200 Loss=0.02110
epo:[3/2000] itr:300 Loss=0.01906
epo:[3/2000] itr:400 Loss=0.01828
epo:[3/2000] itr:500 Loss=0.01787
epo:[3/2000] itr:600 Loss=0.01970
epo:[3/2000] itr:700 Loss=0.01939
epo:[3/2000] itr:800 Loss=0.02107
epo:[4/2000] itr:100 Loss=0.01615
epo:[4/2000] itr:200 Loss=0.01669
epo:[4/2000] itr:300 Loss=0.01728
epo:[4/2000] itr:400 Loss=0.01682
epo:[4/2000] itr:500 Loss=0.01564
epo:[4/2000] itr:600 Loss=0.01815
epo:[4/2000] itr:700 Loss=0.01740
epo:[4/2000] itr:800 Loss=0.01807
epo:[5/2000] itr:100 Loss=0.01421
epo:[5/2000] itr:200 Loss=0.01602
epo:[5/2000] itr:300 Loss=0.01493
epo:[5/2000] itr:400 Loss=0.01675
epo:[5/2000] itr:500 Loss=0.01464
epo:[5/2000] itr:600 Loss=0.01737
epo:[5/2000] itr:700 Loss=0.01627
epo:[5/2000] itr:800 Loss=0.01482
loading data...
10000 entries
Traceback (most recent call last):
  File "codesearcher.py", line 290, in <module>
    searcher.train(model)
  File "codesearcher.py", line 116, in train
    acc1, mrr = self.eval(model,1000,1)              
  File "codesearcher.py", line 193, in eval
    sims = F.cosine_similarity(code_repr, desc_repr).data.cpu().numpy()
NameError: name 'F' is not defined
Closing remaining open files:./data/github/test.desc.h5...done./data/github/test.tokens.h5...done./data/github/test.methname.h5...done./data/github/train.apiseq.h5...done./data/github/test.apiseq.h5...done./data/github/train.tokens.h5...done./data/github/train.methname.h5...done./data/github/train.desc.h5...done
larumuga@scspc743:~/workspace/deep-code-search/pytorch$

Raw data + code for preprocessing the raw data

Hi,

Great paper! Can you please get the raw data and the code that you used to preprocess the raw data?

Regards

Demo site is unavailible

Is it possible to turn on demo site so we can play with it?

Training Data question

Dear Colleague,

For the training data, I want to ask where do you get the descriptions for each code snippets from GitHub. Are the description extracted from the GitHub code comments?
If so, How do you extract the simplified description from the comments?

Questions on Prescion@k with "relevant results"

I am wondering about the "relevant results", does that mean for every StackOverflow question, when we label the data, we need to collect several relevant results? Othervise how to mark the returned k results as relevant or not? (Beacuse before for every question I only chose the most relevant result, so I am quite confused with the Precision@k metric.) Thx a lot.

MRR Calculations

How was the MRR calculated on the 50 examples in the results.xlsx? Do you have perhaps a script of the calculation? What file was used as the search base for the analysis of those 50 examples?

One question about training time with tensorflow keras mode.

Hi, Guxd
I tried to training the model with Tensorflow Keras.
I used the default setting from github repo without any change on one Nivida GTX 1080, or on one AWS K80.
For the default data, (small data with 10000 item), I find it is need about 4~~5 minutes to finish one epoch, If I need to do 500 epochs, the whole training need about 33~~50 hours, the training time is very close as your paper statement for training on K40 with 50 hours.
But if I used the larger train data from google driver( about 1800 times larger than default data), I find it is impossible to finish the training on one K40 withing 50 hours.

My question is if your trained model on google driver is based on larger data,
or if I need to do other sitting for the configs.py or change the code of the model.py and main.py, and I can finish the larger data training within 50 hours.

关于.h5数据

我这里处理好了txt格式的文本数据，文本里的每一行是一个方法名或者tokens。我怎么将数据转换成.h5格式的？
不太熟悉tables操作。
数据好像是这种格式的
/phrases (EArray(102966,), shuffle, blosc(5)) ''
/indices (Table(10000,)) 'a table of indices and lengths'
然后通过incices里的pos和len从dada获取数据。
将txt转化为.h5这部分的代码有吗

where to get the parsed functions

Hi xiaodong,

Is there a place I can get the 1800 0000 parsed functions txt file

about feature number

我在自己的数据集上训练您的模型
(1)当tokens,apiseq,methname——>desc时，eval函数评估结果目前最好acc=0.672 mrr=0.413 map=0.413 ndcg=0.475
(2)当只考虑tokens——>desc时，eval函数评估结果目前最好acc=0.63 mrr=0.4099 map=0.4099 ndcg=0.462
添加methname和apiseq的特征，对eval评估提升不是很明显。不太理解why。

我之前在您的完整数据集上训练过一次模型，跑了2000epoch，但是search的结果感觉不太相关。

add "how to" before query, the searched results would be different

Hi Author,
I find the results of query "split a string in java" is totally different from "how to split a string in java", do you know why this happens? It seems the results will change a lot if we add ''how to' before the query, this happens also for other questions, such as "compare strings in java".

"split a string in java"


('public String [ ] split ( CharSequence input ) { return split ( input , 0 ) ; } \r\n', 0.38729167)

('public static String [ ] split ( String str , char escapeChar , char separator ) { if ( str == null ) { return null ; } ArrayList < String > strList = new ArrayList < String > ( ) ; StringBuilder split = new StringBuilder ( ) ; int index = 0 ;while ( ( index = findNext ( str , separator , escapeChar , index , split ) ) >= 0 ) { ++ index ; strList . add ( split . toString ( ) ) ; split . setLength ( 0 ) ; } strList . add ( split . toString ( ) ) ; int last = strList . size ( ) ; while( -- last >= 0 && "" . equals ( strList . get ( last ) ) ) { strList . remove ( last ) ; } return strList . toArray ( new String [ strList . size ( ) ] ) ; } \r\n', 0.38420376)

('public String [ ] split ( CharSequence input , int limit ) { int index = 0 ; boolean matchLimited = limit > 0 ; ArrayList < String > matchList = new ArrayList < > ( ) ; Matcher m = matcher ( input ) ; while ( m . find ( ) ) { if ( ! matchLimited || matchList . size ( ) < limit - 1 ) { String match = input . subSequence ( index , m . start ( ) ) . toString ( ) ; matchList . add ( match ) ; index = m . end ( ) ; } else if ( matchList . size ( ) == limit - 1 ) { String match = input . subSequence ( index , input . length ( ) ) . toString ( ) ; matchList . add ( match ) ; index = m . end ( ) ; } } if ( index == 0 ) return new String [ ] { input . toString ( ) } ; if ( ! matchLimited || matchList . size ( ) < limit ) matchList . add ( input . subSequence ( index , input . length ( ) ) . toString ( ) ) ; int resultSize = matchList . size ( ) ; if ( limit == 0 ) while ( resultSize > 0 && matchList . get ( resultSize - 1 ) . equals ( "" ) ) resultSize -- ; String [ ] result = new String [ resultSize ] ; return matchList . subList ( 0 , resultSize ) . toArray ( result ) ; } \r\n', 0.37135035)

('public static String [ ] splitKerberosName ( String fullName ) { return fullName . split ( "[/@]" ) ; } \r\n', 0.35848948)

'how to split a string in java '

('public int numberOfWords ( ) { return words . countTokens ( ) ; } \r\n', 0.35190165)

("public static String extractNetworkPortionAlt ( String phoneNumber ) { if ( phoneNumber == null ) { return null ; } int len = phoneNumber . length ( ) ; StringBuilder ret = new StringBuilder ( len ) ; boolean haveSeenPlus = false ; for ( int i= 0 ; i < len ; i ++ ) { char c = phoneNumber . charAt ( i ) ; if ( c == '+' ) { if ( haveSeenPlus ) { continue ; } haveSeenPlus = true ; } if ( isDialable ( c ) ) { ret . append ( c ) ; } else if ( isStartsPostDial ( c ) ) { break ; } } return ret . toString ( ) ; } \r\n", 0.33133727)

('public static String extractBinaryString ( final byte [ ] buffer , final int start , final int end ) { final StringBuilder r = new StringBuilder ( end - start ) ; for ( int i = start ; i < end ; i ++ ) r . append ( ( char ) ( buffer [ i ] & 0xff ) ) ; return r . toString ( ) ; } \r\n', 0.32867655)

("public static String extractNetworkPortion ( String phoneNumber ) { if ( phoneNumber == null ) { return null ; } int len= phoneNumber . length ( ) ; StringBuilder ret = new StringBuilder ( len ) ; for ( int i = 0 ; i < len ; i ++ ) { char c =phoneNumber . charAt ( i ) ; int digit = Character . digit ( c , 10 ) ; if ( digit != - 1 ) { ret . append ( digit ) ; } else if ( c == '+' ) { String prefix = ret . toString ( ) ; if ( prefix . length ( ) == 0 || prefix . equals ( CLIR_ON ) || prefix . equals ( CLIR_OFF ) ) { ret . append ( c ) ; } } else if ( isDialable ( c ) ) { ret . append ( c ) ; } else if ( isStartsPostDial ( c ) ) { break ; } } return ret . toString ( ) ; } \r\n", 0.32114816)

("private List calculateSuffixes ( Locale locale ) { List suffixes = new ArrayList ( 3 ) ; String language = locale . getLanguage ( ) ; String country = locale . getCountry ( ) ; String variant = locale . getVariant ( ) ; StringBuffer suffix = new StringBuffer ( ) ; suffix . append ( '_' ) ; suffix . append ( language ) ; if ( language . length ( ) > 0 ) { suffixes . add ( suffix . toString ( ) ) ; } suffix . append ( '_' ) ; suffix . append ( country ) ; if ( country . length ( ) > 0 ){ suffixes . add ( suffix . toString ( ) ) ; } suffix . append ( '_' ) ; suffix . append ( variant ) ; if ( variant . length ( ) > 0 ) { suffixes . add ( suffix . toString ( ) ) ; } return suffixes ; } \r\n", 0.29929727)

Sincerely hope you can answer.
Thanks very much

How can I get the evaluation benchmark ？

Hi @guxd

It is mentioned that two developers manually labelled relevant/non-relevant results for the top 50 stackoverflow questions used as benchmark. Can I get the manually labelled 10 results of each query？

Thanks，
lxd

Keras training

Hello,

While training the keras module i get the following issue:

Epoch 1 ::
Train on 80000 samples, validate on 20000 samples
Epoch 1/1
Traceback (most recent call last):
  File "/home/matan.pugach/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
    return fn(*args)
  File "/home/matan.pugach/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/matan.pugach/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
    status, run_metadata)
  File "/home/matan.pugach/.local/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'desc' with dtype int32 and shape [?,30]
         [[Node: desc = Placeholder[dtype=DT_INT32, shape=[?,30], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "codesearcher.py", line 429, in <module>
    codesearcher.train(model)
  File "codesearcher.py", line 209, in train
    hist = model.fit([chunk_padded_methnames,chunk_padded_apiseqs,chunk_padded_tokens, chunk_padded_good_descs, chunk_padded_bad_descs], epochs=1, batch_size=batch_size, validation_split=split)
  File "/home-local/matan.pugach/deep-code-search/keras/models.py", line 225, in fit
    return self._training_model.fit(x, y, **kwargs)
  File "/home/matan.pugach/.local/lib/python3.6/site-packages/keras/engine/training.py", line 1485, in fit
    initial_epoch=initial_epoch)
  File "/home/matan.pugach/.local/lib/python3.6/site-packages/keras/engine/training.py", line 1140, in _fit_loop
    outs = f(ins_batch)
  File "/home/matan.pugach/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2075, in __call__
    feed_dict=feed_dict)
  File "/home/matan.pugach/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/home/matan.pugach/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1140, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/matan.pugach/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
    run_metadata)
  File "/home/matan.pugach/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'desc' with dtype int32 and shape [?,30]
         [[Node: desc = Placeholder[dtype=DT_INT32, shape=[?,30], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'desc', defined at:
  File "codesearcher.py", line 424, in <module>
    model.build()
  File "/home-local/matan.pugach/deep-code-search/keras/models.py", line 146, in build
    desc = Input(shape=(self.data_params['desc_len'],), dtype='int32', name='desc')
  File "/home/matan.pugach/.local/lib/python3.6/site-packages/keras/engine/topology.py", line 1375, in Input
    input_tensor=tensor)
  File "/home/matan.pugach/.local/lib/python3.6/site-packages/keras/engine/topology.py", line 1286, in __init__
    name=self.name)
  File "/home/matan.pugach/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 349, in placeholder
    x = tf.placeholder(dtype, shape=shape, name=name)
  File "/home/matan.pugach/.local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1777, in placeholder
    return gen_array_ops.placeholder(dtype=dtype, shape=shape, name=name)
  File "/home/matan.pugach/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 4521, in placeholder
    "Placeholder", dtype=dtype, shape=shape, name=name)
  File "/home/matan.pugach/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/matan.pugach/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
    op_def=op_def)
  File "/home/matan.pugach/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'desc' with dtype int32 and shape [?,30]
         [[Node: desc = Placeholder[dtype=DT_INT32, shape=[?,30], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

I have installed all packages according to the requirements.

Why if the input query is same, but the result is total difference when do search at twice?

Hi,
I do the search twice at the same input query string, but the result is different.
My input is the first string from the the train.desc.txt , I think the result may similar as the first index at the train.xxx.txt, but the result is not. even if you search two times, the result it is total different. From my understanding, if model fixed, and input is fixed, the result shall be fixed.

The following is copy of test.

Input Query: determine whether the specified name has been used as a key in this table or any of its parents
How many results? 5
('public int getTransportTypeId ( ) { return Integer . parseInt ( transport_type_id ) ; } \n', 0.23941047)

('public Long ( java . lang . String name , InputStream i ) throws IOException , FormatException { this ( name , new DataInputStream ( i ) . readLong ( ) ) ; } \n', 0.29885766)

('public YearMonth withYear ( int year ) { int [ ] newValues = getValues ( ) ; newValues = getChronology ( ) . year ( ) . set ( this , YEAR , newValues , year ) ; return new YearMonth ( this , newValues ) ; } \n', 0.26326206)

('public static final Token newToken ( int ofKind ) { switch ( ofKind ) { default : return new Token ( ) ; } } \n', 0.24075004)

('public static int osisIdToVerseNum ( String osisID ) { if ( osisID != null ) { String [ ] parts = osisID . split ( "|." ) ; if ( parts . length > 1 ) { String verse = parts [ parts . length - 1 ] ; return Integer . valueOf ( verse ) ; } } return 0 ; } \n', 0.23932476)
Input Query: determine whether the specified name has been used as a key in this table or any of its parents
How many results? 5
('protected boolean hasUncompressedStrip ( ) { return mStripBytes . size ( ) != 0 ; } \n', 0.22734238)

('public static < T > Predicate < T > or ( final Predicate < T > ... that ) { return new OrPredicate < T > ( that ) ; } \n', 0.25138146)

('public Object visit ( ASTPrimaryExpression node , Object data ) { node . childrenAccept ( this , data ) ; return data ; } \n', 0.24480948)

('public static int randColor ( ) { return randColor ( new Random ( ) ) ; } \n', 0.22334205)

('@ Override public void close ( ) throws IOException { super . close ( ) ; __socket . close ( ) ; } \n', 0.2098054)

How to parse the Java code?

I tried some methods but the results are a little different.
For example, when I get method o.m(), I do not know how to get the Class of o and add O.m to the API Sequence.

Keras model data usage

Hello!

I admire the work you did in this paper and have several questions regarding the data partitioning and the way the model uses it. Mainly because each time i re-run the model searching script, i get different top results.

Regarding the files starting with ‘train.’ - The amount of examples sampled from it is 80k every epoch, is this correct? I wanted also to make sure this the data the model trains on? how many examples does it contain?
Regarding the files starting with ‘test.’- What is it used for (is it for validation)? Is the usage from it is also sampled ( the data is not used entirely each time) ? how many examples does it contain?
Regarding the files starting with ‘use.’ – Is it the base for running the search queries? Is the usage from it is sampled (i.e., Is a subset of it is used every time)? how many examples does it contain?
Is there an intersection\subset relation between any of the train.\test.\use. files content?

Evaluation benchmark dataset top 50 manually inspected relevance labels?

Hi @guxd

It is mentioned in the paper that two developers manually labelled relevant/non-relevant results for the top 50 stackoverflow questions used as benchmark.

I see the 50 queries are available in the paper. Could you please the corresponding manually labelled relevant/non relevant snippets for each query?

Thanks,
Laksh

Pytorch search FileNotFoundError

Hi, Xiaodong Gu!
I successfully completed the training and the word embedding yesterday.
But when I ran the "python search.py --model JointEmbeder"
The Error came as follows:
from collections import Mapping, defaultdict
Constructing Model..
Traceback (most recent call last):
File "search.py", line 109, in
model.load_state_dict(torch.load(ckpt, map_location=device))
File "/home/blx/.local/lib/python3.7/site-packages/torch/serialization.py", line 419, in load
f = open(f, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: './output/JointEmbeder/github/models/epo-1.h5'

Then I opened the search.py and found.

parser.add_argument('--reload_from', type=int, default=-1, help='epoch to reload from')
…………
if name == 'main':
args = parse_args()
device = torch.device(f"cuda:{args.gpu_id}" if torch.cuda.is_available() else "cpu")
config=getattr(configs, 'config_'+args.model)()

##### Define model ######
logger.info('Constructing Model..')
model = getattr(models, args.model)(config)#initialize the model
ckpt=f'./output/{args.model}/{args.dataset}/models/epo{args.reload_from}.h5'

…………
The default value of "--reload from" is -1, but the file in './output/JointEmbeder/github/models/' is "epo10000.h5","epo20000.h5","epo30000.h5",etc

So how should I change the code to make it work?
Thank you very much!

The model isn't loaded

The program fails if we set reload=5 on https://github.com/guxd/deep-code-search/blob/master/keras/configs.py#L50.
Is it due to the comment on https://github.com/guxd/deep-code-search/blob/master/keras/models.py#L243 ?

about eval func

def eval(self, model, poolsize, K):
def ACC(real,predict):
sum=0.0
for val in real:
try: index=predict.index(val)
except ValueError: index=-1
if index!=-1: sum=sum+1
return sum/float(len(real))
def MAP(real,predict):
sum=0.0
for id,val in enumerate(real):
try: index=predict.index(val)
except ValueError: index=-1
if index!=-1: sum=sum+(id+1)/float(index+1)
return sum/float(len(real))
def MRR(real,predict):
sum=0.0
for val in real:
try: index=predict.index(val)
except ValueError: index=-1
if index!=-1: sum=sum+1.0/float(index+1)
return sum/float(len(real))
def NDCG(real,predict):
dcg=0.0
idcg=IDCG(len(real))
for i,predictItem in enumerate(predict):
if predictItem in real:
itemRelevance=1
rank = i+1
dcg+=(math.pow(2,itemRelevance)-1.0)(math.log(2)/math.log(rank+1))
return dcg/float(idcg)
def IDCG(n):
idcg=0
itemRelevance=1
for i in range(n):
idcg+=(math.pow(2,itemRelevance)-1.0)(math.log(2)/math.log(i+2))
return idcg

    if self._eval_sets is None:
        methnames,apiseqs,tokens,descs=self.load_valid_data_chunk(poolsize)
        self._eval_sets=dict()
        self._eval_sets['methnames']=methnames
        self._eval_sets['apiseqs']=apiseqs
        self._eval_sets['tokens']=tokens
        self._eval_sets['descs']=descs
    acc,mrr,map,ndcg=0,0,0,0
    data_len=len(self._eval_sets['descs'])
    for i in range(data_len):
        desc=self._eval_sets['descs'][i]#good desc
        descs=self.pad([desc]*data_len,self.data_params['desc_len'])
        methnames=self.pad(self._eval_sets['methnames'],self.data_params['methname_len'])
        apiseqs=self.pad(self._eval_sets['apiseqs'],self.data_params['apiseq_len'])
        tokens=self.pad(self._eval_sets['tokens'],self.data_params['tokens_len'])
        n_results = K          
        sims = model.predict([methnames, apiseqs,tokens, descs], batch_size=data_len).flatten()
        negsims=np.negative(sims)
        predict=np.argsort(negsims)#predict = np.argpartition(negsims, kth=n_results-1)
        predict = predict[:n_results]   
        predict = [int(k) for k in predict]
        real=[i]
        acc+=ACC(real,predict)
        mrr+=MRR(real,predict)
        map+=MAP(real,predict)
        ndcg+=NDCG(real,predict)                          
    acc = acc / float(data_len)
    mrr = mrr / float(data_len)
    map = map / float(data_len)
    ndcg= ndcg/ float(data_len)
    logger.info('ACC={}, MRR={}, MAP={}, nDCG={}'.format(acc,mrr,map,ndcg))
    return acc,mrr,map,ndcg

请问一下你的eval函数是不是有问题阿？ real每次循环都是一个数据集real=i，但是你的mrr map函数集都是 return sum/float(len(real))。real的长度总是1.
但是我的理解应该是一个desc对应多个real的数据列表作为ground-truth数据。你这里这个eval函数可以解释一下么？

result relevant for larger code base and small code base

Hi,
Based on the larger corpus 18,233,872 keras training model , I did two kind code search based small codebase and larger codebase.
1)the larger codebase search is based on the same use.data from google drive.
2)the small codebase search is based on the default 10000 use.data from git code.

I use the Query from Table 1 in the paper, I find the search result from 1) has higher similar value than 2), but we can find more relevant result from 2) than 1) in the top 5 result.

For example:
I input "convert string to date"
2) can get the the result such as ("public static Date convertStringToDate" , 0.41653952)

get the result such as ("private booolean fileHasValidIdentifier" .... 0.43153855)

I think 1) has about 1000 times code items than 2), and 1) is very difficult to let more relevant code appeared in the top 5 result with similar value sort.

Do you meet the above result?

Problem loading the Data

Hi, i've been trying to activate the full PyTorch model, but i had issue with the data.
I loaded the whole data you provided and added it to the signed location path, then after the train.py, i was running those commands and i got those errors, i hope you could help me out with them:

!python repr_code.py --model JointEmbeder --reload_from 340000

NumExpr defaulting to 2 threads.
Constructing Model..
loading data...
tcmalloc: large alloc 1116725248 bytes == 0xe93ea000 @  0x7f87e6b8d1e7 0x7f87e46f35e1 0x7f87e475a420 0x7f87e47e7f87 0x50a7f5 0x50c1f4 0x507f24 0x509277 0x594b01 0x54a17f 0x5517c1 0x5a9eec 0x50a783 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x50b053 0x634dd2 0x634e87 0x63863f 0x6391e1 0x4b0dc0 0x7f87e678ab97 0x5b26fa
tcmalloc: large alloc 1365450752 bytes == 0x140276000 @  0x7f87e6b8d1e7 0x7f87e46f35e1 0x7f87e475a420 0x7f87e47e7f87 0x50a7f5 0x50c1f4 0x507f24 0x509277 0x594b01 0x54a17f 0x5517c1 0x5a9eec 0x50a783 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x50b053 0x634dd2 0x634e87 0x63863f 0x6391e1 0x4b0dc0 0x7f87e678ab97 0x5b26fa
16262602 entries
 12% 199/1627 [03:24<25:10,  1.06s/it]tcmalloc: large alloc 4096000000 bytes == 0x7f83218e0000 @  0x7f87e6b8d1e7 0x7f87e46f35e1 0x7f87e4757c78 0x7f87e4757d93 0x7f87e47f5ea8 0x7f87e47f6704 0x7f87e47f6852 0x567193 0x59fe1e 0x7f87e47434ed 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x588e91 0x59fe1e 0x7f87e47434ed 0x50a47f 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x507f24
^C

!python search.py --model JointEmbeder --reload_from 340000

NumExpr defaulting to 2 threads.
Constructing Model..
Loading codebase (chunk size=2000000)..
Traceback (most recent call last):
  File "search.py", line 137, in <module>
    "inconsistent number of chunks, check whether the specified files for codebase and code vectors are correct!"    
AssertionError: inconsistent number of chunks, check whether the specified files for codebase and code vectors are correct!

Any suggestions about using this on javascript or python?

We're trying to apply this model to JavaScript and Python dataset, but we don't know how to get the class or type of a variable using static code analysis. So we just use function name instead of <class name>.<function name> or <variable name>.<function name>.

Now the result seems not good, do you think the class name of a function very important in this model?

Do you have any suggestions about using this model with JavaScript or Python data? Thank you!

What role do '<s>' and '</s>' label play in the vocabulary?

I understand the '<pad>' and '<unk>' label in the vocabulary. But what's the use of '<s>' and '</s>' label? Do you add '<s>' and '</s>' label at the beginning and the end of each sentence in the training set? I found there exit '</s>' label in the api_seq sentence.

keras model with python2.7 have problem when dealing prediction input

Keras model with python2.7, mode 'search', in 'Input query: ' would lead to 'SyntaxError: unexpected EOF while parsing' with whatever string input.
I do not find the reason behind, but python3.5 do not have such problem.

trained model

Can you put out the trained model on the github based on the data on the Google drive?

Question: are use.methname.h5, use.tokens.h5, use.apiseq.h5 generated by processing use.rawcode.txt?

Questions about the precision of the trained model

Dear Author,

When we run the model trained with the default dataset in deep-code-search-master/pytorch/data/github, we find the relevance of search result to the input question is not very high, with the highest similarity 0.3;

I want to confirm if the the result of the default dataset is indeed not that relevant? what's the similarity score you tried?
Question: "convert string to date "
Result: ('public static String formatSeconds ( Object obj ) { long time = - 1L ; if ( obj instanceof Long ) { time = ( ( Long ) obj ) . longValue ( ) ; } else if ( obj instanceof Integer ) { time = ( ( Integer ) obj ) . intValue ( ) ; } return ( time + "-s" ) ; } \r\n', 0.31213856)
How about using the larger dataset you provided in the "Google Drive"- https://drive.google.com/drive/folders/1GZYLT_lzhlVczXjD6dgwVUvDDPHMB6L7?usp=sharing;
Will the precision based on the large dataset will be much higher? We haven't got the result yet because it takes quite a long time to train.

Sincerely hope you can help answer, Thx a lot.

Unused api tokens from vocabulary file vocab.apiseq.pkl

I tried to check the frequency of apiseq tokens of vocab.apiseq.pkl file from the training (train.apiseq.h5) and test set (test.apiseq.h5). I found only 9869 from 10000 tokens are used more than once. Is there any proper reason?

Can you provide unprocessed data sets?Method and corresponding method annotation data set

Thx a lot.

.h5 files are vectors from raw data?

Hi @guxd,

The .h5, .pkl files in the given dataset are the vectors generated from raw data? Do you have the scripts that generated these dump files?

Thanks,
Laksh

Training on Pytorch: NameError: name 'dims' is not defined

Sorry, I have a question.
Maybe I did something wrong.
I tried for many times and failed in same way.

from collections import Mapping, defaultdict
[nltk_data] Downloading package punkt to /home/blx/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
loading data...
18223872 entries
loading data...
10000 entries
Constructing Model..
epo:[1/5] itr:[0/284748] step_time:1s Loss=0.49624
epo:[1/5] itr:[100/284748] step_time:3s Loss=0.49965
epo:[1/5] itr:[200/284748] step_time:3s Loss=0.49992
epo:[1/5] itr:[300/284748] step_time:3s Loss=0.49965
epo:[1/5] itr:[400/284748] step_time:3s Loss=0.49847
epo:[1/5] itr:[500/284748] step_time:3s Loss=0.49712
epo:[1/5] itr:[600/284748] step_time:3s Loss=0.49503
epo:[1/5] itr:[700/284748] step_time:3s Loss=0.48588
epo:[1/5] itr:[800/284748] step_time:3s Loss=0.44364
epo:[1/5] itr:[900/284748] step_time:3s Loss=0.38405
epo:[1/5] itr:[1000/284748] step_time:3s Loss=0.35486
epo:[1/5] itr:[1100/284748] step_time:3s Loss=0.33373
epo:[1/5] itr:[1200/284748] step_time:3s Loss=0.31694
epo:[1/5] itr:[1300/284748] step_time:3s Loss=0.30157
epo:[1/5] itr:[1400/284748] step_time:3s Loss=0.28955
epo:[1/5] itr:[1500/284748] step_time:3s Loss=0.27922
epo:[1/5] itr:[1600/284748] step_time:3s Loss=0.27275
epo:[1/5] itr:[1700/284748] step_time:3s Loss=0.25574
epo:[1/5] itr:[1800/284748] step_time:3s Loss=0.25097
epo:[1/5] itr:[1900/284748] step_time:3s Loss=0.24439
epo:[1/5] itr:[2000/284748] step_time:3s Loss=0.24494
epo:[1/5] itr:[2100/284748] step_time:3s Loss=0.23787
epo:[1/5] itr:[2200/284748] step_time:3s Loss=0.22471
epo:[1/5] itr:[2300/284748] step_time:3s Loss=0.22553
epo:[1/5] itr:[2400/284748] step_time:3s Loss=0.20872
epo:[1/5] itr:[2500/284748] step_time:3s Loss=0.20546
epo:[1/5] itr:[2600/284748] step_time:3s Loss=0.20387
epo:[1/5] itr:[2700/284748] step_time:3s Loss=0.19426
epo:[1/5] itr:[2800/284748] step_time:3s Loss=0.18942
epo:[1/5] itr:[2900/284748] step_time:3s Loss=0.19045
epo:[1/5] itr:[3000/284748] step_time:3s Loss=0.18208
epo:[1/5] itr:[3100/284748] step_time:3s Loss=0.18582
epo:[1/5] itr:[3200/284748] step_time:3s Loss=0.17979
epo:[1/5] itr:[3300/284748] step_time:3s Loss=0.17698
epo:[1/5] itr:[3400/284748] step_time:3s Loss=0.17425
epo:[1/5] itr:[3500/284748] step_time:3s Loss=0.16560
epo:[1/5] itr:[3600/284748] step_time:3s Loss=0.16273
epo:[1/5] itr:[3700/284748] step_time:3s Loss=0.16250
epo:[1/5] itr:[3800/284748] step_time:3s Loss=0.15530
epo:[1/5] itr:[3900/284748] step_time:3s Loss=0.15151
epo:[1/5] itr:[4000/284748] step_time:3s Loss=0.15246
epo:[1/5] itr:[4100/284748] step_time:3s Loss=0.15054
epo:[1/5] itr:[4200/284748] step_time:3s Loss=0.15059
epo:[1/5] itr:[4300/284748] step_time:3s Loss=0.14527
epo:[1/5] itr:[4400/284748] step_time:3s Loss=0.14572
epo:[1/5] itr:[4500/284748] step_time:3s Loss=0.14195
epo:[1/5] itr:[4600/284748] step_time:3s Loss=0.13322
epo:[1/5] itr:[4700/284748] step_time:3s Loss=0.13429
epo:[1/5] itr:[4800/284748] step_time:3s Loss=0.12924
epo:[1/5] itr:[4900/284748] step_time:3s Loss=0.12834
validating..
100%|██████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.26s/it]
0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 248, in
train(args)
File "train.py", line 124, in train
acc1, mrr, map1, ndcg = validate(valid_set, model, 10000, 1)
File "train.py", line 206, in validate
sims = np.squeeze(dims, axis=1)
NameError: name 'dims' is not defined
Closing remaining open files:./data/github/train.tokens.h5...done./data/github/valid.apiseq.h5...done./data/github/train.apiseq.h5...done./data/github/valid.name.h5...done./data/github/train.name.h5...done./data/github/valid.desc.h5...done./data/github/train.desc.h5...done./data/github/valid.tokens.h5...done

Evaluation Benchmark on the trained model

Hi, i am trying to reproduce the papers results. I re-trained the Keras model for 500 epochs (as written in the paper) to replicate the originally returned code snippets. I followed the readme instructions and still couldn't reach the paper's results. I then tried updating the reload param in configs.py to be 500 and got the following responses for queries from the benchmark:

For the query:
iterate through a hashmap
I got the responses:

('void startBlock ( final int buildingCount ) { SwingUtilities . invokeLater ( new Runnable ( ) { @ Override public void run ( ) { statusLabel . setText ( "Computing-blockades" ) ; if ( buildingCount == 0 ) { blockadeProgress . setMaximum ( 1 ) ; blockadeProgress . setValue ( 1 ) ; } else { blockadeProgress . setMaximum ( buildingCount ) ; blockadeProgress . setValue ( 0 ) ; block = 0 ; } } } ) ; } \n', 0.035571914)

('public static void main ( String [ ] args ) { long maxMemory = runtime . maxMemory ( ) ; long allocatedMemory = runtime . totalMemory ( ) ; long freeMemory = runtime . freeMemory ( ) ; System . out . println ( "free-memory:-" + freeMemory / 1024 ) ; System . out . println ( "allocated-memory:-" + allocatedMemory / 1024 ) ; System . out . println ( "max-memory:-" + maxMemory / 1024 ) ; System . out . println ( "total-free-memory:-" + ( freeMemory + ( maxMemory - allocatedMemory ) ) / 1024 ) ; } \n', 0.033702463)

('@ Override public boolean equals ( Object obj ) { if ( ! ( obj instanceof ByteOrderMark ) ) { return false ; } ByteOrderMark bom = ( ByteOrderMark ) obj ; if ( bytes . length != bom . length ( ) ) { return false ; } for ( int i = 0 ; i < bytes . length ; i ++ ) { if ( bytes [ i ] != bom . get ( i ) ) { return false ; } } return true ; } \n', 0.03228843)

('static public RtspSession create ( String sessionId ) { if ( sessions . get ( sessionId ) != null ) { log . error ( "Session-key-conflit!!" ) ; return null ; } RtspSession session = new RtspSession ( sessionId ) ; sessions . put ( sessionId , session ) ; log . debug ( "New-session-created---id=" + sessionId ) ; return session ; } \n', 0.03165227)

('public List < Resource > getResources ( ) { if ( resources == null ) { resources = new ArrayList < Resource > ( ) ; } return this . resources ; } \n', 0.031491578)

('public List < String > getSingleCommentDelimiters ( ) { return new ArrayList < String > ( singleCommentDelimiters ) ; } \n', 0.031074077)

('public String getCanonicalHostName ( ) { try { return getHostByAddrImpl ( this ) . hostName ; } catch ( UnknownHostException ex ) { return getHostAddress ( ) ; } } \n', 0.030964658)

('public static < K , V > Map < K , V > createNewMap ( ) { return new HashMap < K , V > ( ) ; } \n', 0.030577734)

('public static String separatorsToWindows ( String path ) { if ( path == null || path . indexOf ( UNIX_SEPARATOR ) == - 1 ) { return path ; } return path . replace ( UNIX_SEPARATOR , WINDOWS_SEPARATOR ) ; } \n', 0.030175861)

('public synchronized void shutdown ( ) { dataQueue . clear ( ) ; shutdown = true ; } \n', 0.030141229)

('@ Override public void run ( ) { while ( ! destroyed ) { StreamOutput processStream = StreamOutput . EMPTY ; try { CapturedStreamOutput capturedProcessStream = null ; while ( ! destroyed && capturedProcessStream == null ) { synchronized ( toCapture ) { if ( toCapture . containsKey ( key ) ) { capturedProcessStream = toCapture . remove ( key ) ; } else { toCapture . wait ( ) ; } } } if ( ! destroyed ) { processStream = capturedProcessStream ; capturedProcessStream . readAndClose ( ) ; } } catch ( InterruptedException e ) { logger . info ( "OutputCapture-interrupted,-exiting" ) ; break ; } catch ( IOException e ) { logger . error ( "Error-reading-process-output" , e ) ; } finally { synchronized ( fromCapture ) { fromCapture . put ( key , processStream ) ; fromCapture . notify ( ) ; } } } } \n', 0.029915098)

('public static Document parse ( File file ) throws IOException , ParserConfigurationException , SAXException { DocumentBuilderFactory factory = DocumentBuilderFactory . newInstance ( ) ; DocumentBuilder builder = factory . newDocumentBuilder ( ) ; return builder . parse ( file ) ; } \n', 0.02982778)

('private static String _latexFromHtml ( Collection col , String latex ) { latex = latex . replaceAll ( "<br(-/)?>|<div>" , "-" ) ; latex = Utils . stripHTML ( latex ) ; return latex ; } \n', 0.029724797)

('public synchronized boolean clearExpired ( final Date date ) { if ( date == null ) { return false ; } boolean removed = false ; for ( Iterator < Cookie > it = cookies . iterator ( ) ; it . hasNext ( ) ; ) { if ( it . next ( ) . isExpired ( date ) ) { it . remove ( ) ; removed = true ; } } return removed ; } \n', 0.029710343)

('private String getAssetSource ( String assetPath ) { if ( StringUtils . isNotBlank ( assetPath ) ) { File asset = new File ( rootDir + File . separator + assetPath ) ; if ( asset . exists ( ) && asset . canRead ( ) ) { try { return FileUtils . readFileToString ( asset ) ; } catch ( IOException e ) { LOG . error ( "Error-reading-asset-source:-" + asset . getPath ( ) , e ) ; } } } return null ; } \n', 0.02968416)

Where only one was related to hash map (and no iteration)

for the query:
convert an inputstream to a string
I got the responses:

('public static void registerTarget ( javax . rmi . CORBA . Tie tie , java . rmi . Remote target ) { if ( utilDelegate != null ) { utilDelegate . registerTarget ( tie , target ) ; } } \n', 0.059240054)

('public Quaternion normalise ( Quaternion dest ) { return normalise ( this , dest ) ; } \n', 0.05841601)

('public static IntFunction chain ( final IntFunction g , final IntFunction h ) { return new IntFunction ( ) { public final int apply ( int a ) { return g . apply ( h . apply ( a ) ) ; } } ; } \n', 0.057828866)

('void focusWheel ( ) { setFocusType ( 1 ) ; } \n', 0.05775229)

('public ServerSocket createServerSocket ( int port ) throws IOException { _logger . debug ( "PortalSocketFactory.createServerPort()---port=" + port ) ; ServerSocket srvsocket = new ServerSocket ( port ) ; _logger . debug ( "PortalSocketFactory-returned-new-ServerSocket(port)..." ) ; return ( srvsocket ) ; } \n', 0.05753772)

('@ Override public final long evaluateAdj ( final int [ ] adj ) { int a , old_i , cur_i ; long totalDist ; final double [ ] m ; double phi1 , phi2 , chi1 , chi2 , twoth1 , twoth2 , t ; old_i = 1 ; totalDist = 0l ; m = this . m_coords ; a = ( ( old_i - 1 ) * 3 ) ; phi2 = m [ a ++ ] ; chi2 = m [ a ++ ] ; twoth2 = m [ a ++ ] ; for ( ; ; ) { cur_i = adj [ old_i - 1 ] ; a = ( ( cur_i - 1 ) * 3 ) ; phi1 = m [ a ++ ] ; chi1 = m [ a ++ ] ; twoth1 = m [ a ++ ] ; t = Math . abs ( phi1 - phi2 ) ; totalDist += ( ( int ) ( 0.5d + ( Math . max ( Math . min ( t , Math . abs ( t - ( 360d ) ) ) , Math . max ( Math . abs ( chi1 - chi2 ) , Math . abs ( twoth1 - twoth2 ) ) ) ) ) ) ; phi2 = phi1 ; chi2 = chi1 ; twoth2 = twoth1 ; if ( cur_i == 1 ) { return totalDist ; } old_i = cur_i ; } } \n', 0.05679653)

('public boolean equals ( Object o ) { return m_Root . equals ( ( ( Trie ) o ) . getRoot ( ) ) ; } \n', 0.056255654)

('@ Override public P2pActivityCorrections fetchByPrimaryKey ( Serializable primaryKey ) throws SystemException { return fetchByPrimaryKey ( ( ( Long ) primaryKey ) . longValue ( ) ) ; } \n', 0.056060914)

('public int getWidth ( ) { init ( ) ; return width ; } \n', 0.056045968)

('public void dispose ( ) { logDebug ( "Disposing." ) ; mSetupDone = false ; if ( mServiceConn != null ) { logDebug ( "Unbinding-from-service." ) ; if ( mContext != null ) mContext . unbindService ( mServiceConn ) ; } mDisposed = true ; mContext = null ; mServiceConn = null ; mService = null ; mPurchaseListener = null ; } \n', 0.055970065)

('private List < PrivacyListItem > tranformPrivacyItemsToPrivacyListItems ( List < PrivacyItem > items ) { List < PrivacyListItem > rItems = new ArrayList < PrivacyListItem > ( ) ; for ( int i = 0 ; i < items . size ( ) ; i ++ ) { rItems . add ( new PrivacyListItem ( items . get ( i ) . getType ( ) . ordinal ( ) , items . get ( i ) . getValue ( ) ) ) ; } return rItems ; } \n', 0.055890348)

('public BigInteger multiply ( BigInteger val ) { if ( val . signum == 0 || signum == 0 ) return ZERO ; int [ ] result = multiplyToLen ( mag , mag . length , val . mag , val . mag . length , null ) ; result = trustedStripLeadingZeroInts ( result ) ; return new BigInteger ( result , signum == val . signum ? 1 : - 1 ) ; } \n', 0.05568312)

('public java . util . Iterator getAllMembers ( ) throws GroupsException { return primGetAllMembers ( new HashSet ( ) ) . iterator ( ) ; } \n', 0.055669665)

('public void checkTextOnly ( ) { for ( int i = 0 ; i < size ; i ++ ) { ( ( CharSequence ) parts [ i ] ) . getClass ( ) ; } } \n', 0.055539)

('public void drawBarsOnGraph ( Graphics2D g2d , ArrayList < ComparableLabel > orderedDateSet , HashMap < ComparableLabel , Integer [ ] > barDataPoints , long yMaxMark ) { int sectionWidth = this . graphWidth / orderedDateSet . size ( ) ; int xOffset = sectionWidth / 2 ; int yValue ; float yOffsetPerc ; int numberOfBars = barDataPoints . get ( orderedDateSet . get ( 0 ) ) . length ; int barWidth = sectionWidth / ( numberOfBars + 1 ) ; for ( int datePos = 0 ; datePos < orderedDateSet . size ( ) ; datePos ++ ) { for ( int barNumber = 0 ; barNumber < numberOfBars ; barNumber ++ ) { yValue = barDataPoints . get ( orderedDateSet . get ( datePos ) ) [ barNumber ] ; yOffsetPerc = yValue / ( float ) yMaxMark ; int xLeftBar = this . graphLeft + datePos * sectionWidth + barWidth / 2 + barWidth * barNumber ; drawBar ( g2d , Math . round ( this . graphHeight * yOffsetPerc ) , barWidth , xLeftBar , this . graphBottom , this . barColors [ barNumber ] ) ; if ( Math . round ( this . graphHeight * yOffsetPerc ) == 0 && yValue != 0 ) { g2d . setColor ( this . barColors [ barNumber ] ) ; g2d . drawLine ( xLeftBar , this . graphBottom , xLeftBar + barWidth , this . graphBottom ) ; } } } } \n', 0.0555174)

And similiar quality results which don't fit the result table for the other queries.

Sorry,The problem has been fixed, but the comments cannot be deleted.

Query output

Hi, given the Query 'sort an array' and requesting 1 result i received about 9 results.
According to the code, there is a thread created for each code representation chunk (leading to several outputs, one from each chunk). Is this correct? If so, what is considered to be query response? If not, could you please provide an explanation?

Input Query: Sort an array
How many results? 1

('private String extractFlashFileFrom ( HtmlNavigator html ) { String script = html . firstElementOrNull ( "//script[contains(.,-\'hs:\')]" ) . getValue ( ) ; if ( script != null ) { Pattern pattern = Pattern . compile ( "hs:|"(.*?)|.flv" ) ; Matcher matcher = pattern . matcher ( script ) ; if ( matcher . find ( ) ) { return "http://video.ted.com/" + matcher . group ( 1 ) + ".flv" ; } } return null ; } \n', 0.02812706)

('public ObjectName getStatusLoggerObjectName ( final ObjectName loggerContextObjName ) { if ( ! isLoggerContext ( loggerContextObjName ) ) { throw new IllegalArgumentException ( "Not-a-LoggerContext:-" + loggerContextObjName ) ; } final String cxtName = loggerContextObjName . getKeyProperty ( "type" ) ; final String name = String . format ( StatusLoggerAdminMBean . PATTERN , cxtName ) ; try { return new ObjectName ( name ) ; } catch ( final MalformedObjectNameException ex ) { throw new IllegalStateException ( name , ex ) ; } } \n', 0.027286334)

('public static String getIdFromVersionedId ( String versionedId ) { if ( versionedId == null ) return null ; int versionDelimiterPos = versionedId . lastIndexOf ( VERSION_DELIMITER ) ; if ( versionDelimiterPos != - 1 ) { return versionedId . substring ( 0 , versionDelimiterPos ) ; } else { return null ; } } \n', 0.027727718)

('@ Override public void write ( String str , int off , int len ) { int newcount = count + len ; if ( newcount > buf . length ) { buf = Arrays . copyOf ( buf , Math . max ( buf . length << 1 , newcount ) ) ; } str . getChars ( off , off + len , buf , count ) ; count = newcount ; } \n', 0.027301062)

('private String reEncodeHtml ( String str ) { StringBuilder builder = new StringBuilder ( ) ; if ( str == null ) return "" ; String [ ] sources = new String [ ] { "<![CDATA[" , "]]>" , "&gt;" , "&lt;" , "&amp;" , "&#8217;" , "&#8220;" , "&#8221;" } ; String [ ] dests = new String [ ] { "" , "" , ">" , "<" , "&" , "\'" , """ , """ } ; builder . append ( TextUtils . replace ( str , sources , dests ) ) ; return builder . toString ( ) ; } \n', 0.028564211)

('public Float getLatitude ( GetMode mode ) { return obtainDirect ( FIELD_Latitude , Float . class , mode ) ; } \n', 0.028008113)

('public void xorOff ( ) { offscreen . setPaintMode ( ) ; } \n', 0.027301062)

('public State graphSearch ( BinaryPuzzle problem ) { final State initialState = new State ( problem , null ) ; LinkedList < State > frontier = new LinkedList < State > ( ) ; LinkedList < State > exploredStates = new LinkedList < State > ( ) ; frontier . addAll ( PuzzleExpander . expand ( initialState ) ) ; while ( ! frontier . isEmpty ( ) ) { State chosenNode = frontier . getFirst ( ) ; frontier . removeFirst ( ) ; if ( chosenNode . isGoalState ( ) ) { return chosenNode ; } exploredStates . addLast ( chosenNode ) ; ArrayList < State > expandedNodes = PuzzleExpander . expand ( chosenNode ) ; for ( State state : expandedNodes ) { if ( ! frontier . contains ( state ) && ! exploredStates . contains ( state ) ) { frontier . addLast ( state ) ; } } } return null ; } \n', 0.026943482)

('private Page ( final String token , final boolean skipHistory ) { this ( null , token , skipHistory ) ; } \n', 0.026943482)

Question about camel and lower case handling in raw data processing

In method name and token extraction, we split the camel case into lowercase words.
I am wondering whether in the api sequence and description extraction, we should also split the camel case and some snake case, and convert them to lower case. Since the java code we crawled has some method call such as 'setIncludingFilterTopNode', 'getOriginalSavedSearch', if we do not split it, in the vocab file, it contains this long word, I think this would be too specific, not that general. I am not sure whether the performance will improve if we do so.