agemagician / codetrans Goto Github PK

View Code? Open in Web Editor NEW

247.0 247.0 32.0 789 KB

Pretrained Language Models for Source code

License: MIT License

Jupyter Notebook 100.00%

codetrans's People

Contributors

Stargazers

Watchers

codetrans's Issues

Cannot reproduce Code Documentation Generation performance

Hi, I've been trying to reproduce the results of Code Documentation Generation but failed to do so. Could you please help to explain how you process the input (directly use the provided tokenized data or manually tokenize using tree_sitter), and how you calculate the smoothed bleu-4 scores? See below for the details:

Take JavaScript for example: the results for CodeTrans-TF-Small/Base/Large reported in the paper are 17.23, 18.25, 18.98, respectively.
I first directly employ the tokenized data provided by CodeBERT (or CodeXGLUE), where my reproduced results are 15.8, 16.96, 17.67.
Besides, I tokenize the source code using tree_sitter following your provided pipeline (i.e., CodeTrans/prediction/multitask/fine-tuning/function documentation generation/javascript/small_model.ipynb), and the obtained results are 15.28, 16.91, 17.61.

Other facts: I calculate the smoothed bleu-4 score following CodeXGLUE (https://github.com/microsoft/CodeXGLUE/blob/main/Code-Text/code-to-text/evaluator/evaluator.py). I truncate the source and target sequence up to 512 tokens before fed to the model.

We also cannot reproduce the results for other languages on Code Documentation Generation task. Please help to resolve this. Thanks in advance!

Is it possible to fine tune a Python capable model herein with custom new libraries dataset

For example let's teach it langchain with carefully annotated question answer pairs dataset.

Thanks in advance.

RuntimeError: Could not infer dtype of dict

I'm trying to fine-tune the model on the Kotlin dataset for code comment/code documentation tasks.

But I getting RuntimeError: Could not infer dtype of dict.

more details are available in the below link

https://stackoverflow.com/questions/75399318/runtimeerror-could-not-infer-dtype-of-dict

summary model outputs interrogative sentence？

Hi, When playing with the example notebook provided in the following link: https://github.com/agemagician/CodeTrans/blob/main/prediction/single%20task/source%20code%20summarization/python/base_model.ipynb
I noticed the summary is an interrogative sentence.

But it seems like when the example was first created, the expected output should be a declarative sentence as follow:

Has the model been updated recently so that it outputs summary differently?

Thank you!

Differences between the first three downstream tasks?(except for the dataset)

Hi, I have a question and want to know the difference between these three tasks: code documentation generate、code summarization and code comment generate. My understanding is that all three of these tasks are generating natural language
descriptions for a code snippet.

Code Summarization

Rather than just using the pre-trained model for single task Source Code Summarization. Will it better to integrate a latest LLM in it?

Any chance of actual pretraining/finetununing code?

A lot of jupyter notebooks with pipelines for the tasks your model can perform is great, but it would also be nice to have a finetuning script. Ideally it would be a slight modification of the transformers run_mlm.py, but a custom script should suffice.

Checkpoints for models

Are you planning to provide checkpoints for your models?

Error in model.predict - TypeError: 'str' object is not callable

Hi.

I was trying to run your code in single task/api generation/t5 interface/base_model.ipynb on COLAB and I am receiving this error after model.predict.

model.predict(
    input_file="input.txt",
    output_file=predict_outputs_path,
    checkpoint_steps=840000,
    beam_size=4,
    vocabulary=vocab, 
    # Select the most probable output token at each step.
    temperature=0,
)
=======================================
INFO:tensorflow:Using config: {'_model_dir': 'base', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 5000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1, experimental_allow_per_host_v2_parallel_get_next=False, experimental_feed_hook=None), '_cluster': None}
INFO:tensorflow:_TPUContext: eval_on_tpu True
WARNING:tensorflow:eval_on_tpu ignored because use_tpu is False.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-e1bf683e0ef5> in <module>()
      7     vocabulary=vocab,
      8     # Select the most probable output token at each step.
----> 9     temperature=0,
     10 )

4 frames
/usr/local/lib/python3.7/dist-packages/mesh_tensorflow/transformer/utils.py in infer_model(estimator, vocabulary, sequence_length, batch_size, model_type, model_dir, eval_checkpoint_step, checkpoint_paths, decode_fn)
   1853         batch_size=batch_size,
   1854         sequence_length=sequence_length,
-> 1855         checkpoint_path=checkpoint_path)
   1856 
   1857 

TypeError: 'str' object is not callable
  In call to configurable 'infer_model' (<function infer_model at 0x7f68488db950>)

Tried with/without GPU runtime
I changed nothing on the code
Tried base_model.ipynb and small_model.ipynb, they have same behavior.

How can I fix this issue?

agemagician / codetrans Goto Github PK

codetrans's People

Contributors

Stargazers

Watchers

Forkers

codetrans's Issues

Cannot reproduce Code Documentation Generation performance

Is it possible to fine tune a Python capable model herein with custom new libraries dataset

RuntimeError: Could not infer dtype of dict

summary model outputs interrogative sentence？

Differences between the first three downstream tasks?(except for the dataset)

Code Summarization

Any chance of actual pretraining/finetununing code?

Checkpoints for models

Error in model.predict - TypeError: 'str' object is not callable

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent