Coder Social home page Coder Social logo

rubhus-cross-langauge-code-clone-detector's Introduction

Rubhus-Cross-Langauge-Code-Clone-Detector

Paper Link - https://www.computer.org/csdl/journal/ts/5555/01/10242168/1QdYTNsq4Xm

This repository contains the source code for the paper - "Improving Cross-Language Code CloneDetection via Code Representation Learning and Graph Neural Networks"

๐Ÿ“œ Code Organisation

Current organisation contains files pertaining to models (rubhusModel.py, baselineModel.py), trainers (trainerBaseline.py , trainerRubhus.py) and some helper function file.

Repository
โ”œโ”€โ”€ helper functions
โ”œโ”€โ”€ models
โ””โ”€โ”€ trainers

After setting up the repository, it would contain dataset files as well.

โš™ Setting Up

1. Clone the repo

   git clone https://github.com/Akash-Sharma-1/Rubhus-Cross-Langauge-Clone-Detector.git

2. Installing Dependencies

   pip install -r requirements.txt

Note - Pytorch and Pytorch-Geometric (+ associated dependencies) versions must be installed in accordance the compatablity of Cuda version and operating system

3. Setting up Datasets

The datasets which were used for experiments couldn't be uploaded to the repository due to file size limits. These files are to be downloaded and can be used independently for testing/running the models.

3.1 Extraction of Dataset Files

  • Java-Python Dataset - Link
  • C-Java Dataset - Link

3.2 Setting up Dataset Files

  • Unzip the downloaded files and extract the datasets files.
  • Place these extracted files in the root directory of this repository

3.3 Configuration of file paths

  • Dataset paths - After extraction of the dataset, clone pair files and non-clone pair text files must be stored in the root directory in a folder named 'CloneDetectionSrc'.
  • Processed Data folder - A folder named 'cloneDetectionData' must be created in the root directory where all the processed data files will be stored for training the model
  • Trained Models folder - A folder named 'cloneDetectionModels' must be created in the root directory where all the formed model files will be stored.

๐Ÿ’ซ Usage

1. Configuration of Hyperparameters

  • Hyperparameters are defined inside the trainer files and can modified as per convenience.

The hyperparameter variables explanation table is as follows :

Var Name Hyperparameter Default Value
dim Embedding size (dimension) for the model 64
epochs #Epochs for the training 25
batch_size Size of the data batch 32
lamda Regulariser 0.001
use_unsup_loss Usage of unsupervised loss in model training True
lr Learning Rate (initial) 0.001
optimizer Optimizer of loss Adam
scheduler Learning Rate Scheduler ReduceLROnPlateau

2. Training RUBHUS Model

   python3 trainerRubhus.py

3. Training Baseline Model

   python3 trainerBaseline.py

โญ About the original setup

  • In our experiments we have trained Rubhus and Baseline Models for Java Python Dataset and for C-Java Dataset separately.
  • The hyperparameters used in the original experiments as well as in this source code are reported in the paper.
  • We have used GTx 2080Ti GPU to run our experiments. The time analysis of the tool also has been reported in the paper.

๐Ÿ“‘ Citing the project

If you are using this for academic work, we would be thankful if you could cite the following paper. BIBTEX

@{,
 author = {Nikita Mehrotra*, Akash Sharma*, Rahul Purandare},
 title = {Improving Cross-Language Code CloneDetection via Code Representation Learning and Graph Neural Networks},
 ....
}

rubhus-cross-langauge-code-clone-detector's People

Contributors

akash-sharma-1 avatar nikitamehrotra12 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

rubhus-cross-langauge-code-clone-detector's Issues

Link to datasets is broken

Both links to the dataset are not working, the link returns:
Sorry, the file you have requested does not exist.

Would be possible to have a zenodo, or equivalent hosting repository link to the data?

Lack of prcessing raw data

Hi there,

the dataset link provided cannot directly used in this project. Could you also provide the source code of processing their raw data. That will be very helpful.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.