Coder Social home page Coder Social logo

code_embedding's Introduction

Java code embeddings from compiled class files for code similarity tasks

Summary

A novel and simple approach for generating source code embeddings for code similarity tasks.

This compiler-in-the-loop approach works by compiling the high level source code to a typed intermediate language. Here we demonstrate for Java using the JVM instruction set. For other languages such as C/C++, LLVM intermediate language could be used.

We take the instruction sequence in each method and generate k-subsequences of instructions.

  • Extra type information is attached to 'invoke' instructions: Function calls are abstracted using the parameter and return types and attached to invoke instructions.
  • Class name is attached to the 'new' instruction.
  • Parameter and return types from function definition are currently not used since they're not part of the instruction stream.

k-subsequences of instructions:

  • For k = 1 .. N (currently N = 2):
    • We take every k-subsequence in the instruction sequence, generating a k-gram.

I experiment with 4 approaches:

  • Subsequence k-gram embeddings generated by 3 methods:
    • Random walks on the control flow graph (CFG) of a method, similar to graph vertex embeddings.
    • On the entire instruction sequence of a method without any path sensitivity.
    • Multi-tasking learning jointly on the above 2 tasks.
  • TF-IDF-style method embeddings.

Path sensitive k-gram embeddings

Here, we generate path sequences via random walks on the control flow graph (CFG). If number of paths are small, complete walks are performed. We then learn k-gram embeddings from these path sequences using a Skip-Gram model, similar to graph vertex embeddings.

Method embeddings are generated by summing up all its path embeddings and path embeddings are generated by summing up all the k-gram embeddings in the path.

Method similarity checking is done by computing vector similarity on method embeddings.

Path insensitive k-gram embeddings

In this approach, embeddings for subsequence n-grams (of instruction sequences) are learnt using a Word2Vec-style skip-gram model (currently n <= 2). Method embeddings are generated by summing up the embeddings of subsequence n-grams contained in it.

Method similarity checking is done by computing vector similarity on method embeddings.

Multi-Task Learning (MTL) of k-gram embeddings

Here, k-gram embeddings are learnt by a Multi-Task Skip-Gram Model that jointly optimizes on the above two tasks: the path sensitive learning and path insensitive learning.

TF-IDF embeddings

In this approach, during the learning phase, the IDF values for the features are learnt and stored in a JSON file.

During similarity checking, the TF vectors are generated and scaled using the previously learnt IDF values. Cosine similarity is used as the similarity measure.

Pre-requisites

  • A recent version of Python 3.
  • A recent version of JDK (javap is used to generate JVM disassembly) - must be in the path.
  • scikit-learn: pip install scikit-learn
  • pytorch: pip install torch (not required for TF-IDF embeddings)

Running (n-subsequence embeddings)

Embedding generation (path insensitive)

python compute_nsubseqs.py <folder containing class files> <subseq output path>
cd word2vec
python trainer.py <subseq output path> <vec output file path>

Embedding generation (path sensitive)

python cfg_embedding.py <folder containing class files> <subseq output path>
cd word2vec
python trainer.py <subseq output path> <vec output file path>

MTL Embedding generation

cd word2vec-mtl
python mtl-trainer.py <path insensitive subseq output path> <path sensitive subseq output path> <vec output file path>

Similarity checking

python compute_nsubseq_emb_similarity.py <folder containing class files> <vec file path>

Similarity checking with SVD post processing

python compute_nsubseq_emb_similarity.py <folder containing class files> <vec file path> <scale> SVD <Desired dimensionality of output data>

Similarity checking with PCA post processing

python compute_nsubseq_emb_similarity.py <folder containing class files> <vec file path> <scale> PCA <Desired dimensionality of output data>

IDF generation:

python compute_idf.py <folder containing class files> <IDF output path>

The folder containing class files is recursively searched for class files and the IDF is computed by aggregating data from all methods in all the class files.

Similarity checking

python compute_tf_idf_similarity.py <folder containing class files> <IDF path>

To run against the test files using the pretrained vectors from commons-lang library:

cd test
javac *.java
cd ..
python compute_nsubseq_emb_similarity test test/commons_lang_ngrams.vec

Running (TF-IDF embeddings)

IDF generation:

python compute_idf.py <folder containing class files> <IDF output path>

The folder containing class files is recursively searched for class files and the IDF is computed by aggregating data from all methods in all the class files.

Similarity checking

python compute_tf_idfsimilarity.py <folder containing class files> <IDF path>

The IDF path must point to a previously computed IDF file. All the class files are read and pair-wise similarity of all methods are printed.

To run against the test files:

cd test
javac *.java
cd ..
python compute_tf_idf_similarity test test/idf_commons_lang.json

Pre-computed

  • The file test/idf_commons_lang.json contains IDF computed from all the class files in the Apache Commons Lang library.
  • The file test/commons_lang_ngrams.vec contains unary and binary-subsequence embeddings trained from all the class files in Apache Commongs Lang library.

Citing

If you are using or extending this work as part of your research, please cite as:

Poroor, Jayaraj, "Java code embeddings from compiled class files for code similarity tasks", (2021), GitHub repository, https://github.com/jayarajporoor/code_embedding

BibTex:

@misc{Poroor2021,
   author = {Poroor, Jayaraj},
   title = {Java code embeddings from compiled class files for code similarity tasks},
   year = {2021},
   publisher = {GitHub},
   journal = {GitHub repository},
   howpublished = {\url{https://github.com/jayarajporoor/code_embedding}}
}

Related work

A few deep learning models have been proposed in recent years to generate source code embeddings:

code_embedding's People

Contributors

abhijitramesh avatar jayarajporoor avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.