Coder Social home page Coder Social logo

dargones / import_prediction Goto Github PK

View Code? Open in Web Editor NEW
4.0 3.0 0.0 37.41 MB

Code that allows predicting imports in Java code with GGNNs

Java 5.95% Jupyter Notebook 76.36% Python 16.40% PLpgSQL 0.71% TSQL 0.58%
machine-learning imports java github graph-neural-networks

import_prediction's Introduction

Java Local Import Prediction with Gated Graph Neural Networks

Introduction

This repository contains the code of my Senior Thesis in which I explore how Gated Graph Neural Networks can be used to predict class-level imports in Java code. More specifically, I attempt to predict which classes (or, more properly, compilation units) defined in a certain project currently under development might be imported to a class newly defined in the same project given the new class's name and possibly initial import statements in that class.

Because the goal is to predict imports occurring within a possibly unfinished project, one cannot gather enough import co-occurance statistics to efficiently predict class-level imports. Instead, one must rely exclusively on information that is locally available. To approach this problem, I propose to model relationships between compilation units in a project with a graph. A node in such a graph corresponds to a particular compilation unit. An edge between two nodes marks some relationship between the two corresponding compilation units.

Below is an example of such a graph build for one of the repositories in my dataset. Grey undirected edges connect compilation units defined within the same package. Black directed units correspond to import statements. Blue edges mark class inheritance or interface implementation:

Graph Example

The code in this repository allows building such graphs for arbitrary GitHub projects and running GGNNs on these graphs to learn to predict future imports in a way similar to how Allamanis et al. solve the variable misuse task (they work on the level of a small piece of code and model relationships between variables, while I model relationships between files\compilation units in the context of a repository). The initial comparison between the baseline results I was able to achieve by other means (see notebooks/Baselines.ipynb) suggest that GGNNs are well suited for this task. For more details on the way I use GGNNs, see README.md in the python directory.

Getting the Data

The data for this project can be obtained by running a set of SQL queries on publicly available BigQuery Dataset. The README.md file in the SQL directory contains detailed information about replicating this step and the criteria used to select the data.

Data Preprocessing and Baselines

The raw data can be downloaded from BigQuery via GCS as a set of .json files each of which will contain millions of lines of Java code. This code can be parsed by running Parser.java (located in java/src) passing the name of one .json file as the first argument and the name of the output file as a second argument. You will need JavaParser to run this code.

Next, the data goes through a series of preprocessing steps. To replicate them, run Filtering.ipynb and ConvertingToGraphs.ipynb notebooks. You might need the following python libraries to run this code: pytorch, numpy, tqdm, joblib, networkx, sklearn, matplotlib, pandas

More information on data preprocessing and baselines used for import prediction can be found in the README.md file in the notebooks directory.

Running the GGNNs

The detailed desription of the pytorch implementation of GGNNs used for this project can be found in the README.md file in the python directory. The same directory contains the python modules that define the network structure, the loss function, and the way the data loader works. wrapper.py in the python directory contains an example of how the network could be run.

import_prediction's People

Contributors

dargones avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.