Coder Social home page Coder Social logo

codedistillery's Introduction

CodeDistillery

What?

CodeDistillery is a framework aimed at facilitating the mining of source code changes from version control systems.

The thing that makes CodeDistillery a framework more than a tool, is its support for pluggable source code mining mechanisms, while providing the underlying infrastructure to efficiently apply these mechanisms on entire revision histories of numerous software repositories.

Why?

While one could get away with mining a single repository with 100 or so commits without considering scale, it is no longer the case when this task involves dozens and more of repositories, each having thousands and more of revisions. In fact, the latter is quite standard for current studies in empirical software engineering, so we thought that the tool we had built for this task could make life easier on other people trying to do similar stuff.

How?

We utilize Spark as the distributed compute engine, and JGit as the data access layer to make the mining workload a highly parallel one. Then, we use spark to distribute and process it.

In light of the above, CodeDistillery would not have been possible (or at the very least, it would have been much much harder to build) without awesome people building awesome open source software, and in particular the OS projects we extensively built upon: Apache Spark, JGit and ChangeDistiller.

Getting Started

git clone https://github.com/staslev/CodeDistillery  
cd CodeDistillery  
mvn clean install  

The project was tested and developed using JDK 8. If you have multiple JDKs installed, make sure maven uses JDK 8 when executing mvn clean install. This can be done using the following command:

export JAVA_HOME=/my_jdk1.8/Contents/Home && mvn clean install

Setting up Maven dependencies

<dependencies>  
  <dependency> 
    <groupId>com.staslev.codedistillery</groupId>   
    <artifactId>distillery-core</artifactId>
    <version>0.5-SNAPSHOT</version>
  </dependency>
  <dependency>  
    <groupId>com.staslev.codedistillery</groupId>
    <artifactId>change-distiller-uzh</artifactId>
    <version>0.5-SNAPSHOT</version>
  </dependency>
</dependencies>  

Usage

We demonstrate CodeDistillery by providing an out-of-the-box support for mining Java fine-grained source code changes from Git repositories.

object Main {  
  
  def main(args: Array[String]): Unit = {  
 
  val codeDistillery =  
    new CodeDistillery(
      vcsFactory = GitRepo.apply,  
      distillerFactory = UzhSourceCodeChangeDistiller.apply,  
      encoderFactory = () => UzhSourceCodeChangeCSVEncoder)  
    with CrossRepoRevisionParallelism
  
    val repoPath = Paths.get("/path/to/my/repo")  
    val output = Paths.get("/path/to/write/output")  
    val branch = "master"

    import LocalSparkParallelism.spark

    codeDistillery.distill(Set((repoPath, branch)), output)  
 }  
}

Output

The output is a CSV file with a # delimiter, consisting of the following fields (in respective order):

  1. Project name
  2. Commit hash
  3. Author name
  4. Author email
  5. Fine-grained change type
  6. Unique name of changed entity
  7. Significance level
  8. Parent entity type
  9. Unique name of parent entity
  10. Root entity type
  11. Unique name of root entity
  12. Commit message
  13. Filename

Obtaining commit level datasets

The output from the previous stage is a dataset of raw fine-grained source code changes as distilled from a software repository. It is often useful to aggregate this raw dataset into commit level statistics. A commit level dataset can be obtained by performing the following:

val input1 :: input2 :: output :: Nil =  
  List("/path/to/input1", "/path/to/input2", "/path/to/output")  
 .map(Paths.get(_))

import LocalSparkParallelism.spark

PerCommit.aggregate(Set(input1, input2), output)

The output is a CSV file with a # delimiter, consisting of the following fields (in respective order):

  1. Project name

  2. Commit hash

  3. Author name

  4. Author mail

  5. Date

  6. Non test versatility

  7. Commit message

  8. Test cases added

  9. Test cases removed

  10. Test cases changed

  11. Test suites added

  12. Test suites removed

  13. Test suites affected

  14. Has issue ref

  15. Non test files in commit

  16. Total files in commit

  17. Commit message length

    ++ { fine-grained source code change type frequencies } Which is a lexicographically sorted list of fine-grained source code change types.

The complete list of columns (a.k.a. header line) can be obtained using: PerCommit.headerLine.

codedistillery's People

Contributors

staslev avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.