Coder Social home page Coder Social logo

seart-group / dl4se Goto Github PK

View Code? Open in Web Editor NEW
16.0 2.0 3.0 4.3 MB

Building Training Datasets for Deep Learning Models in Software Engineering and Empirical Software Engineering Research

Home Page: https://seart-dh.si.usi.ch

License: MIT License

Java 68.55% Python 1.49% Shell 0.22% Dockerfile 0.35% HTML 0.62% JavaScript 2.54% Vue 23.01% Sass 3.23%
dataset-generation deep-learning software-engineering jsonl msr liquibase postgresql spring-boot docker-compose mining-software-repositories

dl4se's Introduction

SEART Data Hub

The SEART Data Hub platform allows to easily create large-scale datasets that can be used to either run empirical MSR studies or to train Deep Learning models to automate software engineering tasks.

Contents

This project contains several modules:

  • dl4se-model: A module containing domain model classes used for mapping the relational database structure to the programming environment;
  • dl4se-analyzer: A module containing implementations of code analysis operations running on tree-sitter;
  • dl4se-transformer: A module containing implementations of code transformation operations running on tree-sitter;
  • dl4se-crawler: A standalone crawler application that we use to mine source code from GitHub repositories indexed by GitHub Search;
  • dl4se-server: A Spring Boot server application that acts as our platform back-end;
  • dl4se-spring: Common Spring Boot configuration and utilities used in both the server and the crawler;
  • dl4se-website: A front-end web-application written in Vue.

Installation and Usage

This section will detail the necessary actions for setting up and running the project locally on your machine.

License

MIT

FAQ

How do you implement language-specific analysis heuristics?

Heuristics used to identify test code in Java and Python can be found here and here. Heuristics used to identify boilerplate code can be found here and here respectively.

How can I request a feature or ask a question?

If you have ideas for a feature you would like to see implemented or if you have any questions, we encourage you to create a new discussion. By initiating a discussion, you can engage with the community and our team, and we'll respond promptly to address your queries or consider your feature requests.

How can I report a bug?

To report any issues or bugs you encounter, please create a new issue. Providing detailed information about the problem you're facing will help us understand and address it more effectively. Rest assured, we are committed to promptly reviewing and responding to the issues you raise, working collaboratively to resolve any bugs and improve the overall user experience.

dl4se's People

Contributors

dabico avatar davekeehl avatar gbavota avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

dl4se's Issues

Statistics tab

Implement the "Statistics" tab with visualizations about the amount of data available in the DB (files/functions/projects). These statistics should already be designed to work with multiple languages in future. Also show the latest date in which the DB has been updated, and statistics about (i) number of submitted requests, (ii) size of the generated datasets.

Issue about Downloading Dataset from DL4SE website

Thank you very much for developing this tool! It is very helpful and benefit for our research.

However, I meet some problem when I downloaded the dataset when it finished. I used Chrome to download the dataset, but it always failed and notify me that “the file is not found”when 1gb content is downloaded (the size of dataset for downloading is 5gb in total).

You could also check this video which replicate this problem (https://www.youtube.com/watch?v=ptXrTrVUvXI). I would like to know if there any limitation on download file size?

New landing page

In the homepage, provide the choice between "Create a Generic Code Dataset" and "Create a Dataset for Code Completion". Design the landing page considering the addition of new types of datasets in the future.

Inconsistent in Generating Dataset Interface and Query Detail in Dashboard

Hi @dabico,

I found that the information in the interface of generating the dataset is inconsistent with the information in the query detail of the dashboard.

For example, I select "at least 10 contributors" on the interface of generating the dataset and click the generating dataset. However, the query detail in the dashboard shows me that "min_contributors = 0" and "min_issues = 10". (see following Figures)

Screenshot 2023-05-31 at 11 29 39 Screenshot 2023-05-31 at 11 29 55

This situation confuses me. Could you kindly take a look at what happened to this issue? Is the problem in the front end?

Non-ASCII instances are present even when flagging the button --Instances with non-ASCII characters-- in the platform

Please find below the query and some examples I got from the mined dataset.

  • Language Name: Java
  • Granularity: function
  • Min Tokens: 11
  • Min Stars: 50
  • Min Contributors: 20
  • Min Commits: 500
  • Max Tokens: 512
  • Exclude Non Ascii
  • Exclude Forks
  • Exclude Duplicates
private static int specialChar ( char ch ) { if ( ( ch > 'ء' && ch < 'ئ' ) || ( ch == 'ا' ) || ( ch > 'خ' && ch < 'س' ) || ( ch > 'ه' && ch < 'ي' ) || ( ch == 'ة' ) ) { return 1 ; } else if ( ch >= 'ً' && ch <= 'ْ' ) { return 2 ; } else if ( ch >= 0x0653 && ch <= 0x0655 || ch == 0x0670 || ch >= 0xFE70 && ch <= 0xFE7F ) { return 3 ; } else { return 0 ; } }
public static String getDiffString ( int i ) { if ( i == 0 ) return "±0" ; String s = Integer . toString ( i ) ; if ( i > 0 ) return "+" + s ; else return s ; }
@ Override public void handleServer ( ServerSession session , ServerListPingMessage message ) { Server server = ( Server ) VanillaPlugin . getInstance ( ) . getEngine ( ) ; if ( PROTOCOL == null ) { PROTOCOL = VanillaPlugin . getInstance ( ) . getDescription ( ) . getData ( "protocol" ) ; MC_VERSION = VanillaPlugin . getInstance ( ) . getDescription ( ) . getVersion ( ) . trim ( ) . split ( " " ) [ 0 ] ; MOTD = VanillaConfiguration . MOTD . getString ( ) ; } ServerListPingEvent event = VanillaPlugin . getInstance ( ) . getEngine ( ) . getEventManager ( ) . callEvent ( new ServerListPingEvent ( session . getAddress ( ) . getAddress ( ) , MOTD , server . getOnlinePlayers ( ) . length , server . getMaxPlayers ( ) ) ) ; session . send ( new PlayerKickMessage ( '§' + "1" + '

Functions of `CONSTRUCTOR` boilerplate type are omitted from the export

Up to yesterday, I believed that JavaParser could parse ConstructorDeclaration strings using the parseMethodDeclaration method. Tests yesterday have revealed that parseBodyDeclaration should be used insted. In order to fix this I just need to update the parser selection logic in TaskToProcessingPipelineConverter:

private CodeProcessingPipeline convert(CodeQuery codeQuery, CodeProcessing codeProcessing) {
boolean includeAst = codeQuery.getIncludeAst();
if (codeQuery instanceof FileQuery) {
return convert(codeProcessing, includeAst, StaticJavaParser::parse);
} else if (codeQuery instanceof FunctionQuery) {
return convert(codeProcessing, includeAst, StaticJavaParser::parseMethodDeclaration);
} else {
throw new UnsupportedOperationException(
"Converter not implemented for code granularity: " + codeProcessing.getClass().getName()
);
}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.