Coder Social home page Coder Social logo

chrisputzu / bughunter Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 11.25 MB

BugHunter is a comprehensive Python project for bug analysis and prediction. Leveraging PySpark, it provides a pipeline for connecting, reading, exploring, cleaning, transforming, splitting, training, evaluating, and visualizing a random forest model for bug prediction in software projects.

Python 7.57% Jupyter Notebook 92.43%
bughunter notebook-jupyter pyspark python

bughunter's Introduction

BugHunter

Introduction

The analysis proposed within the Python script BugHunter.py is based on the dataset produced by the research work of the authors: Rudolf Ferenc, Péter Gyimesi, Gábor Gyimesi, Zoltán Tóth, Tibor Gyimóthy at the Department of Software Engineering and part of the MTA-SZTE Research Group on Artificial Intelligence at the University of Szeged in Hungary. Thanks to their work, encapsulated in the article 'An automatically created novel bug dataset and its validation in bug prediction' [6], published in November 2020 in the 'Journal of Systems and Software,' the authors have created a dataset useful for predicting the presence or absence of bugs in source code. Specifically, the authors selected 15 projects on Github written in the Java language and analyzed the presence of errors in their source codes at the level of:

  1. Class
  2. Method
  3. File

For each level, the individual project's granular datasets or aggregates with all the projects are available. Furthermore, the datasets were further divided by the authors according to the method used to reduce noise in the learning set (No Filter, Removal, Subtract, Single, GDF) resulting from the temporal update of the metrics following the fixing of the bugs identified in the source code. Additionally, the dataset was projected by including both method-level and class-level metrics, along with their relationships. The authors achieved better results on the projected dataset, with the RandomForest algorithm proving to be the most performing with about 74% Precision/Recall/F-measure.

Taking this into consideration, the projected dataset 'methodp.csv' filtered with the 'Removal' method on all projects was chosen for the analysis, as it is the dataset that performs best on the validation tests conducted by the authors. Furthermore, it was chosen to execute the RandomForest algorithm on the same dataset in an attempt to replicate the results obtained by the authors.

Selected Projects (Project column):

  1. Android Universal Image Loader: An Android library that assists with image loading.
  2. ANTLR v4: Popular software in the field of language processing. It is a powerful parser generator for reading, processing, executing, or translating structured texts or binary files.
  3. Broadleaf Commerce: A framework for building e-commerce websites.
  4. Elasticsearch: A very popular RESTful search engine.
  5. Eclipse plugin for Ceylon: An Eclipse plugin that provides an Integrated Development Environment (IDE) for the Ceylon programming language.
  6. Hazelcast: A platform for distributed data processing.
  7. jUnit: A Java framework for writing unit tests.
  8. MapDB: A versatile, fast, and easy-to-use database engine in Java.
  9. mcMMO: A Minecraft-based role-playing game (RPG).
  10. Mission Control Technologies: Initially developed by NASA for space flight operations. It is a real-time monitoring and visualization platform that can also be used for monitoring other data.
  11. Neo4j: The world's leading graph database with high performance.
  12. Netty: An asynchronous, event-driven network framework.
  13. OrientDB: A popular NoSQL, document-based graph database. Mainly famous for its speed and scalability.
  14. Oryx 2: An open-source software with machine learning algorithms that enables the processing of huge datasets.
  15. Titan: A highly scalable, high-performance graph database.

Description

BugHunter is a Python class for analyzing and modeling bug data using PySpark. The class provides methods for establishing a connection to Spark, reading data, exploring, cleaning, transforming, scaling, splitting, training, evaluating, and visualizing a random forest model for bug prediction. The BugHunter class can be used to analyze and model bug data efficiently and accurately.

Key Features

  • Establishes a connection to Spark.
  • Reads bug data from a CSV file and returns a Spark DataFrame.
  • Explores the loaded bug data, printing its schema, summary statistics, and a sample.
  • Cleans the loaded bug data by removing duplicates and columns with all zero values.
  • Transforms the cleaned bug data by processing text columns and assembling features.
  • Scales the assembled bug data using MinMaxScaler.
  • Splits the scaled bug data into training and testing sets.
  • Trains a random forest model using the training data and returns the best model.
  • Evaluates the best model's performance on the test data.
  • Plots the results of the best model, including the confusion matrix, ROC curve, and important features.

Dependencies

  • PySpark: Python library for Apache Spark
  • pandas: Python library for data manipulation and analysis
  • matplotlib: Python library for data visualization
  • scikit-learn: Python library for machine learning

Usage

  1. Clone the repository to your local machine.
  2. Ensure you have installed the necessary libraries: PySpark, pandas, matplotlib, and scikit-learn.
  3. Run the provided Python script for bug analysis and modeling: BugHunter.py.
  4. Input the path to the bug data CSV file.
  5. Follow the instructions in the script to explore, clean, transform, scale, split, train, evaluate, and visualize the bug data.

Contributions

Contributions to this project are welcome! Feel free to submit pull requests for bug fixes, enhancements, or additional features. Please adhere to the project's coding style and guidelines.

License

This project is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.

Disclaimer

While efforts have been made to ensure accurate bug analysis and modeling, no guarantees are provided regarding the accuracy or reliability of the system in all scenarios. Usage of this project is at the user's discretion.

References

Acknowledgments

The authors of the scientific paper used as a reference:

  • Rudolf Ferenc
  • Péter Gyimesi
  • Gábor Gyimesi
  • Zoltán Tóth
  • Tibor Gyimóthy

Author

  • Christian Putzu

bughunter's People

Contributors

chrisputzu avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.