Coder Social home page Coder Social logo

varangian's Introduction

Applying Machine Learning to Static Analysis for the Fedora community

Alessandro Morari, IBM Research Christoph Görn, Red Hat Office of the CTO

IBM Research has developed an Augmented Static Analyzer based on a Deep Learning model called C-BERT that is able to analyze source code in C language. They are currently using it to identify vulnerabilities in source code, in this case, without the help of a traditional static analyzer. The C-BERT model is now one of the best in this field based on the CodeXGlue leaderboard, and it could also be used for a variety of source code tasks such as Code Completion, Code Search, Clone detection, Code translation and Code generation.

In cooperation with the AICoE, IBM Research wants to apply this to one of the projects or communities significant to Red Hat. The goal is to improve the code quality and developer workflows of the chosen project/community.

As C-Bert is well established for the C programming language, we are looking at the CentOS or Fedora communities first. All their source code is available via https://vault.centos.org/ and https://src.fedoraproject.org/. As Fedora can be understood as the upstream community project of Red Hat Enterprise Linux and CoreOS, and the source code seems to be better accessible to automation, we will focus on the Fedora community.

Objective

Enhance the developer workflow by providing a machine-learning backed application on GitHub. The application will automatically guide developers to focus on the most relevant static analysis issues, avoiding spending time on false positives.

Key Results

Analysis of the set of source code repositories, identify which repositories could benefit the most from the c-bert application by ...

Deploy a model release pipeline on Operate First by ...

Create a prototype web service to categorize … using c-bird, deploy a CD pipeline for this app on Operate First by …

Create a GitHub app and Cronjob to use the web service to … by …

Project Planning

TBD

Timeline

TBD

varangian's People

Contributors

goern avatar harshad16 avatar khebhut[bot] avatar kpostoffice avatar saurabhpujar avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

varangian's Issues

Failed to update dependencies to their latest version

Automatic dependency update failed for the current master with SHA a922988.

The automatic dependency management cannot continue. Please fix errors reported bellow.

Command
  $ pipenv update --dev
Standard output

Standard error
Warning: Python 3.9 was not found on your system...
Neither 'pyenv' nor 'asdf' could be found to install Python.
You can specify specific versions of Python with:
$ pipenv --python path/to/python

Environment details

Kebechet version: 1.5.2
Python version: 3.8.6
Platform: Linux-4.18.0-305.19.1.el8_4.x86_64-x86_64-with-glibc2.2.5
pipenv version: pipenv, version 2020.11.15


Dependency graph
Unable to obtain dependency graph:

Warning: Python 3.9 was not found on your system...
Neither 'pyenv' nor 'asdf' could be found to install Python.
You can specify specific versions of Python with:
$ pipenv --python path/to/python

Notes

For more information, see Pipfile and Pipfile.lock.

Once this issue is resolved, the issue will be automatically closed by bot.

/label thoth/potential-flake
/kind bug
/priority critical-urgent

user feedback loop

Is your feature request related to a problem? Please describe.
As a Data Scientist, I want to have a channel for user feedback, so that we can fine-tune the model

Tasks List for Libtiff pipeline

Milestone 2: Implement the AugSA inference pipeline on Libtiff (https://gitlab.com/libtiff/libtiff)

Tasks List for first run of monoloth prototype

Tasks for creation of monolith prototype and it's first run:

  • Determine target projects
  • Training data split
    • Temporal slicing of commits into training, dev splits [Burn/Saurabh]
  • Manual Verification of model output
  • ML
    • Feature Engineering:
      • Incorporate Yunchungs features in Feature Extractor [Burn]
      • Check if new features improve performance [Saurabh]
      • Add new features to feature extractor [Saurabh]
    • Separate training and inference [Burn]
    • Separate Voting from training [Burn]
  • C-BERT:
    • Create new train file for prototype [Luca]
    • Create tokenization script [Saurabh]
    • Inference mode tokenization[Luca/Saurabh]
    • Write script for artifacts package creation [Luca]
    • Run experiments to pick the right C-BERT model for prototype [Luca/Saurabh]
  • Engineering
    • Prepare system diagram [Saurabh]
    • Formalize output across components [Saurabh/Burn/Luca]
    • Create Infer output code extractor [Burn]
    • Write Python code for first run
    • Integration run
    • Model selector after training [Saurabh]
    • Inference output selector [Saurabh]
    • Get metrics at bug level
  • #13
    • Provide libtiff data to Kevin [Saurabh/Burn]
    • Bot Development @KPostOffice

Determine a candidate list of projects for the initial run

There was some discussion on this during the last meeting.

It was mentioned that we need a candidate list of projects which are:

  1. Active and "healthy"
  2. High number of users

Once we get the the candidate list, we can pick one most suited for differential analysis and begin.

Projects we worked with so far: Libtiff, Nginx, HTTPD, OpenSSL, Libav, FFMpeg

Failed to update dependencies to their latest version

Automatic dependency update failed for the current master with SHA 0b65211.

The automatic dependency management cannot continue. Please fix errors reported bellow.

Command
  $ pipenv update --dev
Standard output

Standard error
Warning: Python 3.9 was not found on your system...
Neither 'pyenv' nor 'asdf' could be found to install Python.
You can specify specific versions of Python with:
$ pipenv --python path/to/python

Environment details

Kebechet version: 1.4.0
Python version: 3.8.6
Platform: Linux-4.18.0-193.56.1.el8_2.x86_64-x86_64-with-glibc2.2.5
pipenv version: pipenv, version 2020.11.15


Dependency graph
Unable to obtain dependency graph:

Warning: Python 3.9 was not found on your system...
Neither 'pyenv' nor 'asdf' could be found to install Python.
You can specify specific versions of Python with:
$ pipenv --python path/to/python

Notes

For more information, see Pipfile and Pipfile.lock.

Once this issue is resolved, the issue will be automatically closed by bot.

/label thoth/potential-flake
/kind bug
/priority critical-urgent

OKR and Tasks for Inference Scale Out (Milestone 3)

Milestone 3
Period: October to December 2021

OKRs

  • Objective: Scale Varangian to new OS projects 
    • Key Result: We have 6 repos by Jan 2022
    • Tasks:
      • Generate D2A V2 training data for 6 repos
      • Train models for the 6 repos
      • Automate D2A Retraining
      • Containerize and auto-start inference pipeline
      • Automate the training pipeline
      • Containerize and auto-start training pipeline. 
  • Objective: Scale models to repos with less data
    • Key Result: Improve Model Performance on cross project datasets
    • Tasks:
      • #30 Generate Ensemble baselines
      • #30 Generate C-BERT baselines
      • Improve ensemble performance:
        • #31 Feature Engineering
        • Separability, Segmentation
      • Improve C-BERT performance: More data, better model
  • Objective: Add features for better user engagement 
    • Key Result: Bot will have 5 new features
    • Tasks:
      • Tag/Assign users to issue
      • Access to full ranked list, traces
      • Issue management
        • Close resolved open issues
        • Issues marked FP are not opened again
        • Issues marked FP are reviewed
        • Duplicate issues are not opened
        • Resolved Issue Feedback to backend 
      • Aggregate issues 

MVP: prototype run of training, prediction and issue creating on new code

Is your feature request related to a problem? Please describe.
As a Product Owner, I want to see a manual demo of the monolith, so that new issues are created based on new code comits of one specific repository.

High-level Goals

  • show interaction between model and bot
  • new issues on new code
  • automated issue creation

Describe the solution you'd like
TBA

Describe alternatives you've considered
TBA

Additional context
TBA

Acceptance Criteria
TBA

Determine the contents of a sample issue

First draft of contents and a sample bug trace is given below.

We will probably need multiple iterations and feedback from users before finalizing the content.

Issue Contents:

Subject:

Infer bug type: UNINITIALIZED_VALUE

Location: /openssl/src/test dtls_mtu_test.c:100

Description: The value read from mtus[_] was never initialized.

The buggy line: 100. > for (s = mtus[0]; s <= mtus[29]; s++) {

Confidence:

Priority Rank:

Link to the full trace.


Sample Bug Trace:

/openssl/src/test dtls_mtu_test.c: 100 : error: UNINITIALIZED_VALUE

The value read from mtus[_] was never initialized.
Showing all 1 steps of the trace

/gpfs/automountdir/r92gpfs02/zhengyu/work/ai4code/benchmarks/openssl/src/test/dtls_mtu_test.c:100:10:
98. * that size and see what actual record size we end up with.
99. */
100. > for (s = mtus[0]; s <= mtus[29]; s++) {
101. size_t reclen;
102.

Failed to update dependencies to their latest version

Automatic dependency update failed for the current master with SHA a922988.

The automatic dependency management cannot continue. Please fix errors reported bellow.

Command
  $ pipenv update --dev
Standard output

Standard error
Warning: Python 3.9 was not found on your system...
Neither 'pyenv' nor 'asdf' could be found to install Python.
You can specify specific versions of Python with:
$ pipenv --python path/to/python

Environment details

Kebechet version: 1.5.2
Python version: 3.8.6
Platform: Linux-4.18.0-305.19.1.el8_4.x86_64-x86_64-with-glibc2.2.5
pipenv version: pipenv, version 2020.11.15


Dependency graph
Unable to obtain dependency graph:

Warning: Python 3.9 was not found on your system...
Neither 'pyenv' nor 'asdf' could be found to install Python.
You can specify specific versions of Python with:
$ pipenv --python path/to/python

Notes

For more information, see Pipfile and Pipfile.lock.

Once this issue is resolved, the issue will be automatically closed by bot.

/label thoth/potential-flake
/kind bug
/priority critical-urgent

Updates to the bot

Issue created by the bot needs to following updates:

  1. Title
    Title needs to be upsated to: [Priority]-[Issue Type]-[Bug location]
    Where Priority is rank of the issue in the ML output.

  2. Include priority in the description

  3. Bug trace:
    Show only the first few lines, but hide the full bug trace in the markup.

  4. bug location should actually point to the code so that the user can click and examine the code.

  5. If possible, create issues on actual clone of the Libtiff project so that the links work.

Edit: Edited the title to avoid markup error.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.