Coder Social home page Coder Social logo

francisrodier / cve-database-ingestion Goto Github PK

View Code? Open in Web Editor NEW

This project forked from softmaxer/cve-database-ingestion

0.0 0.0 0.0 21 KB

A take home exercise for ingesting data from the NVD feeds and OSV vulnerabilities database.

Go 100.00%

cve-database-ingestion's Introduction

Take home exercise - konvu

I this take home exercise, I was asked to write a program that can create and update a database of Vulnerabilities and their corresponding affected Java packages (So primarily the Maven repositories). The CVE feeds are updated here which may also be indexed on the OSV database.

Methodology

I will briefly explain what my thought process was, for this project.

  • First, I had to get the NVD feeds. I was explicitly asked to analyze feeds for the years 2023 and 2024, so I downloaded the archives from the given link above. However, this was not my first approach since I saw that the Feeds were going to be migrated to an API. So I tried to implement a search after / pagination request to gather all the feeds from start-date to end-date, but I soon realized that this will take me a lot of time to do, and I was told that this test shouldn't take more than 3 hours. So I went ahead and just downloaded the feeds.

  • The feeds from NVD contain not only packages but also other applications, which do not concern us, and the only common fields in the NVD databse and the OSV database are either the IDs Which can be one of GHSA, OSV, or even a CVE if it exist's in the alias of the IDs OR the commit SHA of the packge affected and you can have either one of them, or both. (I'm guessing it's some sort of an elasticsearch/opensearch ish database). The problem is that The JSON from the feeds are very nested and you find these sometimes, in different parts of the JSON. So in order to simply extracting information from the JSON, I used an LLM, namely, Llama-3 (8b parameters). In general, when parsing very dense JSONs, using an LLM seems like a good choice as they have gotten very efficient in understanding JSONs.

  • Luckily, the OSV database API is not rate limited. So the moment we receive the extracted information, we can pass that on to the OSV api and search if that package exists and extract the affected package ecosystem, version ranges, etc. and filter based on the desired ecosystem, in this case, Maven repositories.

  • Once, that information is gathered, we will have, ideally: a CVE id and their corresponding affected maven packages and their versions in a JSON file. Note: I did not use an actual database, I just used a json file called pkg_info.json

Tech choices

  • I made this project in Go due to it's concurrency patterns.
  • The LLM I used is coming from an inference engine called Groq as they have very generous rate limits in the free tier and it's extremely fast. ~0.26 seconds per request on average, which for a model like Llama-3 is extraordinary!

Special tools and libraries

  • I mostly stuck to the Go standard library, except for a scheduler package called gocron For starting an update job every 2 hours, since that's roughly the time period in which the NVD database get's updated.

Installation

Make sure to have Go installed

Then run:

go build

Get your Groq API key here It should be very straightforward, just creating an account should do the job, however, if you want to use my API key, Please shoot me an email, I'll send it over :)

Check the .env-dist file to fill in the necessary variables into a .env:

DATA_PATH=
GROQ_API_KEY=

The DATA_PATH is folder containing the feeds from 2023 and 2024 in a JSON format (They have to downloaded in order to test the code, since they are relatively large files for github). and the GROQ_API_KEY is, well, the Groq API key.

To ingest current data from 2023 - 2024:

./nvdbase -c ingest

To launch a cron job that updates the last modified feeds from NVD:

./nvdbase -c update

Challenges faced

  • The Groq API, inspite of having a generous free tier, is very limiting for a large and continuos database feed like the NVD, so I was rate limited quite a few times and had to wait.
  • There are significantly lesser Maven packges, when compared to let's say PyPI, so to even test if my code was working, I had to first test it on the PyPI ecosystem to see that the JSON file was being written properly. (I included some test results for the PyPI package ecosystem in a file called pkg_info_pypi_test.json) as I was rate limited before I even hit my first maven package search.
  • Since I initially spent time exploring the NVD feeds API, I had lost some amount of time before I started coding and I could not finish this test in the given time frame of 2-3 hours. (It took me roughly 5h20 minutes in total).

Potential improvements

This code is by no means, perfect and can use some improvements, if this project scales up. Here are some things that I think can be nice:

  • If a Mistral/OpenAI API key is available, then integrating Tool calling / function calling capability of LLMs to directly call the OSV endpoint automatically would be a plus!
  • Actually implementing the Pagination search for NVD feeds to not be solely based on the Webpage.
  • Parametrize the ecosystem variable to run this ingestion pipeline against many different types of packages.

Final thoughts

I really enjoyed doing this take home exercise, and I would definitely love to hear your feedback on how I did and where I can improve.

cve-database-ingestion's People

Contributors

softmaxer avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.