Coder Social home page Coder Social logo

Extending data dictionaries? about ossem HOT 3 OPEN

otrf avatar otrf commented on July 27, 2024
Extending data dictionaries?

from ossem.

Comments (3)

hxnoyd avatar hxnoyd commented on July 27, 2024 1

Hi @nicolasreich.

Thanks for the detailed explanation, it is now more clear what you mean by 'extending', in a nutshell: deconstruct data dictionaries depending on the field prevalence, to avoid duplicates, and keep the data dictionary YAML as clean as possible.

I see the benefit of such approach for events in the same log source (keep it simple/reduce duplicate), but that would mean an increase in the number of data dictionaries, since we would need to create the 'common fields' data dictionaries (i.e. src_ip, dest_ip, etc). On the one hand we would have a schema with low duplicate fields and, on the other hand, we would have more YAML data dictionaries to maintain.

The field name duplication have been raised multiple times in the past, but we always opted by keeping the data dictionaries as close as possible to the original events, so that the community could customize them as needed. The main reason for this is to keep the data dictionary atomicity, an absolutely independent object, or the source of truth in a single document if you like. By doing so we enable the community to model the data dictionaries as they like, to their own needs (i.e. logstash pipelines).

Regardless, I think your suggestion is aligned with our vision for the improvement of data dictionaries, possibly with the creation of a separate dictionary that would provide a first layer of abstraction for data dictionaries, where the community would be able to better map events with entities, and/or the detection data model. This would allow us to keep the source of truth, at the expense of maintaining another dictionary with modeled/standardized events.

Unfortunately the last few months have been insanely busy, and we haven't had the time to work on a PoC for this... but it is on the roadmap :)

from ossem.

hxnoyd avatar hxnoyd commented on July 27, 2024

Hi @nicolasreich. First of all, sorry for the late reply.

So far we have developed data dictionaries as independent document, as close as possible to the raw events produced by the sensor. The main goal is that you will always be able to drill down (i.e. from the data model) to the source of truth of an event and its fields. One of the tradeoffs is, as you suggest, duplicate information, that becomes apparent when you consume multiple events in the same sensor.

We are, however, planning to improve Data Dictionaries, in order to deal with situations were event fields can have different definitions depending on the event type, or in situations where a field contains a nested JSON,list,etc, that we could use to extend the fieldset of the event.

Regardless, I would interested in further exploring your use case.

from ossem.

nicolasreich avatar nicolasreich commented on July 27, 2024

Hi @hxnoyd. No worries, it was the holidays for everyone.

The rationale for this question was Suricata Eve JSON logs, where you have common fields, then nested fields for specific data. So for any alert, you get common fields, like source and destination IP addresses, as well as an alert section, and a different section depending on the protocol that triggered the alert.

So for a alert triggered by a DNS request, you would get something like:

src_ip: ...,
dest_ip: ...,
...
other common fields
...
alert: { ... alert fields ... },
dns: { ... dns fields ... }

While for an alert triggered by an HTTP request:

src_ip: ...,
dest_ip: ...,
...
other common fields
...
alert: { ... alert fields ... },
http: { ... http fields ... }

So the common fields are present in every event; the alert object is present in every alert; and then, depending on the type of the underlying traffic, there might be other objects.

It's obviously possible to have a data dictionary for each alert type, each containing the common fields and the alert fields; but it means a lot of duplication, causing a lot of potential mistakes, and what seems like unnecessary verbiage.

I think it would make sense to be able to extend a data dictionary, much like it's possible for entities. The rendered markdown version of the Data Dictionary would still be an independent document containing all the data.

from ossem.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.