Coder Social home page Coder Social logo

pi2schema / pi2schema Goto Github PK

View Code? Open in Web Editor NEW
9.0 4.0 2.0 541 KB

Describe your Data Protection rules and Personal Identifying Information as part of your schema

Home Page: https://github.com/gustavomonarin/schema-evolution-experiments

License: Apache License 2.0

Java 100.00%
gdpr gdpr-tracker lgpd schema protobuf avro kafka schema-registry schema-evolution governance

pi2schema's Introduction

build codecov dependabot

Intro

While testing out with the new schema support available in the ecosystem and its best practice, more specifically protobuf, was surprised to not find open references of implementing personal data protection. Please see kafka references and general information to the link of the solutions found.

This repo intends to present some experimentation on gdpr which were not ...

Further more provide an ?open? space to collaborate in a so complex subject and with so many possible combinations for example with cloud kms implementations, use cases as Acls including the extense kafka ecosystem.

Project Goals

  • Gdpr compliant / right to be forgotten
  • No deletion, event loss, data loss of non personal data
  • Explicit data classification over implicit encryption (as part of the schema)
  • Composable with the current kafka clients / serializers
  • Composable with different key management systems
  • Composable with the kafka ecosystem (could be used directly by the client or by a kafka connect)
  • Yet, providing a simple implementation
  • Composability should enable different Acls/ways to access data from different consumers

Background

  • Event driven architectures and its persistence is finally becoming known and becoming the new core.
    • The new source of true
    • Streaming platforms with long term durability rather than data in transit, specially with KIP-405
    • Streaming platforms extending to provide database like operations instead of the opposite - lsm ;)
  • Data governance at center with personal data laws (gdpr/lgpd)
    • Maturity levels - Early, many times mixed with bureaucracy and spreadsheets

Challenges

  • Multiple areas of knowledge:
    • Serializers (Avro, Protobuf, Json Schema, ...)
    • Schema registries (Confluent, Apiario, ...)
    • Cryptography / shredding approach
    • Multiple kms implementations (aws, gcp, ...)

Getting started

Please see the kotlin-springboot code sample and video.

Concepts

The pi2schema project relies basically on the 3 following modules/components which can be composed among them. They are implemented for extensibility and to support multiple cloud providers, encryption mechanisms and security level.

Schema

The schema is the central part of the p2schema solution. All the metadata information is intended to be described explicitly and naturally as part of the schema, even if the information itself comes from outside.

The core metadata information to be described in the schema consists of:

  • Subject Identifier: Identifies which subject the personal data belongs to. It can be for instance the user uuid , or the user email or any other identifier.

  • Personal Information: The data which should be protected related to the subject identifier.

Although this project started as part of the confluent protobuf support exploration, the goal is to be extensible for any schema / serialization format. While the intention is to have the definition / usage as close as possible within the implementations, they will inevitably be different depending on the schema capabilities. Please refer to the specific documentation for details:

protobuf protobuf protobuf

Crypto

Application

Next steps

  • DelegateSecretKey and cloud implementations/providers
  • Secret keys wrapping and Acls
  • Multi language support similar to librdkafka implemented in rust
  • Extending schema support/vocabulary

See also

Alternative approaches

kafka references

General implementations (mainly non free) references

pi2schema's People

Contributors

dependabot-preview[bot] avatar dependabot[bot] avatar gustavomonarin avatar jomilanez avatar juliano avatar razorcd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pi2schema's Issues

Kafka based kms

Simple initial implementation based on symmetrics keys persisted on kafka/kafka streams.

The crypto provider interfaces are already present, the kafka solution should implement them.

Add depend a boot

Considering that the number of external dependencies can grow quite quickly depending of the directions the project can lead, would be great to have a bot checking the dependency versions.

Currently the code is already closely associated to apache kafka, confluent and protobuf artifacts.

This list can grow quickly to other serialization protocols as well as multiple key management systems

PersonalMetadataProvider cache

Currently for each message to be published, the metadata/message descriptor is inspected for personal data fields.

The Message descriptor for a producer and consumer should rarely change and could be easily cached. The cache could be a small LRU in memory cache.

Both the lru cache as well as the metadata/descriptor identification is already implemented in the kafka schema registry protbuf serializer which could be consulted for inspiration and for consistency with the schema manipulation.

Adopt JPMS and ServiceLocator

Within #73 metadata provider is defined as a standard configuration property.

This fits well within the ecosystem and has a small change within the code

However we could adopt jpms and ServiceLocator in order to autodiscover the metadata providers available and match with the given payload type.

This would also introduce the building blocks in order to adopt pi2schema beyond the kafka ecosystem.

Reestructure of the project

Background

The project was created as part of experimentation of protobuf support on confluent schema registry.

The first time the idea of using the schema appeared (long ago)[https://github.com/gustavomonarin/schema-evolution-experiments/commit/ce27f87de8a933f9683654f8116272dd9d43729b] and was implemented as a sub project.

Moving forward

As of the intention of moving it as a library, this new repo should contain only data protection and schema related code.

  • Remove protobuf experimentation code
  • Overall project structure and packages

Personal metadata provider abstraction

The personal metadata provider is common concept, which could extract the common PersonalDataDefinition and its implementations from different serialization formats as protobuf and json, or even different strategies in protobuf itself.

It would be a nice place to apply the strategy pattern.

relates to #63

Set up easy channel of coversation

Although github issues works great for features, there are many areas of uncertainty and learning required as well open discussions that could be facilitated by a simple gitter.im for example.

Ideally the simplest communication channel and most probably temporarily, for the initial discussions.

Review cache usage on github actions

The current workflows uploads the artifacts after every build taking about one minute, in every single build.

After the other opmitizations, the build went down from 5 minutes to 3 minutes.

Now the cache time is representing +30% of the time, which most probably is not worth at all.

Kafka Producer/Consumer Interceptor implementation

The current source code is using a composed serializer in order to add the encryption and decryption behavior.

Interceptors can provider a more clean way for the same.

We could keep both for the time being, or at least until we explore the composing with the kafka platform (with kafka connect, maybe ksqldb, etc...).

Migrate to java 11

While working with completablefutures with juliano there were bigger issues of using java 8.

Review @NotNull strategy

Currently there are some jetbrains annotations which should be replaced.

We should consider using findbugs jsr305 instead. Also would be great to have notnull as default and annotate only the @nullables.

Discuss usage of Missing fields instead of oneof

I feedback received was that the usage of oneOf and the protobuf oneof container as an Either<Encrypted, OriginalOtherValues> makes the schema dirty and a normal property annotation with missing fields could be an option.

There are several points on this:

  • Oneof would be feature similar to the union types in avro
  • Explicit is always good

Please add ur thoughts here as well as downvote upvote ...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.