pi2schema / pi2schema Goto Github PK

View Code? Open in Web Editor NEW

9.0 4.0 2.0 541 KB

Describe your Data Protection rules and Personal Identifying Information as part of your schema

Home Page: https://github.com/gustavomonarin/schema-evolution-experiments

License: Apache License 2.0

Java 100.00%

gdpr gdpr-tracker lgpd schema protobuf avro kafka schema-registry schema-evolution governance

pi2schema's Introduction

Intro

While testing out with the new schema support available in the ecosystem and its best practice, more specifically protobuf, was surprised to not find open references of implementing personal data protection. Please see kafka references and general information to the link of the solutions found.

This repo intends to present some experimentation on gdpr which were not ...

Further more provide an ?open? space to collaborate in a so complex subject and with so many possible combinations for example with cloud kms implementations, use cases as Acls including the extense kafka ecosystem.

Project Goals

Gdpr compliant / right to be forgotten
No deletion, event loss, data loss of non personal data
Explicit data classification over implicit encryption (as part of the schema)
Composable with the current kafka clients / serializers
Composable with different key management systems
Composable with the kafka ecosystem (could be used directly by the client or by a kafka connect)
Yet, providing a simple implementation
Composability should enable different Acls/ways to access data from different consumers

Background

Event driven architectures and its persistence is finally becoming known and becoming the new core.
- The new source of true
- Streaming platforms with long term durability rather than data in transit, specially with KIP-405
- Streaming platforms extending to provide database like operations instead of the opposite - lsm ;)
Data governance at center with personal data laws (gdpr/lgpd)
- Maturity levels - Early, many times mixed with bureaucracy and spreadsheets

Challenges

Multiple areas of knowledge:
- Serializers (Avro, Protobuf, Json Schema, ...)
- Schema registries (Confluent, Apiario, ...)
- Cryptography / shredding approach
- Multiple kms implementations (aws, gcp, ...)

Getting started

Please see the kotlin-springboot code sample and video.

Concepts

The pi2schema project relies basically on the 3 following modules/components which can be composed among them. They are implemented for extensibility and to support multiple cloud providers, encryption mechanisms and security level.

Schema

The schema is the central part of the p2schema solution. All the metadata information is intended to be described explicitly and naturally as part of the schema, even if the information itself comes from outside.

The core metadata information to be described in the schema consists of:

Subject Identifier: Identifies which subject the personal data belongs to. It can be for instance the user uuid , or the user email or any other identifier.
Personal Information: The data which should be protected related to the subject identifier.

Although this project started as part of the confluent protobuf support exploration, the goal is to be extensible for any schema / serialization format. While the intention is to have the definition / usage as close as possible within the implementations, they will inevitably be different depending on the schema capabilities. Please refer to the specific documentation for details:

Crypto

Application

Next steps

DelegateSecretKey and cloud implementations/providers
Secret keys wrapping and Acls
Multi language support similar to librdkafka implemented in rust
Extending schema support/vocabulary

Background

The project was created as part of experimentation of protobuf support on confluent schema registry.

The first time the idea of using the schema appeared (long ago)[https://github.com/gustavomonarin/schema-evolution-experiments/commit/ce27f87de8a933f9683654f8116272dd9d43729b] and was implemented as a sub project.

Moving forward

As of the intention of moving it as a library, this new repo should contain only data protection and schema related code.

Remove protobuf experimentation code
Overall project structure and packages

Add codecov

Personal metadata provider abstraction

The personal metadata provider is common concept, which could extract the common PersonalDataDefinition and its implementations from different serialization formats as protobuf and json, or even different strategies in protobuf itself.

It would be a nice place to apply the strategy pattern.

relates to #63

Set up easy channel of coversation

Although github issues works great for features, there are many areas of uncertainty and learning required as well open discussions that could be facilitated by a simple gitter.im for example.

Ideally the simplest communication channel and most probably temporarily, for the initial discussions.