Take a look at the companion blog for more info!
⚠️ THIS PROJECT IS PROVIDED AS-IS WITHOUT ANY GUARANTEES: we make no claims as to the accuracy of the PII detection provided here, and if you decide to use it, it's YOUR RESPONSIBILITY to ensure that the example regexes and detection/redaction/tagging meets your internal, legal or regulatory requirements.
Using Delta Live Tables to detect and redact PII data
Delta Live Tables makes it easy to build and manage reliable data pipelines that deliver high-quality data on Delta Lake.
To get this pipeline running on your environment, please use the following steps:
- Clone this Github Repo using Databricks Repos (see the docs for AWS, Azure, GCP)
- First you need some input data to run the pipeline on. If you don't have any data, or just want to try it out, you can use the 00_generate_data.py notebook to generate some, using the following options to customise that data generation:
GENERATE_CLEAN_DATA
: Whether to generate 4 records of artificially created "clean data" specifically designed not to get evaluated as PIIGENERATE_PII_DATA
: Whether to generate fake PII dataNUM_ROWS
: The number of rows of fake PII data to generateOUTPUT_DIR
: The path on cloud storage to write the generated data out to
- Create a new DLT pipeline, selecting 01_observability.py and 02_detect_and_redact_pii.py as Notebook Libraries (see the docs for AWS, Azure, GCP). You’ll need add the following pipeline settings:
INPUT_PATH
: The path on cloud storage where the input data is locatedINPUT_FORMAT
: The format of the input data. One of "delta", "parquet", "json", "csv" is supportedTABLE_PATH
: The path to write out all of the tables created by the pipeline toSTORAGE_PATH
: A location on cloud storage where output data and metadata required for the pipeline execution are stored. This should match theStorage Location
entered below.EXPECTATIONS_PATH
: The path to the pii_firewall_rules.json config file uploaded to an DBFS accessible path (for example a mounted storage bucket). This is the main configuration file used to customise the behaviour of the detection/redaction/tagging of data. See Firewall Rules below for more detailsNUM_SAMPLE_ROWS
: In order to generate the SQL used to automatically redact the PII discovered, the pipeline will sample this many rows of data, evaluate it against your expectations and generate SQL code to leave it unchanged or redact it accordingly. The fewer rows sampled, the faster this initial stage of the pipeline will run, albeit the more likely that PII may make it through our firewallNESTED_DEPTH
: Because regexes aren't going to perform well on nested data, the pipeline will try and flatten any struct, map or array types in the data. It will iterate over the data to this depth to try and achieve thisTarget
: The name of the database for persisting all of the output dataStorage Location
: A location on cloud storage where output data and metadata required for the pipeline execution are stored. This should match theSTORAGE_PATH
entered above.
- Note: once you’ve edited the settings that are configurable via the UI, you’ll need to edit the JSON so that you can add the configuration needed to authenticate with your chosen cloud storage:
- For AWS add the
instance_profile_arn
to the aws_attributes object - For Azure add the Service Principal secrets to the
spark_conf
object - For GCP add the
google_service_account
to thegcp_attributes
object
- For AWS add the
- As well as the DLT pipeline, the project contains the notebook 03_tag_pii.py. This is designed to run after the DLT pipeline has finished, and tag databases/tables/columns appropriate to confirm that:
- They have been scanned for PII
- That PII has either been found or not found
- Where PII has been found, add a customisable comment to the column it was found in
- In order to get this to run straight after our DLT pipeline, we're going to create a multi-task job workflow (see the docs for AWS, Azure, GCP). You'll need to select the notebook 03_tag_pii.py and pass in the following Parameters:
DATABASE_NAME
: The database to apply tagging (via properties) to. Should match theTarget
entered above.TABLE_NAMES
: The tables within the databse to apply tagging (via properties) to and column level comments to. The DLT pipeline creates 3 main tables:clean
,redacted
andoutput
you can apply tagging to 1, 2 or all 3 of these.EXPECTATIONS_PATH
: The path to the dynamic_firewall_rules.json config file uploaded to an DBFS accessible path (for example a mounted storage bucket). This is the main configuration file used to customise the behaviour of the detection/redaction/tagging of data. See Firewall Rules below for more details.
The following data tables and views are created by this pipeline:
NOTE: The pipeline will try and flatten the first level of any nested columns of your input data. This is for performance as well as accuracy reasons (you can't apply
REGEX
to astruct
,array
ormap
easily and so you could try to cast the entire column to a string but the performance of this will suck and it won't help you if you have lots of different nested PII). If you have more than one nested level you may want to consider what to do here - you could update the functionflatten_dataframe()
to recursively try to find nested columns, or you could update it to automatically drop any additional nested columns it finds after the first pass.
Name | Type | Description |
---|---|---|
staging | View | Initial view that data is loaded into. May contain PII and therefore declared as a view (so that PII is not persisted after the pipeline has been run. |
quarantine | View | View containing data that has failed expectations. May contain PII and therefore declared as a view (so that PII is not persisted after the pipeline has been run |
clean | Table | Table containing data that has passed expectations and therefore is not expected to contain PII |
redacted | Table | Table containing data that has failed expectations and therefore is expected to contain PII but in which that PII has been redacted based on the specified actions |
output | Table | A union of clean and redacted, creating a table that contains either data that has passed expectations and therefore is not expected to contain PII or data that is expected to contain PII but has been redacted based on the specified actions |
The following monitoring tables are created by this pipeline:
Name | Type | Description |
---|---|---|
event_logs | Table | Raw DLT event logs relating to the running and management of the pipeline |
audit_logs | Table | Logs capturing the management events relating to the pipeline |
data_quality_logs | Table | Logs capturing the DQ metrics relating to the pipeline |
flow_logs | Table | Logs capturing the runtime events relating to the pipeline |
When everything is set up correctly, run the MT Job and you should see something like this...
The pipeline following a successful run (10M rows of data):
The expectations evaluated against our sample data:
The dynamic_firewall_rules.json file is the main way that you can customise the behaviour of how the detection/redaction/tagging of data works. Within the file you'll notice a number of rules defined as follows:
"name": "",
"constraint": "",
"action": "",
"mode": "",
"tag":""
Every rule that you specify here will be applied against every column of your input data. If those columns contain structs of nested data, it will flatten the first layer of those so that the expectations can be applied uniformly across all of the fields contained within them. To add new rules, just add a new JSON object as follows:
Element | Is Mandatory | Can Contain | Description |
---|---|---|---|
name | Yes | any string | The name of the expectation. {} will be replaced by the column name. |
constraint | Yes | a valid SQL invariant | The expectation on which success or failure will determine whether the row contains PII or not. |
action | No | a valid SQL expression | The action that will be applied if any rows in the column fail when evaluated against their expectation AND the mode selected is REDACT or REDACT_AND_TAG |
mode | Yes | One of REDACT , TAG , or REDACT_AND_TAG |
Whether to REDACT , just TAG or REDACT_AND_TAG all of the rows in any column if any rows in the column fail when evaluated against their expectation |
tag | No | any string | The comment to be added to any columns found to contain the. {} will be replaced by the column name. |
redact_threshold | Yes | A number between 0 and 100 | During the generation of redaction SQL, the pipeline will sample NUM_SAMPLE_ROWS to detect PII. If the % of failed expectations found in these sample rows is greater than the redact_threshold , that column will have the specified action applied to it. The threshold is useful for fine-tuning REGEXES which may generate a lot of false positives versus those you are confident will always find PII. |
tag_threshold | Yes | A number between 0 and 100 | During the PII tagging step, all failed expectations will be used to determine whether to tag the column as potentially containing PII. If the % of failed expectations for each expectation is greater than the tag_threshold , we will tag this column as potentially containing PII. This threshold is useful for fine-tuning REGEXES which may generate a lot of false positives versus those you are confident will always find PII. |
- A firewall is only as good as its rules - the regular expressions provided are done so on an as-is basis and without any guarantees. That said, I’d love to test them against more data, refine their accuracy and add new regexes for PII types which aren’t covered yet. If you’re interested in collaborating on this, I’d love to hear from you!
- In my experience most of the enterprise platforms that I encounter in this field still rely on regexes for ~90% of their PII detection. That said, it’s important to acknowledge that this approach is not without its limitations. If you used the notebook provided to generate your input data you’ll notice that there are two fields that our detection has failed pretty miserably on:
name
- short of having a dictionary of every person’s name on the planet - in every language and every alphabet - detecting a person’s name via regex is pretty much an impossible jobfreetext
- it's hard to iterate expectations and the associated regex sql across a column multiple times. Right now, only the first piece of PII detected in a column will be redacted, which may cause problems with columns that contain long strings of text- So what’s the answer here? I’d like to look at using more sophisticated approaches like Named Entity Recognition (NER) or Natural Language Processing (NLP) to improve the detection of specific PII types like names, addresses or businesses, and also to apply identification and redaction to the contents of an entire column at a time. Again, if you’re interested in collaborating on this, please let me know!
- 80 - 90% of the world’s data is unstructured - whilst the approaches outlined here work well for structured and semi-structured data, unstructured data comes with its own, entirely different challenges!