The anonymizer from thoughtworks-datakind

Add Docker and CircleCI

Build wheel package
Create docker-compose for spark cluster
Update SparkSession builder to use new configurations
Update Readme with working commands to spark-submit to cluster
Set up circleci
Run e2e test

Support for enabling different acquire and parse handlers

Design a contract for supporting different inputs and formats
Decide on a viable output format for the next step (ex pandas df)

Provide the anonymized csv file in the output directory

As an anonymizer user,
When I run the application,
Anonymized output needs to be created in the output directory with the name <input_file_name>_anonymized_.csv.
input_file_name here is obtained from the config json that is passed as an input to the application.
output_directory needs to be obtained from the config json file as a parameter in the 'anonymize' section.

Considerations:

An empty output file needs to be created if the input file is empty
If no PII detected in the fields, the same input file needs to be available as the anonymized version.
In the course of errors during anonymization, no file should be created and appropriate error message should be available to the user.

Build a pool of regex matches for identifying Singapore (National Identification numbers) PII data

Capability to identify following PII data related to Singaporeans

NRIC / FIN
Work Permit Id
Passport

Main class returns with an error if output directory does not exist

As far as I can see, if the directory in the output path specified in the config file does not exist then there is an error as like:

FileNotFoundError: [Errno 2] No such file or directory: '<parent_directory>/anonymizer/output/comma_delimited_file_anonymized_.csv'

Adding a control rule that is like create if not exist would be useful to pass this error.

Changes in PIIDetector to trigger anonymisation actions

This story assumes that anonymisation would happen right after detection and not stop at the report generation step

[SPIKE] - Comparing different libraries for PII detection

Compare support for :

basic PII like name, email, phone number, NRIC, etc
custom PII identification
localization (esp. Asia) - support for PDPA policies
free-text PII identification
Python

KPIs :

True positives vs false positives
Ease of use and configuration
Easy to extend
Subsets that are identified accurately

Tools to explore :

PII analyzer
CommonRegex
Spacy
FuzzyWuzzy
de-identify from Google
presidio from Microsoft

Requirements and findings consolidated in this document

Add an end to end integration test

Add a test that takes in csv data with PII in it and a json config, parses them and generates output data with no PII information

[SPIKE] Comparison of PII identification tools/libraries

Features to compare based on -

Support for basic PII like name, email, phone number, NRIC, etc
Support for custom PII identification
Support for localization (esp. Asia) - support for PDPA policies
Support for free text PII identification
Python preferably

KPIs

True positives vs false positives
Ease of use and configuration
Easy to extend
Subsets that are identified accurately

Tools to explore -

PII analyzer
CommonRegex
Spacy
FuzzyWuzzy
deidentify

Create a Lightweight Architecture Record

Updating context and having sufficient on How to use and Setup this tool.
Also why certain things / design decisions were made.

Migrate pandas to spark

acquire
analyze
anonymize
~~report~~
write

Acceptance criteria: output from spark is the same as from pandas previously

Make the columns to be scanned for PII analysis configurable

Add a section in the schema that can hold the information on whether a given column has to be taken into account for PII detection or not.

Support to run regex matchers against the parser output

Take parser output,
Split it if required in an efficient way
Run it against all the available regex matchers
Record the result whether a particular cell is PII or not

Saving as parquet or csv

Allows specifying output file format to be csv or parquet

[Tech task] Log the report findings to an output file.

Support for anonymizer action 'hash'

As part of this story, we would build an anonymizer action that hashes out PII elements in the column values. This might include a spike for choosing the default Hashing algorithm

Performance testing of pyspark code

Run the main.py file in the src_spark folder against datasets of increasing size from 500mb to 10GB and compare it against runs in the src folder when using pandas.

Add report generator functionality in spark migration

Done, with pandas used for formatting output.

Support for anonymizer action 'encrypt'

Handle null values

Running analyze on null values in the dataframe will cause an exception to occur, and the program to force terminate.
Currently, the input file is checked for null values during parsing stage and the program terminates if null values are found.
Ideally, the PiiDetectors will be able to handle the null values.

Columns with PII data
Low level granularity - showing which cells have PII data, along with the values

Read and parse a delimited file

parse a delimited file
output as columns / pandas dataframe

Build a pool of regex matches for identifying personal contact information

Build a pool for identifying personal contact information

Phone number
Email

thoughtworks-datakind / anonymizer Goto Github PK

anonymizer's People

Contributors

Stargazers

Watchers

Forkers

anonymizer's Issues

Recommend Projects

Recommend Topics

Recommend Org