thoughtworks-datakind / anonymizer Goto Github PK
View Code? Open in Web Editor NEWLibrary for identification, anonymization and de-anonymization of PII data
Library for identification, anonymization and de-anonymization of PII data
Design a contract for supporting different inputs and formats
Decide on a viable output format for the next step (ex pandas df)
As an anonymizer user,
When I run the application,
Anonymized output needs to be created in the output directory with the name <input_file_name>_anonymized_.csv.
input_file_name here is obtained from the config json that is passed as an input to the application.
output_directory needs to be obtained from the config json file as a parameter in the 'anonymize' section.
Considerations:
Capability to identify following PII data related to Singaporeans
As far as I can see, if the directory in the output path specified in the config file does not exist then there is an error as like:
FileNotFoundError: [Errno 2] No such file or directory: '<parent_directory>/anonymizer/output/comma_delimited_file_anonymized_.csv'
Adding a control rule that is like create if not exist would be useful to pass this error.
This story assumes that anonymisation would happen right after detection and not stop at the report generation step
Compare support for :
KPIs :
Tools to explore :
Requirements and findings consolidated in this document
Add a test that takes in csv data with PII in it and a json config, parses them and generates output data with no PII information
Features to compare based on -
KPIs
Tools to explore -
Updating context and having sufficient on How to use and Setup this tool.
Also why certain things / design decisions were made.
Acceptance criteria: output from spark is the same as from pandas previously
Add a section in the schema that can hold the information on whether a given column has to be taken into account for PII detection or not.
Take parser output,
Split it if required in an efficient way
Run it against all the available regex matchers
Record the result whether a particular cell is PII or not
Allows specifying output file format to be csv or parquet
As part of this story, we would build an anonymizer action that hashes out PII elements in the column values. This might include a spike for choosing the default Hashing algorithm
Run the main.py file in the src_spark folder against datasets of increasing size from 500mb to 10GB and compare it against runs in the src folder when using pandas.
Done, with pandas used for formatting output.
Running analyze on null values in the dataframe will cause an exception to occur, and the program to force terminate.
Currently, the input file is checked for null values during parsing stage and the program terminates if null values are found.
Ideally, the PiiDetectors will be able to handle the null values.
When a glob is specified as an input, it should ingest all files that matches into the anonymizer.
As part of this story, we would build an anonymizer action that deals with dropping of the PII elements in the column values.
Given the findings from the regex matchers, generate a report that is user friendly that displays :
parse a delimited file
output as columns / pandas dataframe
Build a pool for identifying personal contact information
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.