Coder Social home page Coder Social logo

thoughtworks-datakind / anonymizer Goto Github PK

View Code? Open in Web Editor NEW
21.0 3.0 5.0 115 KB

Library for identification, anonymization and de-anonymization of PII data

Shell 0.65% Python 99.35%
pii personal-identifiable-information python data-protection data-anonymization

anonymizer's People

Contributors

kyzers0ze avatar siddarthshahtw avatar siddharthlshah avatar sowmyasgkrishnan avatar wisuc-tw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

anonymizer's Issues

Add Docker and CircleCI

  • Build wheel package
  • Create docker-compose for spark cluster
  • Update SparkSession builder to use new configurations
  • Update Readme with working commands to spark-submit to cluster
  • Set up circleci
  • Run e2e test

Provide the anonymized csv file in the output directory

As an anonymizer user,
When I run the application,
Anonymized output needs to be created in the output directory with the name <input_file_name>_anonymized_.csv.
input_file_name here is obtained from the config json that is passed as an input to the application.
output_directory needs to be obtained from the config json file as a parameter in the 'anonymize' section.

Considerations:

  1. An empty output file needs to be created if the input file is empty
  2. If no PII detected in the fields, the same input file needs to be available as the anonymized version.
  3. In the course of errors during anonymization, no file should be created and appropriate error message should be available to the user.

Main class returns with an error if output directory does not exist

As far as I can see, if the directory in the output path specified in the config file does not exist then there is an error as like:

FileNotFoundError: [Errno 2] No such file or directory: '<parent_directory>/anonymizer/output/comma_delimited_file_anonymized_.csv'

Adding a control rule that is like create if not exist would be useful to pass this error.

[SPIKE] - Comparing different libraries for PII detection

Compare support for :

  1. basic PII like name, email, phone number, NRIC, etc
  2. custom PII identification
  3. localization (esp. Asia) - support for PDPA policies
  4. free-text PII identification
  5. Python

KPIs :

  1. True positives vs false positives
  2. Ease of use and configuration
  3. Easy to extend
  4. Subsets that are identified accurately

Tools to explore :

  1. PII analyzer
  2. CommonRegex
  3. Spacy
  4. FuzzyWuzzy
  5. de-identify from Google
  6. presidio from Microsoft

Requirements and findings consolidated in this document

[SPIKE] Comparison of PII identification tools/libraries

Features to compare based on -

  1. Support for basic PII like name, email, phone number, NRIC, etc
  2. Support for custom PII identification
  3. Support for localization (esp. Asia) - support for PDPA policies
  4. Support for free text PII identification
  5. Python preferably

KPIs

  1. True positives vs false positives
  2. Ease of use and configuration
  3. Easy to extend
  4. Subsets that are identified accurately

Tools to explore -

  1. PII analyzer
  2. CommonRegex
  3. Spacy
  4. FuzzyWuzzy
  5. deidentify

Migrate pandas to spark

  • acquire
  • analyze
  • anonymize
    report
  • write

Acceptance criteria: output from spark is the same as from pandas previously

Support for anonymizer action 'hash'

As part of this story, we would build an anonymizer action that hashes out PII elements in the column values. This might include a spike for choosing the default Hashing algorithm

Performance testing of pyspark code

Run the main.py file in the src_spark folder against datasets of increasing size from 500mb to 10GB and compare it against runs in the src folder when using pandas.

Handle null values

Running analyze on null values in the dataframe will cause an exception to occur, and the program to force terminate.
Currently, the input file is checked for null values during parsing stage and the program terminates if null values are found.
Ideally, the PiiDetectors will be able to handle the null values.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.