Coder Social home page Coder Social logo

anonymize-it's Introduction

anonymize-it

A general utility for anonymizing data

anonymize-it can be run as a script that accepts a config file specifying the type source, anonymization mappings, and destination and an anonymizer pipeline. Individual pipeline components can also be imported into any python program that wishes to anonymize data.

Currently, the anonymize-it supports two methods for anonymization:

  1. Faker-based: Relies on providers from Faker to perform masking of fields. This method is suitable for one-off anonymization usecases, where correlation between data obtained from different sources (indices/clusters) is not necessary.

E.g.:

>>> from faker import Faker
>>> f = Faker()
>>> f.file_path()
'/break/Congress.json'
  1. Hash-based: Uses a unique user/customer ID as a salt to anonymize fields. This method is suitable when anonymization of data needs to be performed regularly and/or if correlation of data from different sources is crucial.

E.g.: A user wants to anonymize network events and process events stored in two separate indices but wants to correlate all activity for a particular host even after anonymization

Disclaimer

anonymize-it is intended to serve as a tool to replace real data values with sensical artificial ones such that the semantics of the data are retained. It is not intended to be used for anonymization requirements of GDPR policies, but rather to aid pseudonymization efforts. There may also be some collisions in high cardinality datasets on using the Faker implementation.

Instructions for use

Installation

This must be run in a virtualenvironment with the correct dependencies installed. These are enumerated in requirements.txt

Install virtualenv globally:

[sudo] pip install virtualenv

Create a virtualenv and install the dependencies of anonymize-it

virtualenv -p python3 venv
source venv/bin/activate
pip install -r requirements.txt

and run:

python anonymize.py configs/config.json

Quick Start

anonymize.py is reproduced below to walk through a simple anonymization pipeline.

First load and parse the config file.

config_file = sys.argv[1]
config = read_config(config_file) # opens json file and stores as python dict
config = utils.parse_config(config) # utility function for parsing configuration and setting variables

Then, create the reader as defined in the configuration. reader_mapping is used as a dispatcher that maps human reader reader types (e.g. elasticsearch) to reader classes (e.g. ESReader()).

reader = reader_mapping[config.source['type']]
reader = reader(config.source['params'], config.masked_fields, config.suppressed_fields)

Next, create the writer in the same way.

writer = writer_mapping[config.dest['type']]
writer = writer(config.dest['params'])

Finally, create an anonymizer by passing the reader and writer instances and run anonymize().

anon = Anonymizer(reader=reader, writer=writer)
anon.anonymize()

Creating your own anonymizer pipeline

An anonymizer requires a reader and a writer. Currently, only an elasticsearch reader readers.ESReader() and a filesystem writer writers.FSWriter() are provided.

readers

Creating an instance of a reader requires the following:

  • a source object, which contains parameters about the source. Please note that each reader class requires a different set of parameters. Please consult docstrings for specific parameters.
  • masked_fields which is a dictionary that contains field names that should be masked, along with the faker provider to be used for masking, if using the Faker-based anonymization. e.g.: {"user.name": "user_name", "user.email": "email"} If using the hash-based implementation, masked_fields is simply a list of field names to be masked. e.g.: ["user.name", "user.email"]
  • suppressed_fields which is a list of fields that should NOT be included in anonymization.

masked_fields is required on the reader since the reader is responsible for enumerating the distinct values for each field to be used as a lookup for masking values in the faker-based anonymization.

suppressed_fields is required on the reader since we will explicitly exclude these from a search query.

Readers must implement the following methods:

  • get_data(), which is responsible for returning data from the source and passing it to the anonymizer.
  • (If using Faker-based anonymization), create_mappings(), which is responsible for generating a dictionary to be used by the anonymizer object. The dictionary is structured as so:
    {
      "field.1": {
          "val1.1": None,
          "val1.2": None,
          ...,
          "val1.n": None
        },
      "field.2": {
          "val2.1": None,
          "val2.2": None,
          ...,
          "val2.m": None
        }
    }

where field.1 and field.2 are the fields to be anonymized and the val1.1, val1.2 etc. are the distinct values for each field

writers

Creating an instance of a writer requires the following:

  • A dest object, which contains parameters about the destination. Please note that each writer class requires a different set of parameters. Please consult docstrings for specific parameters.

Writers must implement the following methods:

  • write_data(), which send anonymized data to the destination.

Run as Script

anonymizers

python anonymize.py configs/config.json

config.json defines the work to be done, please see template file at configs/config.json for guidance:

  • source defines the location of the original data to be anonymized along with the type of reader that should be invoked.
    • source.type: a reader type. one of:
      • "elasticsearch"
      • "csv" (TBD)
      • "json" (TBD)
    • source.params: parameters allowing for access of data. specific to the reader type.
      • "elasticsearch":
        • host
        • index
        • use_ssl
        • auth (native optional)
  • dest defines the location where the data should be written back to
    • dest.type a writer type. one of:
      • "filesystem"
      • "csv' (TBD)
      • "elasticsearch" (TBD)
    • dest.params: parameters allowing for writing of data. specific to writer types
      • "json":
        • directory : directory to write output json files
  • anonymization: type of anonymization i.e. faker or hash
  • include: the fields to mask along with the method for anonymization in case of faker-based anonymization. This is a dict with entries like {"field.name":"faker.provider.mask"}. Please see faker documentation for providers here. For hash-based anonymization, this can be a list of fields to be masked like ["field.name"].
  • exclude: specific fields to exclude
  • sensitive: included fields (apart from the masked fields) that should not be completely replaced by a faker/hash substitute, but should be searched for sensitive information
  • include_rest: {true|false} if true, all fields except excluded fields will be written. if false, only fields specified in masks will be written.

Important notes for Faker-based anonymization

  1. Set the provider_map class attribute for the Anonymizer class, which is a dict with entries like {"field.name":self.faker.provider.mask}. Refer anonymizers.py for a test configuration of provider_map.
  2. If the fields being anonymized have high cardinality, set the high_cardinality_fields class attribute for the Anonymizer class, which is a dict with entries like {"field.name": [self.faker.provider.mask(10) for _ in range(10)]}.

Important notes for hash-based anonymization

  1. The user should have monitor privilege for the Elastic environment in which to run the anonymization.
  2. If you are a Cloud user and want to perform hash-based anonymization, you'll need to create an API key in the Elasticsearch Service Console and provide it as input when prompted. To create an API key, follow the instructions here.

In addition to the above settings, for more fine-grained control over the anonymization, you can also set the following class attributes for Anonymizer:

  1. user_regexes, which is a dict with entries like {"regex.name": "regex"}. These regexes are used to redact PII (apart from secrets, which is already taken care of) from the sensitive fields
  2. keywords, which is a list like ["keyword1", "keyword2"]. Documents containing any of the keywords in any of the sensitive fields are dropped.

Adding Masks

For the faker-based anonymization, the anonymizer class only knows how to use providers that are enumerated in the provider_map class attribute. If you would like to add support for new faker providers, please add entries to this dict.

Adding Readers

Readers can be added to readers.py, simply extend the base reader class and implement all abstract methods. Add a new entry to reader_mapping

Adding Writers

Readers can be added to writers.py, simply extend the base writer class and implement all abstract methods. Add a new entry to reader_mapping

General Notes

https://stackoverflow.com/questions/17486578/how-can-you-bundle-all-your-python-code-into-a-single-zip-file

Running Tests

To run the unit tests,

  1. Create a virtual environment and install dependencies in requirements.txt
  2. Execute py.test from the top-level repository directory

anonymize-it's People

Contributors

ajosh0504 avatar blaklaybul avatar christophercutajar avatar dependabot[bot] avatar gregoireco avatar hub-cap avatar tirkarthi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

anonymize-it's Issues

[Discuss] Supporting hash-based anonymization

The current Faker-based anonymization implementation works for one-off anonymization use cases. However, it has the following drawbacks in cases where users might want to correlate anonymized data collected periodically/ using multiple queries/ across multiple sources (clusters):

  • Faker mappings are not persistent across multiple runs if the order in which the aggregated results appear while creating the mappings changes between runs. Eg: If we want to correlate login and network activity for a particular host in a given environment and the user makes two separate queries for login and network events, we can't be sure that the faker mapping for fields like hostname will be maintained across both queries
  • Faker mappings might not be persistent across queries run on multiple clusters

An alternative to Faker-based anonymization for such data correlation usecases is to use a hash-based anonymization technique wherein the anonymizer accepts a "password" configured by the user. The user will always use the same password for all future anonymization runs. Fields to be anonymized are replace by the sha256 of a combination of the password and the field value.

Discussion items:

  • What happens if the user forgets the password? Are there other ways in the Elastic framework to get a static but unique representation of a user/environment so as to avoid human error?
  • Do we want to support both anonymization techniques or just hash based? How do we go about this?

Anonymizer failing since no aggregations in query

When the query in the anonymiser doesn't have an aggregation, the anonymizer fails since it will be null

Traceback (most recent call last):
  File "anonymize.py", line 41, in <module>
    anon.anonymize(infer=True, include_rest=config.include_rest)
  File "/Users/christophercutajar/Documents/Tools/anonymize-it/anonymize_it/anonymizers.py", line 161, in anonymize
    self.field_maps = self.reader.create_mappings(high_cardinality_fields=self.high_cardinality_fields)
  File "/Users/christophercutajar/Documents/Tools/anonymize-it/anonymize_it/readers.py", line 126, in create_mappings
    term = response['aggregations']["my_buckets"]['buckets'][-1]['key'][field]
IndexError: list index out of range

Elasticsearch Destination

Hey guys,

Do you have any idea if those integration it will be develop? On my use case, I want to anonymize some fields of an index/index-pattern but ingest the result to the same index. Basically my anonymization should be time-based of the document.

Chunk Output of FSWriter

split output of FSWriter into smaller .json files. potentially, allow this to be set as a parameter in the config file.

API Key Auth Support

Support API Key authentication apart from username and password that can be used to connect to the Elasticsearch cluster

Add necessary GCP dependencies to requirements.txt and remove unused GCP deps

Currently, the code in anonymize-it writers module imports the google package for interfacing with Google Cloud.
Grepping through the codebase we can find the following two dependencies

grep -R --exclude-dir=env google ./*
./anonymize_it/writers.py:from google.oauth2 import service_account
./anonymize_it/writers.py:from google.cloud import storage

The storage object is used once in the writers module

grep -R --exclude-dir=env storage ./*
./anonymize_it/writers.py:from google.cloud import storage
./anonymize_it/writers.py:        self.client = storage.Client()

whereas service_account appears to be unused

grep -R --exclude-dir=env service_account ./*
./anonymize_it/writers.py:from google.oauth2 import service_account

It seems that we could remove service_account from the imports and add google.cloud package to the requirements.txt.
As for the tests, it would probably not be ideal to call into GCP, so we can simply mock out the storage.Client.get_bucket method to return some dummy object with same interface as what the code expects.

# example usage of google.cloud.storage in the codebase
grep -R --exclude-dir=env self.client ./*
./anonymize_it/writers.py:        self.client = storage.Client()
./anonymize_it/writers.py:        self.bucket = self.client.get_bucket(self.bucket)

Edit: the test methods actually not call into Google Cloud (the only writer that is tested, is the local file system writer), so there is no test-time calls into external systems :)

For reference:
google-cloud-storage package
https://googleapis.github.io/google-cloud-python/latest/storage/client.html

Dependency cytoolz fails to install on Mac OS Mojave with a Python 3.7.2 virtualenv

Hi everyone,
(documenting this in case any of our users comes across this issue)
I was testing a local installation of anonymize-it on Mojave with virtualenv and Python 3.7.2 and the installation for the dependency cytoolz failed with the message.

  cytoolz/dicttoolz.c:8277:65: error: too many arguments to function call, expected 3, have 4
        return (*((__Pyx_PyCFunctionFast)meth)) (self, args, nargs, NULL);
               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                     ^~~~
    /Library/Developer/CommandLineTools/usr/lib/clang/10.0.0/include/stddef.h:105:16: note: expanded from macro 'NULL'
    #  define NULL ((void*)0)
                   ^~~~~~~~~~
    cytoolz/dicttoolz.c:9053:21: error: no member named 'exc_type' in 'struct _ts'
        *type = tstate->exc_type;
                ~~~~~~  ^

This appears to be the same as the issue with spaCy here
explosion/spaCy#2490

The solution suggested by users in the spaCy thread is to downgrade to Python 3.6 or update cytoolz to 0.9.x.
I'm going to test to see if bumping the cytoolz dependency will work and open a PR to update the requirements.txt file.

Unable to install without changing requirements

It is currently impossible to install with the current requirements, as:
thinc 8.0.15 depends on murmurhash<1.1.0 and >=1.0.2
and murmurhash==0.28.0 is explicitely required

Changing the requirements.txt record for thinc to:
thinc==8.0.13
Seems to fix the issue.

Make Anonymizer a Pipeline

an anonymizer should be a pipeline, i.e.:

Anonymizer = Reader + Mapper + Writer

The Readers gets data from the source.
The Mappers create value mappings and apply them to the data to be anonymized.
The Writers write data to the destination.

e.g.:

anonymizer = Anonymizer([
    ('es', ESReader()),
    ('faker', FakerMapper()),
    ('fs', FSWriter())
])

anonymizer.get_data(source.params).map_data(mapper.params).write_data(dest.params)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.