Coder Social home page Coder Social logo

microsoft / presidio-research Goto Github PK

View Code? Open in Web Editor NEW
151.0 14.0 54.0 2.67 MB

This package features data-science related tasks for developing new recognizers for Presidio. It is used for the evaluation of the entire system, as well as for evaluating specific PII recognizers or PII detection models.

License: MIT License

Python 67.57% Jupyter Notebook 32.43%
natural-language-processing nlp pii privacy spacy flair named-entity-recognition ner deep-learning machine-learning

presidio-research's Introduction

Presidio-research

This package features data-science related tasks for developing new recognizers for Presidio. It is used for the evaluation of the entire system, as well as for evaluating specific PII recognizers or PII detection models. In addition, it contains a fake data generator which creates fake sentences based on templates and fake PII.

Who should use it?

  • Anyone interested in developing or evaluating PII detection models, an existing Presidio instance or a Presidio PII recognizer.
  • Anyone interested in generating new data based on previous datasets or sentence templates (e.g. to increase the coverage of entity values) for Named Entity Recognition models.

Getting started

Note: Presidio evaluator requires Python>=3.9

From PyPI

conda create --name presidio python=3.9
conda activate presidio
pip install presidio-evaluator

# Download a spaCy model used by presidio-analyzer
python -m spacy download en_core_web_lg

From source

To install the package:

  1. Clone the repo
  2. Install all dependencies, preferably in a virtual environment:
# Install package+dependencies
pip install poetry
poetry install --with=dev

# To install with all additional NER dependencies (e.g. Flair, Stanza, CRF), run:
# poetry install --with='ner,dev'

# Download a spaCy model used by presidio-analyzer
python -m spacy download en_core_web_lg

# Verify installation
pytest

Note that some dependencies (such as Flair and Stanza) are not automatically installed to reduce installation complexity.

What's in this package?

  1. Fake data generator for PII recognizers and NER models
  2. Data representation layer for data generation, modeling and analysis
  3. Multiple Model/Recognizer evaluation files (e.g. for Spacy, Flair, CRF, Presidio API, Presidio Analyzer python package, specific Presidio recognizers)
  4. Training and modeling code for multiple models
  5. Helper functions for results analysis

1. Data generation

See Data Generator README for more details.

The data generation process receives a file with templates, e.g. My name is {{name}}. Then, it creates new synthetic sentences by sampling templates and PII values. Furthermore, it tokenizes the data, creates tags (either IO/BIO/BILUO) and spans for the newly created samples.

Once data is generated, it could be split into train/test/validation sets while ensuring that each template only exists in one set. See this notebook for more details.

2. Data representation

In order to standardize the process, we use specific data objects that hold all the information needed for generating, analyzing, modeling and evaluating data and models. Specifically, see data_objects.py.

The standardized structure, List[InputSample] could be translated into different formats:

  • CONLL
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
conll = InputSample.create_conll_dataset(dataset)
conll.to_csv("dataset.csv", sep="\t")
  • spaCy v3
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
InputSample.create_spacy_dataset(dataset, output_path="dataset.spacy")
  • Flair
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
flair = InputSample.create_flair_dataset(dataset)
  • json
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
InputSample.to_json(dataset, output_file="dataset_json")

3. PII models evaluation

The presidio-evaluator framework allows you to evaluate Presidio as a system, a NER model, or a specific PII recognizer for precision and recall and error-analysis.

Examples:

4. Training PII detection models

CRF

To train a vanilla CRF on a new dataset, see this notebook. To evaluate, see this notebook.

spaCy

To train a new spaCy model, first save the dataset in a spaCy format:

# dataset is a List[InputSample]
InputSample.create_spacy_dataset(dataset ,output_path="dataset.spacy")

To evaluate, see this notebook

Flair

  • To train Flair models, see this helper class or this snippet:
from presidio_evaluator.models import FlairTrainer
train_samples = "data/generated_train.json"
test_samples = "data/generated_test.json"
val_samples = "data/generated_validation.json"

trainer = FlairTrainer()
trainer.create_flair_corpus(train_samples, test_samples, val_samples)

corpus = trainer.read_corpus("")
trainer.train(corpus)

Note that the three json files are created using InputSample.to_json.

For more information

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Copyright notice:

Fake Name Generator identities by the Fake Name Generator are licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.

presidio-research's People

Contributors

bdolor avatar benkilimnik avatar coderunrepeat avatar diwu1989 avatar exilit avatar gillesdami avatar gustavz avatar kirkins avatar melmatlis avatar microsoft-github-operations[bot] avatar microsoftopensource avatar msebragge avatar navalev avatar omri374 avatar prvenk avatar robbie-palmer avatar tranguyen221 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

presidio-research's Issues

Generator template question

Is there an example showing how to make a provider that maps to a string for replacement in a template?

For example, if I have a snippet like this

class HospitalProvider(BaseProvider):
    def __init__(
        self,
        generator,
    ):
        super().__init__(generator=generator)
        self.hospital_name = "My Test Hospital"

fake = RecordsFaker(records = pd.DataFrame() ,locale="en_US")

sentence_templates = [
    "My name is {{HOSPITAL}}",
    ]

fake.add_provider(HospitalProvider)

How do I tell the generator to replace all strings {{HOSPITAL}} with , in this case, My Test Hospital?

Integrate evaluation capabilities for PII Column Identification in tables or JSONs with presidio-structured

I would like to propose the enhancement of the presidio-research repository by introducing functionalities that enable the evaluation of how accurately columns in a table or a JSON containing Personally Identifiable Information (PII) are identified, utilizing the capabilities of the newly introduced package presidio-structured.

A starting point could be simply assess the precision, recall, and F1 score of PII column identification.

DataGenerator - templates containing square brackets treated as entities to be replaced

Templates which contain square brackets fail on generating new fake data. Square brackets are converted to curly brackets to be treated as entities i.e -> "{{entity}}". Method _prep_templates responsible for the replacement.
Is this the desired behavior?

How to reproduce the error:


from faker import Faker
fake = RecordsFaker(fake_data_df, local="en_US")
data_generator = PresidioDataGenerator(
    custom_faker=fake, lower_case_ratio=0
)

provider_list = [HospitalProvider, IpAddressProvider,NationalityProvider,  AgeProvider, AddressProviderNew, PhoneNumberProviderNew, OrganizationProvider]
for provider in provider_list:
    fake.add_provider(provider)

template = "My name is {{name}} and I [travel] to the factory"


fake_records = data_generator.generate_fake_data(
    templates=[template], n_samples=1
)
list(fake_records)

Returns:

AttributeError: Failed to generate fake data based on template "My name is {{name}} and I {{travel}} to the factory".You might ?need to add a new Faker provider! Unknown formatter 'travel'

Bug in PresidioAnalyzerWrapper: 'ORGANIZATION' is not removed by default

The method _update_recognizers_based_on_entities_to_keep add "ORGANIZATION" entity:

Add ORGANIZATION as it is removed by default
"""
supported_entities = analyzer_engine.get_supported_entities(
language=self.language
)

if "ORGANIZATION" in self.entities and "ORGANIZATION" not in supported_entities:

But it is actually not removed by default, which could be checked just by running:

ORGANIZATION" in AnalyzerEngine().get_supported_entities("en")
True  # output

Using Flair Embeddings?

Thanks for your work on this project, trying to reproduce locally and possibly expand on it for a graduate project.

We've been able to run all the Notebooks for evaluating models except for flair. We run into some issues with embeddings being required in Evaluate flair models.ipynb.

flair_bert_embeddings = '../../models/presidio-ner/flair-bert-embeddings.pt'
glove_flair_embeddings = '../../models/presidio-ner/flair-embeddings.pt'

I was thinking we might be able to import them directly from flair with something like:

from flair.embeddings import WordEmbeddings, BertEmbeddings

flair_bert_embeddings = BertEmbeddings()
glove_flair_embeddings = WordEmbeddings('glove')

But this also resulted in an error.

Do you know where we can download these embedding files?

Handle Unmapped Faker Entity Types

Suggested by @melmatlis in #50 (comment)

To avoid the library breaking when faker adds new entity types

"Perhaps we can add to the faker_to_presidio mapping an additional default value 'other' for cases when a new unmapped key appears, then the mapping will become:"

                span.type = self.faker_to_presidio_entity_type.get(span.type,'other')

These unknown types would also need tracked and replaced in the template

Not possible to add extra params in analyze in PresidioAnalyzerWrapper

In the method predict from the class PresidioAnalyzerWrapper is not possible to add extra params when calling analyze from its analyzer_engine:

def predict(self, sample: InputSample) -> List[str]:
results = self.analyzer_engine.analyze(
text=sample.full_text,
entities=self.entities,
language=self.language,
score_threshold=self.score_threshold,
)

This could be allowed by, for example, adding kwargs in predict method:

    def predict(self, sample: InputSample, **kwargs) -> List[str]:

        results = self.analyzer_engine.analyze(
            text=sample.full_text,
            entities=self.entities,
            language=self.language,
            score_threshold=self.score_threshold,
            **kwargs,
        )

Question on Training / Evaluation Data

Hello again, in the last issue I opened, you mentioned for training Flair we might want to use our own data.

This seems to be good advice to me as I've realized that there are only 125 templates included in the data generator. Thus the 300 samples used for training the model, and the 300 samples used in evaluation inevitably include many near duplicates for example:

The name in the account is not correct, please change it to Philip Jessen

The name in the account is not correct, please change it to Alexandra Dalgety

I believe this large overlap in data invalidates the evaluation scores for the custom trained Flair and SpaCey models which use it (for example we have one Flair model which scores over 98.5% which seems too high).

My colleague is convinced that because this data was included as the default in a Microsoft repository that it must be ok to use as is.

Could you help me to confirm what I'm thinking? That we need to use our own data or add more templates, if we want to accurately assess our trained models.

Change entity value instead entity type when translating tags in a input sample

There is a bug in the method translate_input_sample_tags of InputSample, which set to 'O' the entity_value instead of entity_type of the input sample spans when the entity_type is not in the provided dictionary and ignore_unknown is True. For this reason, these spans will not be removed from the InputSample object, but only their corresponding values will be set to 'O'. See in:

# Translate spans
for span in self.spans:
if span.entity_type in dictionary:
span.entity_type = dictionary[span.entity_type]
elif ignore_unknown:
span.entity_value = "O"
# Remove spans if they were changed to "O"
self.spans = [span for span in self.spans if span.entity_type != "O"]

Export the train data and imported to presidio

I have been using presidio_analyzer in my local host.

I used presidio-research to generate a dataset with Fake and then run the train/run/dev.

the output is 3 JSON files.
I used the same generated dataset to generate the .spacy file

Now my question is how to integrate the trained data to my local presidio_analyzer to run with the trained data.

I feel like it's missing the integration steps ^^

Python 3.11 Support

Presidio works with Python v3.11 but Presidio Research does not install

Installation fails because this package requires sklearn_crfsuite which requires python-crfsuite which throws an error on install related to its C bindings

This should be in the process of being fixed
But two related improvements could be made in this repo

  • Add Python 3.11 as a tested version in ci.yml
  • Utilise extras_require in setup.py for optional dependencies
    • So, bugs like this in downstream dependencies only affect people who want those dependencies. E.g. those using spacy vs stanza vs crf vs flair vs azure text analytics models only need to install a subset of the dependencies, and their install will succeed even if there are bugs related to the other model families

How can I use flair like an nlp-engine?

Hi,

I'm working with presidio using spacy model, and I would try to test how works flair with my data.
I'm following the steps defined in other issue to improve my detection rates.

For the spacy model, I use:

configuration_spacy_es = {
"nlp_engine_name": "spacy",
"models": [{"lang_code": "es", "model_name": "es_core_news_lg"}]#, #es_core_news_lg es_dep_news_trf]
}
provider = NlpEngineProvider(nlp_configuration=configuration_spacy_es)
nlp_engine_with_spanish = provider.create_engine()

analyzer_spacy = AnalyzerEngine(
nlp_engine=nlp_engine_with_spanish,
supported_languages=["es"]
)

And then I use the "analyzer_spacy" to add new recognizers and analyze data.

I tried to do something similar with the flair model, without success.
How can I create a new nlp_engine for flair model and create an "analyzer_flair" engine to use it in presidio?

Thanks for your help.

Why translate tags in CRF model in predict method?

The translation of tags procedure in the predict method of the CRF model is repeated, as it can be noted in the lines:

sample.translate_input_sample_tags()

conll = sample.to_conll(translate_tags=True)

Also, it is not consistent to translate labels, since all other models do not translate labels in their prediction method, so why in the CRF model does?

Questions about evaluation

Hi, I'm starting to dive into Presidio and it's a great tool! I plan to expand the functionalities to detect other types of information; do you think I could use your evaluation code to test how well they work? And do I need to build a specific dataset for that? Thanks

Fine-tuning flair model

Can you provide more details on the fine tuning approach here that is discussed in the README? It says that Flair models can be trained with trainer.train(corpus), but no such method exists. Also, how would you take an existing flair model and fine tune it with FlairTrainer? An example would be great! Thanks!

Master branch differs from published PyPi version

TextAnalyticsWrapper is not exported in the current published version, 0.1.0. The following code fails when executing against the published version, but works fine when building from source:

from pathlib                                import Path
from copy                                   import deepcopy
from pprint                                 import pprint
from collections                            import Counter

from presidio_evaluator                     import InputSample
from presidio_evaluator.evaluation          import Evaluator, ModelError
from presidio_evaluator.models              import TextAnalyticsWrapper
from presidio_evaluator.experiment_tracking import get_experiment_tracker
import pandas as pd

def main():    
    print('hello world')

if __name__ == "__main__":
    main()

There are no versions or tags, so this will build whatever the current state of the master branch is: pip install git+https://github.com/microsoft/presidio-research.git, but you can always fork and tag it yourself as a workaround for this issue.

Cannot run presidio analyzer main

Hi,

First congratulation for the great work.
I'm looking to benchmark my solution against presidio and doing so i tried to run presidio_evaluator\presidio_analyzer.py.
However it did not worked as expected for 2 reasons:

  • The import from presidio_analyzer ... could not resolve properly since the file is already named presidio_analyzer.
  • The main part was not updated even if the structure of some key objects changed.

I opened a pull request solving both issues: #16

Some punctuations like eiphens (-), back slash (/) got removed while training the flair model

When I tried to train the flair model, some punctuations like eiphens (-), back slash (/) and so on got removed while training the flair model. So could you please help me to overcome the above issue. I am using this code https://github.com/microsoft/presidio-research/blob/master/models/flair_train.py for training the flair model.

For example when I trained the date entity with this format which is 03-03-2020, when i tested with 03 03 2020, it works and
it is not working when i tested with 03-03-2020 format.

How to load OntoNotes?

Hi
I'm probably at the wrong place to ask this question because this is not about Presidio but how do I conveniently load OntoNotes?
In one of the notebook, there is a code block like this

## Download OntoNotes data
ontonotes = ""

I have downloaded OntoNotes and there are a lot of files. I haven't found an easy way to load OntoNotes in the format that can be used by the rest of the notebook yet so if anyone can shed some light, that would be awesome

Thanks in advance
Alex

Faker based data generator should output spans of fake entities

Example:

from faker import Faker
from presidio_evaluator.data_generator.faker_extensions import SpanGenerator

generator = SpanGenerator()
faker = Faker(generator=generator)

pattern = "My name is {{name}} and i live in {{address}}."

res = faker.parse(pattern, add_spans=True)

print(res.spans)

[{"value": "819 Johnson Course\nEast William, OH 26563", "start": 38, "end": 79, "type": "address"},
{"value": "Allison Hill", "start": 11, "end": 23, "type": "name"}]

Data Generator produces empty values labelled as LOCATION

Here's a snippet of data I get from using the data generator, the problem is it includes invalid entities.

{
        "full_text": "The title refers to  Street in Cite Ezzitoun 2. It was on this street that many of the clubs where Metallica first played were situated. \"Battery is found in me\" shows that these early shows on Rue de Ouerdanine Street were important to them. Battery is where \"lunacy finds you\" and you \"smash through the boundaries.\"",
        "masked": null,
        "spans": [
            {
                "entity_type": "LOCATION",
                "entity_value": "",
                "start_position": 20,
                "end_position": 20
            },
            {
                "entity_type": "LOCATION",
                "entity_value": "Cite Ezzitoun 2",
                "start_position": 31,
                "end_position": 46
            },
...

For the very first span emitted from the generator, you would see that it is an empty LOCATION entity with start_position the same as end_position. This pattern has been repetitively showing up in my generated data.

This is the code I was using to generate the data, as there are some randomness, you may be able to reproduce the issue.

from presidio_evaluator.data_generator.main import generate

import datetime

EXAMPLES = 500
SPAN_TO_TAG = True  # Whether to create tokens + token labels (tags)
TEMPLATES_FILE = 'presidio_evaluator/data_generator/raw_data/templates.txt'
KEEP_ONLY_TAGGED = False
LOWER_CASE_RATIO = 0.1
IGNORE_TYPES = None

cur_time = datetime.date.today().strftime("%B_%d_%Y")
OUTPUT = "data/generated_size_{}_date_{}.json".format(EXAMPLES, cur_time)

fake_pii_csv = 'presidio_evaluator/data_generator/' \
               'raw_data/FakeNameGenerator.com_3000.csv'
utterances_file = TEMPLATES_FILE
dictionary_path = None

generate(fake_pii_csv=fake_pii_csv,
         utterances_file=utterances_file,
         dictionary_path=dictionary_path,
         output_file=OUTPUT,
         lower_case_ratio=LOWER_CASE_RATIO,
         num_of_examples=EXAMPLES,
         ignore_types=IGNORE_TYPES,
         keep_only_tagged=KEEP_ONLY_TAGGED,
         span_to_tag=SPAN_TO_TAG)

Default installation fails pytest

Several tests fail with the following error:

Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

This is a clean virtual environment based on Python 3.9 following the "From Source" instructions. 113 tests pass (5 are skipped) after running python -m spacy download en_core_web_sm.

I will send a PR to clarify the documentation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.