Comments (7)
Hi @Idomingog, Presidio only supports spaCy and Stanza as NLP engines. However, you can use Flair as an additional recognizer within presidio. In this scenario, presidio would call the Flair recognizer the same way it would call any other recognizer, and extract results coming from flair.
See some details here:
- https://microsoft.github.io/presidio/analyzer/customizing_nlp_models/#leverage-frameworks-other-than-spacy-or-stanza-for-ml-based-pii-detection
- https://microsoft.github.io/presidio/analyzer/adding_recognizers/#creating-a-remote-recognizer
- Example 3rd party recognizer. Similar code could be used for translating the analyzer input into flair and back.
from presidio-research.
Thanks for your advices. I have included a new flair recognizer to my project.
from presidio-research.
Thanks! If you'd like to add this as a sample to the main repo, this would be a great contribution and I would be happy to help with reviewing or anything else.
from presidio-research.
Hi,
I'm very happy to share the class, but it's not "finished". It works for my needs, but there are a couple of things to improve.
- The code doesn't accept languages, it's set up for spanish.
- I didn't find how to manage the flair entity tags [MISC, ORG, PER, LOC], I don't know how to find it, I found the original ones like E-LOC, I-LOC, S-LOC, etc.... I tried to define it like a list, but with neither of two options I could use it in the analyzer.analyze(text, language, entities, score_threshold) call. The selection of the entities doesn't work.
- I'm not sure if it's the best option to load the model.
- For sure there are other issues that I didn't notice.
Best regards.
from presidio_analyzer import AnalyzerEngine, EntityRecognizer
from flair.data import Sentence
from flair.models import SequenceTagger
class EsFlairRecognizer(EntityRecognizer):
def __init__(
self,
supported_language: str = "es",
supported_entities: List[str] = [],
ner_strength: float = 0.85,
name: str = "esflairRecognizer",
version: str = "0.1",
model: SequenceTagger = None,
):
self.supported_language = supported_language
self.supported_entities = supported_entities
self.ner_strength = ner_strength
self.version = version
self.name = name
self.model = SequenceTagger.load("flair/ner-spanish-large")
super().__init__(
supported_entities=self.get_supported_entities(),
supported_language=supported_language,
name="Flair Analytics"
)
def get_supported_entities(self) -> List[str]:
"""
Supported Entities by flair/ner-spanish-large model.
:return: List of the supported entities.
"""
return self.model.tag_dictionary.get_items() #['E-LOC', 'I-LOC', 'S-LOC', ....]
def load(self) -> None:
"""No loading is required."""
pass
def analyze(
self, text: str, entities: List[str] = [], nlp_artifacts: NlpArtifacts = None
) -> List[RecognizerResult]:
"""
Analyze text using Text Analytics.
:param text: The text for analysis.
:param entities: Not working properly for this recognizer.
:param nlp_artifacts: Not used by this recognizer.
:return: The list of Presidio RecognizerResult constructed from the recognized
Flair detections.
"""
sentences = Sentence(text)
self.model.predict(sentences)
return [
self._convert_to_recognizer_result(categorized_entity)
for categorized_entity in sentences.get_spans('ner')
]
def _convert_to_recognizer_result(
self, categorized_entity
) -> RecognizerResult:
entity_type = categorized_entity.tag
explanation = EsFlairRecognizer._build_explanation(
original_score=round(categorized_entity.score, 2),
entity_type=entity_type
)
flair_results = RecognizerResult(
entity_type=entity_type,
start=categorized_entity.start_pos,
end=categorized_entity.end_pos,
score=round(categorized_entity.score, 2),
analysis_explanation=explanation
)
return flair_results
@staticmethod
def _build_explanation(
original_score: float, entity_type: str
) -> AnalysisExplanation:
"""
Create explanation for why this result was detected.
:param original_score: Score given by this recognizer
:param explanation: Explanation string
:return:
"""
explanation = AnalysisExplanation(
recognizer=EsFlairRecognizer.__class__.__name__,
original_score=original_score,
textual_explanation=f"Identified as {entity_type} by Flair Recognizer",
)
return explanation
from presidio-research.
This is great! I do have some suggestions to generalize this code. If you'd create a PR on the Presidio repo, I'd be happy to provide specific comments and improvements. Would that work?
If you do, consider putting it under the samples folder, as we would rather not have flair (and torch) as a dependency for Presidio at this point in time.
from presidio-research.
Hi, I just put an example with the code in the samples folder and open a PR.
Thanks for your help.
from presidio-research.
Thanks!! I quickly took a look and it is great. We'll do a more formal review in the next few days
from presidio-research.
Related Issues (20)
- Fake / synthetic address curation HOT 1
- Create languages containing addresses (for example: I live in XYZ) HOT 2
- Investigate the Enron dataset to see if it has sentences containing addresses (https://www.cs.cmu.edu/~enron/) HOT 2
- Questions about evaluation HOT 2
- Faker based data generator should output spans of fake entities HOT 1
- Faker based CSV/dataframe reader HOT 1
- Replace FakeDataGenerator class with a more modular approach
- Create adaptations in existing data / templates / code to allow new generator to work as the for generator V2
- Update new FakeNameGenerator
- PresidioAnalyzerWrapper should call predict with defined language HOT 2
- Generator template question HOT 2
- Reference to non existing file 'presidio_evaluator/data_generator/raw_data/organizations.csv' HOT 1
- Evaluate PII detection for Azure Text Analytics
- DataGenerator - templates containing square brackets treated as entities to be replaced HOT 3
- Python 3.11 Support HOT 2
- Handle Unmapped Faker Entity Types
- Support `TransformersRecognizer`
- FakeNameGenerator service broken : need file FakeNameGenerator.com_3000.csv HOT 1
- Export the train data and imported to presidio HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from presidio-research.