Coder Social home page Coder Social logo

aphp / eds-pseudo Goto Github PK

View Code? Open in Web Editor NEW
43.0 4.0 5.0 4.19 MB

EDS-Pseudo is a hybrid model for detecting personally identifying entities in clinical reports

Home Page: https://aphp.github.io/eds-pseudo

License: Other

Python 100.00%
nlp pseudonymisation edsnlp

eds-pseudo's Introduction

Tests Documentation Codecov Poetry DVC Demo

EDS-Pseudo

The EDS-Pseudo project aims at detecting identifying entities in clinical documents, and was primarily tested on clinical reports at AP-HP's Clinical Data Warehouse (EDS).

The model is built on top of edsnlp, and consists in a hybrid model (rule-based + deep learning) for which we provide rules (eds-pseudo/pipes) and a training recipe train.py.

We also provide some fictitious templates (templates.txt) and a script to generate a synthetic dataset generate_dataset.py.

The entities that are detected are listed below.

Label Description
ADRESSE Street address, eg 33 boulevard de Picpus
DATE Any absolute date other than a birthdate
DATE_NAISSANCE Birthdate
HOPITAL Hospital name, eg Hôpital Rothschild
IPP Internal AP-HP identifier for patients, displayed as a number
MAIL Email address
NDA Internal AP-HP identifier for visits, displayed as a number
NOM Any last name (patients, doctors, third parties)
PRENOM Any first name (patients, doctors, etc)
SECU Social security number
TEL Any phone number
VILLE Any city
ZIP Any zip code

Downloading the public pre-trained model

The public pretrained model is available on the HuggingFace model hub at AP-HP/eds-pseudo-public and was trained on synthetic data (see generate_dataset.py). You can also test it directly on the demo.

  1. Install the latest version of edsnlp

    pip install "edsnlp[ml]" -U
  2. Get access to the model at AP-HP/eds-pseudo-public

  3. Create and copy a huggingface token https://huggingface.co/settings/tokens?new_token=true

  4. Register the token (only once) on your machine

    import huggingface_hub
    
    huggingface_hub.login(token=YOUR_TOKEN, new_session=False, add_to_git_credential=True)
  5. Load the model

    import edsnlp
    
    nlp = edsnlp.load("AP-HP/eds-pseudo-public", auto_update=True)
    doc = nlp(
        "En 2015, M. Charles-François-Bienvenu "
        "Myriel était évêque de Digne. C’était un vieillard "
        "d’environ soixante-quinze ans ; il occupait le "
        "siège de Digne depuis 2006."
    )
    
    for ent in doc.ents:
        print(ent, ent.label_, str(ent._.date))

To apply the model on many documents using one or more GPUs, refer to the documentation of edsnlp.

Installation to reproduce

If you'd like to reproduce eds-pseudo's training or contribute to its development, you should first clone it:

git clone https://github.com/aphp/eds-pseudo.git
cd eds-pseudo

And install the dependencies. We recommend pinning the library version in your projects, or use a strict package manager like Poetry.

poetry install

How to use without machine learning

import edsnlp

nlp = edsnlp.blank("eds")

# Some text cleaning
nlp.add_pipe("eds.normalizer")

# Various simple rules
nlp.add_pipe(
    "eds_pseudo.simple_rules",
    config={"pattern_keys": ["TEL", "MAIL", "SECU", "PERSON"]},
)

# Address detection
nlp.add_pipe("eds_pseudo.addresses")

# Date detection
nlp.add_pipe("eds_pseudo.dates")

# Contextual rules (requires a dict of info about the patient)
nlp.add_pipe("eds_pseudo.context")

# Apply it to a text
doc = nlp(
    "En 2015, M. Charles-François-Bienvenu "
    "Myriel était évêque de Digne. C’était un vieillard "
    "d’environ soixante-quinze ans ; il occupait le "
    "siège de Digne depuis 2006."
)

for ent in doc.ents:
    print(ent, ent.label_)

# 2015 DATE
# Charles-François-Bienvenu NOM
# Myriel PRENOM
# 2006 DATE

How to train

Before training a model, you should update the configs/config.cfg and pyproject.toml files to fit your needs.

Put your data in the data/dataset folder (or edit the paths configs/config.cfg file to point to data/gen_dataset/train.jsonl).

Then, run the training script

python scripts/train.py --config configs/config.cfg --seed 43

This will train a model and save it in artifacts/model-last. You can evaluate it on the test set (defaults to data/dataset/test.jsonl) with:

python scripts/evaluate.py --config configs/config.cfg

To package it, run:

python scripts/package.py

This will create a dist/eds-pseudo-aphp-***.whl file that you can install with pip install dist/eds-pseudo-aphp-***.

You can use it in your code:

import edsnlp

# Either from the model path directly
nlp = edsnlp.load("artifacts/model-last")

# Or from the wheel file
import eds_pseudo_aphp

nlp = eds_pseudo_aphp.load()

Documentation

Visit the documentation for more information!

Publication

Please find our publication at the following link: https://doi.org/mkfv.

If you use EDS-Pseudo, please cite us as below:

@article{eds_pseudo,
  title={Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse},
  author={Tannier, Xavier and Wajsb{\"u}rt, Perceval and Calliger, Alice and Dura, Basile and Mouchet, Alexandre and Hilka, Martin and Bey, Romain},
  journal={Methods of Information in Medicine},
  year={2024},
  publisher={Georg Thieme Verlag KG}
}

Acknowledgement

We would like to thank Assistance Publique – Hôpitaux de Paris and AP-HP Foundation for funding this project.

eds-pseudo's People

Contributors

bdura avatar percevalw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

eds-pseudo's Issues

Feature request: [feature] RPPS anonymization

Feature type

Additional entity

Description

👋 Congrats for the repo, it is a crucial topic ! Do you plan on adding RPPS (doctors national identifiers) in the entity list ?

Preventing health professional's identification in health data is increasing patient's privacy, and it is also protecting health professional's privacy !

Missing files

Description

It appears that the repository is missing two files : infer.py and train.py. These files are mentioned in the project.yml file.
Have they been renamed, or is there another way to train the eds-pseudo module than what is mentioned in the docs ?

How to reproduce the bug

After following the quickstart tutorial from the github.io, I can't seem to be able to train the model on my dataset.

The error I get when running dvc repro is :

ERROR: failed to reproduce 'xp': [Errno 2] No such file or directory: 'eds-pseudo/scripts/infer.py'

I suspect the same error would occur with the train.py file.

Your Environment

  • Operating System: Ubuntu
  • Python Version Used: 3.10
  • EDS-Pseudo Version Used: 0.2.0

Feature request: date normalizer & format parser

Feature type

Once a date has been extracted by the ML NER module (e.g., eds.ner_crf), it still needs to be normalized, and its format extracted to be able to shift it before replacing it in the pseudonymized report.

The normalizer should assign a date attribute (edsnlp's AbsoluteDate) and a date format (either strftime's %d %m %Y or java's dd-mm-yyyy, compatible with pendulum). The question is not format syntax is not trivial, since C standard date format, since strftime does not support case modifiers (a full letter month will always be replaced in titlecase), nor full letter numbers.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.