Coder Social home page Coder Social logo

philter-ucsf's People

Contributors

beaunorgeot avatar dependabot[bot] avatar kmuenzen avatar paulheider avatar redchrists avatar tschaffter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

philter-ucsf's Issues

--outputformat "asterisk" not producing deidentified text

Pip installed, then ran the command as given in the readme:

python -m philter_ucsf -i ./data/i2b2_notes/ -o ./data/i2b2_results/ -f ./configs/philter_delta.json --prod=True --outputformat "asterisk"

And none of the PHI identified in tags have been replaced by asterisks.

This is because in main.py, on line 70, if the prod flag is set, output format is set to i2b2. Once I changed this to asterisk the phi was removed as expected.

Why do you need to set output format for prod = True?

An alternative would be to check if config provides an output format and if so use that, otherwise fallback to i2b2.

Issues with pathology terms

Here are examples where current version corrupts pathology terms:

Molecular markers

Ki-67 shows a high proliferative index in the poorly differentiated region (which accounts for 50-75% of the tumor).
===>
*****-67 shows a high proliferative index in the poorly  differentiated region (which accounts for 50-75% of the tumor).
Results in the tumor cell nuclei are:
MLH1 expression:  Present.       
MSH2 expression:  Present.
MSH6 expression:  Present.
PMS2 expression:  Present.
===>
Results in the tumor cell nuclei are:
***** expression:  Present. 
***** expression:  Present. 
***** expression:  Present.
***** expression:  Present.

Tumour staging

AJCC/UICC stage:  pT3N2b.
===> 
- AJCC/UICC stage: *****.

Positive lymph nodes

Are misinterpreted as dates

4.  Metastatic  adenocarcinoma in eleven of sixteen lymph nodes (11/16) 
and four satellite tumor nodules; see comment.
===>
4.Metastatic adenocarcinoma in eleven of sixteen lymph nodes (July 30) 
and four satellite tumor nodules; see comment.

Slide and Cassette numbers

CASSETTES:  Representative sections are submitted as follows:
B1:          Proximal surgical margin.
B2:          Distal surgical margin.
B3:          Appendix, cross-sections and distal tip, longitudinal.
B4-B5:          Surgical margins, separate small bowel segment.
B6:          Separate small bowel. 
B7:          Proximal terminal ileum.
B8:          Middle terminal ileum.
B9:          Distal terminal ileum.
B10:          Ascending colon polyps proximal to mass (smaller inked
blue).
B11-B12:     Closest approach of mass to radial surgical margin, full
thickness.
B13-B15:     Mass.

turns:

CASSETTES:
Representative sections are submitted as follows:
B1: Proximal surgical margin.
B2: Distal surgical margin.
*****: Appendix, cross-sections and distal tip, longitudinal.
*****-*****: Surgical margins, separate small bowel segment.
B6: Separate small bowel.
*****: Proximal terminal ileum.
*****: Middle terminal ileum.
*****: Distal terminal ileum.
*****: Ascending colon polyps proximal to mass (smaller inked blue).
*****-B12: Closest approach of mass to radial surgical margin, full thickness.
*****-*****: Mass

Issues reproducing Precision/Recall/F1/F2 on the i2b2 dataset

Hi,

Thank you for the development and release of this package. I followed the steps 0, 2a, 1b, 1c using the PHI config file, and then 2d with prod=True. In calculation of the scores and following my understanding of the paper, I separated all PHI text on the word level including sanitizing for edge cases such as "," and "." at the end of words (otherwise the stats are much lower). However, I was only able to achieve Precision 0.696 Recall 0.915 F1 0.791 F2 0.861 on the test set, which is some way away from the statistics reported on the i2b2 test set in the paper. I think I am most likely missing something, but am unsure what it is.

Update regular expressions to move global flags at the front of regular expressions

In python 3.10, global flags, like (?i) will throw a FutureWarning about global flags not being at the front. In python 3.11+, this now becomes an exception.

See this issue: python/cpython#83575

We have done this on our end as we have moved to python 3.12, but due to corporate policy I can't easily propose those changes in a PR so I wanted to let you know.

The easy solution is to just move all global flags in all regular expressions to the front. As far as I can tell, (?i) is the only used global flag. We wrote a script to just replace (?i) in the regexs and then prepend (?i) in those cases where we removed it.

These are the regexes which need updating. Sorry I can't actually easily share the fixes. I wish I could.

addresses\at_street_dash_street_transformed.txt
addresses\at_street_number_dash_street_transformed.txt
addresses\box_room_transformed.txt
addresses\box_transformed.txt
addresses\corner_of_street_&_street_transformed.txt
addresses\county_name_transformed.txt
addresses\desk_#_transformed.txt
addresses\floor_box_transformed.txt
addresses\full_street_address_transformed.txt
addresses\full_street_address_with_concatenated_indicator_transformed.txt
addresses\num_streetname_city_transformed.txt
addresses\num_streetname_extension_transformed.txt
addresses\num_streetname_transformed.txt
addresses\number_streetname_noindicator_suite_transformed.txt
addresses\room_#_transformed.txt
addresses\room_box_transformed.txt
addresses\short_street_name_transformed.txt
addresses\state_indicator_transformed.txt
addresses\street_and_street_transformed.txt
addresses\streetname_floor_number_transformed.txt
addresses\streetname_only_transformed.txt
addresses\to_state_transformed.txt
addresses\waiting_room_transformed.txt
age\x_year_old.txt
age\x_yo.txt
contact\call_#.txt
dates\as_of_date.txt
mrn_id\accession_#.txt
mrn_id\activation_code.txt
mrn_id\cassette_#.txt
mrn_id\file_indicator.txt
mrn_id\id_verbose.txt
mrn_id\lot_#.txt
mrn_id\order_number.txt
mrn_id\specimen_#.txt
mrn_id\ssn.txt
mrn_id\tape_number.txt
safe\a1c_safe.txt
safe\able_to_safe.txt
safe\active_safe.txt
safe\age_safe.txt
safe\airway_safe.txt
safe\alert_safe.txt
safe\AM_safe.txt
safe\arms_safe.txt
safe\assessment_&_plan_safe.txt
safe\assessment_safe.txt
safe\at_measurement_safe.txt
safe\attending_safe.txt
safe\axis_safe.txt
safe\baby_safe.txt
safe\back_safe.txt
safe\balance_safe.txt
safe\barretts_esophagus_safe.txt
safe\base_safe.txt
safe\be_safe.txt
safe\beats_safe.txt
safe\bee_safe.txt
safe\below_safe.txt
safe\bile_safe.txt
safe\birth_safe.txt
safe\bp_safe.txt
safe\bruits_safe.txt
safe\call_safe.txt
safe\cancer_safe.txt
safe\case_safe.txt
safe\cava_safe.txt
safe\central_safe.txt
safe\chief_safe.txt
safe\child_safe.txt
safe\code_safe.txt
safe\colon_safe.txt
safe\commonly_safe.txt
safe\contact_safe.txt
safe\contractions_safe.txt
safe\cord_safe.txt
safe\crohns_disease_safe.txt
safe\day_safe.txt
safe\days_safe.txt
safe\dial_safe.txt
safe\disposition_safe.txt
safe\distance_safe.txt
safe\distensable_safe.txt
safe\do_safe.txt
safe\doppler_safe.txt
safe\down_syndrome_safe.txt
safe\drop_safe.txt
safe\due_safe.txt
safe\edge_safe.txt
safe\effort_safe.txt
safe\est_safe.txt
safe\fall_safe.txt
safe\fax_safe.txt
safe\few_safe.txt
safe\fiber_safe.txt
safe\file_safe.txt
safe\flexion_extension_safe.txt
safe\floor_safe.txt
safe\found_safe.txt
safe\go_safe.txt
safe\grade_safe.txt
safe\gross_safe.txt
safe\hearing_safe.txt
safe\hour_safe.txt
safe\how_safe.txt
safe\id_safe.txt
safe\independence_safe.txt
safe\index_safe.txt
safe\intake_safe.txt
safe\key_safe.txt
safe\knee_safe.txt
safe\lab_safe.txt
safe\last_safe.txt
safe\lb_safe.txt
safe\learn_safe.txt
safe\loop_safe.txt
safe\male_safe.txt
safe\max_safe.txt
safe\may_safe.txt
safe\measurement_safe.txt
safe\medical_safe.txt
safe\medication_safe.txt
safe\micro_safe.txt
safe\morning_safe.txt
safe\mri_safe.txt
safe\na_safe.txt
safe\night_safe.txt
safe\non_safe.txt
safe\not_safe.txt
safe\onset_safe.txt
safe\oral_safe.txt
safe\order_md_safe.txt
safe\ordering_md_safe.txt
safe\other_safe.txt
safe\pain_safe.txt
safe\pap_safe.txt
safe\parkinson_safe.txt
safe\peak_safe.txt
safe\pounds_safe.txt
safe\prn_safe.txt
safe\reviewed_by_safe.txt
safe\sci_notation_safe.txt
safe\scope_safe.txt
safe\smoking_safe.txt
safe\tablet_safe.txt
safe\the_safe.txt
safe\units_safe.txt
safe\young_safe.txt
salutations\code_status_name.txt
salutations\confirmed_by_name.txt
salutations\dr_ambiguous.txt
salutations\editor_name.txt
salutations\name_age.txt
salutations\name_indicator.txt
salutations\ordering_md.txt
salutations\sent_note_to.txt
salutations\wrote.txt
ucsf_regex\ucsf_apex_safe.txt
ucsf_regex\ucsf_bay_area.txt
ucsf_regex\ucsf_neighborhoods.txt

[feature request] ignoring input files

Hi,
I use DVC in my project for data files version control:
https://dvc.org/
DVC creates special text metafile with the .dvc file extension.
And when I execute filter I get an error related to DVC files:

 File ".../python3.6/site-packages/philter_ucsf/philter.py", line 800, in transform
    contents = self.transform_text_i2b2(self.data_all_files[filename])
KeyError: '.../sample_notes/1.txt.dvc'

It would be nice to add option for skipping files with defined file extension (for example .dvc).

Best regards
Grzegorz

Is it possible to use Philter with non-english language text

Hi thanks for releasing this software. I was just wondering is there anyway of enabling Philter to process non-english text?

I had a quick try using default settings (python main.py -i ./data/i2b2_notes/ -o ./data/i2b2_results/ -f ./configs/philter_delta.json --prod=True --outputformat "asterisk") and it seems to anonymise everything by default. For example:

This:

"pitkävuoroHengitys :alkuun vm <NAME> <NHS_NUMB> 4890262253 <NHS_NUMB> Margot <NAME> 40% , co2 nousee , vaihdettu 28% <NI_NUMB> <ADDRESS> 0487 Hull Village Suite 759, New Donald <ADDRESS>, <POSTCODE> EX13 5LY <POSTCODE> KK218196A <NI_NUMB> , jolla saturaatio laskee ad 84 ja co2 edelleen nousee , viikset , joilla saturoituu 90-91.<NAME> <ADDRESS> 94892 Garcia Cliffs, Thomasville <ADDRESS>, <POSTCODE> PO41 0SD <POSTCODE> <NHS_NUMB> <NI_NUMB> CJ389083D <NI_NUMB> 4890262253 <NHS_NUMB> Ibrahim <NAME> Hengitys pinnallista ja krohisevaa.

Became:

****ä************* :****** <NHS_NUMB> ********** <NHS_NUMB> <ADDRESS> **** **** ******* ***** ***, New ****** <ADDRESS>, <POSTCODE> *** 6BP <POSTCODE> vm 40% , co2 ****** , <NAME> ******** <NAME> ********* 28% , ***** <**_NUMB> ********* <**_NUMB> ********** ****** ad ** ** *** ******** ****** , ******* , ****** ********** 90-91.<ADDRESS> ***** ****** ******, Thomasville <ADDRESS>, <POSTCODE> *** 7BE <POSTCODE> <**_NUMB> ********* <**_NUMB> <NHS_NUMB> ********** <NHS_NUMB> ******** <NAME> **** <NAME> *********** ** **********.

Is there a way of modifying this so that only the regex patterns are anonymised?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.