Coder Social home page Coder Social logo

chexpert-labeler's Introduction

chexpert-labeler

CheXpert NLP tool to extract observations from radiology reports.

Read more about our project here and our AAAI 2019 paper here.

Prerequisites

Please install following dependencies or use the Dockerized labeler (see below).

  1. Clone the NegBio repository:
git clone https://github.com/ncbi-nlp/NegBio.git
  1. Add the NegBio directory to your PYTHONPATH:
export PYTHONPATH={path to negbio directory}:$PYTHONPATH
  1. Make the virtual environment:
conda env create -f environment.yml
  1. Activate the virtual environment:
conda activate chexpert-label
  1. Install NLTK data:
python -m nltk.downloader universal_tagset punkt wordnet
  1. Download the GENIA+PubMed parsing model:
>>> from bllipparser import RerankingParser
>>> RerankingParser.fetch_and_load('GENIA+PubMed')

Usage

Place reports in a headerless, single column csv {reports_path}. Each report must be contained in quotes if (1) it contains a comma or (2) it spans multiple lines. See sample_reports.csv (with output labeled_reports.csv)for an example.

python label.py --reports_path {reports_path}

Run python label.py --help for descriptions of all of the command-line arguments.

Dockerized Labeler

docker build -t chexpert-labeler:latest .
docker run -v $(pwd):/data chexpert-labeler:latest \
  python label.py --reports_path /data/sample_reports.csv --output_path /data/labeled_reports.csv --verbose

Contributions

This repository builds upon the work of NegBio.

This tool was developed by Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, and Silviana Ciurea-Ilcus.

Citing

If you're using the CheXpert labeling tool, please cite this paper:

@inproceedings{irvin2019chexpert,
  title={CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison},
  author={Irvin, Jeremy and Rajpurkar, Pranav and Ko, Michael and Yu, Yifan and Ciurea-Ilcus, Silviana and Chute, Chris and Marklund, Henrik and Haghgoo, Behzad and Ball, Robyn and Shpanskaya, Katie and others},
  booktitle={Thirty-Third AAAI Conference on Artificial Intelligence},
  year={2019}
}

chexpert-labeler's People

Contributors

alistairewj avatar daikikatsuragawa avatar jantrienes avatar jirvin16 avatar kl2532 avatar maxghenis avatar saahil9jain avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chexpert-labeler's Issues

Compatibility with NegBio

Hello,
It seems the NegBio underwent significant changes recently. This results in a bunch of errors. Can you please the update the repo to absorb those?

ERROR:Cannot process sentence 0 in 0

Thanks for open sourcing your labeler! I'm running into the following error with the sample reports:

/root/miniconda3/envs/chexpert-label/lib/python3.8/site-packages/StanfordDependencies/JPypeBackend.py:160: UserWarning: This jar doesn't support universal dependencies, falling back to Stanford Depen
dencies. To suppress this message, call with universal=False
warnings.warn("This jar doesn't support universal "
ERROR:root:Cannot process sentence 0 in 0
Traceback (most recent call last):
File "/root/Y_chexpert-labeler-master/negbio/pipeline/ptb2ud.py", line 118, in convert_doc
anns, rels = convert_dg(dependency_graph, sentence.text,
File "/root/Y_chexpert-labeler-master/negbio/pipeline/ptb2ud.py", line 171, in convert_dg
index = text.find(node_form, start)
TypeError: must be str, not java.lang.String
ERROR:root:Cannot process sentence 39 in 0
Traceback (most recent call last):
File "/root/Y_chexpert-labeler-master/negbio/pipeline/ptb2ud.py", line 118, in convert_doc
anns, rels = convert_dg(dependency_graph, sentence.text,
File "/root/Y_chexpert-labeler-master/negbio/pipeline/ptb2ud.py", line 171, in convert_dg
index = text.find(node_form, start)
TypeError: must be str, not java.lang.String
ERROR:root:Cannot process sentence 62 in 0
Traceback (most recent call last):
File "/root/Y_chexpert-labeler-master/negbio/pipeline/ptb2ud.py", line 118, in convert_doc
anns, rels = convert_dg(dependency_graph, sentence.text,
File "/root/Y_chexpert-labeler-master/negbio/pipeline/ptb2ud.py", line 171, in convert_dg
index = text.find(node_form, start)
TypeError: must be str, not java.lang.String
ERROR:root:Cannot process 0
Traceback (most recent call last):
File "/root/Y_chexpert-labeler-master/negbio/pipeline/negdetect.py", line 75, in detect
total_loc = ann.get_total_location()
AttributeError: 'BioCAnnotation' object has no attribute 'get_total_location'
ERROR:root:Cannot process sentence 0 in 1
Traceback (most recent call last):
File "/root/Y_chexpert-labeler-master/negbio/pipeline/ptb2ud.py", line 118, in convert_doc
anns, rels = convert_dg(dependency_graph, sentence.text,
File "/root/Y_chexpert-labeler-master/negbio/pipeline/ptb2ud.py", line 171, in convert_dg
index = text.find(node_form, start)
TypeError: must be str, not java.lang.String
ERROR:root:Cannot process sentence 52 in 1
Traceback (most recent call last):
File "/root/Y_chexpert-labeler-master/negbio/pipeline/ptb2ud.py", line 118, in convert_doc
anns, rels = convert_dg(dependency_graph, sentence.text,
File "/root/Y_chexpert-labeler-master/negbio/pipeline/ptb2ud.py", line 171, in convert_dg
index = text.find(node_form, start)
TypeError: must be str, not java.lang.String
ERROR:root:Cannot process sentence 84 in 1
Traceback (most recent call last):
File "/root/Y_chexpert-labeler-master/negbio/pipeline/ptb2ud.py", line 118, in convert_doc
anns, rels = convert_dg(dependency_graph, sentence.text,
File "/root/Y_chexpert-labeler-master/negbio/pipeline/ptb2ud.py", line 171, in convert_dg
index = text.find(node_form, start)
TypeError: must be str, not java.lang.String
ERROR:root:Cannot process 1
Traceback (most recent call last):
File "/root/Y_chexpert-labeler-master/negbio/pipeline/negdetect.py", line 75, in detect
total_loc = ann.get_total_location()
AttributeError: 'BioCAnnotation' object has no attribute 'get_total_location'
ERROR:root:Cannot process sentence 0 in 2
Traceback (most recent call last):
File "/root/Y_chexpert-labeler-master/negbio/pipeline/ptb2ud.py", line 118, in convert_doc
anns, rels = convert_dg(dependency_graph, sentence.text,
File "/root/Y_chexpert-labeler-master/negbio/pipeline/ptb2ud.py", line 171, in convert_dg
index = text.find(node_form, start)
TypeError: must be str, not java.lang.String
ERROR:root:Cannot process 2
Traceback (most recent call last):
File "/root/Y_chexpert-labeler-master/negbio/pipeline/negdetect.py", line 75, in detect
total_loc = ann.get_total_location()
AttributeError: 'BioCAnnotation' object has no attribute 'get_total_location'
ERROR:root:Cannot process sentence 0 in 3
Traceback (most recent call last):
File "/root/Y_chexpert-labeler-master/negbio/pipeline/ptb2ud.py", line 118, in convert_doc
anns, rels = convert_dg(dependency_graph, sentence.text,
File "/root/Y_chexpert-labeler-master/negbio/pipeline/ptb2ud.py", line 171, in convert_dg
index = text.find(node_form, start)
TypeError: must be str, not java.lang.String
ERROR:root:Cannot process sentence 29 in 3
Traceback (most recent call last):
File "/root/Y_chexpert-labeler-master/negbio/pipeline/ptb2ud.py", line 118, in convert_doc
anns, rels = convert_dg(dependency_graph, sentence.text,
File "/root/Y_chexpert-labeler-master/negbio/pipeline/ptb2ud.py", line 171, in convert_dg
index = text.find(node_form, start)
TypeError: must be str, not java.lang.String
ERROR:root:Cannot process 3
Traceback (most recent call last):
File "/root/Y_chexpert-labeler-master/negbio/pipeline/negdetect.py", line 75, in detect
total_loc = ann.get_total_location()
AttributeError: 'BioCAnnotation' object has no attribute 'get_total_location'

One of the problems here is that 'TypeError: must be str'. However, when I print the type of sample_report, it is indeed str type.

['Heart size normal and lungs are clear. No edema or pneumonia. No effusion', '1. Left pleural effusion with adjacent atelectasis. Right effusion is also present.\n\n2. Cardiomegaly without overt ede
ma.', 'Minimal patchy airspace disease within the lingula, may reflect atelectasis or consolidation.', '1. Stable mild cardiomegaly. 2. Hyperexpanded but clear lungs.']
0 <class 'str'>
1 <class 'str'>
2 <class 'str'>
3 <class 'str'>

How can I fix it? Is it an issue of incompatibility with the Conda environment?

Results

Hello and thank you very much for sharing your great work!
The prediction ran smoothly, but I am not sure on how to interpret the resulting csv file.
There are some categories present, some with a "0", others with a "-1" and other void.
What does that mean?
Here is an example:

Reports,No Finding,Enlarged Cardiomediastinum,Cardiomegaly,Lung Lesion,Lung Opacity,Edema,Consolidation,Pneumonia,Atelectasis,Pneumothorax,Pleural Effusion,Pleural Other,Fracture,Support Devices
"PA and moderate loss of the chest demonstrate stable moderate cardiomediastinal silhouette with atherosclerotic calcifications of the aortic XXXX and mild aortic ectasia. Emphysematous changes with flattening of the hemidiaphragms. Blunting of the costophrenic XXXX, and XXXX secondary to scarring/emphysematous changes. No evidence of focal airspace consolidation large pleural effusion or pneumothorax. Visualized osseous structures appear intact",,-1.0,,,-1.0,,0.0,,,-1.0,-1.0,,,

Thank you very much in advance!

Chexpert reports

Hello, I would like to ask whether the complete data set includes reports? excuse me!

ResolvePackageNotFound: openssl==1.1.1=h7b6447c_0 (environment.yml)

(negbio3.7) mghenis@penguin:~/chexpert-labeler$ conda env create -f environment.yml                                      
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:
  - openssl==1.1.1=h7b6447c_0

Looking through the openssl conda packages, this is associated with version 1.1.1f, and the following worked instead:

conda install -c anaconda openssl==1.1.1f=h7b6447c_0

run label.py ModuleNotFoundError: No module named '_JohnsonReranker'

Hi, when I try to run 'label.py --help', it reports the error like that:
image
Then I uninstall the bllipparser and use 'pip install bllipparser' to install lastest bllipparser, it can successful run the script: 'from bllipparser import RerankingParser'
However, when I try again 'label.py --help', it reports the error like that:
image
I have no idea how to fix this error

ERROR::root:Cannot process sentence 39 in 0

'UserWarning: This jar doesn't support universal dependencies, falling back to Stanford Dependencies. To suppress this message, call with universal=False
warnings.warn("This jar doesn't support universal "
Traceback (most recent call last):
File "/media/zoule/Elements/chexpert-labeler/chexpert-labeler-master/negbio/pipeline/ptb2ud.py", line 120, in convert_doc
has_lemmas=self._backend == 'jpype')
File "/media/zoule/Elements/chexpert-labeler/chexpert-labeler-master/negbio/pipeline/ptb2ud.py", line 172, in convert_dg
index = text.find(node_form, start)
TypeError: must be str, not java.lang.String'

When I run 'python label.py --reports_path sample_reports.csv', the error will happen. I check the type of text by 'print(type(text))' before 'index = text.find(node_form, start)'. It will print '<class 'str'>.

I don't know what is the problem and how to fix it. Can anybody help me?
error1

Inexplicable negation mistake

Thanks for your work on this and making it available! However I have found a really bizarre edge case that I'd love to understand...

If the input CSV contains just this:

No rib fracture

Then the output CSV looks like this (as I would expect).

Reports,No Finding,Enlarged Cardiomediastinum,Cardiomegaly,Lung Lesion,Lung Opacity,Edema,Consolidation,Pneumonia,Atelectasis,Pneumothorax,Pleural Effusion,Pleural Other,Fracture,Support Devices
No rib fracture,1.0,,,,,,,,,,,,0.0,

If however I add a full stop at the end (and make no other change), the output switches to be positive for fracture!

No rib fracture.

with the output:

Reports,No Finding,Enlarged Cardiomediastinum,Cardiomegaly,Lung Lesion,Lung Opacity,Edema,Consolidation,Pneumonia,Atelectasis,Pneumothorax,Pleural Effusion,Pleural Other,Fracture,Support Devices
No rib fracture.,,,,,,,,,,,,,1.0,

Any insight would be much appreciated!

What's the meaning of Support Devices?

Excuse me! Could you tell me what is support devices one of the 14 labels? She didn't give me the answer when I ask the Google. And I know rarely medical knowledge. Thanks!

Import error with bllipparser/_CharniakParser.so

Hi. I've followed the instructions for the NegBio install. Everything works fine till step 5 where I get an import error.

Download the GENIA+PubMed parsing model:
>>> from bllipparser import RerankingParser
>>> RerankingParser.fetch_and_load('GENIA+PubMed')

Full traceback

Traceback (most recent call last):
File "", line 1, in
File "/anaconda3/envs/negbio2.7/lib/python2.7/site-packages/bllipparser/init.py", line 399, in
from .RerankingParser import RerankingParser, Tree, Sentence, tokenize
File "/anaconda3/envs/negbio2.7/lib/python2.7/site-packages/bllipparser/RerankingParser.py", line 19, in
from . import CharniakParser as parser
File "/anaconda3/envs/negbio2.7/lib/python2.7/site-packages/bllipparser/CharniakParser.py", line 28, in
_CharniakParser = swig_import_helper()
File "/anaconda3/envs/negbio2.7/lib/python2.7/site-packages/bllipparser/CharniakParser.py", line 24, in swig_import_helper
_mod = imp.load_module('_CharniakParser', fp, pathname, description)
ImportError: dlopen(/anaconda3/envs/negbio2.7/lib/python2.7/site-packages/bllipparser/_CharniakParser.so, 2): Symbol not found: __ZNKSt5ctypeIcE13_M_widen_initEv
Referenced from: /anaconda3/envs/negbio2.7/lib/python2.7/site-packages/bllipparser/_CharniakParser.so
Expected in: /usr/lib/libstdc++.6.dylib
in /anaconda3/envs/negbio2.7/lib/python2.7/site-packages/bllipparser/_CharniakParser.so

Seems it's some issue with how the package is compiled?

Mac Version 10.14, Anaconda3

Unable to run Labeler - invalid package

Hello, I am trying to run the chexpert labeler on some reports. I'm first working out how to even get it going, so I'm using the sample_reports.csv file. Upon running label.py, the program quickly fails with the following error:

File "/Users/xxx/opt/miniconda3/envs/chexpert-label/lib/python3.6/site-packages/StanfordDependencies/JPypeBackend.py", line 44, in init
self.corenlp = jpype.JPackage('edu').stanford.nlp
AttributeError: Java package 'edu' is not valid

I've tried importing the edu.stanford.nlp package at the top of the file, using jpype.JClass(...), and a couple other things but this inevitably breaks something else further down the line. Any guidance or suggestions in resolving this problem and getting the labeler up and running is much appreciated!

ValueError: Parser model has not been loaded.

when i run with the command :
python label.py --reports_path {reports_path}
here is the error:

Traceback (most recent call last): File "label.py", line 51, in <module> label(parser.parse_args()) File "label.py", line 42, in label classifier.classify(loader.collection) File "/Users/binerone/test/chexpert-labeler/stages/classify.py", line 100, in classify self.parser.parse_doc(document) File "/Users/binerone/test/NegBio/negbio/pipeline/parse.py", line 55, in parse_doc tree = self.parse(text) File "/Users/binerone/test/NegBio/negbio/pipeline/parse.py", line 34, in parse nbest = self.rrp.parse(str(s)) File "/Users/binerone/anaconda3/envs/chexpert-label/lib/python3.6/site-packages/bllipparser/RerankingParser.py", line 614, in parse rerank = self.check_models_loaded_or_error(rerank) File "/Users/binerone/anaconda3/envs/chexpert-label/lib/python3.6/site-packages/bllipparser/RerankingParser.py", line 786, in check_models_loaded_or_error raise ValueError("Parser model has not been loaded.") ValueError: Parser model has not been loaded.

My os is Mac os, I can not solve this problem. Thanks for help.

Probabilities

Hi, I was thinking if it is somehow possible to extract the probabilities of the classification, to then evaluate the AUROC

Out of memory

my csv is so big that the memory is out of the range. May be there is a way to fix out

How to run the labeler on GPU.

Hello:

When running the labeler on a laptop, it is taking too much time for a few thousand items. Is it possible to run the same on GPU?

Thanks

negative construction in clinical report

Thanks for open sourcing the CXR labelling tool! I tried using the labelling tool on other CXR clinical reports and have got a very bizarre result:

From here:
https://openi.nlm.nih.gov/detailedresult?img=CXR2016_IM-0665-1001

The text to be parsed is:
"The lungs are clear without evidence of focal airspace disease. There is no evidence of pneumothorax or large pleural effusion. The cardiac and mediastinal contours are within normal limits."

The output of the NLP labler is:
The lungs are clear without evidence of focal airspace disease. There is no evidence of pneumothorax or large pleural effusion. The cardiac and mediastinal contours are within normal limits.,,1.0,,,1.0,,,,,1.0,1.0,,,

which has marked positive for the following; Enlarged Cardiomediastinum, Lung Opacity, Pneumothorax, Pleural Effusion.

I'm not from an NLP background, so I'm not very sure what is causing these classes to be positively flagged when it should be negative?

Thanks!

Have stanford-corenlp-3.5.2.jar pre-downloaded in docker image

Hi, I'm using the dockerized version of the CheXpert labeler so as to run multiple instances in parallel, which is helping to save a lot of time. However, I've noticed that every time I run a docker container with cheXpert labeler, the file stanford-corenlp-3.5.2.jar gets downloaded. It happens every single time, so if I run 10 containers in parallel, that means 10 downloads in parallel. Even if I just want to run a cheXpert labeler container to label a single report, the file gets downloaded again.

When this happens, the log usually looks like this:

Downloading 'http://search.maven.org/remotecontent?filepath=edu/stanford/nlp/stanford-corenlp/3.5.2/stanford-corenlp-3.5.2.jar' -> '/root/.local/share/pystanforddeps/stanford-corenlp-3.5.2.jar'

This usually is not a problem (I've been downloading this file thousands of times already), but it certainly adds some network bandwidth overhead, it makes labeling a single report slower (10 seconds for the first report, from the second report onwards the penalty does not apply because the file is downloaded only once per run), and from time to time the run fails because the download attempt receives a 502 bad gateway response from the server.

Question: Would it possible to have stanford-corenlp-3.5.2.jar pre-downloaded within the CheXpert labeler docker image, so that it doesn't have to be downloaded again and again every time I run the container?

Thanks in advance.

Image labeler

Hello, I was looking for the script to label x-ray images, can I find it in this repository or is there another one?

Issue while installing

Hello:

I am trying to install chexpert-labeler on Ubuntu machine however, when I run the step 5. I am getting following error

Traceback (most recent call last):
  File "test.py", line 1, in <module>
    from bllipparser import RerankingParser
  File "/home/ubuntu/.conda/envs/chexpert-label/lib/python3.6/site-packages/bllipparser/__init__.py", line 399, in <module>
    from .RerankingParser import RerankingParser, Tree, Sentence, tokenize
  File "/home/ubuntu/.conda/envs/chexpert-label/lib/python3.6/site-packages/bllipparser/RerankingParser.py", line 20, in <module>
    from . import JohnsonReranker as reranker
  File "/home/ubuntu/.conda/envs/chexpert-label/lib/python3.6/site-packages/bllipparser/JohnsonReranker.py", line 28, in <module>
    _JohnsonReranker = swig_import_helper()
  File "/home/ubuntu/.conda/envs/chexpert-label/lib/python3.6/site-packages/bllipparser/JohnsonReranker.py", line 24, in swig_import_helper
    _mod = imp.load_module('_JohnsonReranker', fp, pathname, description)
  File "/home/ubuntu/.conda/envs/chexpert-label/lib/python3.6/imp.py", line 243, in load_module
    return load_dynamic(name, filename, file)
  File "/home/ubuntu/.conda/envs/chexpert-label/lib/python3.6/imp.py", line 343, in load_dynamic
    return _load(spec)
ImportError: /home/ubuntu/.conda/envs/chexpert-label/lib/python3.6/site-packages/bllipparser/_JohnsonReranker.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZNSt7__cxx1118basic_stringstreamIcSt11char_traitsIcESaIcEEC1Ev

I am not sure what is wrong here.
Can anybody help me here?

Thanks

Error - ValueError: Parser model has not been loaded.

I am getting the below error, when running the label.py

  File "/home/user/anaconda3/envs/chexpert-label/lib/python3.6/site-packages/bllipparser/RerankingParser.py", line 786, in check_models_loaded_or_error
    raise ValueError("Parser model has not been loaded.")
ValueError: Parser model has not been loaded.

I have already downloaded the stanford-core-nlp jar and the bz2 file for the below.

from bllipparser import RerankingParser
RerankingParser.fetch_and_load('GENIA+PubMed')

Error when running label.py

Hi, I get this error when I run python label.py --reports_path reports.csv.

self.ptb2dep = ptb2ud.NegBioPtb2DepConverter(universal=True) TypeError: __init__() missing 1 required positional argument: 'lemmatize

How do I solve this?

Getting access to the Chexpert reports

Hi,

After downloading the light version of the dataset, I couldn't find the reports.
I want to develop a new NLP method to extract labels and for that reason I want to have the reports.
Is it possible to get access to them?

Thanks

gzip: stdout: broken pipeline

To whom may sovle our issue:

Hello, we fail to install chexpert labeler in our environment and we are here for seeking help.

We found that the error happened in :

rrp = RerankingParser.fetch_and_load('GENIA+PubMed')

Furthermore, we found that the issue actually happened when JohnsonReranker is trying to read reranker_features_filename and reranker_weights_filename .
Based on our current knowledge, we can not step forward and wish there will be someone helping us...

Environment: Colab. Solution on the local environment is also acceptable but plz make sure it will work on Win10.

Cheers

Error encountered on MIMIC:TypeError: 'NoneType' object is not iterable

Hello, I encountered the following error when I tried to use your work for MIMIC data and extract labels.The following error is thrown during the program run, but the program does not stop. I don't know how to solve this error. I am looking forward to your reply.

TypeError: 'NoneType' object is not iterable
1%| | 2871/270791 [24:13<56:52:19, 1.31it/s]ERROR:root:Cannot process sentence 140 in 2871
Traceback (most recent call last):
File ".local/lib/python3.6/site-packages/negbio/pipeline/ptb2ud.py", line 130, in convert_doc
has_lemmas=self._backend == 'jpype')
TypeError: 'NoneType' object is not iterable

About Term Definition and Consistency

Thanks for your great works. I have a few questions and need your generous help.

As mentioned in Structured dataset documentation: a datasheet for CheXpert, the terms are from ”Fleischner Society: Glossary of Terms for Thoracic Imaging,“. However I can not find the definition of some labels like Lung Lesion. This seems to be a vague label with many situations.

Also in the peper's figure 1, Cadiomegaly seems to be a special situation of Enlarged Cardiomediastinum, while there are not all the Cadiomegaly positive cases with positive labels on Enlarged Cardiomediastinum.

Is there any detailed information for term definition an if there are some self consistency in the labeling procedure?

blank label

Thanks for your great work and datasets! I wonder how you treated label blank while training the model? As the labeler model should output only positive, negative and uncertain. Also did you use any tricks trying to fix the unbalanced dataset like data augmentation or twisted training loss functions to make model less biased?

Error, can not found previous push version

git checkout 2f29daf;
bash: cd: chexpert-labeler: No such file or directory
error: Your local changes to the following files would be overwritten by checkout:
loader/load.py
Please commit your changes or stash them before you switch branches.
Aborting

Running multiple chexpert labelers in parallel

Chexpert labeler is a bit slow. In my machine it labels about 4-5 reports per second on average, which is too slow if you want to label dozens of thousands of reports quickly. As a workaround, I thought that I could leverage the fact that my machine has multiple cores by running multiple instances of chexpert labeler over disjoint splits of my report dataset. To this effect I tried the following:

def _invoke_chexpert_labeler_process(self, reports, tmp_suffix='', n_processes = 10):

    n = len(reports)
    if n < 100:
        n_processes = 1

    chunk_size = n // n_processes
    processes = []
    output_paths = []

    if self.verbose:
        print(f'Chexpert labeler: running {n_processes} processes in parallel')

    start = time.time()
    custom_env = _get_custom_env()

    for i in range(n_processes):
        # Define chunk range
        b = i * chunk_size
        e = n if i + 1 == n_processes else b + chunk_size
        
        # Define input & output paths for i-th chunk
        input_path = os.path.join(TMP_FOLDER, f'labeler-input{tmp_suffix}_{i}.csv')
        output_path = os.path.join(TMP_FOLDER, f'labeler-output{tmp_suffix}_{i}.csv')
        output_paths.append(output_path)

        # Create input file
        os.makedirs(TMP_FOLDER, exist_ok=True)
        in_df = pd.DataFrame(reports[b:e])
        in_df.to_csv(input_path, header=False, index=False, quoting=csv.QUOTE_ALL)

        # Build command & call chexpert labeler process
        cmd_cd = f'cd {CHEXPERT_FOLDER}'
        cmd_call = f'{CHEXPERT_PYTHON} label.py --reports_path {input_path} --output_path {output_path}'
        cmd = f'{cmd_cd} && {cmd_call}'
        if self.verbose:
            print(f'({i}) Running chexpert labeler over {len(in_df)} reports ...')
        processes.append(subprocess.Popen(cmd, shell=True, env=custom_env))
    
    out_labels = np.empty((n, len(CHEXPERT_LABELS)), np.int8)
    
    offset = 0        
    for i, p in enumerate(processes):
        # Wait for subprocess to finish
        if p.poll() is None:
            p.wait()
        if self.verbose: print(f'process {i} finished, elapsed time = {time.time() - start}')
        # Read chexpert-labeler output
        out_df = pd.read_csv(output_paths[i])
        out_df = out_df.fillna(-2)
        out_labels[offset : offset + len(out_df)] = out_df[CHEXPERT_LABELS].to_numpy().astype(np.int8)
        offset += len(out_df)
    
    assert offset == n
    
    return out_labels

Unfortunately, I'm getting this very strange behavior:

Chexpert labeler: running 10 processes in parallel

  1. Running chexpert labeler over 29 reports ...
  2. Running chexpert labeler over 29 reports ...
  3. Running chexpert labeler over 29 reports ...
  4. Running chexpert labeler over 29 reports ...
  5. Running chexpert labeler over 29 reports ...
  6. Running chexpert labeler over 29 reports ...
  7. Running chexpert labeler over 29 reports ...
  8. Running chexpert labeler over 29 reports ...
  9. Running chexpert labeler over 29 reports ...
  10. Running chexpert labeler over 34 reports ...
    process 0 finished, elapsed time = 9.482320785522461
    process 1 finished, elapsed time = 10.595801830291748
    process 2 finished, elapsed time = 203.73371744155884
    process 3 finished, elapsed time = 203.74254941940308
    process 4 finished, elapsed time = 203.7504105567932
    process 5 finished, elapsed time = 209.21588110923767
    process 6 finished, elapsed time = 209.2250039577484
    process 7 finished, elapsed time = 209.2326741218567
    process 8 finished, elapsed time = 209.23797416687012
    process 9 finished, elapsed time = 209.24284863471985

As you can see, the first two processes terminate relatively quickly (in about 10 seconds), but for some unknown reason processes 2 through 9 terminate about 200 seconds later. I've run my code several times and I always get the same result.

I have two questions:

  • Is it possible to run multiple instances of chexpert labeler in parallel for performance gains?
  • If so, is there an example code of how this can be done? Maybe the way I'm doing it is not optimal (to be honest, I'm not even sure if I'm doing it the right way, this is the first time I attempt to parallelize a command using subprocess.Popen).

Thank you very much in advance.

Reports require end punctuation

Thanks for open sourcing your labeler! I'm running into the following error with the sample reports:

$ python label.py --reports_path sample_reports.csv
ERROR:root:Cannot process sentence 62 in 0
Traceback (most recent call last):
  File "NegBio/negbio/pipeline/ptb2ud.py", line 109, in convert_doc
    self.add_lemmas)
  File "NegBio/negbio/pipeline/ptb2ud.py", line 183, in convert_dg
    ann = annotations[annotation_id_map[node.index]]
IndexError: list index out of range

I believe the issue is due to the lack of punctuation at the end of the first sample report.

For example, if the input is:
Heart size normal and lungs are clear. No edema or pneumonia. No effusion,
then the labeled report output is:
Heart size normal and lungs are clear. No edema or pneumonia. No effusion,,,0.0,,,0.0,,0.0,,,1.0,,,

However, the example labeled_reports.csv has:
Heart size normal and lungs are clear. No edema or pneumonia. No effusion.,1.0,,0.0,,,0.0,,0.0,,,0.0,,,

We can achieve the example labels by modifying the input to Heart size normal and lungs are clear. No edema or pneumonia. No effusion. (added a period to the end of the report). The output is Heart size normal and lungs are clear. No edema or pneumonia. No effusion.,1.0,,0.0,,,0.0,,0.0,,,0.0,,,.

To summarize, do the radiology reports require punctuation at the end of each sentence?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.