Coder Social home page Coder Social logo

pkpdai / pkdocclassifier Goto Github PK

View Code? Open in Web Editor NEW
16.0 4.0 7.0 8.63 MB

Binary classifier to identify scientific publications reporting pharmacokinetic parameters estimated in vivo

Home Page: https://pkpdai.com/pkdocsearch

License: MIT License

Python 92.36% Jupyter Notebook 7.64%
bionlp pharmacokinetics nlp machine-learning pharmacometrics bioinformatics document-classification

pkdocclassifier's Introduction

PKDocClassifier

License Website shields.io version

PKDocClassifier| Data | Reproduce our results | Make new predictions | Citing

This repository contains custom pipes and models to classify scientific publications from PubMed depending on whether they estimate pharmacokinetic (PK) parameters from in vivo studies. The final pipeline retrieved more than 121K PK publications and runs weekly updates available at https://pkpdai.com/pkdocsearch. The architectures in this repository are described in the following publication: https://pubmed.ncbi.nlm.nih.gov/34381873/

Data

The labels assigned to each publication can be found in the labels folder. These labels are available under the terms of the Creative Commons Zero "No rights reserved" data (CC0 1.0 Public domain dedication).

Reproduce our results

1. Installing dependencies

You will need an environment with Python 3.7+. We strongly recommend that you use an isolated Python environment (such as virtualenv or conda) to install the packages related to this project. Our default option will be to create a virtual environment with conda:

  1. If you don't have anaconda installed follow the instructions here

  2. Create conda environment for this project and activate it

    conda create -n PKDocClassifier python=3.7
    conda activate PKDocClassifier
  3. Clone and access this repository on your local machine

    git clone https://github.com/fgh95/PKDocClassifier
    cd PKDocClassifier

    If you are on MacOSX install LLVM's OpenMP runtime library, e.g.

    brew install libomp
  4. Install all project dependencies

    pip install .

2. Data download and parsing - Optional

If you would like to reproduce the steps taken for data retrieval and parsing you will need to download the whole MEDLINE dataset and store it into a spark dataframe. Alternatively, you can also skip this step and use the parsed data available at data/subsets/. Alternatively, follow the steps at pubmed_parser wiki and place the resulting medline_lastview.parquet file at data/medline_lastview.parquet. Then, change the spark config file to your spark configuration and run:

python scripts/getready.py

This should generate the files at data/subsets/.

3. Reproduce results

3.1. Field analysis and N-grams

  1. To generate the features run (~30min):

    python scripts/features_bow.py
  2. Bootstrap field analysis (~3h on 12 threads, requires at least 16GB of RAM, set overwrite to False if you want to skip this step)

    python scripts/bootstrap_bow.py \
       --input-dir data/encoded/fields \
       --output-dir data/results/fields \
       --output-dir-bootstrap data/results/fields/bootstrap \
       --path-labels data/labels/training_labels.csv \
       --overwrite True

    Optional analysis using idf scores for reweighing:

    python scripts/bootstrap_bow.py \
       --input-dir data/encoded/fields \
       --output-dir data/results/fields \
       --output-dir-bootstrap data/results/fields/bootstrap \
       --path-labels data/labels/training_labels.csv \
       --use-idf True
  3. Bootstrap n-grams (set overwrite to False if you want to skip this step)

    python scripts/bootstrap_bow.py \
       --input-dir data/encoded/ngrams \
       --output-dir data/results/ngrams \
       --output-dir-bootstrap data/results/ngrams/bootstrap \
       --path-labels data/labels/training_labels.csv \
       --overwrite True
  4. Display results

    python scripts/display_results.py \
       --input-dir  data/results/fields\
       --output-dir data/final/fields
    python scripts/display_results.py \
       --input-dir  data/results/ngrams\
       --output-dir data/final/ngrams

3.2. Distributed representations

  1. Encode using SPECTER. To generate the features with specter you can preprocess the data running:

    python preprocess_specter.py

    This will generate the following input data as .ids and .json files at data/encoded/specter/. Finally, to generate the input features you will need to clone the SPECTER repo and follow the instructions on how to use the pretrained model. After cloning and installing SPECTER dependencies we ran the following command from the specter directory to encode the documents:

    python scripts/embed.py \
       --ids ../data/encoded/specter/training_ids.ids --metadata ../data/encoded/specter/training_meta.json \
       --model ./model.tar.gz \
       --output-file ../data/encoded/specter/training_specter.jsonl \
       --vocab-dir data/vocab/ \
       --batch-size 16 \
       --cuda-device -1
    python scripts/embed.py \
       --ids ../data/encoded/specter/test_ids.ids --metadata ../data/encoded/specter/test_meta.json \
       --model ./model.tar.gz \
       --output-file ../data/encoded/specter/test_specter.jsonl \
       --vocab-dir data/vocab/ \
       --batch-size 16 \
       --cuda-device -1

    This should output two files in the data directory: /data/encoded/specter/training_specter.jsonl and data/encoded/specter/test_specter.jsonl

  2. Generate BioBERT representations:

    python scripts/features_dist.py
  3. Run bootstrap iterations for distributed representations (set overwrite to False if you want to skip this step):

    python scripts/bootstrap_dist.py \
       --is-specter True \
       --use-bow False \
       --input-dir data/encoded/specter \
       --output-dir data/results/distributional \
       --output-dir-bootstrap data/results/distributional/bootstrap \
       --path-labels data/labels/training_labels.csv \
       --path-optimal-bow data/encoded/ngrams/training_unigrams.parquet \
       --overwrite True
    python scripts/bootstrap_dist.py \
       --is-specter False \
       --use-bow False \
       --input-dir data/encoded/biobert \
       --output-dir data/results/distributional \
       --output-dir-bootstrap data/results/distributional/bootstrap \
       --path-labels data/labels/training_labels.csv \
       --path-optimal-bow data/encoded/ngrams/training_unigrams.parquet \
       --overwrite True
  4. Display results

    python scripts/display_results.py \
       --input-dir  data/results/distributional \
       --output-dir data/final/distributional \
       --convert-latex
    python scripts/display_results.py \
       --input-dir  data/results/distributional/bow_and_distributional \
       --output-dir data/final/distributional/bow_and_distributional \
       --convert-latex

    From these plots we can see that the best-performing architecture on the training data, on average, is the one using average embeddings from BioBERT and unigram features.

3.3. Final pipeline

Run the cross-validation analyses:

python scripts/cross_validate.py \
   --training-embeddings  data/encoded/biobert/training_biobert_avg.parquet \
   --training-optimal-bow  data/encoded/ngrams/training_unigrams.parquet \
   --training-labels  data/labels/training_labels.csv\
   --output-dir  data/results/final-pipeline 

Train the final pipeline (preprocessing, encoding, decoding) from scratch with optimal hyperparameters and apply it to the test set:

python scripts/train_test_final.py \
   --path-train  data/subsets/training_subset.parquet \
   --train-labels  data/labels/training_labels.csv \
   --path-test  data/subsets/test_subset.parquet \
   --test-labels  data/labels/test_labels.csv \
   --cv-dir  data/results/final-pipeline \
   --output-dir  data/results/final-pipeline \
   --train-pipeline  True 

Final results on the test set should be printed on the terminal.

Make new predictions

You can make new predictions in three simple steps:

import pandas as pd
import joblib

# 1. Import data
data = pd.read_csv('data/examples/to_classify.csv').reset_index(drop=True)
data['pmid'] = data['pmid'].fillna(0).astype(int).fillna('')
data.head()
#>>>                                             abstract  mesh_terms  ...                                              title      pmid
#>>> 0  Rituximab, an anti-CD20 monoclonal antibody, i...         NaN  ...  Pharmacokinetics, efficacy and safety of the r...  28766389
#>>> 1  Background: Biosimilars are highly similar to ...         NaN  ...  A Randomized, Double-Blind, Efficacy and Safet...  31820339
#>>> 2  AIMS: Rituximab is standard care in a number o...         NaN  ...  Pharmacokinetics, exposure, efficacy and safet...  31050355
#>>> 3  BACKGROUND: Studies in patients with rheumatoi...         NaN  ...  Efficacy, pharmacokinetics, and safety of the ...  28712940
#>>> 4  Rituximab, a chimeric monoclonal antibody targ...         NaN  ...  Rituximab (monoclonal anti-CD20 antibody): mec...  14576843

# 2. Import trained model
pipeline_trained = joblib.load("data/results/final-pipeline/optimal_pipeline.pkl")

# 3. Make predictions
pred_test = pipeline_trained.predict(data)
print(pred_test)
#>>> array(['Not Relevant', 'Not Relevant', 'Relevant', 'Relevant',
#       'Not Relevant'], dtype=object)

You can find this example on a jupyter notebook.

Citation

@article{hernandez2021automated,
  title={An automated approach to identify scientific publications reporting pharmacokinetic parameters},
  author={Hernandez, Ferran Gonzalez and Carter, Simon J and Iso-Sipil{\"a}, Juha and Goldsmith, Paul and Almousa, Ahmed A and Gastine, Silke and Lilaonitkul, Watjana and Kloprogge, Frank and Standing, Joseph F},
  journal={Wellcome Open Research},
  volume={6},
  number={88},
  pages={88},
  year={2021},
  publisher={F1000 Research Limited}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.