Coder Social home page Coder Social logo

cap-cde-extraction's Introduction

Extracting Common Data Elements

pulling common data elements from CAP reports

And for my next trick, I will attempt to identify all the Common Data Elements (CDEs) identified by the College of American Pathologists.

The task is simple, but a bit annoying. Take the PDFs they provide that outline the CDEs and turn them into JSON structures. These JSON structures will be used to build templates later. Here is a brief example from a PDF in the data/ folder

Histologic Type (Note A)
___ Adenocarcinoma (acinar, not otherwise specified) ___ Prostatic duct adenocarcinoma
___ Mucinous (colloid) adenocarcinoma
___ Signet-ring cell carcinoma
___ Adenosquamous carcinoma
___ Small cell carcinoma
___ Sarcomatoid carcinoma
___ Undifferentiated carcinoma, not otherwise specified ___ Other (specify): __________________

Should be turned into something like this, but I don't have any religion on the data model... yet

{ name: "Histologic Type"
  multiple_choice: False
  valid_choices: 
	[ 
		{name: "Adenocarcinoma (acinar, not otherwise specified)", type: Boolean}
		, {name: "Prostatic duct adenocarcinoma", type: Boolean}
		, {name: "Mucinous (colloid) adenocarcinoma", type: Boolean}
		, {name: "Signet-ring cell carcinoma", type: Boolean}
		, {name: "Adenosquamous carcinoma", type: Boolean}
		, {name: "Small cell carcinoma", type: Boolean}
		, {name: "Sarcomatoid carcinoma", type: Boolean}
		, {name: "Undifferentiated carcinoma, not otherwise specified", type: Boolean}
		, {name: "Other", type: Boolean, input: Text}
	]
}

There are literally hundreds of PDFs with the CDEs in them. You can collect them all from the CAP website. Some of the CDEs will occur across multiple forms. This, of course, leads us to having to create a "form" object that has these as CDEs.

For instance, the sample form is "PROSTATE GLAND: Radical Prostatectomy".

So ultimately, we'll need to large JSON files, one with the CDEs and one with the report temmplates.

Reports

I added the report summary XLS file in the data directory, it has all the mandatory headings for the reports.

VirtualEnv

If you've never used Virtual Env before, it's not hard. From the top level dir of the project, do

source env/bin/activate
pip install -r requirements.txt

if you add the the needed python libraries, do

pip freeze > requirements.txt

and check in the new requirements file.

Running Scripts

I usually set them up to be run from the project root. So you should be able to do

python scripts/python/extract_cde_test.py

and get a load of output.

PDFMiner

For the moment, I'm using PDFMiner. It has a basic API

FHIR Options

So, the CAP formats are under license by the College of American Pathologists. This is a pain. HOWEVER, it turns out that HL7 also maintains the fields we need for the reports, so all we have to do is translate the pathology reports into FHIR messages. Sounds simple... This page has some details on the prostate pathology. Yes!

cap-cde-extraction's People

Contributors

buddha314 avatar

Watchers

L337[224ffa16]SIGMA avatar Les Horne avatar Gabriel Lipson avatar Jeremy Kahn avatar Newton Truong avatar  avatar Doug Tung avatar Joe avatar Talha F Basit avatar James Cloos avatar  avatar Alex Paransky avatar Arturo Peña avatar  avatar Scott Hoch avatar Carey Cade avatar Andy Miller avatar Robb Rowe avatar  avatar Sophie Guo avatar Regina Lee avatar Bidimpata-Kerim Aramyan-Tshimanga avatar Yiu San avatar Jarret Spino avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.