kaiko-ai / spark-dicom Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 2.0 27.84 MB

Spark dicom (streaming) connector

License: Apache License 2.0

Scala 89.32% Nix 9.74% Shell 0.94%

spark-dicom's People

Contributors

Stargazers

Watchers

Forkers

plpxsk

spark-dicom's Issues

Use hash function (+salt) to pseudonymize columns

The de-identifier should use a hash function to generate pseudonyms for tags with the U action (from the table here).

Read Private & Sequence tags as JSON with Keyword (str) as key instead of Tag (int)

To make the private and sequence tags more usable, the JSON should use the keyword of the tag as keys instead of the tag itself

Enable `deidentify` as a org.apache.spark.ml.Transformer with option as Parameters

This will allow the deidentification to be incorporated in an end to end workflow

Dependency mismatch?

Running

display(spark.read.format("dicomFile")
  .option("recursiveFileLookup", true)
  .option("includeContent", true)
  .load("/mnt/tcia/gcs-public-data--healthcare-tcia-apollo/"))

results in NoSuchMethodError: org.apache.hadoop.fs.FSDataInputStream.readAllBytes()[B

likely due to dependency mismatch.

Make tag-key replacement in SQ VR optional

As discussed in: #48 (comment)

Add a "parsing errors" columns

Following comment from @robopoc

Currently, a parsing error stops the whole DataSource read.

Because of the nature of DICOM files, failure is bound to happen.

The goal would be to allow read failure, and report them to a column.

Example:

An element of Keyword "A" of VR TM has an invalid time written 12:10:55.0 (with : which does not comply to standard)
In a column "errors", we get [{ column: "A", message: "DateTimeParseException: ..." }]

Read whole bytes

Would load the raw binary and drop it into a column.

Need optionality just like PixelData

Read sequence tags (VR: SQ) as json formatted StringType column

Currently, sequence tags are ingested into BinaryType columns. In order to better work with the data in these tags, we should read them in as json formatted StringType

Recover from reading files which are not DICOM

Currently, when the data source reads a file which is not a DICOM file, the entire job fails.

The data source should just go to the next file.

An idea would be to add a boolean column isDicom. Then, parsing a non-DICOM file would still give you an entry, but without any data but isDicom = false. This would increase ease of use as it would allow to ensure file detection.

DateTimeParseException

I'm getting the below error message when trying to read DICOM files from the TCIA gs bucket.

java.time.format.DateTimeParseException: Text '122734.625000' could not be parsed, unparsed text found at index 10

Code to reproduce (pyspark):

APOLLO_FILE_PATH = "gs://gcs-public-data--healthcare-tcia-apollo/dicom/"
apollo_df = spark.read.format("dicomFile").option("recursiveFileLookup", "true").load(APOLLO_FILE_PATH)
apollo_df.collect()

Read private tags

Currently, private tags are not read at all.

We could have a column "private data" in StringType in which we dump private tags as a JSON string.

kaiko-ai / spark-dicom Goto Github PK

spark-dicom's People

Contributors

Stargazers

Watchers

Forkers

spark-dicom's Issues

Use hash function (+salt) to pseudonymize columns

Read Private & Sequence tags as JSON with Keyword (str) as key instead of Tag (int)

Enable `deidentify` as a org.apache.spark.ml.Transformer with option as Parameters

Dependency mismatch?

Make tag-key replacement in SQ VR optional

Add a "parsing errors" columns

Read whole bytes

Read sequence tags (VR: SQ) as json formatted StringType column

Recover from reading files which are not DICOM

DateTimeParseException

Read private tags

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent