Coder Social home page Coder Social logo

lifeomic / phc-sdk-py Goto Github PK

View Code? Open in Web Editor NEW
1.0 20.0 2.0 37.56 MB

The phc-sdk-py is a developer kit for interfacing with the PHC API on Python 3.8 and above.

Home Page: https://lifeomic.github.io/phc-sdk-py/index.html

License: MIT License

Python 99.88% Shell 0.12%
python developer-kit pypi jupyter-notebook sdk team-clinical-intelligence

phc-sdk-py's Introduction

PHC SDK for Python

The phc-sdk-py is a developer kit for interfacing with the PHC API on Python 3.7 and above.

Project Status

GitHub PyPI status Downloads GitHub release Docs User Guides

Getting Started

Dependencies

Getting the Source

This project is hosted on GitHub.

Usage

A Session needs to be created first that stores the token and account information needed to access the PHC API. One can currently using API Key tokens generated from the PHC Account, or OAuth tokens generated using the CLI.

from phc import Session

session = Session(token=<TOKEN VALUE>, account="myaccount")

Once a Session is created, you can then access the different parts of the platform.

from phc.services import Accounts

accounts = Accounts(session)
myaccounts = accounts.get_list()

Contributing

We encourage public contributions! Please review CONTRIBUTING.md and CODE_OF_CONDUCT.md for details on our code of conduct and development process.

License

This project is licensed under the MIT License - see LICENSE file for details.

Authors

See the list of contributors who participate in this project.

Acknowledgements

This project is built with the following:

  • aiohttp - Asynchronous HTTP Client/Server for asyncio and Python.

phc-sdk-py's People

Contributors

atolivero avatar cluebbehusen avatar dependabot[bot] avatar epeters3 avatar hemp avatar indigocarmen avatar jairav avatar joedimarzio avatar loscm avatar mjtieman avatar morpheusnephew avatar mschroering avatar rcdilorenzo avatar schaestewart avatar shawnzhu avatar simons5593 avatar swain avatar taylordeatri avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

taylordeatri

phc-sdk-py's Issues

Add cache_override to specify file location

If we have pre-loaded the data and simply want to use the same loading pipeline, we can specify this additional option and the SDK will not hit the FSS to get new data.

Example:

phc.Observation.get_data_frame(cache_override="/tmp/observations.csv")

Add GenomicShortVariant Enums

(See the UI code for the translation from labels to values.)

Example for clinVarSignificance

{
  id: 'omicsExplorer.filters.short.clinVarSignificance.pathogenic',
  searchValue: 'Pathogenic:like',
  intlLabel: {
    id: 'omicsExplorer.filters.short.clinVarSignificance.pathogenic',
    defaultMessage: 'Pathogenic or Likely Pathogenic',
  },
}

Auto-caching breaks when no results

It appears that this behavior happens because it's the "last batch" and the APICache callback expects that the file already exists at that point.

Stacktrace:

/opt/conda/lib/python3.7/site-packages/phc/easy/procedure.py in get_data_frame(all_results, raw, patient_id, query_overrides, auth_args, ignore_cache, expand_args)
     96             query_overrides,
     97             auth_args,
---> 98             ignore_cache,
     99         )

/opt/conda/lib/python3.7/site-packages/phc/easy/query/__init__.py in execute_fhir_dsl_with_options(query, transform, all_results, raw, query_overrides, auth_args, ignore_cache)
    157                 auth_args,
    158                 callback=APICache.build_cache_fhir_dsl_callback(
--> 159                     query, transform
    160                 ),
    161             )

/opt/conda/lib/python3.7/site-packages/phc/easy/query/__init__.py in execute_fhir_dsl(query, all_results, auth_args, callback)
    110             return with_progress(
    111                 lambda: tqdm(total=MAX_RESULT_SIZE),
--> 112                 lambda progress: recursive_execute_fhir_dsl(
    113                     {
    114                         "limit": [

/opt/conda/lib/python3.7/site-packages/phc/easy/query/fhir_dsl.py in with_progress(init_progress, func)
     20     if _has_tqdm:
     21         progress = init_progress()
---> 22         result = func(progress)
     23         progress.close()
     24         return result

/opt/conda/lib/python3.7/site-packages/phc/easy/query/__init__.py in <lambda>(progress)
    126                     progress=progress,
    127                     callback=callback,
--> 128                     auth_args=auth_args,
    129                 ),
    130             )

/opt/conda/lib/python3.7/site-packages/phc/easy/query/fhir_dsl.py in recursive_execute_fhir_dsl(query, scroll, progress, auth_args, callback, _scroll_id, _prev_hits)
     72         callback(current_results, False)
     73     elif callback and is_last_batch:
---> 74         return callback(current_results, True)
     75     elif is_last_batch:
     76         suffix = "+" if actual_count == MAX_RESULT_SIZE else ""

/opt/conda/lib/python3.7/site-packages/phc/util/api_cache.py in handle_batch(batch, is_finished)
     76             if is_finished:
     77                 print(f'Loading data frame from "{filename}"')
---> 78                 return APICache.read_csv(filename)
     79 
     80             df = pd.DataFrame(map(lambda r: r["_source"], batch))

/opt/conda/lib/python3.7/site-packages/phc/util/api_cache.py in read_csv(filename)
     85     @staticmethod
     86     def read_csv(filename: str) -> pd.DataFrame:
---> 87         df = pd.read_csv(filename)
     88         min_count = max(min(int(len(df) / 3), 5), 1)
     89 

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    674         )
    675 
--> 676         return _read(filepath_or_buffer, kwds)
    677 
    678     parser_f.__name__ = name

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    446 
    447     # Create the parser.
--> 448     parser = TextFileReader(fp_or_buf, **kwds)
    449 
    450     if chunksize or iterator:

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    878             self.options["has_index_names"] = kwds["has_index_names"]
    879 
--> 880         self._make_engine(self.engine)
    881 
    882     def close(self):

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
   1112     def _make_engine(self, engine="c"):
   1113         if engine == "c":
-> 1114             self._engine = CParserWrapper(self.f, **self.options)
   1115         else:
   1116             if engine == "python":

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1889         kwds["usecols"] = self.usecols
   1890 
-> 1891         self._reader = parsers.TextReader(src, **kwds)
   1892         self.unnamed_cols = self._reader.unnamed_cols
   1893 

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

FileNotFoundError: [Errno 2] File ~/Downloads/phc/api-cache/fhir_dsl_procedure_where_********.csv does not exist: '~/Downloads/phc/api-cache/fhir_dsl_procedure_where_********.csv'

Count by patient throws error when empty results

When using something like the following query where there are no results, the method throws an error since "subject.reference" doesn't exist:

phc.Observation.get_count_by_patient(patient_ids=["unknown-patient-id"])

Update Data Lake Query Endpoint

The SDK is using a deprecated endpoint for interacting with the data-lake (query, list executions, etc). The deprecated path is analytics/query. The correct path is analytics/data-lake/query.

Add top-level API options (for Enums)

Extract options modules to be exposed in the following manner:

import phc.easy as phc

# Example for short variants
phc.Option.GenomicShortVariantInclude.VCF

# Example for genomic tests
phc.Option.GenomicTestStatus.ACTIVE

Add support for Cohorts

Examples:

phc.Patient.get_data_frame(cohort_name="MyCohort")
phc.Cohort.get_patient_ids(name="MyCohort")

(Note: Cohorts can be static or dynamic.)

Convert out of range dates to NA's and warn

In this case, we had a value that was 0217-06-07 (Observation effectiveDateTime) and caused the entire parsing system to error out, thus prohibiting a researcher from continuing to work. Instead, we'd like to issue a warning and convert the value to NA (and not assume we know the problem is.

Multiple addresses in column breaks frame expansion

Pulling out address breaks when multiple values are present. For example:

[{'state': 'NC', 'postalCode': '27540', 'period': {'start': '2001'}}, {'use': 'old', 'state': 'SC', 'period': {'start': '1999', 'end': '2001'}}]

Error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-6-0355439bcf07> in <module>
----> 1 phc.Patient.get_data_frame()

/opt/conda/lib/python3.7/site-packages/phc/easy/patients/__init__.py in get_data_frame(limit, all_results, raw, query_overrides, auth_args, ignore_cache, expand_args)
    101             query_overrides,
    102             auth_args,
--> 103             ignore_cache,
    104         )

/opt/conda/lib/python3.7/site-packages/phc/easy/query/__init__.py in execute_fhir_dsl_with_options(query, transform, all_results, raw, query_overrides, auth_args, ignore_cache)
    168             return df
    169 
--> 170         return transform(df)

/opt/conda/lib/python3.7/site-packages/phc/easy/patients/__init__.py in transform(df)
     92 
     93         def transform(df: pd.DataFrame):
---> 94             return Patient.transform_results(df, **expand_args)
     95 
     96         return Query.execute_fhir_dsl_with_options(

/opt/conda/lib/python3.7/site-packages/phc/easy/patients/__init__.py in transform_results(data_frame, **expand_args)
     32         }
     33 
---> 34         return Frame.expand(data_frame, **args)
     35 
     36     @staticmethod

/opt/conda/lib/python3.7/site-packages/phc/easy/frame.py in expand(frame, code_columns, date_columns, custom_columns)
     94             *[
     95                 column_to_frame(frame, key, func)
---> 96                 for key, func in custom_columns
     97             ],
     98             frame.drop([*codeable_col_names, *custom_names], axis=1),

/opt/conda/lib/python3.7/site-packages/phc/easy/frame.py in <listcomp>(.0)
     94             *[
     95                 column_to_frame(frame, key, func)
---> 96                 for key, func in custom_columns
     97             ],
     98             frame.drop([*codeable_col_names, *custom_names], axis=1),

/opt/conda/lib/python3.7/site-packages/phc/easy/frame.py in column_to_frame(frame, column_name, expand_func)
     29     "Converts a column (if exists) to a data frame with multiple columns"
     30     if column_name in frame.columns:
---> 31         return expand_func(frame[column_name])
     32 
     33     return pd.DataFrame([])

/opt/conda/lib/python3.7/site-packages/phc/easy/patients/address.py in expand_address_column(address_col)
     32 
     33 def expand_address_column(address_col):
---> 34     return pd.DataFrame(map(expand_address_value, address_col.values))

/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    467         elif isinstance(data, abc.Iterable) and not isinstance(data, (str, bytes)):
    468             if not isinstance(data, (abc.Sequence, ExtensionArray)):
--> 469                 data = list(data)
    470             if len(data) > 0:
    471                 if is_list_like(data[0]) and getattr(data[0], "ndim", 1) == 1:

/opt/conda/lib/python3.7/site-packages/phc/easy/patients/address.py in expand_address_value(value)
     22 
     23     # Value is always list of one item
---> 24     assert len(value) == 1
     25     value = value[0]
     26 

AssertionError:

Add variant_set_id when returning GenomicShortVariant

The id of the variants comes in the following form:

'55e945ec-57d1-4dde-9a59-bcdd6d7271e6:+0LLoDMx2dXBmvef9GcN4Dz+v4EMI87/FXW9X2mG72k=:TFB2M'

The first part of this ID is the variant_set_id which can be joined with the output of the GenomicTest to match a given mutation to the patient. This is a common use case.

Fix blank progress lines when using all_results

When passing all_results=True, there are blank lines where the frame expand progress bars are created and then destroyed. This becomes particularly annoying when there are many batches of data. Perhaps, we could use a shared transient progress bar that is reused until the entire query is finished.

Merge patient_ids with "must" query

Currently, patient_ids cannot be auto-merged with this type of custom query:

phc.Procedure.get_date_frame(patient_ids=["a", "b"], query_overrides={
    "where": {
        "type": "elasticsearch",
        "query": {
            "bool": {
                "must": [
                    {"term": {"code.coding.code.keyword": "blah"}},
                    {"term": {"code.coding.system.keyword": "http://loinc.org"}}
                ]
            }
        }
    }
})

Merge patient_id with FSS filter

The patient_id argument needs to merge just like with the must FSS query.

phc.Observation.get_data_frame(patient_id="41e6a5bc-7b8a-4434-b38b-0da652d6364e", query_overrides={
    "where": {
        "type": "elasticsearch",
        "query": {
            "bool": {
                "filter": [
                    {"term": {"code.coding.system.keyword": "http://my-system-example.org"}},
                    {"term": {"code.coding.code.keyword": "123456-7"}}
                ]
            }
        }
    }
})

Supports providing patient_id in array type

Now

When using GenomicShortVariant#get_data_frame(), it could provides patient_id in string type only:

df = get_data_frame(
    ...,
    patient_id = 'UUID_1,UUID2'
,)

Expected result

It could support providing patient ids in array type as well:

df = get_data_frame(
    ...,
    patient_id = ['UUID_1', 'UUID_2'],
,)

Pretty print log with FSS

When passing log=True to get_data_frame or get_codes, I'd like to have the FSS query pretty printed as JSON so it's readable.

Add option to parse dates as local times

Right now, we lose all of the timezone information when the easy modules are used since everything gets auto-converted to UTC. We'd like to add an option to still remove the timezone (since that's how Pandas likes it) but force it into the recorded timezone.

Supports python 3.8+

I plan to drop the support of python 3.7 to adopt pandas 2.0.0+.

in jupyter/datascience-notebook:lab-3.6.3 notebook image, it ships python 3.10 and pandas 2.0.1, In order to let customer use phc package, it needs to be able to support pandas 2.0.1.

It will also benefits maintenance of this repo to avoid using backport features just for supporting python 3.7

Use previous logic for reading saved genomics files

We encountered the same error for a date outside of normal bounds (i.e. #64) for genomics-related APIs. When reading the file-version, we need to use the APICache and Frame.expand calls to have that logic consolidated.

Thanks to Steven Bray for finding this issue. ๐Ÿ‘

Positive time zones are not preserved as local time

Currently, the regex only truncates negative time zones before converting to UTC. This is a significant problem since positive time zones will get changed instead of truly being the local time.

Example:

import pandas as pd
from phc.easy.frame import TZ_REGEX

df = pd.DataFrame({
    "effectiveDateTime": pd.to_datetime([
        "2019-03-11 12:00:10.000000+02:00",
        "2019-03-11 12:00:10.000000-04:00",
    ])})

# Remove timezone and then mark as UTC date (building a local date)
pd.to_datetime(df.effectiveDateTime.astype(str).str.replace(TZ_REGEX, ""), utc=True)

# Result
# 0   2019-03-11 10:00:10+00:00
# 1   2019-03-11 12:00:10+00:00
# Name: effectiveDateTime, dtype: datetime64[ns, UTC]

This first date should be at noon rather than 10 AM.

When `tag` appears in `meta`, it adds `tag_` as prefix to other `meta` attributes.

This occurs with the frequently used meta.lastUpdated field. If this is changed, it will break any code that currently uses the name meta_tag_lastUpated.*.

E.g.

generic_codeable_to_dict({
    'tag': [
        {
                "system": "http://lifeomic.com/fhir/group",
                "code": "group-code-id",
            },        
    ],
    'other': 'ok',
})
>>> 
{'tag_other': 'ok',
 'tag_system__lifeomic.com/fhir/group__code': 'group-code-id'}

Code references:

if "tag" in codeable_dict:
return (
[without_keys(codeable_dict, ["tag"]), *codeable_dict["tag"]],
join_underscore([prefix, "tag"]),
)

"tag_lastUpdated": "2019-08-13T17:47:18.957Z",

HL7 FHIR Resources docs. See Meta.lastUpdated and Meta.tag:

pdoc -> sphinx

The motivation is improve the UX of sdk document for target users.

When adding more capability into this sdk for target users like a ML engineer, it needs getting started guide and dev guide to start. Existing solution via pdoc is API doc focused and the hierarchy is from the README.md. I'm looking for a better tool so I'd give sphinx a try.

Things in my mind:

  1. generate document via sphinx instead of pdoc
  2. improve TOC for target audiences (users and contributors/maintainers)
  3. include new getting started guide for ML engineers to use the Patient ML feature

Add simple recipe for survey results (Observation, QuestionnaireResponse, Questionnaire)

Right now, we don't have Questionnaire or QuestionnaireResponse. Each of these have to be merged with the appropriate Observation resources in order to have a nice frame of the results with the survey names and versions.

  • phc.Questionnaire
  • phc.Survey
  • phc.QuestionnaireResponse
  • phc.SurveyObservation (or SurveyResults, SurveyResponse, etc)

Example usage:

phc.SurveyObservation.get_data_frame() # => Returns observations that have the survey code

phc.Survey.get_data_frame() # => Returns surveys (questionnaires) and then you pick an ID
phc.SurveyObservation.get_data_frame(questionnaire_id="...", join_survey=True) # => (Allow multiple ids)

Here's some rough code of what's needed right now to accomplish this.

import pandas as pd

questionnaires_raw = phc.Query.execute_fhir_dsl({
    "type": "select",
    "columns": [
        {"expr": {"type": "column_ref", "column": c}}
        for c in ["id", "title", "meta", "version", "status", "date",
                  "subjectType", "identifier", "description", "contained"]
    ],
    "from": [{"table": "questionnaire"}],
}, page_size=100)

questionnaires = (
    phc.Frame
    .expand(pd.DataFrame([r["_source"] for r in questionnaires_raw]))
    .sort_values("version", ascending=False)
)

questionnaire_responses_raw = phc.Query.execute_fhir_dsl({
    "type": "select",
    "columns": [
        {"expr": {"type": "column_ref", "column": c}}
        for c in ["id", "status", "questionnaire"]
    ],
    "from": [{"table": "questionnaire_response"}],
}, all_results=True)

questionnaire_responses = phc.Frame.expand(
    pd.DataFrame([r["_source"] for r in questionnaire_responses_raw]),
    code_columns=["questionnaire"]
)

responses = phc.Observation.get_data_frame(
    code="...",
    system="http://lifeomic.com/fhir/primary-survey-id",
#     all_results=True
)

def join_observation_responses(
    observation_df: pd.DataFrame,
    questionnaire_response_df: pd.DataFrame,
    questionnaire_df: pd.DataFrame
):
    return observation_df.assign(**{
        "related.target_reference": observation_df["related.target_reference"].str.replace("QuestionnaireResponse/", "")
    }).join(
        questionnaire_response_df.set_index("id"),
        on="related.target_reference",
        rsuffix=".questionnaire_response"
    ).pipe(
        lambda df: df.assign(**{
            "questionnaire.reference": df["questionnaire.reference"].str.replace("Questionnaire/", "")
        })
    ).join(
        questionnaire_df.set_index("id"),
        on="questionnaire.reference",
        rsuffix=".questionnaire"
    )


join_observation_responses(responses, questionnaire_responses, questionnaires)[[
    "id",
    "subject.display",
    "status", # ?
    "title",
    "version",
    "status.questionnaire",
    "description",
    "status.questionnaire_response",
    "code.coding_system__lifeomic.com/fhir/questionnaire/item__code",
    "code.coding_system__lifeomic.com/fhir/questionnaire/item__display",
    "code.coding_system__lifeomic.com/fhir__code",
    "code.coding_system__lifeomic.com/fhir__display",
    "code.coding_system__lifeomic.com/fhir__userSelected",
    "valueCodeableConcept.coding_display",
    "valueCodeableConcept.coding_code",
    "valueString",
    "valueQuantity.value",
    "effectiveDateTime.local"
]].rename(columns={
    "code.coding_system__lifeomic.com/fhir/questionnaire/item__code": "item_code",
    "code.coding_system__lifeomic.com/fhir/questionnaire/item__display": "item", 
})

make -> poetry

I propose to use poetry to:

  • manage dependencies
  • manage virtualenv out of box
  • manage build/release process
  • replace the need of make - needs poethepoet package

It will improve the UX of maintenance and contribution to this repo for future features.

see PR #189

Handle `raw` for GenomicVariant sub-classes

In this case, we can't join to the GenomicTest so we'll require variant_set_ids, but this will at least give an opportunity to do something custom (or fix something that blows up when it tries to expand the data frame).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.