lifeomic / phc-sdk-py Goto Github PK

View Code? Open in Web Editor NEW

1.0 20.0 2.0 37.56 MB

The phc-sdk-py is a developer kit for interfacing with the PHC API on Python 3.8 and above.

Home Page: https://lifeomic.github.io/phc-sdk-py/index.html

License: MIT License

Python 99.88% Shell 0.12%

python developer-kit pypi jupyter-notebook sdk team-clinical-intelligence

phc-sdk-py's Introduction

PHC SDK for Python

The phc-sdk-py is a developer kit for interfacing with the PHC API on Python 3.7 and above.

Project Status

Getting Started

Dependencies

Python 3 version >= 3.8

Getting the Source

This project is hosted on GitHub.

Usage

A Session needs to be created first that stores the token and account information needed to access the PHC API. One can currently using API Key tokens generated from the PHC Account, or OAuth tokens generated using the CLI.

from phc import Session

session = Session(token=<TOKEN VALUE>, account="myaccount")

Once a Session is created, you can then access the different parts of the platform.

from phc.services import Accounts

accounts = Accounts(session)
myaccounts = accounts.get_list()

Contributing

We encourage public contributions! Please review CONTRIBUTING.md and CODE_OF_CONDUCT.md for details on our code of conduct and development process.

License

This project is licensed under the MIT License - see LICENSE file for details.

Authors

See the list of contributors who participate in this project.

Acknowledgements

This project is built with the following:

aiohttp - Asynchronous HTTP Client/Server for asyncio and Python.

phc-sdk-py's People

Contributors

Stargazers

Watchers

Forkers

taylordeatri

phc-sdk-py's Issues

Add cache_override to specify file location

If we have pre-loaded the data and simply want to use the same loading pipeline, we can specify this additional option and the SDK will not hit the FSS to get new data.

Example:

phc.Observation.get_data_frame(cache_override="/tmp/observations.csv")

Pass query keyword arguments in all easy query methods (e.g. get_count)

Auto-cache GenomicTest when all_results=True

Add GenomicShortVariant Enums

(See the UI code for the translation from labels to values.)

Example for clinVarSignificance

{
  id: 'omicsExplorer.filters.short.clinVarSignificance.pathogenic',
  searchValue: 'Pathogenic:like',
  intlLabel: {
    id: 'omicsExplorer.filters.short.clinVarSignificance.pathogenic',
    defaultMessage: 'Pathogenic or Likely Pathogenic',
  },
}

Project: No matches found for search

The parallelization of project lookup is causing incorrect results or not able to find it at all. Related to #91.

Auto-caching breaks when no results

It appears that this behavior happens because it's the "last batch" and the APICache callback expects that the file already exists at that point.

Stacktrace:

/opt/conda/lib/python3.7/site-packages/phc/easy/procedure.py in get_data_frame(all_results, raw, patient_id, query_overrides, auth_args, ignore_cache, expand_args)
     96             query_overrides,
     97             auth_args,
---> 98             ignore_cache,
     99         )

/opt/conda/lib/python3.7/site-packages/phc/easy/query/__init__.py in execute_fhir_dsl_with_options(query, transform, all_results, raw, query_overrides, auth_args, ignore_cache)
    157                 auth_args,
    158                 callback=APICache.build_cache_fhir_dsl_callback(
--> 159                     query, transform
    160                 ),
    161             )

/opt/conda/lib/python3.7/site-packages/phc/easy/query/__init__.py in execute_fhir_dsl(query, all_results, auth_args, callback)
    110             return with_progress(
    111                 lambda: tqdm(total=MAX_RESULT_SIZE),
--> 112                 lambda progress: recursive_execute_fhir_dsl(
    113                     {
    114                         "limit": [

/opt/conda/lib/python3.7/site-packages/phc/easy/query/fhir_dsl.py in with_progress(init_progress, func)
     20     if _has_tqdm:
     21         progress = init_progress()
---> 22         result = func(progress)
     23         progress.close()
     24         return result

/opt/conda/lib/python3.7/site-packages/phc/easy/query/__init__.py in <lambda>(progress)
    126                     progress=progress,
    127                     callback=callback,
--> 128                     auth_args=auth_args,
    129                 ),
    130             )

/opt/conda/lib/python3.7/site-packages/phc/easy/query/fhir_dsl.py in recursive_execute_fhir_dsl(query, scroll, progress, auth_args, callback, _scroll_id, _prev_hits)
     72         callback(current_results, False)
     73     elif callback and is_last_batch:
---> 74         return callback(current_results, True)
     75     elif is_last_batch:
     76         suffix = "+" if actual_count == MAX_RESULT_SIZE else ""

/opt/conda/lib/python3.7/site-packages/phc/util/api_cache.py in handle_batch(batch, is_finished)
     76             if is_finished:
     77                 print(f'Loading data frame from "{filename}"')
---> 78                 return APICache.read_csv(filename)
     79 
     80             df = pd.DataFrame(map(lambda r: r["_source"], batch))

/opt/conda/lib/python3.7/site-packages/phc/util/api_cache.py in read_csv(filename)
     85     @staticmethod
     86     def read_csv(filename: str) -> pd.DataFrame:
---> 87         df = pd.read_csv(filename)
     88         min_count = max(min(int(len(df) / 3), 5), 1)
     89 

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    674         )
    675 
--> 676         return _read(filepath_or_buffer, kwds)
    677 
    678     parser_f.__name__ = name

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    446 
    447     # Create the parser.
--> 448     parser = TextFileReader(fp_or_buf, **kwds)
    449 
    450     if chunksize or iterator:

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    878             self.options["has_index_names"] = kwds["has_index_names"]
    879 
--> 880         self._make_engine(self.engine)
    881 
    882     def close(self):

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
   1112     def _make_engine(self, engine="c"):
   1113         if engine == "c":
-> 1114             self._engine = CParserWrapper(self.f, **self.options)
   1115         else:
   1116             if engine == "python":

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1889         kwds["usecols"] = self.usecols
   1890 
-> 1891         self._reader = parsers.TextReader(src, **kwds)
   1892         self.unnamed_cols = self._reader.unnamed_cols
   1893 

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

FileNotFoundError: [Errno 2] File ~/Downloads/phc/api-cache/fhir_dsl_procedure_where_********.csv does not exist: '~/Downloads/phc/api-cache/fhir_dsl_procedure_where_********.csv'

Auto-convert set-ids for GenomicShortVariant to list

We shouldn't have to convert a pd.Series to a list when passing to GenomicShortVariant:

phc.GenomicShortVariant.get_data_frame(set_ids=tests.id)

Count by patient throws error when empty results

When using something like the following query where there are no results, the method throws an error since "subject.reference" doesn't exist:

phc.Observation.get_count_by_patient(patient_ids=["unknown-patient-id"])

Enforce 100 limit on GenomicTest ids for GenomicShortVariant

According to an API response for v1/genomics/variants:

failed to satisfy constraint: Member must have length less than or equal to 100

Update Data Lake Query Endpoint

The SDK is using a deprecated endpoint for interacting with the data-lake (query, list executions, etc). The deprecated path is analytics/query. The correct path is analytics/data-lake/query.

Add filtering by id or ids for FSS entities (Patient, Condition, etc)

Example:

phc.Patient.get_data_frame(ids=[
    "b9a01ea9-447b-4e24-b724-e234e2da897c",
    "3a0103d2-5e9d-40fb-ab2f-e2577671ffd0"
])

phc.Observation.get_data_frame(id="9d5685f4-460e-4465-8106-4c1ef501504b")

(Implementation: Add to https://github.com/lifeomic/phc-sdk-py/blob/master/phc/easy/query/fhir_dsl_query.py#L133-L145)

Add top-level API options (for Enums)

Extract options modules to be exposed in the following manner:

import phc.easy as phc

# Example for short variants
phc.Option.GenomicShortVariantInclude.VCF

# Example for genomic tests
phc.Option.GenomicTestStatus.ACTIVE

Add support for Cohorts

Examples:

phc.Patient.get_data_frame(cohort_name="MyCohort")

phc.Cohort.get_patient_ids(name="MyCohort")

(Note: Cohorts can be static or dynamic.)

Add GenomicStructuralVariant

Follow pattern of GenomicShortVariant and use this API: https://docs.us.lifeomic.com/api/#query-structural-variant-data

Add `component` code column to Observation

In an internal discussion, we found that there is a component column that needs to be expanded.

Convert out of range dates to NA's and warn

In this case, we had a value that was 0217-06-07 (Observation effectiveDateTime) and caused the entire parsing system to error out, thus prohibiting a researcher from continuing to work. Instead, we'd like to issue a warning and convert the value to NA (and not assume we know the problem is.

Multiple addresses in column breaks frame expansion

Pulling out address breaks when multiple values are present. For example:

[{'state': 'NC', 'postalCode': '27540', 'period': {'start': '2001'}}, {'use': 'old', 'state': 'SC', 'period': {'start': '1999', 'end': '2001'}}]

Error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-6-0355439bcf07> in <module>
----> 1 phc.Patient.get_data_frame()

/opt/conda/lib/python3.7/site-packages/phc/easy/patients/__init__.py in get_data_frame(limit, all_results, raw, query_overrides, auth_args, ignore_cache, expand_args)
    101             query_overrides,
    102             auth_args,
--> 103             ignore_cache,
    104         )

/opt/conda/lib/python3.7/site-packages/phc/easy/query/__init__.py in execute_fhir_dsl_with_options(query, transform, all_results, raw, query_overrides, auth_args, ignore_cache)
    168             return df
    169 
--> 170         return transform(df)

/opt/conda/lib/python3.7/site-packages/phc/easy/patients/__init__.py in transform(df)
     92 
     93         def transform(df: pd.DataFrame):
---> 94             return Patient.transform_results(df, **expand_args)
     95 
     96         return Query.execute_fhir_dsl_with_options(

/opt/conda/lib/python3.7/site-packages/phc/easy/patients/__init__.py in transform_results(data_frame, **expand_args)
     32         }
     33 
---> 34         return Frame.expand(data_frame, **args)
     35 
     36     @staticmethod

/opt/conda/lib/python3.7/site-packages/phc/easy/frame.py in expand(frame, code_columns, date_columns, custom_columns)
     94             *[
     95                 column_to_frame(frame, key, func)
---> 96                 for key, func in custom_columns
     97             ],
     98             frame.drop([*codeable_col_names, *custom_names], axis=1),

/opt/conda/lib/python3.7/site-packages/phc/easy/frame.py in <listcomp>(.0)
     94             *[
     95                 column_to_frame(frame, key, func)
---> 96                 for key, func in custom_columns
     97             ],
     98             frame.drop([*codeable_col_names, *custom_names], axis=1),

/opt/conda/lib/python3.7/site-packages/phc/easy/frame.py in column_to_frame(frame, column_name, expand_func)
     29     "Converts a column (if exists) to a data frame with multiple columns"
     30     if column_name in frame.columns:
---> 31         return expand_func(frame[column_name])
     32 
     33     return pd.DataFrame([])

/opt/conda/lib/python3.7/site-packages/phc/easy/patients/address.py in expand_address_column(address_col)
     32 
     33 def expand_address_column(address_col):
---> 34     return pd.DataFrame(map(expand_address_value, address_col.values))

/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    467         elif isinstance(data, abc.Iterable) and not isinstance(data, (str, bytes)):
    468             if not isinstance(data, (abc.Sequence, ExtensionArray)):
--> 469                 data = list(data)
    470             if len(data) > 0:
    471                 if is_list_like(data[0]) and getattr(data[0], "ndim", 1) == 1:

/opt/conda/lib/python3.7/site-packages/phc/easy/patients/address.py in expand_address_value(value)
     22 
     23     # Value is always list of one item
---> 24     assert len(value) == 1
     25     value = value[0]
     26 

AssertionError:

Add variant_set_id when returning GenomicShortVariant

The id of the variants comes in the following form:

'55e945ec-57d1-4dde-9a59-bcdd6d7271e6:+0LLoDMx2dXBmvef9GcN4Dz+v4EMI87/FXW9X2mG72k=:TFB2M'

The first part of this ID is the variant_set_id which can be joined with the output of the GenomicTest to match a given mutation to the patient. This is a common use case.

Add GenomicCopyNumber

Follow pattern of GenomicShortVariant and use this API: https://docs.us.lifeomic.com/api/#query-copy-number-data

Cut new release

Fix blank progress lines when using all_results

When passing all_results=True, there are blank lines where the frame expand progress bars are created and then destroyed. This becomes particularly annoying when there are many batches of data. Perhaps, we could use a shared transient progress bar that is reused until the entire query is finished.

Merge patient_ids with "must" query

Currently, patient_ids cannot be auto-merged with this type of custom query:

phc.Procedure.get_date_frame(patient_ids=["a", "b"], query_overrides={
    "where": {
        "type": "elasticsearch",
        "query": {
            "bool": {
                "must": [
                    {"term": {"code.coding.code.keyword": "blah"}},
                    {"term": {"code.coding.system.keyword": "http://loinc.org"}}
                ]
            }
        }
    }
})

Add .utc column for dates

See #80 (comment)

Use lifeomic-logger for logging

Add max_pages for PatientItem

Especially when dealing with a tremendous amount of data, we'd like to page until we reach a certain count.

Make phc.easy.patient.Patient inherit Item behavior

Currently, the easy Patient is a special one-off class. Instead, we'd like it to inherit the behavior of Item like the Organization class for example.

Merge patient_id with FSS filter

The patient_id argument needs to merge just like with the must FSS query.

phc.Observation.get_data_frame(patient_id="41e6a5bc-7b8a-4434-b38b-0da652d6364e", query_overrides={
    "where": {
        "type": "elasticsearch",
        "query": {
            "bool": {
                "filter": [
                    {"term": {"code.coding.system.keyword": "http://my-system-example.org"}},
                    {"term": {"code.coding.code.keyword": "123456-7"}}
                ]
            }
        }
    }
})

Supports providing patient_id in array type

Now

When using GenomicShortVariant#get_data_frame(), it could provides patient_id in string type only:

df = get_data_frame(
    ...,
    patient_id = 'UUID_1,UUID2'
,)

Expected result

It could support providing patient ids in array type as well:

df = get_data_frame(
    ...,
    patient_id = ['UUID_1', 'UUID_2'],
,)

Pretty print log with FSS

When passing log=True to get_data_frame or get_codes, I'd like to have the FSS query pretty printed as JSON so it's readable.

Add option to parse dates as local times

Right now, we lose all of the timezone information when the easy modules are used since everything gets auto-converted to UTC. We'd like to add an option to still remove the timezone (since that's how Pandas likes it) but force it into the recorded timezone.

Supports python 3.8+

I plan to drop the support of python 3.7 to adopt pandas 2.0.0+.

in jupyter/datascience-notebook:lab-3.6.3 notebook image, it ships python 3.10 and pandas 2.0.1, In order to let customer use phc package, it needs to be able to support pandas 2.0.1.

It will also benefits maintenance of this repo to avoid using backport features just for supporting python 3.7

Use previous logic for reading saved genomics files

We encountered the same error for a date outside of normal bounds (i.e. #64) for genomics-related APIs. When reading the file-version, we need to use the APICache and Frame.expand calls to have that logic consolidated.

Thanks to Steven Bray for finding this issue. 👍

Positive time zones are not preserved as local time

Currently, the regex only truncates negative time zones before converting to UTC. This is a significant problem since positive time zones will get changed instead of truly being the local time.

Example:

import pandas as pd
from phc.easy.frame import TZ_REGEX

df = pd.DataFrame({
    "effectiveDateTime": pd.to_datetime([
        "2019-03-11 12:00:10.000000+02:00",
        "2019-03-11 12:00:10.000000-04:00",
    ])})

# Remove timezone and then mark as UTC date (building a local date)
pd.to_datetime(df.effectiveDateTime.astype(str).str.replace(TZ_REGEX, ""), utc=True)

# Result
# 0   2019-03-11 10:00:10+00:00
# 1   2019-03-11 12:00:10+00:00
# Name: effectiveDateTime, dtype: datetime64[ns, UTC]

This first date should be at noon rather than 10 AM.

When `tag` appears in `meta`, it adds `tag_` as prefix to other `meta` attributes.

This occurs with the frequently used meta.lastUpdated field. If this is changed, it will break any code that currently uses the name meta_tag_lastUpated.*.

E.g.

generic_codeable_to_dict({
    'tag': [
        {
                "system": "http://lifeomic.com/fhir/group",
                "code": "group-code-id",
            },        
    ],
    'other': 'ok',
})
>>> 
{'tag_other': 'ok',
 'tag_system__lifeomic.com/fhir/group__code': 'group-code-id'}

Code references:

phc-sdk-py/phc/easy/codeable.py

Lines 94 to 98 in 4aeee26

    
           if "tag" in codeable_dict: 
        
               return ( 
        
                   [without_keys(codeable_dict, ["tag"]), *codeable_dict["tag"]], 
        
                   join_underscore([prefix, "tag"]), 
        
               )

phc-sdk-py/tests/test_codeable.py

Line 54 in 0638200

"tag_lastUpdated": "2019-08-13T17:47:18.957Z",

HL7 FHIR Resources docs. See Meta.lastUpdated and Meta.tag:

https://www.hl7.org/fhir/resource-definitions.html

pdoc -> sphinx

The motivation is improve the UX of sdk document for target users.

When adding more capability into this sdk for target users like a ML engineer, it needs getting started guide and dev guide to start. Existing solution via pdoc is API doc focused and the hierarchy is from the README.md. I'm looking for a better tool so I'd give sphinx a try.

Things in my mind:

generate document via sphinx instead of pdoc
improve TOC for target audiences (users and contributors/maintainers)
include new getting started guide for ML engineers to use the Patient ML feature

Add simple recipe for survey results (Observation, QuestionnaireResponse, Questionnaire)

Right now, we don't have Questionnaire or QuestionnaireResponse. Each of these have to be merged with the appropriate Observation resources in order to have a nice frame of the results with the survey names and versions.

phc.Questionnaire
phc.Survey
phc.QuestionnaireResponse
phc.SurveyObservation (or SurveyResults, SurveyResponse, etc)

Example usage:

phc.SurveyObservation.get_data_frame() # => Returns observations that have the survey code

phc.Survey.get_data_frame() # => Returns surveys (questionnaires) and then you pick an ID
phc.SurveyObservation.get_data_frame(questionnaire_id="...", join_survey=True) # => (Allow multiple ids)

Here's some rough code of what's needed right now to accomplish this.

import pandas as pd

questionnaires_raw = phc.Query.execute_fhir_dsl({
    "type": "select",
    "columns": [
        {"expr": {"type": "column_ref", "column": c}}
        for c in ["id", "title", "meta", "version", "status", "date",
                  "subjectType", "identifier", "description", "contained"]
    ],
    "from": [{"table": "questionnaire"}],
}, page_size=100)

questionnaires = (
    phc.Frame
    .expand(pd.DataFrame([r["_source"] for r in questionnaires_raw]))
    .sort_values("version", ascending=False)
)

questionnaire_responses_raw = phc.Query.execute_fhir_dsl({
    "type": "select",
    "columns": [
        {"expr": {"type": "column_ref", "column": c}}
        for c in ["id", "status", "questionnaire"]
    ],
    "from": [{"table": "questionnaire_response"}],
}, all_results=True)

questionnaire_responses = phc.Frame.expand(
    pd.DataFrame([r["_source"] for r in questionnaire_responses_raw]),
    code_columns=["questionnaire"]
)

responses = phc.Observation.get_data_frame(
    code="...",
    system="http://lifeomic.com/fhir/primary-survey-id",
#     all_results=True
)

def join_observation_responses(
    observation_df: pd.DataFrame,
    questionnaire_response_df: pd.DataFrame,
    questionnaire_df: pd.DataFrame
):
    return observation_df.assign(**{
        "related.target_reference": observation_df["related.target_reference"].str.replace("QuestionnaireResponse/", "")
    }).join(
        questionnaire_response_df.set_index("id"),
        on="related.target_reference",
        rsuffix=".questionnaire_response"
    ).pipe(
        lambda df: df.assign(**{
            "questionnaire.reference": df["questionnaire.reference"].str.replace("Questionnaire/", "")
        })
    ).join(
        questionnaire_df.set_index("id"),
        on="questionnaire.reference",
        rsuffix=".questionnaire"
    )


join_observation_responses(responses, questionnaire_responses, questionnaires)[[
    "id",
    "subject.display",
    "status", # ?
    "title",
    "version",
    "status.questionnaire",
    "description",
    "status.questionnaire_response",
    "code.coding_system__lifeomic.com/fhir/questionnaire/item__code",
    "code.coding_system__lifeomic.com/fhir/questionnaire/item__display",
    "code.coding_system__lifeomic.com/fhir__code",
    "code.coding_system__lifeomic.com/fhir__display",
    "code.coding_system__lifeomic.com/fhir__userSelected",
    "valueCodeableConcept.coding_display",
    "valueCodeableConcept.coding_code",
    "valueString",
    "valueQuantity.value",
    "effectiveDateTime.local"
]].rename(columns={
    "code.coding_system__lifeomic.com/fhir/questionnaire/item__code": "item_code",
    "code.coding_system__lifeomic.com/fhir/questionnaire/item__display": "item", 
})

make -> poetry

I propose to use poetry to:

manage dependencies
manage virtualenv out of box
manage build/release process
replace the need of make - needs poethepoet package

It will improve the UX of maintenance and contribution to this repo for future features.

see PR #189

	if "tag" in codeable_dict:
	return (
	[without_keys(codeable_dict, ["tag"]), *codeable_dict["tag"]],
	join_underscore([prefix, "tag"]),
	)

lifeomic / phc-sdk-py Goto Github PK

phc-sdk-py's Introduction

PHC SDK for Python

Project Status

Getting Started

Dependencies

Getting the Source

Usage

Contributing

License

Authors

Acknowledgements

phc-sdk-py's People

Contributors

Stargazers

Watchers

Forkers

phc-sdk-py's Issues

Now

Expected result

Recommend Projects

Recommend Topics

Recommend Org