ohdsi / phenotypelibrary Goto Github PK

A repository to store, organize and maintain the content of the OHDSI Phenotype library. OHDSI Forum post https://forums.ohdsi.org/t/ohdsi-phenotype-library-announcements/16910

Home Page: https://ohdsi.github.io/PhenotypeLibrary/

License: Other

R 96.57% Perl 1.44% Shell 1.98%

hades

phenotypelibrary's Introduction

The OHDSI phenotype library

PhenotypeLibrary is part of HADES.

Gowtham A Rao

Annoucements on OHDSI forums
Release notes
Definition catalog

The Observational Health Data Sciences and Informatics (OHDSI) community has developed a publicly accessible, version-controlled Phenotype Library to guide real-world evidence towards the FAIR principles: Findability, Accessibility, Reproducibility, and Interoperability.[1] This library aims to foster the submission and retrieval of high-quality cohort definitions, cataloging of metadata, attribution and promotion of discovery and reuse in scientific research.

Within the OHDSI Phenotype Library (OHDSI PL), each entry represents a unique cohort definition identifiable by a stable, externally referenceable ID. Comprehensive metadata about each cohort definition is cataloged and made searchable for researchers.[2] Content in the library is subject to version control, with each version is assigned a specific DOI.

The OHDSI PL employs a community engagement and contribution process, crediting contributors via ORCID when provided. Submitted cohort definitions are subject to a voluntary, open peer review process managed by the OHDSI Phenotype Development and Evaluation Workgroup. All cohort definitions are computable and portable and conform to the specifications of the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) promoting efficient implementation, standard terminology use, seamless conversions between computable and human-readable definitions, and consistent understanding of the logic.[3-6]

Metadata: The library has the capacity to collect a wide range of metadata:

a) User/community/author-submitted metadata. This includes the short and long names of the cohort definition; the names or ORCID IDs of the contributor(s) and peer reviewer(s); a clinical description of the phenotype for which the cohort definition was designed; a concise explanation of the cohort definition's technical logic to help others understand the underlying code; the recommended study applications for these definitions; pertinent external links, such as OHDSI forum posts discussing the definition; any contributed summary output from OHDSI software such as CohortDiagnostics and/or PheValuator; community recommended tags; any relevant evaluation or peer reviewer comments; and notes on the community's experiences implementing the definition in research.

b) Librarian-assigned metadata, including a managed taxonomy of tags to promote systematic discovery and content navigation; the status of the definition (whether it's accepted, pending peer review, or deprecated), and its Digital Object Identifier (DOI).

c) Computer-generated metadata, which is currently only available for cohort definitions that adhere to the Circe-defined phenotype definition object model. This data encompasses a human-readable, complete cohort definition logic; a list of domains used in the cohort definition; entry event code lists; comprehensive code lists; and resolved codes.

Maintenance:

a) Lifecycle: Once peer-reviewed and accepted, Cohort Definitions become immutable. This differs from Cohort Definitions that have not undergone peer review; these definitions have the potential to evolve in future released versions of the library. However, within a referenced released version with an associated DOI, all cohort definitions maintain stability. Cohort Definitions accepted in a release can be deprecated or marked as an error in subsequent versions. Deprecated cohorts [D] remain valid but an alternate cohort might be suggested based on peer review feedback. These cohorts continue to be relevant for OHDSI studies and remain immutable and referenceable. On the other hand, an Error cohort [E] refers to an accepted cohort identified to have a previously unrecognized error. This is akin to a soft deletion, and such cohorts are not recommended for use in OHDSI studies. Despite the error, as accepted cohorts, they will persist and maintain their immutable status in the OHDSI library.

b) Technical infrastructure and version control: The OHDSI Phenotype Library (PL) is hosted in a GitHub repository under the OHDSI organization (https://github.com/ohdsi/PhenotypeLibrary) and is encapsulated within the R package known as PhenotypeLibrary. This R package is an integral component of the OHDSI HADES ecosystem (https://ohdsi.github.io/Hades/), and in adherence with the HADES principles, it is designed to seamlessly integrate with other HADES packages. It's worth noting that this repository can be accessed directly using GitHub APIs without utilizing R, as illustrated by https://dash.ohdsi.org/phenotype-explorer. The PhenotypeLibrary R package also includes a function, getPhenotypeLog(), which retrieves the cohort definitions and related metadata in a tabular format. The release process of this library is aligned with the HADES convention, employing a three-segment numbering system. The first segment signifies major breaking changes, like a full library overhaul, although no such changes are anticipated in the foreseeable future. The second segment is the most common change and is for cohort definitions. Importantly, once a cohort definition is accepted, it remains unchanged. The third segment is used when changes are limited to documentation but not to cohort definitions.

c) Release cycle: This library follows a regular release cycle, having launched approximately 15 releases since the establishment of major version 3 in 2022. Prior major versions, like version 2 (deprecated in 2022 - https://github.com/OHDSI/PhenotypeLibrary/tree/master-archive) and version 1 (deprecated in 2020 - https://github.com/OHDSI/PhenotypeLibrary/tree/legacy), are now archived and no longer in active use.

d) Quality checks: The library is subject to regular quality checks to ensure that the cohort definitions are executable across the OHDSI network. This is done by executing a study package, named PhenotypeLibraryDiagnostics (https://github.com/ohdsi-studies/PhenotypeLibraryDiagnostics), that executed limited set of diagnostics within CohortDiagnostics. It is executed on volunteer data partners and the outcomes are uploaded to https://data.ohdsi.org/PhenotypeLibrary. This process ensures that every cohort definition can be executed on the OMOP CDM v5.x platform.

Limitations:

Although the library attempts to guide real world evidence towards the FAIR principles, there are several limitations specifically on the conformance to a machine-readable semantic Resource Description Framework (RDF) standard[1]. In the absence of such conformance, it is hard to relate items in meta-data, or to find related cohort definitions with ease. As the library is evolving to such as desired future state, this limitation is perhaps best described as that of the current observational research standards. By promoting the standards set out in this paper, we aim to work towards a larger resource of information that will conform to the FAIR principles.

Conclusions

The OHDSI Phenotype Library is an open-science version-controlled cohort definition repository with robust community engagement and contribution, an embedded open peer review process using DOI and ORCID. Comprehensive metadata and peer review processes ensure cohort definitions are good quality and usable in observational research.

References:

Wilkinson, M.D., et al., The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 2016. 3(1): p. 160018.
Richesson, R.L., M.M. Smerek, and C. Blake Cameron, A Framework to Support the Sharing and Reuse of Computable Phenotype Definitions Across Health Care Delivery and Clinical Research Applications. EGEMS (Wash DC), 2016. 4(3): p. 1232.
Hripcsak, G., et al., Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. Stud Health Technol Inform, 2015. 216: p. 574-8.
Overhage, J.M., et al., Validation of a common data model for active safety surveillance research. J Am Med Inform Assoc, 2012. 19(1): p. 54-60.
Newton, K.M., et al., Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. J Am Med Inform Assoc, 2013. 20(e1): p. e147-54.
Hripcsak, G., et al., Facilitating phenotype transfer using a common data model. J Biomed Inform, 2019. 96: p. 103253.

Technology

PhenotypeLibrary is an R package.

System Requirements

Requires R (version 3.6.0 or higher).

Installation

See HADES instructions for configuring your R environment, including RTools and Java.
In R, use the following commands to download and install PhenotypeLibrary:

install.packages("remotes")
remotes::install_github("ohdsi/PhenotypeLibrary")

Contributing

The OHDSI Phenotype Development and Evaluation workgroup manages contributions to the OHDSI Phenotype Library.

License

PhenotypeLibrary is licensed under Apache License 2.0

phenotypelibrary's People

Contributors

Stargazers

Watchers

Forkers

urvishah3 mvanzandt alexdavv igorgorbenko bradanton abmi hargi12 himschoota rfherrerac ablack3 grparsons gkovaig odyosg siirsu aldrengonzales shanshan4q33

phenotypelibrary's Issues

Add URLs in About

I had to hunt around to find these URLs and I think adding them to the GitHub ABOUT would be helpful.

The second link I do not have access to. Maybe if it is only admins, but if not, instructions on how to get access would be helpful too.

Publish subscribe mechanism for (e.g. WebSub)

Going forward, I see a potential issue occuring when phenotypes update and different end users working with old phenotypes that may have been updated. Implementing a form of publish subscriber model for this would be ideal. For example, following the WebSub protocol would be one approach.

In the websub model, 'webhooks' are callbacks that run when updates are published to 'topics' that users intentionally follow. In the case of OHDSI network collaborators this would be implementing tools that automatically run updates to generated cohorts across sites incrementally. This could also provide a model for keeping track of changes in librarians specific phenotype's of interest providing them with updates when related cohorts are changed. They could then plan actions to update cohort diagnostics or results across different network sites.

I don't have a concrete idea of how to implement this exactly, but I do have the use case of wanting to regenerate cohorts when the PhenotypeLibrary definition is updated. Designing a clear, consistent, way to do this would be beneficial.

Incorrect name for cohort 1219

The cohort name of 1219 is "Acute Bronchitis", but this should be "Hyperlipidemia":

cohort <- getPlCohortDefinitionSet(1219)
cohort$cohortName
[1] "Acute Bronchitis"

writeLines(cohort$json)
{
	"cdmVersionRange" : ">=5.0.0",
	"PrimaryCriteria" : {
		"CriteriaList" : [
			{
				"ConditionOccurrence" : {
					"CodesetId" : 0,
					"ConditionTypeExclude" : false
				}
			}
		],
		"ObservationWindow" : {
			"PriorDays" : 0,
			"PostDays" : 0
		},
		"PrimaryCriteriaLimit" : {
			"Type" : "First"
		}
	},
	"ConceptSets" : [
		{
			"id" : 0,
			"name" : "Hyperlipidemia",
			"expression" : {
				"items" : [
					{
						"concept" : {
							"CONCEPT_ID" : 432867,
							"CONCEPT_NAME" : "Hyperlipidemia",
							"STANDARD_CONCEPT" : "S",
							"STANDARD_CONCEPT_CAPTION" : "Standard",
							"INVALID_REASON" : "V",
							"INVALID_REASON_CAPTION" : "Valid",
							"CONCEPT_CODE" : "55822004",
							"DOMAIN_ID" : "Condition",
							"VOCABULARY_ID" : "SNOMED",
							"CONCEPT_CLASS_ID" : "Clinical Finding"
						},
						"isExcluded" : false,
						"includeDescendants" : true,
						"includeMapped" : false
					},
					{
						"concept" : {
							"CONCEPT_ID" : 437530,
							"CONCEPT_NAME" : "Disorder of lipid metabolism",
							"STANDARD_CONCEPT" : "S",
							"STANDARD_CONCEPT_CAPTION" : "Standard",
							"INVALID_REASON" : "V",
							"INVALID_REASON_CAPTION" : "Valid",
							"CONCEPT_CODE" : "267431006",
							"DOMAIN_ID" : "Condition",
							"VOCABULARY_ID" : "SNOMED",
							"CONCEPT_CLASS_ID" : "Clinical Finding"
						},
						"isExcluded" : false,
						"includeDescendants" : false,
						"includeMapped" : false
					}
				]
			}
		}
	],
	"QualifiedLimit" : {
		"Type" : "First"
	},
	"ExpressionLimit" : {
		"Type" : "First"
	},
	"InclusionRules" : [],
	"CensoringCriteria" : [],
	"CollapseSettings" : {
		"CollapseType" : "ERA",
		"EraPad" : 0
	},
	"CensorWindow" : {}
}

Inclusion in HADES

Your package is almost ready for inclusion in HADES! I did find some issues that need resolving first:

There are no unit test. Please add some. Since the package only has 2 functions, it should be easy to achieve 100% coverage ;-)
The package manual (PDF) and website are out of date. At the next release, don't forget to run these lines in PackageMaintenance.R before pushing to main.
The set of issue labels does not correspond to those recommended in HADES. You don't have to copy them exactly, but right now the issue labels are a bit of a mess.
Please clean up your issues (I think all can be closed).

Ideally, there should also be a vignette or maybe just some example code in the README demonstrating how to use the package, for example in combination with CohortDiagnostics, to instantiate cohorts in the Library.

Not a requirement for HADES, but I think many people will want to know

Which phenotypes are included in the library, and how to find out if they're any good for a new study someone wants to design.
By what process the Phenotype Library came about. What quality checks have been applied, etc.
Where to find the Cohort Diagnostics of these phenotypes.
How one can contribute new phenotypes.

This does not have to be documented in this repo, but pointers to websites where this is documented would be helpful.

Phenotype Library App not working

We noticed that the app is not working anymore:

https://data.ohdsi.org/PhenotypeLibrary/

Can someone look into that?

CKD Excluding Stage 1 & Stage 2 has those concept

Identify the cohort definition
46271022005 | Chronic kidney disease prevalent cohort: earliest CKD with evidence CKD 90d post-index, 2+ dialysis events 90d apart, 2+ kayexlate drug exposure 90d apart, or evidence of AV Fistula

Describe the issue
The CKD definition says it exclude Stage 1 & 2 but has:
3185897 | Chronic kidney disease stage II

Currently this concept is not used (Nebraska Lexicon), according to PHEOBE. But if you are doing clean up it would better non paper.

Additional context
Low Priority

Description flipped for Incident and Prevalent cohorts?

Identify the cohort definition
Looking at Malignant Melanoma of Skin. C1 Incident Cohort, but I believe this applies to most if not all Incident vs Prevalent cohorts.

Describe the issue
The Description does not show the 365 day prior observation period requirement. But the JSON code does. (The JSON appears correct to me (in terms of my understanding of incident and the fact that the count is lower on C1 than C2.)

Screenshots

Additional context
The incident and prevalent descriptions need to be corrected (ie switched), potentially for every C1 and C2 in the library.

Misprint in cteOvul2StartDates

instead of pod.PERSON_ID = e.PERSON_ID used pod.PERSON_ID = pod.EVENT_ID

cteOvul2StartDates (PERSON_ID, EVENT_ID, EPISODE_START_DATE, CATEGORY, DATE_RANK) as
(
select PERSON_ID, EVENT_ID, EPISODE_START_DATE, CATEGORY, 4 as DATE_RANK
from
(
select e.PERSON_ID, e.EVENT_ID as EVENT_ID, dateadd(d,(-14) + 1, p.EVENT_DATE) as EPISODE_START_DATE, p.Category,
row_number() over (partition by e.person_id, e.event_id order by p.EVENT_DATE) as rn
from cteOutcomeEvents e
JOIN @target_database_schema.@tablestem_term_durations lb on e.Category = lb.CATEGORY
JOIN ctePriorOutcomeDates pod on pod.PERSON_ID = pod.EVENT_ID and pod.EVENT_ID = e.EVENT_ID
JOIN #pregnancy_events p on e.PERSON_ID = p.PERSON_ID
where p.CATEGORY = 'OVUL2'
and dateadd(d,(-14) + 1, p.EVENT_DATE) between
case when dateadd(d, lb.retry , pod.PRIOR_OUTCOME_DATE) > dateadd(d, -1* lb.MAX_TERM, e.EVENT_DATE) then dateadd(d, lb.retry , pod.PRIOR_OUTCOME_DATE)
else dateadd(d, -1* lb.MAX_TERM, e.EVENT_DATE) end
and dateadd(d, -1* lb.MIN_TERM, e.EVENT_DATE)
) Q
where rn=1
)

Create Date, Update Date

A Create Date and an Update Date on the Cohort Definitions would be helpful. Before the holidays I was working with a few definitions and now it looks like there are new ones in the library. These dates would help me know how much might be out of sync.

https://dash.ohdsi.org/phenotype-explorer Not working

The documentation refers to https://dash.ohdsi.org/phenotype-explorer but this is not loading for me. Is this indeed not working anymore?

Some json files are not valid UTF8

Are the json files in the Phenotype library supposed to be valid UTF-8?
If not what encoding is used?

library(PhenotypeLibrary)

listPhenotypes()$cohortId |>
  getPlCohortDefinitionSet() |>
  dplyr::filter(!validUTF8(json)) 
#> # A tibble: 5 × 4
#>   cohortId cohortName                                              json    sql  
#>      <dbl> <chr>                                                   <chr>   <chr>
#> 1        6 [P] Fever (3Pe, 30Era)                                  "{\n\t… "CRE…
#> 2       16 [P] Exposure to SARS-Cov 2 and coronavirus (7Pe, 30Era) "{\n\t… "CRE…
#> 3       29 [W] Autoimmune condition (FP)                           "{\n\t… "CRE…
#> 4       64 [P] Flu-like symptoms (3P, 30Era)                       "{\n\t… "CRE…
#> 5       73 [W] Pregnancy (270P, 0Era)                              "{\n\t… "CRE…

^{Created on 2023-06-20 with reprex v2.0.2}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.2 (2022-10-31)
#>  os       macOS Big Sur ... 10.16
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/London
#>  date     2023-06-20
#>  pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package          * version date (UTC) lib source
#>  backports          1.4.1   2021-12-13 [1] CRAN (R 4.2.0)
#>  bit                4.0.5   2022-11-15 [1] CRAN (R 4.2.0)
#>  bit64              4.0.5   2020-08-30 [1] CRAN (R 4.2.0)
#>  checkmate          2.2.0   2023-04-27 [1] CRAN (R 4.2.0)
#>  cli                3.6.1   2023-03-23 [1] CRAN (R 4.2.0)
#>  crayon             1.5.2   2022-09-29 [1] CRAN (R 4.2.0)
#>  digest             0.6.31  2022-12-11 [1] CRAN (R 4.2.0)
#>  dplyr              1.1.2   2023-04-20 [1] CRAN (R 4.2.0)
#>  evaluate           0.21    2023-05-05 [1] CRAN (R 4.2.0)
#>  fansi              1.0.4   2023-01-22 [1] CRAN (R 4.2.0)
#>  fastmap            1.1.1   2023-02-24 [1] CRAN (R 4.2.0)
#>  fs                 1.6.2   2023-04-25 [1] CRAN (R 4.2.0)
#>  generics           0.1.3   2022-07-05 [1] CRAN (R 4.2.0)
#>  glue               1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  hms                1.1.3   2023-03-21 [1] CRAN (R 4.2.0)
#>  htmltools          0.5.5   2023-03-23 [1] CRAN (R 4.2.0)
#>  knitr              1.43    2023-05-25 [1] CRAN (R 4.2.0)
#>  lifecycle          1.0.3   2022-10-07 [1] CRAN (R 4.2.0)
#>  magrittr           2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  PhenotypeLibrary * 3.2.0   2023-06-20 [1] local
#>  pillar             1.9.0   2023-03-22 [1] CRAN (R 4.2.0)
#>  pkgconfig          2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr              1.0.1   2023-01-10 [1] CRAN (R 4.2.0)
#>  R.cache            0.16.0  2022-07-21 [1] CRAN (R 4.2.0)
#>  R.methodsS3        1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo               1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils            2.12.2  2022-11-11 [1] CRAN (R 4.2.0)
#>  R6                 2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  readr              2.1.4   2023-02-10 [1] CRAN (R 4.2.0)
#>  reprex             2.0.2   2022-08-17 [1] CRAN (R 4.2.0)
#>  rlang              1.1.1   2023-04-28 [1] CRAN (R 4.2.0)
#>  rmarkdown          2.22    2023-06-01 [1] CRAN (R 4.2.0)
#>  rstudioapi         0.14    2022-08-22 [1] CRAN (R 4.2.0)
#>  sessioninfo        1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  styler             1.10.1  2023-06-05 [1] CRAN (R 4.2.0)
#>  tibble             3.2.1   2023-03-20 [1] CRAN (R 4.2.0)
#>  tidyselect         1.2.0   2022-10-10 [1] CRAN (R 4.2.0)
#>  tzdb               0.4.0   2023-05-12 [1] CRAN (R 4.2.0)
#>  utf8               1.2.3   2023-01-31 [1] CRAN (R 4.2.0)
#>  vctrs              0.6.2   2023-04-19 [1] CRAN (R 4.2.0)
#>  vroom              1.6.3   2023-04-28 [1] CRAN (R 4.2.0)
#>  withr              2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun               0.39    2023-04-20 [1] CRAN (R 4.2.0)
#>  yaml               2.3.7   2023-01-23 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

R check notes need solving

R check reveals some issues in the code that need to be fixed:

* checking R code for possible problems ... NOTE
getPhenotypeLog: no visible binding for global variable ‘cohortId’
getPhenotypeLog: no visible binding for global variable ‘addedVersion’
getPhenotypeLog: no visible binding for global variable ‘addedDate’
getPhenotypeLog: no visible binding for global variable
  ‘deprecatedVersion’
getPhenotypeLog: no visible binding for global variable
  ‘deprecatedDate’
getPhenotypeLog: no visible binding for global variable
  ‘updatedVersion’
getPhenotypeLog: no visible binding for global variable ‘updatedDate’
getPhenotypeLog: no visible binding for global variable ‘notes’
getPlCohortDefinitionSet: no visible binding for global variable
  ‘cohortId’
updatePhenotypeLog: no visible binding for global variable
  ‘createdDate’
updatePhenotypeLog: no visible binding for global variable
  ‘modifiedDate’
updatePhenotypeLog: no visible binding for global variable ‘name’
updatePhenotypeLog: no visible binding for global variable
  ‘description’
updatePhenotypeLog: no visible binding for global variable ‘cohortId’
updatePhenotypeLog: no visible binding for global variable ‘cohortName’
updatePhenotypeLog: no visible binding for global variable ‘addedDate’
updatePhenotypeLog: no visible binding for global variable
  ‘updatedDate’
updatePhenotypeLog: no visible binding for global variable ‘notes’
updatePhenotypeLog: no visible binding for global variable ‘getResults’
updatePhenotypeLog: no visible binding for global variable
  ‘addedVersion’
updatePhenotypeLog: no visible binding for global variable
  ‘deprecatedDate’
updatePhenotypeLog: no visible binding for global variable
  ‘deprecatedVersion’
updatePhenotypeLog: no visible binding for global variable
  ‘updatedVersion’
Undefined global functions or variables:
  addedDate addedVersion cohortId cohortName createdDate deprecatedDate
  deprecatedVersion description getResults modifiedDate name notes
  updatedDate updatedVersion

I think these are all caused by not using .data$ or quotes in dplyr calls. For example, this

cohorts <- listPhenotypes() %>%
    filter(cohortId %in% cohortIds)

should be

cohorts <- listPhenotypes() %>%
    filter(.data$cohortId %in% cohortIds)

Of course, nowadays select and rename need quotes instead of .data$.

Move `extra/PhenotypeDescription.csv` into `inst` for use by other packages

Might I ask that extra/PhenotypeDescription.csv get moved into inst so that its contents may be used for other (dependent) packages? Pretty please.

Add test to ensure that all characters are valid UTF-8

I would like to propose ensuring that all json and sql files in the Phenotype library contain Unicode encoded characters. Preferably all characters would be valid UTF-8. Ensuring that all characters are valid UTF8 will make files easier to read into R. See https://rdrr.io/rforge/stringi/man/stringi-encoding.html for some background.

I propose converting any non-utf8 encoded characters to their utf8 counterparts (e.g. "Sjögren's syndrome")
Adding a test that would check to make sure all text files contain only valid utf8 characters
Adding a dataframe/tibble to the package that would contain one row per cohort with metadata, json (as a list column), and sql (as a list column) so that loading the package in R would provide instant access to all phenotype and cohort data the package provides.

Currently there are a number of files that contain non-utf8 (ansi) characters.

# verify that all text is valid utf-8
library(dplyr)
phenotypeIds <- list.files("inst") %>%
  stringr::str_subset("\\d+")

jsonFiles <- purrr::map(phenotypeIds, ~paste0(glue::glue("inst/{.}/") , list.files(glue::glue("inst/{.}")))) %>%
  unlist() %>%
  stringr::str_subset("\\.json")

df <- tibble::tibble(jsonFiles) %>%
  mutate(isutf8 = purrr::map(jsonFiles, ~stringi::stri_enc_isutf8(readr::read_lines_raw(.)))) %>%
  mutate(bad_lines = purrr::map_chr(isutf8, ~paste(which(!.), collapse = ","))) %>%
  filter(bad_lines != "") %>%
  mutate(notutf8 = paste("file:", jsonFiles, "lines:", bad_lines)) %>%
  select(notutf8)

if(nrow(df) != 0){
  message(paste0(c("The following lines contain non-utf8 characters", df$notutf8), collapse = "\n"))
}

#> The following lines contain non-utf8 characters
#> file: inst/254443000/254443001.json lines: 24,30
#> file: inst/254443000/254443002.json lines: 24,30
#> file: inst/254443000/254443003.json lines: 30
#> file: inst/254443000/254443004.json lines: 64
#> file: inst/378419000/378419003.json lines: 966,1227
#> file: inst/4098597000/4098597001.json lines: 24,30
#> file: inst/4098597000/4098597002.json lines: 24,30
#> file: inst/4101602000/4101602001.json lines: 24,30
#> file: inst/4101602000/4101602002.json lines: 24,30
#> file: inst/4101602000/4101602003.json lines: 23,29
#> file: inst/4101602000/4101602004.json lines: 73
#> file: inst/4137275000/4137275003.json lines: 23
#> file: inst/4164770000/4164770001.json lines: 24,30
#> file: inst/4164770000/4164770002.json lines: 24,30
#> file: inst/4164770000/4164770003.json lines: 63
#> file: inst/4266367000/4266367003.json lines: 173,929
#> file: inst/436642000/436642004.json lines: 81
#> file: inst/437663000/437663003.json lines: 129,885

Possible issue with existing cohort definition 441202003: Anaphylactic reaction by Walsh (Mini-sentinial validated definition)

Identify the cohort definition
Cohort 441202003: Anaphylactic reaction by Walsh (Mini-sentinial validated definition)

Describe the issue
This cohort definition seems to capture many more events than I would expect. I apologize for the lack of detail but have very limited clinical background to understand what the problem is. I do think that this cohort definition should be reviewed by experts though.

Screenshots
I cannot share any screenshots unfortunately but can tell you that I believe the problem has to do with the third part of the cohort definition.

Initial Event Cohort
People having any of the following:
A visit occurrence of Outpatient visit
Having all of the following criteria:
Having any of the following criteria:
at least 1 occurrences of a drug era of Diphenhydramine hydrochloride5
where event starts between 0 days Before and 0 days After index start date
or at least 1 occurrences of a condition occurrence of Bronchospasm3
where event starts between all days Before and all days After index start date
or at least 1 occurrences of a condition occurrence of Stridor11
where event starts between all days Before and all days After index start date
And having all of the following criteria:
at least 1 occurrences of a drug era of Epinephrine6
where event starts between 0 days Before and 0 days After index start date
and at least 1 occurrences of a condition occurrence of Hypotension7
where event starts between 0 days Before and 0 days After index start date
and at least 1 occurrences of a procedure of Cardiopulmonary resuscition4
where event starts between 0 days Before and 0 days After index start date
And having all of the following criteria:
at least 1 occurrences of a condition occurrence of Allergy unspecified1
where event starts between 0 days Before and 0 days After index start date

Additional context
Just want to flag this cohort for a second look by the experts.

Project structure and contribution guidelines

Suuuuper thanks for working on this!

I want to contribute, but hate to step on others' toes. To that end, could you add a contribution file? Here's an example from GitHub (the organization, not the platform 😄).

I'm happy to have my hand at it, if you like.

Thanks!!

[BUG] Critical Error When Accessing Website

Error Description

No idea what is happening with the PhenotypeLibrary website, but I just tried accessing it this morning and got this error page:

Error Details

Here is the traceback:

su: ignoring --preserve-environment, it's mutually exclusive with --login
Connecting using PostgreSQL driver
Error in value[[3L]](cond) : 
  Error reading from results.database: org.postgresql.util.PSQLException: ERROR: permission denied for table database
Calls: runApp ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
Execution halted

Additional Details

I tested on the Vivaldi browser:

Vivaldi | 4.2.2406.52 (Stable channel) (x86_64)
-- | --
Revision | 15706f801ff5b5ab0b97af359fe9b12707e81807
OS | macOS Version 10.15.7 (Build 19H1417)
JavaScript | V8 9.3.345.19
User Agent | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.99 Safari/537.36
Command Line | /Applications/Vivaldi.app/Contents/MacOS/Vivaldi --flag-switches-begin --flag-switches-end --disable-smooth-scrolling --save-page-as-mhtml
Executable Path | /Applications/Vivaldi.app/Contents/MacOS/Vivaldi
Profile Path | /Users/jzelko3/Library/Application Support/Vivaldi/Default
Linker | ld

And Safari Browser:

Version 14.1.2 (15611.3.10.1.7, 15611)

Error in IL-23 inhibitor cohorts I submitted

@gowthamrao , I made an error in the 'New users of IL-23 inhibitors' cohorts (ID 1042 and 1057 in the Phenotype Library). The conceptset is supposed to contain drugs that are IL-23 inhibitors. I incorrectly included ustekinumab (which is an IL12/23 inhibitor) and incorrectly omitted guselkumab (which is an IL23 inhibitor, alongside risankizumab and tildrakizumab). How can we correct these errors?

Unable to use Implementation files - T2DM

Hello,

I was trying to use the T2DM_broad.json file in my Atlas instance to generate the cohort. But when I copy the contents of the file from github and put it in my Atlas Instance, the file doesn't load and I don't see any concept sets or cohort definition being generated. For ex, when I copy the json file the screen looks like below and even after I click on "Reload", I don't see any updates in concept tab of Atlas. Can help with this please. May I kindly check with you on what's the right way to do this?

The logic description for referent and incident cohorts are flipped

Cohort ID:		200219001
Cohort Name:		Abdominal pain referent concept incident cohort: First occurrence of referent concept + descendants with >=365d prior observation
Logic:		Persons with condition occurrence of referent concept (200219) or descendants, for the first time in the person's history. Persons exit cohort at the end of the observation period.

Cohort ID:		200219002
Cohort Name:		Abdominal pain referent concept prevalent cohort: First occurrence of referent concept + descendants
Logic:		Persons with condition occurrence of referent concept (200219) or descendants, for the first time in the person's history, with at least 365 days of prior continuous observation. Persons exit cohort at the end of the observation period.

Warning in utils::tar() storing paths of more than 100 bytes is not portable

When installing this package I get a number of warnings like this. Is non-portability of long file names a potential concern?

 Warning in utils::tar(filepath, pkgname, compression = compression, compression_level = 9L,  :
     storing paths of more than 100 bytes is not portable:
     'PhenotypeLibrary/inst/197320000/literature/Acute kidney injury or Acute renal failure syndrome literature review.xlsx'

T2DM --> Hemoglobin A1c (HbA1c) measurements

An issue for Cohort ID 228 - [P] Type 2 Diabetes Mellitus indexed on diagnosis, treatment or lab results

The concept set "Hemoglobin A1c (HbA1c) measurements" should also include the following CONCEPT_ID:
37393623 - HbA1c level (Diabetes Control and Complications Trial aligned)

The most common unit is percent (at least in my local DB that uses it)

v3.1.6 non-standard concepts to define concept sets

Hi, I am not sure if this was changed in newer versions but wanted to mention that for phenotypes 551 and 676, they are defined by non-standard concepts, and thus in the corresponding sql,
JOIN #Codesets cs on (co.condition_concept_id = cs.concept_id and cs.codeset_id = 0)
this most likely won't grab any patients since the non-standard concept ids will be in their co.condition_source_concept_id if I understand correctly. Thanks!