Phenotype diagnostics

Introduction

In this vignette, we are going to present how to run PhenotypeDiagnostics().

We’ll use the following packages and mock data for example purposes:

library(CohortConstructor)
library(OmopSketch)
library(PhenotypeR)
library(dplyr)
library(DBI)
library(duckdb)
library(CDMConnector)

con <- dbConnect(duckdb(), 
                 eunomiaDir("synpuf-1k", "5.3"))

cdm <- cdmFromCon(con = con, 
                  cdmName = "Eunomia Synpuf",
                  cdmSchema   = "main",
                  writeSchema = "main", 
                  achillesSchema = "main")
cdm

Note that we have included achilles tables in our cdm reference, which will be used to speed up some of the analyses.

We need to create a set of cohorts to review. For this we are going to use the package CohortConstructor to generate cohorts with users of warfarin, acetaminophen and morphine.

# Create codelists
codes <- list("warfarin" = c(1310149, 40163554),
              "acetaminophen" = c(1125315, 1127078, 1127433, 40229134, 40231925, 40162522, 19133768),
              "morphine" = c(1110410, 35605858, 40169988))

# Instantiate cohorts with CohortConstructor
cdm$my_cohort <- conceptCohort(cdm = cdm,
                               conceptSet = codes, 
                               exit = "event_end_date",
                               overlap = "merge",
                               name = "my_cohort")

Running PhenotypeDiagnostics

Now that we have our cohort, we will use phenotypeDiagnotics() to assess them. This will run the following diagnostics which help us know whether our cohorts are ready to be used in research with the OMOP CDM dataset we’re using:

Database diagnostics: This includes information about the size of the data, the time period covered, the number of people in the data, and other meta-data of the CDM object. If only database diagnostics are of interest, these analyses can be run using databaseDiagnotics().
Codelist diagnostics: This includes information on the concepts included in our cohorts’ codelist. If only codelist diagnostics are of interest, these analyses can be run using codelistDiagnotics().
Cohort diagnostics: This summarises the characteristics of our cohorts, as well as comparing them to age and sex matched controls from the dataset. If only cohort diagnostics are of interest, these analyses can be run using cohortDiagnotics().
Population diagnostics: Calculates the frequency of our study cohorts in the database in terms of their incidence rates and period prevalence. If only population diagnostics are of interest, these analyses can be run using populationDiagnotics().

If we do not provide any specifications, the default values of the functions will be used. That means, the following script will run with the default values used in each individual diagnostics function.

diagnostics <- phenotypeDiagnostics(cdm$my_cohort,
                                databaseDiagnostics = list(),
                                codelistDiagnostics = list(),
                                cohortDiagnostics = list(),
                                populationDiagnostics = list(),
                                stagingDirectory = NULL)

Notice that we can specify the directory where to save a log file so we can keep track on which incremental results are being run at each time.

If we don’t want to run one of the diagnostics we can switch it off by setting it to NULL.

phenotypeDiagnostics(cdm$my_cohort,
                     databaseDiagnostics = list(),
                     codelistDiagnostics = NULL,
                     cohortDiagnostics = list(),
                     populationDiagnostics = NULL)

Or if we want to change the settings we can include arguments used in the sub-functions in a list. For example, survial analysis is not run by default (cohortSuvival is set by default to FALSE in cohortDiagnotics()). We can run this, leaving other arguments as their defaults, like so:

diagnostics <- phenotypeDiagnostics(cdm$my_cohort,
                                databaseDiagnostics = list(),
                                codelistDiagnostics = list(),
                                cohortDiagnostics = list("cohortSurvival" = TRUE),
                                populationDiagnostics = list())

Database diagnostics

Although we may have created our study cohort, to inform analytic decisions and interpretation of results requires an understanding of the dataset from which it has been derived. The database diagnostics builds on OmopSketch package to perform the following analyses:

Snapshot: Summarises the meta data of a CDM object by using summariseOmopSnapshot()
Person table: Summarises the person table by using summarisePerson(). This provides demographic information including sex, race, ethnicity, year/month/day of birth distributions, and location/provider/care site information.
Observation periods: Summarises the observation period table by using summariseObservationPeriod(). This will allow us to see if there are individuals with multiple, non-overlapping, observation periods and how long each observation period lasts on average.
Clinical Records: The diagnostics will detect which domains appear to the codelist associated to your cohort (i.e., Drug), and use summariseClinicalRecords() to summarise the associated clinical table (i.e., “drug_exposure”).

Codelist diagnostics

Codelist diagnostics builds on CodelistGenerator and MeasurementDiagnostics R packages to perform the following analyses:

Achilles code use: Which summarises the counts of our codes in our database based on achilles results using summariseAchillesCodeUse(). Notice that it will only run if ACHILLES tables are present in your CDM.
Orphan code use: Orphan codes refer to codes that we did not include in our cohort definition, but that have any relationship with the codes in our codelist. So, although many can be false positives, we may identify some codes that we may want to use in our cohort definitions. This analysis uses summariseOrphanCodes(). Notice that it will only run if ACHILLES tables are present in your CDM.
Cohort code use: Summarises the cohort code use in our cohort using summariseCohortCodeUse().
Measurement diagnostics: If any of the concepts used in our codelist is a measurement, it summarises its code use using summariseCohortMeasurementUse().
Drug diagnostics: If any of the concepts used in our codelist is a drug, it summarises its code use, including a summary of the exposure duration, the days between records, the daily dose, and the quantity.

Cohort diagnostics

Cohort diagnostics builds on CohortCharacteristics and CohortSurvival R packages to perform the following analyses on our cohorts:

Cohort count: Summarises the number of records and persons in each one of the cohorts using summariseCohortCount() and summarises the attrition associated with the cohorts using summariseCohortAttrition().
Cohort characteristics: Summarises cohort baseline characteristics using summariseCharacteristics(). Results are stratified by sex and by age group (0 to 17, 18 to 64, 65 to 150). Age groups cannot be modified.
Cohort large scale characteristics: Summarises cohort large scale characteristics using summariseLargeScaleCharacteristics(). Results are stratified by sex and by age group (0 to 17, 18 to 64, 65 to 150). Time windows (relative to cohort entry) included are: -Inf to -1, -Inf to -366, -365 to -31, -30 to -1, 0, 1 to 30, 31 to 365, 366 to Inf, and 1 to Inf. The analysis is perform at standard and source code level.
Compare cohort: If there is more than one cohort in the cohort table supplied, it summarises the overlap between them using summariseCohortOverlap() and the timing between them summariseCohortTiming().
Cohort survival: Summarises the survival until the event of death (if death table is present in the CDM) using
estimateSingleEventSurvival().

For computational efficiency, cohort diagnostics will take a joint random sample of 20,000 people from across the study cohorts for describing cohort charateristics. The number sampled can be changed by altering the cohortSample argument (e.g. cohortSample = 40000 to double the number). Sampling can be switched off by setting cohortSample = NULL.

For each of the input cohorts, cohort diagnostics are also run on a set of age and sex matched controls taken from the dataset as a whole. Again random sampling is used for efficiency. By default 1,000 age and sex matched controls are identified for 1,000 individuals from each of the study cohorts. The number matched can be changed by altering the matchedSample argument (e.g. matchedSample = 2000 to double the number). Sampling can be switched off by setting matchedSample = NULL. Creation of age and sex matched controls can be skipped by setting matchedSample = 0.

Modify windows, event in windows, and episodes in window

By default, CohortDiagnostics uses the following predefined window, eventInWindow, and episodesInWindow settings for the large-scale characterisation analysis:

window = list(c(-365, -31), c(-30, -1), c(0, 0), c(1, 30), c(31, 365))
tableEvents = c("condition_occurrence", "measurement", "procedure_occurrence", "device_exposure", "observation")
tableEpisodes = c("drug_exposure", "drug_era", "visit_occurrence")

*(see the window, eventInWindow, and episodeInWindow arguments of the summariseLargeScaleCharacteristics function in CohortCharacteristics R package for further details)

These defaults can be overwritten by setting global options. For example, to restrict the large-scale characterisation to only the index date (i.e., modifying the window parameter), use:

options("PhenotypeR_summariseLargeScaleCharacteristics_window" = list(c(0, 0)))

Alternatively, to include only events from the condition_occurrence table and episodes from the drug_exposure table, use:

options("PhenotypeR_summariseLargeScaleCharacteristics_eventInWindow"   = c("condition_occurrence"))
options("PhenotypeR_summariseLargeScaleCharacteristics_episodeInWindow" = c("drug_exposure"))

Population diagnostics

Population diagnostics builds on IncidencePrevalence R package to perform the following analyses:

Incidence: It estimates the incidence of our cohorts using estimateIncidence().
Period Prevalence: It estimates the period prevalence of our cohort on a year basis using estimatePeriodPrevalence().

By default, these analyses are performed for:

Overall, stratified by age groups (0 to 17, 18 to 64, 65 to 150) and by sex (Female, Male).
Including all individuals, and restricting the denominator population to those with 0 and 365 of days of prior observation.

By default incidence rates and period prevalence will be calculated for all years captured in the dataset (based on earliest observation period start date and latest observation period end date). The date range can though be limited by using the populationDateRange argument.

These analyses are also conducted on a random sample of the population captured in the dataset. By default this sample is set to 100,000 individuals and so will only be relevant for particularly large datasets. The sampling number can be changed via the populationSample argument (e.g. populationSample = 200000 to double the number) or switched off by setting populationSample = NULL.

Save the results

To save our diagnositics results, we can use exportSummarisedResult function from omopgenerics R Package:

exportSummarisedResult(diagnostics, path = here::here(), minCellCount = 5)

Visualisation of the results

Once we get our Phenotype diagnostics result, we can use shinyDiagnostics to easily create a shiny app and visualise our results:

shinyDiagnostics(diagnostics,
                 directory = tempdir(),
                 minCellCount = 5, 
                 open = TRUE)

Notice that we have specified the minimum number of counts (minCellCount) for suppression to be shown in the shiny app, and also that we want the shiny to be launched in a new R session (open). You can see the shiny app generated for this example in here.See Shiny diagnostics vignette for a full explanation of the shiny app.