Phenotype diagnostics

Introduction

In this vignette, we are going to present how to run PhenotypeDiagnostics().

We’ll use the following packages and mock data for example purposes:

library(CohortConstructor)
library(OmopSketch)
library(PhenotypeR)
library(dplyr)
library(DBI)
library(duckdb)
library(CDMConnector)

con <- dbConnect(duckdb(), 
                 eunomiaDir("synpuf-1k", "5.3"))

cdm <- cdmFromCon(con = con, 
                  cdmName = "Eunomia Synpuf",
                  cdmSchema   = "main",
                  writeSchema = "main", 
                  achillesSchema = "main")
cdm

Note that we have included achilles tables in our cdm reference, which will be used to speed up some of the analyses.

We need to create a set of cohorts to review. For this we are going to use the package CohortConstructor to generate cohorts with users of warfarin, acetaminophen and morphine.

# Create codelists
codes <- list("warfarin" = c(1310149, 40163554),
              "acetaminophen" = c(1125315, 1127078, 1127433, 40229134, 40231925, 40162522, 19133768),
              "morphine" = c(1110410, 35605858, 40169988))

# Instantiate cohorts with CohortConstructor
cdm$my_cohort <- conceptCohort(cdm = cdm,
                               conceptSet = codes, 
                               exit = "event_end_date",
                               overlap = "merge",
                               name = "my_cohort")

Running PhenotypeDiagnostics

Now that we have our cohort, we will use phenotypeDiagnotics() to assess them. This will run the following diagnostics which help us know whether our cohorts are ready to be used in research with the OMOP CDM dataset we’re using:

If we do not provide any specifications, the default values of the functions will be used. That means, the following script will run with the default values used in each individual diagnostics function.

diagnostics <- phenotypeDiagnostics(cdm$my_cohort,
                                databaseDiagnostics = list(),
                                codelistDiagnostics = list(),
                                cohortDiagnostics = list(),
                                populationDiagnostics = list(),
                                stagingDirectory = NULL)

Notice that we can specify the directory where to save a log file so we can keep track on which incremental results are being run at each time.

If we don’t want to run one of the diagnostics we can switch it off by setting it to NULL.

phenotypeDiagnostics(cdm$my_cohort,
                     databaseDiagnostics = list(),
                     codelistDiagnostics = NULL,
                     cohortDiagnostics = list(),
                     populationDiagnostics = NULL)

Or if we want to change the settings we can include arguments used in the sub-functions in a list. For example, survial analysis is not run by default (cohortSuvival is set by default to FALSE in cohortDiagnotics()). We can run this, leaving other arguments as their defaults, like so:

diagnostics <- phenotypeDiagnostics(cdm$my_cohort,
                                databaseDiagnostics = list(),
                                codelistDiagnostics = list(),
                                cohortDiagnostics = list("cohortSurvival" = TRUE),
                                populationDiagnostics = list())

Database diagnostics

Although we may have created our study cohort, to inform analytic decisions and interpretation of results requires an understanding of the dataset from which it has been derived. The database diagnostics builds on OmopSketch package to perform the following analyses:

Codelist diagnostics

Codelist diagnostics builds on CodelistGenerator and MeasurementDiagnostics R packages to perform the following analyses:

Cohort diagnostics

Cohort diagnostics builds on CohortCharacteristics and CohortSurvival R packages to perform the following analyses on our cohorts:

For computational efficiency, cohort diagnostics will take a joint random sample of 20,000 people from across the study cohorts for describing cohort charateristics. The number sampled can be changed by altering the cohortSample argument (e.g. cohortSample = 40000 to double the number). Sampling can be switched off by setting cohortSample = NULL.

For each of the input cohorts, cohort diagnostics are also run on a set of age and sex matched controls taken from the dataset as a whole. Again random sampling is used for efficiency. By default 1,000 age and sex matched controls are identified for 1,000 individuals from each of the study cohorts. The number matched can be changed by altering the matchedSample argument (e.g. matchedSample = 2000 to double the number). Sampling can be switched off by setting matchedSample = NULL. Creation of age and sex matched controls can be skipped by setting matchedSample = 0.

Population diagnostics

Population diagnostics builds on IncidencePrevalence R package to perform the following analyses:

By default, these analyses are performed for:

By default incidence rates and period prevalence will be calculated for all years captured in the dataset (based on earliest observation period start date and latest observation period end date). The date range can though be limited by using the populationDateRange argument.

These analyses are also conducted on a random sample of the population captured in the dataset. By default this sample is set to 100,000 individuals and so will only be relevant for particularly large datasets. The sampling number can be changed via the populationSample argument (e.g. populationSample = 200000 to double the number) or switched off by setting populationSample = NULL.

Save the results

To save our diagnositics results, we can use exportSummarisedResult function from omopgenerics R Package:

exportSummarisedResult(diagnostics, path = here::here(), minCellCount = 5)

Visualisation of the results

Once we get our Phenotype diagnostics result, we can use shinyDiagnostics to easily create a shiny app and visualise our results:

shinyDiagnostics(diagnostics,
                 directory = tempdir(),
                 minCellCount = 5, 
                 open = TRUE)

Notice that we have specified the minimum number of counts (minCellCount) for suppression to be shown in the shiny app, and also that we want the shiny to be launched in a new R session (open). You can see the shiny app generated for this example in here.See Shiny diagnostics vignette for a full explanation of the shiny app.