The invitroTKstats
R package is designed to work as a
data processing pipeline for in vitro assays measuring various
chemical-specific toxicokinetic (TK) parameters used in high-throughput
TK (HTTK) modeling. The invitroTKstats
pipeline is meant to
take mass spectrometry (MS) data and estimate the chemical-specific TK
parameters from the relevant in vitro assays. These include but
are not limited to:
The MS data are experimental measurements and the possibility for
variation in reporting exists between laboratories and technicians.
Preserving the data as reported by the original laboratory allows
reproducibility, transparency, and auditing of any numbers calculated
from these data by the invitroTKstats
R package. We refer
to the MS data files generated by the laboratory as “level-0” files.
An important distinction between invitroTkstats
and
other approaches for determining chemical-specific in vitro TK
parameters is that here we systematically analyze large data sets
containing measurements for many (potentially hundreds) of
chemicals.
Here, we discuss the initial steps of the data processing pipeline.
First, we create the data guide (also known as a data catalog). We then
use the data guide to compile the level-0 (raw) data into a single R
data.frame
. Note, these steps are meant to be
consistent across the various assays for TK parameters that may be
pipelined by this package, that is, they are not assay specific.
However, the pipeline is not set up to simultaneously process
data from different assays (for example \(f_{up}\) from RED and \(f_{up}\) from UC).
In this vignette, we are going to use data from the intrinsic hepatic clearance assays (\(Cl_{int}\)) collected as part of the Kreutz et al. (2023) PFAS manuscript for demonstration purposes.
The first thing we need to do for the pipelining process is to
construct a data guide itemizing which data files and sheets therein
contain relevant MS data, as well as other relevant meta-data
information. Before creating a data guide with
invitroTKstats
functions, let us first consider the raw
data files and their structure in the relevant data directory, called
“inst/extdata/Kreutz-Clint”.
NOTE: The terms “data catalog” and “data guide” are used interchangeably throughout the vignette.
For users to replicate this vignette, it is necessary to clone the invitroTKstats
package repository which contains the “inst/extdata/Kreutz-Clint”
sub-directory.
The “Kreutz-Clint” sub-directory contains a subset of raw data files
with gas chromatography-tandem mass spectrometry (GC-MS/MS) analyses of
a subset of PFAS alcohols, amides, and acrylates (Kreutz et al. (2023)). The entire set of raw
data files can be found within the “working/KreutzPFAS” sub-directory
tracked in the invitrotkstats_dataproc
repository (i.e. not within the invitroTKstats
package).
# the path to the applicable sub-directory after the `invitroTKstats` package repository is cloned and the R project is opened
raw_data_dir <- system.file("extdata/Kreutz-Clint", package = "invitroTKstats")
More than one experiment (that is, data collection run) is typically necessary to obtain all the relevant data for an analysis. Thus, there is likely to be more than one Excel file containing raw data that needs to be processed.
For this exercise, it is known that the Excel (.xlsx) files
containing intrinsic hepatic clearance data are denoted with prefixes of
“Hep” and “G
# identify the hepatic clearance raw datasets
hep_clint_xlsx <- list.files(raw_data_dir,
pattern = paste(
paste(c("^Hep","^G\\d"),"[.]xlsx$",sep = ".+"),
collapse = "|"))
# show the data file names
hep_clint_xlsx
#> [1] "Hep4_971_941_041122.xlsx" "Hep_745_949_959_082421_final.xlsx"
#> [3] "Hep_900_273_HC_082721_final.xlsx"
# add the raw data file directory path
hep_clint_xlsx <- paste(raw_data_dir,hep_clint_xlsx, sep = "/")
In cases where one is unfamiliar with the raw data Excel files, it may be useful to do some programmatic evaluation of the file structure. For example, listing the sheets within the Excel file to determine where data exists and should be pulled from. It should be noted this will likely, or may not always, replace the need to visually inspect the file. To demonstrate this type of investigation consider the first hepatic clearance dataset identified above.
# for the first dataset, see the sheets contained in the raw data file
readxl::excel_sheets(hep_clint_xlsx[1])
#> [1] "Cover Sheet (2)" "Hep Clearance Calcs (2)"
#> [3] "Hep Data for Prism" "All Hep04 data"
#> [5] "MDL Calculations" "WMH deletes for figs"
#> [7] "ValueList_Helper"
Here, it is already known that the “Data063021” data sheet contains MS data for parameter estimation. However, other data sheets (for example, “Cover Sheet”) may also provide valuable information about the original experiment, chemicals evaluated, meta data, etc. Clear communication between the lab staff and data analyst about the structure of the data files and general expectations is highly encouraged to enable accurate, efficient, and transparent data pipelining.
It should be noted, other exploratory steps may be necessary, but we leave this up to the reader to determine the necessary steps for raw data exploration given their data and use case.
Now that the Excel files and sheets containing the raw data are identified and we have a general idea of how to programatically explore the data, we need to create the data guide. The data guide creation step, as previously mentioned, can be thought of as logging all the relevant raw data files for a given toxicokinetic parameter we wish to pipeline (that is, data inventory). This is a crucial part of the pipeline in that it allows us to identify where all of the raw datafiles are coming from and where we are pulling raw (level-0) data from within those datasets. Thus, setting up a standard and transparent level-0 data compilation process. This also provides key meta-data information that can be used for tracking/evaluating data provenance.
Though the data guide (which is ultimately a data.frame
format) may be created manually through an Excel file, the
invitroTKstats
package includes a function called
create_catalog
that automatically generates and documents
the data guide creation. Furthermore, this function has built in checks
to ensure all the necessary information is provided and in the correct
format expected by the pipeline. This allows for maximum
reproducibility, transparency, and efficiency in compiling the data
guide. Thus, it is highly encouraged and considered best practice to
utilize the create_catalog
function for this step.
Typically, one will want to include all the datasets “delivered” to the data analyst. As mentioned before, the data catalog logs where the raw MS data is for the respective files and allows us to efficiently pull level-0 data (raw data) from those disparate files and compile them into a comprehensive level-0 dataset. In this sub-section, however, we are going to demonstrate how to put together the data catalog for a single Excel file.
Here, we are going to use the first Excel file (and only the
“Data063021” sheet) to demonstrate putting together the data catalog
programmatically using the create_catalog
function within
invitroTKstats
. It should be noted, the information
provided in create_catalog
is obtained by visual inspection
of the file of interest and identifying the corresponding input values
for each of the arguments.
DC_kreutz.pfas <- create_catalog(
# filename (no file path)
file = c(rep("Hep_745_949_959_082421_final.xlsx",3)),
# sheet name (or sheet number)
sheet = c(rep("Data063021",3)),
# number of rows to skip in L0 Excel file - start for compound/analyte samples
skip.rows = c(44,74,92),
# number of rows to read in from L0 Excel file for compound/analyte samples
num.rows = c(30,18,18),
# date the data was generated
# (MMDDYY: 2-digit month, 2-digit day, 2-digit year)
date = "063021",
# chemical id
compound = c("745","949","959"),
# internal standard compound (corresponding to chemical id)
istd = c("MFBET","MFOET","MFHET"),
# column name for sample names
sample = "Name",
# column name for sample types
type = "Type",
# column name(s) for analyte MS peak areas
peak = c("Area...13","Area...27","Area...20"),
# column name(s) for internal standard MS peak areas
istd.peak = c("Resp....16","Resp....30","Resp....23"),
# column name(s) for experimental concentration
conc = c("Final Conc....11","Final Conc....25","Final Conc....18"),
# column name(s) with analysis parameters
analysis.param = c("RT...12", "RT...26", "RT...19"),
# column name - row locations
col.names.loc = 2
# note = "RT...12"
)
#> ##################################
#> ## Data Catalog Checks
#> ##################################
#> All of the standard columns exist in the catalog.
#> All standard columns are data complete.
#> All of the standard columns in the catalog are of the correct class.
#> Your data catalog is ready for merge_level0.
#> ##################################
File | Sheet | Skip.Rows | Date | Chemical.ID | ISTD.Name | Col.Names.Loc | Sample.ColName | Type.ColName | Peak.ColName | ISTD.Peak.ColName | Conc.ColName | AnalysisParam.ColName | Number.Data.Rows |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Hep_745_949_959_082421_final.xlsx | Data063021 | 44 | 063021 | 745 | MFBET | 2 | Name | Type | Area...13 | Resp....16 | Final Conc....11 | RT...12 | 30 |
Hep_745_949_959_082421_final.xlsx | Data063021 | 74 | 063021 | 949 | MFOET | 2 | Name | Type | Area...27 | Resp....30 | Final Conc....25 | RT...26 | 18 |
Hep_745_949_959_082421_final.xlsx | Data063021 | 92 | 063021 | 959 | MFHET | 2 | Name | Type | Area...20 | Resp....23 | Final Conc....18 | RT...19 | 18 |
In cases where a data analyst is working with a set of raw data Excel files with a semi-standardized form (often lab-specific) it may be advantageous to have a wrapper function for streamlining the data guide creation process, such that the function automatically extracts the necessary meta-data to provide for populating the data catalog. This exercise is left up to the reader.
Once the data catalog is compiled, the only remaining experimental meta-data to collect prior to compiling the comprehensive level-0 data files is the chemical identification mapping table (that is chemical ID map). The chemical ID map provides the pipeline with the compound name used by the lab, EPA DSSTox identifier (see Distributed Structure-Searchable Toxicity (DSSTox) Database), and the common compound name. This typically will be provided by the lab within the Excel files with raw data and/or in a separate “meta-data” file, and this may be lab dependent given the potential for differing lab protocols.
For our purposes, the “Cover Sheet” contains a table with the chemical ID mapping along with some physico-chemical properties, the corresponding reference, and the corresponding internal standard.
# obtain the chemical identification mapping information from the MS-data
# cover sheet (that is raw data summary information)
assay_cover_sheet <-
readxl::read_xlsx(
paste(raw_data_dir,"Hep_745_949_959_082421_final.xlsx",sep = "/"),
sheet = "Cover Sheet",skip = 35,n_max = 4) %>%
as.data.frame()
Analyte | Name | Sample ID | IS | Reference Compound | Avg. MW |
---|---|---|---|---|---|
DTXSID70381090 | 1H,1H,8H,8H-Perfluoro-3,6-dioxaoctane-1,8-diol | 949 | MFOET | Ametryn | 294.010 |
DTXSID50381992 | Bis(1H,1H-perfluoropropyl)amine | 959 | MFHET | 281.097 | |
DTXSID3066215 | (Heptafluorobutanoyl)pivaloylmethane | 745 | MFBET | 296.185 |
The invitroTKstats
package has a function that can take
a data.frame
, such as assay_cover_sheet
, and
generate the required chemical mapping table necessary for the
merge_level0
function. This function is called
create_chem_table
. Though the chemical mapping table can be
manually constructed it is highly encouraged analysts using
invitroTKstats
leverage the utility of this function.
# create a chemical table necessary for the L0 compilation function, using the
# assay cover sheet chemical identification mapping information
chem.ids <- create_chem_table(
# input table (data.frame class) with information
input.table = assay_cover_sheet,
# column name with DSSTox chemical ID's
dtxsid.col = "Analyte",
# column name with formal compound names
compound.col = "Name",
# column name with lab chemical ID's
lab.compound.col = "Sample ID"
)
#> 3 chemicals.
Compound.Name | DTXSID | Lab.Compound.Name |
---|---|---|
(Heptafluorobutanoyl)pivaloylmethane | DTXSID3066215 | 745 |
Bis(1H,1H-perfluoropropyl)amine | DTXSID50381992 | 959 |
1H,1H,8H,8H-Perfluoro-3,6-dioxaoctane-1,8-diol | DTXSID70381090 | 949 |
Once the data guide and chemical ID map data.frame
s are
compiled we can use the merge_level0
function to pull in
all of the ‘raw’ level-0 data and compile it into a single
data.frame
.
It should be noted here we are not exporting the data
catalog or level-0 data file, as denoted by the arguments
catalog.out
and output.res
(respectively) both
being set to FALSE
. Furthermore, since we are not exporting
either of these data.frame
s the FILENAME
argument is not required and thereby not provided.
Any of the column names from the raw data not matching the
anticipated column name convention are specified to their corresponding
argument. For more details on using the merge_level0
function we direct the user to the help file, which can be accessed via
help("merge_level0")
or ?merge_level0
in the R
console.
# compile the l0 files
kreutz.pfas_L0 <- merge_level0(
level0.catalog = DC_kreutz.pfas, # data catalog
INPUT.DIR = raw_data_dir, # the path to your raw data files
num.rows.col = "Number.Data.Rows",
istd.col = "ISTD.Name",
type.colname.col = "Type.ColName",
chem.ids = chem.ids, # chemical ID mapping data
chem.lab.id.col = "Lab.Compound.Name",
chem.name.col = "Compound.Name",
catalog.out = FALSE, # do not export the data catalog during this function
output.res = FALSE # do not export the compiled L0 during this function
)
#> Hep_745_949_959_082421_final.xlsx, Data063021, 745
#> Hep_745_949_959_082421_final.xlsx, Data063021, 949
#> Hep_745_949_959_082421_final.xlsx, Data063021, 959
# show the dimension of the the Kruetz PFAS data
dim(kreutz.pfas_L0)
#> [1] 66 13
Compound | DTXSID | Lab.Compound.ID | Date | Sample | Type | Compound.Conc | Peak.Area | ISTD.Peak.Area | ISTD.Name | Analysis.Params | Level0.File | Level0.Sheet |
---|---|---|---|---|---|---|---|---|---|---|---|---|
(Heptafluorobutanoyl)pivaloylmethane | DTXSID3066215 | 745 | 063021 | G5-745 Hep062221 Inactive T0a | Sample | 707.89135 | 24,893.5958 | 95,208.54 | MFBET | 5.589200 | Hep_745_949_959_082421_final.xlsx | Data063021 |
(Heptafluorobutanoyl)pivaloylmethane | DTXSID3066215 | 745 | 063021 | G5-745 Hep062221 Inactive T0b | Sample | 848.90684 | 29,609.5852 | 94,542.88 | MFBET | 5.589217 | Hep_745_949_959_082421_final.xlsx | Data063021 |
(Heptafluorobutanoyl)pivaloylmethane | DTXSID3066215 | 745 | 063021 | G5-745 Hep062221 Inactive T0c | Sample | 779.10136 | 27,117.1085 | 94,288.24 | MFBET | 5.589200 | Hep_745_949_959_082421_final.xlsx | Data063021 |
(Heptafluorobutanoyl)pivaloylmethane | DTXSID3066215 | 745 | 063021 | G5-745 Hep062221 Inactive T240a | Sample | 23.00494 | 841.2405 | 98,452.29 | MFBET | 5.593517 | Hep_745_949_959_082421_final.xlsx | Data063021 |
(Heptafluorobutanoyl)pivaloylmethane | DTXSID3066215 | 745 | 063021 | G5-745 Hep062221 Inactive T240b | Sample | 28.26328 | 1,044.1819 | 99,471.52 | MFBET | 5.593550 | Hep_745_949_959_082421_final.xlsx | Data063021 |
(Heptafluorobutanoyl)pivaloylmethane | DTXSID3066215 | 745 | 063021 | G5-745 Hep062221 Inactive T240c | Sample | 26.58771 | 974.4343 | 98,675.85 | MFBET | 5.589200 | Hep_745_949_959_082421_final.xlsx | Data063021 |
The level-0 data file is now ready to undergo the assay-specific processing steps, which can be found in the assay-specific vignettes:
create_catalog
function, but to auto-generate
the input for this function programmatically (for example a lab-specific
processing function).check_catalog
to ensure the data catalog meets the
minimum meta-data reporting requirements to use
merge_level0
..R
script) and accompanied by the code script execution
output (for example .Rout
file).