Data Guide Creation and Level-0 Data Compilation

US EPA’s Center for Computational Toxicology and Exposure ccte@epa.gov

Overview

The invitroTKstats R package is designed to work as a data processing pipeline for in vitro assays measuring various chemical-specific toxicokinetic (TK) parameters used in high-throughput TK (HTTK) modeling. The invitroTKstats pipeline is meant to take mass spectrometry (MS) data and estimate the chemical-specific TK parameters from the relevant in vitro assays. These include but are not limited to:

The MS data are experimental measurements and the possibility for variation in reporting exists between laboratories and technicians. Preserving the data as reported by the original laboratory allows reproducibility, transparency, and auditing of any numbers calculated from these data by the invitroTKstats R package. We refer to the MS data files generated by the laboratory as “level-0” files.

An important distinction between invitroTkstats and other approaches for determining chemical-specific in vitro TK parameters is that here we systematically analyze large data sets containing measurements for many (potentially hundreds) of chemicals.

Here, we discuss the initial steps of the data processing pipeline. First, we create the data guide (also known as a data catalog). We then use the data guide to compile the level-0 (raw) data into a single R data.frame. Note, these steps are meant to be consistent across the various assays for TK parameters that may be pipelined by this package, that is, they are not assay specific. However, the pipeline is not set up to simultaneously process data from different assays (for example \(f_{up}\) from RED and \(f_{up}\) from UC).

In this vignette, we are going to use data from the intrinsic hepatic clearance assays (\(Cl_{int}\)) collected as part of the Kreutz et al. (2023) PFAS manuscript for demonstration purposes.

Suggested packages for use with this vignette

# Primary Package #
library(invitroTKstats)
# Data Formatting Packages #
library(dplyr)
library(magrittr)
library(stringr)
library(readxl)
# Table Formatting Package #
library(flextable)

Raw Data File Exploration

The first thing we need to do for the pipelining process is to construct a data guide itemizing which data files and sheets therein contain relevant MS data, as well as other relevant meta-data information. Before creating a data guide with invitroTKstats functions, let us first consider the raw data files and their structure in the relevant data directory, called “inst/extdata/Kreutz-Clint”.

NOTE: The terms “data catalog” and “data guide” are used interchangeably throughout the vignette.

For users to replicate this vignette, it is necessary to clone the invitroTKstats package repository which contains the “inst/extdata/Kreutz-Clint” sub-directory.

The “Kreutz-Clint” sub-directory contains a subset of raw data files with gas chromatography-tandem mass spectrometry (GC-MS/MS) analyses of a subset of PFAS alcohols, amides, and acrylates (Kreutz et al. (2023)). The entire set of raw data files can be found within the “working/KreutzPFAS” sub-directory tracked in the invitrotkstats_dataproc repository (i.e. not within the invitroTKstats package).

# the path to the applicable sub-directory after the `invitroTKstats` package repository is cloned and the R project is opened
raw_data_dir <- system.file("extdata/Kreutz-Clint", package = "invitroTKstats")

More than one experiment (that is, data collection run) is typically necessary to obtain all the relevant data for an analysis. Thus, there is likely to be more than one Excel file containing raw data that needs to be processed.

For this exercise, it is known that the Excel (.xlsx) files containing intrinsic hepatic clearance data are denoted with prefixes of “Hep” and “G”. Thus, we can programmatically search in the sub-directory for all the relevant Excel files we need to pull and compile raw data from.

# identify the hepatic clearance raw datasets
hep_clint_xlsx <- list.files(raw_data_dir,
                             pattern = paste(
                               paste(c("^Hep","^G\\d"),"[.]xlsx$",sep = ".+"),
                               collapse = "|"))
# show the data file names
hep_clint_xlsx
#> [1] "Hep4_971_941_041122.xlsx"          "Hep_745_949_959_082421_final.xlsx"
#> [3] "Hep_900_273_HC_082721_final.xlsx"
# add the raw data file directory path
hep_clint_xlsx <- paste(raw_data_dir,hep_clint_xlsx, sep = "/")

In cases where one is unfamiliar with the raw data Excel files, it may be useful to do some programmatic evaluation of the file structure. For example, listing the sheets within the Excel file to determine where data exists and should be pulled from. It should be noted this will likely, or may not always, replace the need to visually inspect the file. To demonstrate this type of investigation consider the first hepatic clearance dataset identified above.

# for the first dataset, see the sheets contained in the raw data file
readxl::excel_sheets(hep_clint_xlsx[1])
#> [1] "Cover Sheet (2)"         "Hep Clearance Calcs (2)"
#> [3] "Hep Data for Prism"      "All Hep04 data"         
#> [5] "MDL Calculations"        "WMH deletes for figs"   
#> [7] "ValueList_Helper"

Here, it is already known that the “Data063021” data sheet contains MS data for parameter estimation. However, other data sheets (for example, “Cover Sheet”) may also provide valuable information about the original experiment, chemicals evaluated, meta data, etc. Clear communication between the lab staff and data analyst about the structure of the data files and general expectations is highly encouraged to enable accurate, efficient, and transparent data pipelining.

It should be noted, other exploratory steps may be necessary, but we leave this up to the reader to determine the necessary steps for raw data exploration given their data and use case.

Data Catalog

Now that the Excel files and sheets containing the raw data are identified and we have a general idea of how to programatically explore the data, we need to create the data guide. The data guide creation step, as previously mentioned, can be thought of as logging all the relevant raw data files for a given toxicokinetic parameter we wish to pipeline (that is, data inventory). This is a crucial part of the pipeline in that it allows us to identify where all of the raw datafiles are coming from and where we are pulling raw (level-0) data from within those datasets. Thus, setting up a standard and transparent level-0 data compilation process. This also provides key meta-data information that can be used for tracking/evaluating data provenance.

Though the data guide (which is ultimately a data.frame format) may be created manually through an Excel file, the invitroTKstats package includes a function called create_catalog that automatically generates and documents the data guide creation. Furthermore, this function has built in checks to ensure all the necessary information is provided and in the correct format expected by the pipeline. This allows for maximum reproducibility, transparency, and efficiency in compiling the data guide. Thus, it is highly encouraged and considered best practice to utilize the create_catalog function for this step.

Typically, one will want to include all the datasets “delivered” to the data analyst. As mentioned before, the data catalog logs where the raw MS data is for the respective files and allows us to efficiently pull level-0 data (raw data) from those disparate files and compile them into a comprehensive level-0 dataset. In this sub-section, however, we are going to demonstrate how to put together the data catalog for a single Excel file.

Here, we are going to use the first Excel file (and only the “Data063021” sheet) to demonstrate putting together the data catalog programmatically using the create_catalog function within invitroTKstats. It should be noted, the information provided in create_catalog is obtained by visual inspection of the file of interest and identifying the corresponding input values for each of the arguments.

DC_kreutz.pfas <- create_catalog(
  # filename (no file path)
  file = c(rep("Hep_745_949_959_082421_final.xlsx",3)), 
  # sheet name (or sheet number)
  sheet = c(rep("Data063021",3)), 
  # number of rows to skip in L0 Excel file - start for compound/analyte samples
  skip.rows = c(44,74,92), 
  # number of rows to read in from L0 Excel file for compound/analyte samples
  num.rows = c(30,18,18), 
  # date the data was generated
  #   (MMDDYY: 2-digit month, 2-digit day, 2-digit year)
  date = "063021", 
  # chemical id
  compound = c("745","949","959"),
  # internal standard compound (corresponding to chemical id)
  istd = c("MFBET","MFOET","MFHET"), 
  # column name for sample names
  sample = "Name", 
  # column name for sample types
  type = "Type",
  # column name(s) for analyte MS peak areas
  peak = c("Area...13","Area...27","Area...20"), 
  # column name(s) for internal standard MS peak areas
  istd.peak = c("Resp....16","Resp....30","Resp....23"), 
  # column name(s) for experimental concentration
  conc = c("Final Conc....11","Final Conc....25","Final Conc....18"),
  # column name(s) with analysis parameters
  analysis.param = c("RT...12", "RT...26", "RT...19"),
  # column name - row locations
  col.names.loc = 2
  # note = "RT...12"
)
#> ##################################
#> ## Data Catalog Checks
#> ##################################
#> All of the standard columns exist in the catalog. 
#> All standard columns are data complete.
#> All of the standard columns in the catalog are of the correct class.
#> Your data catalog is ready for merge_level0.
#> ##################################
Table 1: Data Catalog for the Kruetz et al. (2023) PFAS Cl~int~ Experiment.

File

Sheet

Skip.Rows

Date

Chemical.ID

ISTD.Name

Col.Names.Loc

Sample.ColName

Type.ColName

Peak.ColName

ISTD.Peak.ColName

Conc.ColName

AnalysisParam.ColName

Number.Data.Rows

Hep_745_949_959_082421_final.xlsx

Data063021

44

063021

745

MFBET

2

Name

Type

Area...13

Resp....16

Final Conc....11

RT...12

30

Hep_745_949_959_082421_final.xlsx

Data063021

74

063021

949

MFOET

2

Name

Type

Area...27

Resp....30

Final Conc....25

RT...26

18

Hep_745_949_959_082421_final.xlsx

Data063021

92

063021

959

MFHET

2

Name

Type

Area...20

Resp....23

Final Conc....18

RT...19

18

In cases where a data analyst is working with a set of raw data Excel files with a semi-standardized form (often lab-specific) it may be advantageous to have a wrapper function for streamlining the data guide creation process, such that the function automatically extracts the necessary meta-data to provide for populating the data catalog. This exercise is left up to the reader.

Chemical ID Map

Once the data catalog is compiled, the only remaining experimental meta-data to collect prior to compiling the comprehensive level-0 data files is the chemical identification mapping table (that is chemical ID map). The chemical ID map provides the pipeline with the compound name used by the lab, EPA DSSTox identifier (see Distributed Structure-Searchable Toxicity (DSSTox) Database), and the common compound name. This typically will be provided by the lab within the Excel files with raw data and/or in a separate “meta-data” file, and this may be lab dependent given the potential for differing lab protocols.

For our purposes, the “Cover Sheet” contains a table with the chemical ID mapping along with some physico-chemical properties, the corresponding reference, and the corresponding internal standard.

# obtain the chemical identification mapping information from the MS-data
#   cover sheet (that is raw data summary information)
assay_cover_sheet <-
  readxl::read_xlsx(
    paste(raw_data_dir,"Hep_745_949_959_082421_final.xlsx",sep = "/"),
    sheet = "Cover Sheet",skip = 35,n_max = 4) %>% 
  as.data.frame()
Table 2: Chemical Mapping Information from the Raw Excel File.

Analyte

Name

Sample ID

IS

Reference Compound

Avg. MW

DTXSID70381090

1H,1H,8H,8H-Perfluoro-3,6-dioxaoctane-1,8-diol

949

MFOET

Ametryn

294.010

DTXSID50381992

Bis(1H,1H-perfluoropropyl)amine

959

MFHET

281.097

DTXSID3066215

(Heptafluorobutanoyl)pivaloylmethane

745

MFBET

296.185

The invitroTKstats package has a function that can take a data.frame, such as assay_cover_sheet, and generate the required chemical mapping table necessary for the merge_level0 function. This function is called create_chem_table. Though the chemical mapping table can be manually constructed it is highly encouraged analysts using invitroTKstats leverage the utility of this function.

# create a chemical table necessary for the L0 compilation function, using the
#   assay cover sheet chemical identification mapping information
chem.ids <- create_chem_table(
  # input table (data.frame class) with information
  input.table = assay_cover_sheet,
  # column name with DSSTox chemical ID's
  dtxsid.col = "Analyte",
  # column name with formal compound names
  compound.col = "Name",
  # column name with lab chemical ID's
  lab.compound.col = "Sample ID"
)
#> 3 chemicals.
Table 3: Chemical ID Mapping Information for the `merge_level0` Function.

Compound.Name

DTXSID

Lab.Compound.Name

(Heptafluorobutanoyl)pivaloylmethane

DTXSID3066215

745

Bis(1H,1H-perfluoropropyl)amine

DTXSID50381992

959

1H,1H,8H,8H-Perfluoro-3,6-dioxaoctane-1,8-diol

DTXSID70381090

949

Level-0 (L0) Data Compilation

Once the data guide and chemical ID map data.frames are compiled we can use the merge_level0 function to pull in all of the ‘raw’ level-0 data and compile it into a single data.frame.

It should be noted here we are not exporting the data catalog or level-0 data file, as denoted by the arguments catalog.out and output.res (respectively) both being set to FALSE. Furthermore, since we are not exporting either of these data.frames the FILENAME argument is not required and thereby not provided.

Any of the column names from the raw data not matching the anticipated column name convention are specified to their corresponding argument. For more details on using the merge_level0 function we direct the user to the help file, which can be accessed via help("merge_level0") or ?merge_level0 in the R console.

# compile the l0 files
kreutz.pfas_L0 <- merge_level0(
  level0.catalog = DC_kreutz.pfas, # data catalog
  INPUT.DIR = raw_data_dir, # the path to your raw data files
  num.rows.col = "Number.Data.Rows", 
  istd.col = "ISTD.Name",
  type.colname.col = "Type.ColName",
  chem.ids = chem.ids, # chemical ID mapping data
  chem.lab.id.col = "Lab.Compound.Name",
  chem.name.col = "Compound.Name",
  catalog.out = FALSE, # do not export the data catalog during this function
  output.res = FALSE # do not export the compiled L0 during this function
)
#> Hep_745_949_959_082421_final.xlsx, Data063021, 745
#> Hep_745_949_959_082421_final.xlsx, Data063021, 949
#> Hep_745_949_959_082421_final.xlsx, Data063021, 959
# show the dimension of the the Kruetz PFAS data
dim(kreutz.pfas_L0)
#> [1] 66 13
Table 4: Level-0 data resulting from the `merge_level0` Function.

Compound

DTXSID

Lab.Compound.ID

Date

Sample

Type

Compound.Conc

Peak.Area

ISTD.Peak.Area

ISTD.Name

Analysis.Params

Level0.File

Level0.Sheet

(Heptafluorobutanoyl)pivaloylmethane

DTXSID3066215

745

063021

G5-745 Hep062221 Inactive T0a

Sample

707.89135

24,893.5958

95,208.54

MFBET

5.589200

Hep_745_949_959_082421_final.xlsx

Data063021

(Heptafluorobutanoyl)pivaloylmethane

DTXSID3066215

745

063021

G5-745 Hep062221 Inactive T0b

Sample

848.90684

29,609.5852

94,542.88

MFBET

5.589217

Hep_745_949_959_082421_final.xlsx

Data063021

(Heptafluorobutanoyl)pivaloylmethane

DTXSID3066215

745

063021

G5-745 Hep062221 Inactive T0c

Sample

779.10136

27,117.1085

94,288.24

MFBET

5.589200

Hep_745_949_959_082421_final.xlsx

Data063021

(Heptafluorobutanoyl)pivaloylmethane

DTXSID3066215

745

063021

G5-745 Hep062221 Inactive T240a

Sample

23.00494

841.2405

98,452.29

MFBET

5.593517

Hep_745_949_959_082421_final.xlsx

Data063021

(Heptafluorobutanoyl)pivaloylmethane

DTXSID3066215

745

063021

G5-745 Hep062221 Inactive T240b

Sample

28.26328

1,044.1819

99,471.52

MFBET

5.593550

Hep_745_949_959_082421_final.xlsx

Data063021

(Heptafluorobutanoyl)pivaloylmethane

DTXSID3066215

745

063021

G5-745 Hep062221 Inactive T240c

Sample

26.58771

974.4343

98,675.85

MFBET

5.589200

Hep_745_949_959_082421_final.xlsx

Data063021

The level-0 data file is now ready to undergo the assay-specific processing steps, which can be found in the assay-specific vignettes:

Best Practices for Data Processing

References

Kreutz, Anna, Matthew S. Clifton, W. Matthew Henderson, Marci G. Smeltz, Matthew Phillips, John F. Wambaugh, and Barbara A. Wetmore. 2023. “Category-Based Toxicokinetic Evaluations of Data-Poor Per- and Polyfluoroalkyl Substances (PFAS) Using Gas Chromatography Coupled with Mass Spectrometry.” Toxics 11 (5): 463.