The Sparse Marginal Epistasis test (SME) requires a HDF5 file containing binary masking data. The effectiveness of the test relies on how informative the mask is for a specific trait. For instance, to test for marginal epistasis related to transcriptional activity, DNase I hypersensitivity sites data can be used to condition the detection on accessible chromatin regions.
Figure 1. The illustration shows open chromatin data.
DNase I hypersensitive sites indicate transcriptional activity. The
DNase-seq signal is converted to a binary mask, where genetic variants
in inaccessible chromatin regions are excluded from the covariance
matrix for marginal epistasis.
The sme() function expects the mask data to be in an
HDF5 file. The HDF5 format includes two primary object types:
The mask data should be organized into the following groups and datasets:
Groups:
ld: Contains SNPs in linkage disequilibrium (LD) with
the focal SNP, which will be excluded.gxg: Contains indices of SNPs used to condition the
marginal epistasis test, which will be included.The required group names can be configured as input parameters of
sme(). The defaults are ld and
gxg.
Datasets:
ld/<j>: For each focal SNP
<j>, this dataset contains indices of SNPs in the
same LD block as that SNP. These SNPs will be excluded
from the gene-by-gene interaction covariance matrix.gxg/<j>: For each focal SNP
<j>, this dataset contains indices of SNPs to
include in the the gene-by-gene interaction covariance
matrix for focal SNP <j>.Important: All indices in the mask file must be
zero-based and should correspond to the zero-based row
indices of the PLINK .bim file. This includes the dataset
index (<j> in gxg/<j>) and the
data itself. This zero-based indexing is necessary because the mask data
is read by a C++ subroutine in sme(), which uses zero-based
indexing, unlike R’s one-based indexing for SNP indices in the function
call.
The package provides utility functions to create, write, and read
valid mask files for sme().
hdf5_file <- tempfile()
# Group names
gxg_h5_group <- "gxg"
ld_h5_group <- "ld"
# Data (still in 1-based R indexing)
include_gxg_snps <- 1:10
exclude_ld_snps <- 5:6
# Focal SNP (still in 1-based R indexing)
focal_snp <- 4
# Dataset names
dataset_name_pattern <- "%s/%s"
# 0-based index!
gxg_dataset <- sprintf(dataset_name_pattern, gxg_h5_group, focal_snp - 1)
ld_dataset <- sprintf(dataset_name_pattern, ld_h5_group, focal_snp - 1)
# Create an empty HDF5 file
create_hdf5_file(hdf5_file)
# Write LD data
write_hdf5_dataset(hdf5_file, ld_dataset, exclude_ld_snps - 1) # 0-based index!
# Write GXG data
write_hdf5_dataset(hdf5_file, gxg_dataset, include_gxg_snps - 1)We can check that the data was written correctly.
ld_read <- read_hdf5_dataset(hdf5_file, ld_dataset)
gxg_read <- read_hdf5_dataset(hdf5_file, gxg_dataset)
print(sprintf("Zero-based indices of SNPs to exclude: %s", str(ld_read)))
#>  int [1:2] 4 5
#> character(0)
print(sprintf("Zero-based indices of SNPs to include: %s", str(gxg_read)))
#>  int [1:10] 0 1 2 3 4 5 6 7 8 9
#> character(0)sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sequoia 15.2
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Europe/Berlin
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] GenomicRanges_1.56.2 GenomeInfoDb_1.40.1  IRanges_2.38.1      
#> [4] S4Vectors_0.42.1     BiocGenerics_0.50.0  ggplot2_3.5.1       
#> [7] dplyr_1.1.4          smer_0.0.1          
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.9              generics_0.1.3          tidyr_1.3.1            
#>  [4] digest_0.6.37           magrittr_2.0.3          evaluate_1.0.3         
#>  [7] grid_4.4.2              iterators_1.0.14        CompQuadForm_1.4.3     
#> [10] mvtnorm_1.3-3           fastmap_1.2.0           foreach_1.5.2          
#> [13] genio_1.1.2             jsonlite_1.8.9          backports_1.5.0        
#> [16] httr_1.4.7              purrr_1.0.2             UCSC.utils_1.0.0       
#> [19] scales_1.3.0            codetools_0.2-20        jquerylib_0.1.4        
#> [22] cli_3.6.3               rlang_1.1.4             XVector_0.44.0         
#> [25] munsell_0.5.1           withr_3.0.2             cachem_1.1.0           
#> [28] yaml_2.3.10             FMStable_0.1-4          tools_4.4.2            
#> [31] parallel_4.4.2          checkmate_2.3.2         colorspace_2.1-1       
#> [34] GenomeInfoDbData_1.2.12 vctrs_0.6.5             R6_2.5.1               
#> [37] lifecycle_1.0.4         zlibbioc_1.50.0         mvMAPIT_2.0.3          
#> [40] pkgconfig_2.0.3         pillar_1.10.1           bslib_0.8.0            
#> [43] gtable_0.3.6            glue_1.8.0              Rcpp_1.0.14            
#> [46] harmonicmeanp_3.0.1     xfun_0.50               tibble_3.2.1           
#> [49] tidyselect_1.2.1        knitr_1.49              farver_2.1.2           
#> [52] htmltools_0.5.8.1       rmarkdown_2.29          labeling_0.4.3         
#> [55] compiler_4.4.2