---
title: "Genome-wide Melting Temperature Profiling: An E. coli Case Study"
author: "Junhui Li, Lihua Julie Zhu"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Genome-wide Melting Temperature Profiling: An E. coli Case Study}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo      = TRUE,
  message   = FALSE,
  warning   = FALSE,
  fig.align = "center",
  fig.width = 7,
  fig.height = 6
)
```

# Introduction

This vignette demonstrates how to compute melting temperature (Tm) across an
entire bacterial genome in a reproducible, end-to-end workflow using
**TmCalculator**. We use *Escherichia coli* K-12 MG1655
(NCBI assembly GCF\_000005845.2 / ASM584v2) as the reference organism,
calculating Tm for every non-overlapping 200 bp window along the single
circular chromosome.

The four steps covered here are:

1. **Build a BSgenome data package** from the NCBI assembly accession.
2. **Generate non-overlapping genomic windows** with `make_genomiccoord()`.
3. **Compute Tm and GC content** for all windows with `tm_calculate()`.
4. **Visualise** the genome-wide Tm profile as a circular (Circos) plot,
   optionally overlaid with multi-omics tracks.
5. **Test if the Tm values in the mutH peaks are significantly different from the rest of the genome** with `test_granges_targets()`.

By the end of this vignette you will have a publication-ready circular genome
map showing how sequence composition—quantified through Tm—varies around the
*E. coli* chromosome, and how it relates to independently measured genomic
features such as mismatch-repair hotspots, microsatellite density, and
cruciform-forming sequences.

---

# Prerequisites

## R packages

```{r packages}
# Bioconductor core
#BiocManager::install(c(
#  "BSgenomeForge",   # forge custom BSgenome packages from NCBI
#  "BSgenome",
#  "GenomicRanges"
#))

library(TmCalculator)
library(BSgenome)
library(GenomicRanges)
```

## Building the *E. coli* BSgenome data package {#build-bsgenome}

TmCalculator retrieves sequences from BSgenome data packages. Because an
*E. coli* BSgenome package is not on Bioconductor, we forge one locally from
the NCBI RefSeq assembly using **BSgenomeForge**. This step only needs to be
run once; thereafter, install the resulting package and load it like any other
BSgenome package.

```{r forge, eval=FALSE}
library(BSgenomeForge)

# Download the FASTA and build the data package (takes ~1–2 min)
forgeBSgenomeDataPkgFromNCBI(
  assembly_accession = "GCF_000005845.2",
  pkg_maintainer     = "Junhui Li <ljh.biostat@gmail.com>",
  destdir            = "."        # writes BSgenome.Ecoli.NCBI.ASM584v2/ here
)

# Install the freshly forged package
install.packages(
  "./BSgenome.Ecoli.NCBI.ASM584v2",
  repos = NULL,
  type  = "source"
)
```

The GitHub repo ships package metadata only; sequence data must be forged once.
The chunk below installs the package (if needed), forges it from NCBI when
the `Ecoli` genome object is still missing, then loads it. **Knit from the top**
so this chunk runs before `load-genome`.

```{r setup-ecoli-bsgenome, eval=FALSE, echo=FALSE, message=FALSE, warning=FALSE}
ecoli_pkg   <- "BSgenome.Ecoli.NCBI.ASM584v2"
genome_obj  <- "Ecoli"   # BSgenomeObjname in DESCRIPTION; not the package name

.ecoli_genome_ready <- function() {
  if (!requireNamespace(ecoli_pkg, quietly = TRUE)) return(FALSE)
  exists(genome_obj, envir = asNamespace(ecoli_pkg), inherits = FALSE)
}

if (!requireNamespace(ecoli_pkg, quietly = TRUE)) {
  if (!requireNamespace("remotes", quietly = TRUE)) {
    utils::install.packages("remotes", repos = "https://cloud.r-project.org")
  }
  remotes::install_github(
    "JunhuiLi1017/BSgenome.Ecoli.NCBI.ASM584v2",
    upgrade = "never",
    quiet = TRUE
  )
}

if (!.ecoli_genome_ready()) {
  if (!requireNamespace("BiocManager", quietly = TRUE)) {
    utils::install.packages("BiocManager", repos = "https://cloud.r-project.org")
  }
  if (!requireNamespace("BSgenomeForge", quietly = TRUE)) {
    BiocManager::install("BSgenomeForge", ask = FALSE, update = FALSE)
  }
  pkgdir <- BSgenomeForge::forgeBSgenomeDataPkgFromNCBI(
    assembly_accession = "GCF_000005845.2",
    pkg_maintainer     = "Junhui Li <ljh.biostat@gmail.com>",
    destdir            = tempdir()
  )
  utils::install.packages(pkgdir, repos = NULL, type = "source", quiet = TRUE)
  if (ecoli_pkg %in% loadedNamespaces()) {
    unloadNamespace(ecoli_pkg)
  }
}

if (!.ecoli_genome_ready()) {
  stop(
    "Could not load genome object '", genome_obj, "' from package '", ecoli_pkg, "'.\n",
    "Run the manual 'forge' chunk above, or forgeBSgenomeDataPkgFromNCBI() locally.",
    call. = FALSE
  )
}
```

Load the genome (object `Ecoli`; package `BSgenome.Ecoli.NCBI.ASM584v2` for coordinates):

```{r load-genome, eval=FALSE}
ecoli_pkg  <- "BSgenome.Ecoli.NCBI.ASM584v2"
genome_obj <- "Ecoli"

suppressPackageStartupMessages(library(ecoli_pkg, character.only = TRUE))
genome      <- base::get(genome_obj, envir = asNamespace(ecoli_pkg))
genome_name <- ecoli_pkg
chr_name    <- "U00096.3"
chr_length  <- length(genome[[chr_name]])

cat("Chromosome:", chr_name, "\n")
cat("Length:    ", format(chr_length, big.mark = ","), "bp\n")
```

---

# Step 1 — Generate Genomic Windows

`make_genomiccoord()` tiles the chromosome into non-overlapping 200 bp bins
and returns a character vector of coordinate strings accepted by
`coor_to_genomic_ranges()`. Setting `slide = window` (both 200 bp) ensures
the bins are contiguous and non-overlapping.

```{r make-coord, eval=FALSE}
bins_gc <- make_genomiccoord(
  bsgenome     = genome,
  chromosomes  = chr_name,
  window       = 200L,
  slide        = 200L,          # non-overlapping: slide == window
  start        = 1,
  end          = chr_length,
  strand       = "+"
)

cat("Total windows:", length(bins_gc), "\n")
cat("Last window:  ", bins_gc[length(bins_gc)], "\n")
```

`coor_to_genomic_ranges()` resolves each coordinate string against the
BSgenome package and returns a `GRanges` object carrying the full sequence
and its reverse complement:

```{r coor-to-gr, eval=FALSE}
input_new <- list(
  pkg_name = genome_name,
  seq      = bins_gc
)

runtime1 <- system.time({
  gr_batch <- to_genomic_ranges_fast(input_new)
})

cat(sprintf(
  "Coordinate resolution: %.2f s (elapsed)\n",
  runtime1["elapsed"]
))
```

---

# Step 2 — Compute Genome-wide Tm

We calculate Tm for every window using the nearest-neighbor (NN) method with
the SantaLucia & Hicks (2004) parameter set and a standard monovalent salt
concentration of 50 mM Na⁺. GC content is computed alongside Tm automatically.

```{r tm-calculate, eval=FALSE}
runtime2 <- system.time({
  tm_ASM584v2 <- tm_calculate(
    gr_batch,
    method   = "tm_nn",
    nn_table = "DNA_NN_SantaLucia_2004",
    Na       = 50            # mM; standard PCR-like conditions
  )
})

cat(sprintf(
  "Tm calculation: %.2f s (elapsed) for %s windows\n",
  runtime2["elapsed"],
  format(length(bins_gc), big.mark = ",")
))


```

The combined wall-clock time for both steps on a standard desktop is typically
under 10 seconds for the full *E. coli* genome at 200 bp resolution.

Extract the Tm and GC-content columns into a plain data frame for downstream
plotting:

```{r extract-tm, eval=FALSE}
Tm <- as.data.frame(tm_ASM584v2$gr[, c(4, 6)])

# Quick summary
summary(Tm[, c("Tm", "GC")])
```

---

# Step 3 — Visualise the Genome-wide Tm Profile

## Assemble multi-track data

We overlay the Tm/GC profile with four independently measured genomic
datasets bundled in the `ecoli_rep_hotspots` data object:

| Track | Source |
|-------|--------|
| MutL-AR | MutL protein ChIP-seq peaks (mismatch-repair activity) |
| Microsatellites | Tandem-repeat density per 200 bp bin |
| Cruciform | Cruciform-forming sequence density per 200 bp bin |
| ssDNA | Single-stranded DNA regions |
| GATC sites | GATC methylation-site density per 200 bp bin |

```{r load-hotspots, eval=FALSE}
data(ecoli_rep_hotspots)

# Reference labels: replication origin (ori) and terminus (dif)
label <- data.frame(
  seqnames = genome_name,
  start    = c(3925804, 1590777),
  end      = c(3925804, 1590777),
  label    = c("ori", "dif")
)
```

```{r track-list, eval=FALSE}
tracks <- list(
  # --- Protein interaction ---
  list(
    type   = "rect",
    data   = ecoli_rep_hotspots$all_peaks_IP_mutH,
    col    = "black",
    bg.col = "grey",
    name   = "MutL-AR"
  ),

  # --- Sequence thermodynamics ---
  list(
    type      = "line",
    data      = Tm,
    value_col = "GC",
    name      = "GC content"
  ),
  list(
    type      = "line",
    data      = Tm,
    value_col = "Tm",
    name      = "Melting temp"
  ),

  # --- Repeat / structural features ---
  list(
    type      = "line",
    data      = ecoli_rep_hotspots$bins_rep,
    value_col = "count",
    name      = "Microsatellites"
  ),
  list(
    type      = "line",
    data      = ecoli_rep_hotspots$bins_cru,
    value_col = "count",
    name      = "Cruciform"
  ),

  # --- ssDNA regions ---
  list(
    data = ecoli_rep_hotspots$ssdna,
    name = "ssDNA"
  ),

  # --- GATC methylation sites ---
  list(
    type      = "line",
    data      = ecoli_rep_hotspots$bins_gatc,
    value_col = "count",
    name      = "GATC sites"
  )
)
```

## Full circular genome map

`plot_circos_genome()` renders all tracks as concentric rings around the
chromosome, with the replication origin (ori) and terminus (dif) labelled:

```{r circos-full, eval=FALSE, fig.width=8, fig.height=8, fig.cap="Genome-wide Tm profile of *E. coli* K-12 MG1655 (ASM584v2). Concentric rings from outside in: MutL-AR ChIP-seq peaks (grey), GC content (%), melting temperature (°C), microsatellite density, cruciform-forming sequences, ssDNA regions, and GATC site density. Replication origin (ori, position 3,925,804) and terminus (dif, position 1,590,777) are labelled."}
plot_circos_genome(
  genome_name = genome_name,
  genome_size = chr_length,
  track_list  = tracks,
  canvas.xlim = c(-1, 1),
  canvas.ylim = c(-1, 1),
  label       = label
)
```

## Zoomed view — upper-right quadrant

To highlight a genomic region in detail, restrict `canvas.xlim` and
`canvas.ylim` to a quadrant. The plot below focuses on the first quadrant
(upper right), which spans roughly the first quarter of the chromosome
starting from ori:

```{r circos-zoom, eval=FALSE, fig.width=7, fig.height=7, fig.cap="Zoomed circular view of the upper-right quadrant of the *E. coli* chromosome. Track order and colour scheme are identical to the full-genome plot above."}
plot_circos_genome(
  genome_name = genome_name,
  genome_size = chr_length,
  track_list  = tracks,
  canvas.xlim = c(0.5, 1),
  canvas.ylim = c(0,   1)
)
```

# Step 4 — Test if the Tm values in the mutH peaks are significantly different from the rest of the genome

### Step 1 — Build the mutH peak GRanges 
```{r build-mutH-peaks, eval=FALSE}
mutH_peaks <- GRanges(
  seqnames = ecoli_rep_hotspots$all_peaks_IP_mutH$chr,         # NC_000913.3
  ranges   = IRanges(start = ecoli_rep_hotspots$all_peaks_IP_mutH$start,
                     end   = ecoli_rep_hotspots$all_peaks_IP_mutH$end)
)
```

### Step 2 — Align seqlevels: NC_000913.3 → U00096.3
```{r align-seqlevels, eval=FALSE}
# tm_ASM584v2 uses the BSgenome name "U00096.3"; remap the peaks to match.
seqlevels(mutH_peaks) <- "U00096.3"
```

### Step 3 — Annotate each Tm tile with which peak it falls into 
```{r annotate-mutH-peaks, eval=FALSE}
mutH_peaks$peak_id <- paste0("mutH_", seq_along(mutH_peaks))

tm_annot <- integrate_granges(
  gr_tm        = tm_ASM584v2$gr,
  gr_features  = mutH_peaks,
  strategy     = "overlap",
  feature_cols = "peak_id",
  keep_unmatched = TRUE   # keep non-overlapping tiles so they form the "non_peak" group
)
```

### Step 4 — Add a two-level grouping column 
```{r add-two-level-grouping-column, eval=FALSE}
tm_annot$in_mutH <- ifelse(is.na(tm_annot$peak_id), "non_peak", "peak")
```

```{r sanity-check, eval=FALSE}
table(tm_annot$in_mutH)
```
### Step 5 — Wilcoxon on Tm and GC, peaks vs the rest of the genome 
```{r wilcoxon-on-tm-and-gc, eval=FALSE}
res <- compare_groups(
  gr           = tm_annot,
  target       = c("Tm", "GC"),
  method       = "wilcoxon",
  group        = "in_mutH",
  alternative  = "greater",   # tests "peak > non_peak"
  posthoc      = FALSE     # only 2 groups, no post-hoc needed
)
res$results   # one row per target: test, statistic, p.value
res$summary   # per-group n, mean, sd, median
```

---

# Interpreting the Results

**MutL-AR-associated windows have lower Tm and higher GC content** than the rest of the genome
(p-value < 0.001 for both Tm and GC).

**MutL-AR peaks cluster near the replication terminus.** The mismatch-repair
protein MutL accumulates preferentially in the terminus region (around *dif*),
where replication forks converge and mismatch density is highest. This spatial
coincidence with locally elevated microsatellite density and cruciform-forming
sequence is consistent with the replication stress model of MMR recruitment.

**GATC methylation sites are distributed genome-wide** but show local density
fluctuations that partially anti-correlate with GC content, reflecting the
sequence context requirements for Dam methyltransferase (5′-GATC-3′).

---

For human-scale analyses (non-overlapping 200 bp windows across the ~3.1 Gb
haploid genome, ~15.5 M windows), we recommend running `tm_calculate()` in
parallelly using `BiocParallel` and splitting input by chromosome. Memory
requirements scale approximately linearly with the number of windows.

---

# Session Information

```{r session-info}
sessionInfo()
```