---
title: "Introduction to easyLSEA"
author: "easyLSEA authors"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Introduction to easyLSEA}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse  = TRUE,
  comment   = "#>",
  eval      = FALSE
)
```

## Overview

**easyLSEA** provides a complete pipeline for Lipid Set Enrichment Analysis
(LSEA) in R. Starting from a table of differential lipid abundances, it
annotates lipids into biological groups, tests whether those groups are
systematically shifted between conditions, and produces publication-ready
bubble plots, distribution plots, and fatty acid chain visualizations.

The package runs two complementary enrichment engines:

- **Kolmogorov–Smirnov (KS)** — tests whether the entire logFC distribution
  of a lipid set differs from the background. Sensitive to distributed moderate
  effects across many set members.
- **fgsea** — tests whether set members cluster at the extreme ends of the
  ranked list. Sensitive to a small number of strongly regulated lipids.

Running both engines and comparing their results (convergence analysis) gives
a more complete picture of lipid remodeling than either method alone.

---

## Installation

Install the stable version from CRAN:

```{r install-cran}
install.packages("easyLSEA")
```

Or the development version from GitHub:

```{r install-github}
# install.packages("remotes")
remotes::install_github("DavidGO464/easyLSEA")
```

To enable the fgsea engine, install the optional Bioconductor dependency:

```{r install-fgsea}
BiocManager::install("fgsea")
```

---

## Input data

`easyLSEA` expects a `data.frame` with at least:

| Column | Description |
|--------|-------------|
| `LipidName` | Lipid identifier in standard shorthand notation (e.g. `PC 36:4`, `TG 54:3`) |
| `logFC` | log2 fold-change (case vs reference) |
| `P.Value` | Raw (unadjusted) p-value — used for the fgsea pi-value rank metric |

Additional columns are used when present:

| Column | Used for |
|--------|----------|
| `adj.P.Val` | Counting significantly changed lipids |
| `Confidence_rank` | Annotation confidence filtering |
| `Shorthand` | Alternative lipid name fallback |

Column names are configurable via `lipid_col`, `fc_col`, and `pval_col`
arguments — only the defaults are shown above.

---

## Quick start

The entire pipeline runs in a single call to `easyLSEA()`:

```{r quickstart}
library(easyLSEA)

result <- easyLSEA(
  data      = my_lipid_data,
  lipid_col = "LipidName",
  fc_col    = "logFC",
  pval_col  = "P.Value",
  case_lbl  = "NASH",
  ref_lbl   = "Control",
  engine    = "both",   # run KS and fgsea
  min_rank  = "E"       # include all confidence ranks except P and NA (default)
)
```

This returns an `easyLSEA_result` object with five slots described in the
next section.

---

## Understanding the output

### `result$meta`

A named list with run metadata: date, comparison labels, engine used, number
of lipids, and the original function call.

```{r meta}
result$meta$date
result$meta$case_lbl
result$meta$n_lipids
```

### `result$lsea`

Contains the enrichment statistics. The key sub-elements are:

```{r lsea}
# KS results — one row per lipid set per grouping level
head(result$lsea$ks)

# fgsea results
head(result$lsea$fgsea)

# Combined table with Convergence column
head(result$lsea$combined)
```

**Key columns in the KS results:**

| Column | Description |
|--------|-------------|
| `Group` | Lipid set name (e.g. `PC`, `Glycerolipids`) |
| `Level` | Grouping level (`LipidClass`, `LipidCategory_LMAPS`, `LipidCategory_functional`) |
| `N_group` | Number of lipids in the set |
| `DirectionalScore` | Standardized mean difference (Cohen's *d* analog). Positive = up in case. |
| `KS_pval` | Two-sided KS p-value |
| `FDR_LSEA` | BH-adjusted FDR |
| `DS_perm_pval` | Permutation p-value for the DirectionalScore |
| `ContributingLipids_KS` | Lipids on the enriched side of the CDF divergence point |

**Key columns in the fgsea results:**

| Column | Description |
|--------|-------------|
| `NES` | Normalized Enrichment Score. Positive = enriched toward top of ranked list (up in case). |
| `FDR_fgsea` | BH-adjusted FDR from fgsea |
| `N_leading` | Leading edge size |
| `LeadingEdge` | Lipids in the leading edge |
| `rank_metric` | Rank metric used (`pi_value`, `logFC`, or `t_stat`) |

**The `Convergence` column** in the combined table classifies each set as:

| Value | Meaning |
|-------|---------|
| `KS+fgsea [strongest]` | Significant by both engines — highest confidence |
| `KS only [distributed effect]` | Moderate shift across many lipids |
| `fgsea only [extreme-driven]` | A few strongly regulated lipids drive the signal |
| `Neither` | Not significant by either engine |

### `result$chains`

Fatty acid chain analysis results, available when `run_chains = TRUE`:

```{r chains}
# Long format — one row per acyl chain per lipid
head(result$chains$parsed)

# Parsing status — one row per lipid
head(result$chains$summary)

# Wide format — one row per lipid with sn positions and totals
head(result$chains$wide)
```

The `wide` table is the most convenient for reporting. Each row is one lipid,
with columns `sn1`, `sn2`, `sn3`, `sn4` containing the individual acyl chain
positions (e.g. `"18:1"`), and `total_carbons` / `total_unsat` with the
summed totals. The `chain_type` column clarifies how to interpret the sn
columns:

| `chain_type` | `sn1` | `sn2` | `sn3` | `sn4` |
|---|---|---|---|---|
| `sn2` (PC, PE, PS...) | sn-1 chain | sn-2 chain | NA | NA |
| `nacyl` (Cer, SM...) | sphingoid base | N-acyl chain | NA | NA |
| `long_format` (TG) | chain 1 | chain 2 | chain 3 | NA |
| `long_format` (CL) | chain 1 | chain 2 | chain 3 | chain 4 |
| `single` (CAR, LPC) | the chain | NA | NA | NA |

### `result$plots`

A named list of `ggplot2` objects, available when `plots = TRUE`:

```{r plots-list}
# List all available plots
names(result$plots$lsea)
names(result$plots$chains)
```

Plot naming convention for LSEA bubble plots:

| Name pattern | Description |
|-------------|-------------|
| `bubble_ks_01_Class` | KS bubble plot — lipid class level |
| `bubble_ks_sig_01_Class` | KS bubble plot — significant sets only |
| `bubble_fgsea_01_Class` | fgsea bubble plot — lipid class level |
| `bubble_fgsea_sig_01_Class` | fgsea bubble plot — significant sets only |
| `dist_01_Class` | Distribution (boxplot) — lipid class level |

Levels: `01_Class` (lipid class), `02_LMAPS` (LIPID MAPS category),
`03_Functional` (functional category).

### `result$input`

The annotated input data and the grouping columns used:

```{r input}
# Annotated data with LipidClass, LipidCategory_LMAPS, etc.
head(result$input$data)

# Grouping columns tested
result$input$group_cols
```

---

## Viewing plots

Individual plots can be displayed directly:

```{r view-plots}
# KS bubble plot — all lipid classes
result$plots$lsea$bubble_ks_01_Class

# fgsea bubble plot — significant sets only
result$plots$lsea$bubble_fgsea_sig_01_Class

# Distribution plot — lipid class level
result$plots$lsea$dist_01_Class
```

To customize bubble labels:

```{r bubble-label}
# Regenerate plots showing only FDR and n
plots <- plot_lsea(
  result$lsea,
  case_lbl     = "NASH",
  ref_lbl      = "Control",
  bubble_label = c("FDR", "n")
)
```

---

## Exporting results

`export_lsea()` saves all results to a timestamped folder. Supply the output
directory explicitly via `dir` (here a temporary directory; for a real
analysis use a folder of your choice):

```{r export}
export_lsea(
  result,
  dir    = tempdir(),
  format = c("csv", "excel", "pdf")
)
```

This creates:

```
easyLSEA_NASH_vs_Control_2024-01-15_1430/
  tables/
    lsea_results_ks.csv
    lsea_results_fgsea.csv
    lsea_combined.csv
    chain_results.csv
  plots/
    lsea/
      01_Class/
        bubble_ks_01_Class.pdf
        bubble_ks_sig_01_Class.pdf
        bubble_fgsea_01_Class.pdf
        bubble_fgsea_sig_01_Class.pdf
        dist_01_Class.pdf
      02_LMAPS/  ...
      03_Functional/  ...
    chains/
      tile/  ...
      trend/  ...
  results.xlsx
```

---

## Advanced usage

### Running engines separately

```{r advanced-separate}
# Step 1: annotate
annotated <- annotate_lipids(my_lipid_data, lipid_col = "LipidName")

# Step 2: run enrichment
lsea_res <- run_lsea(
  data      = annotated,
  fc_col    = "logFC",
  engine    = "both",
  case_lbl  = "NASH",
  ref_lbl   = "Control"
)

# Step 3: generate plots manually
plots <- plot_lsea(
  lsea_res,
  case_lbl     = "NASH",
  ref_lbl      = "Control",
  fdr_thresh   = 0.05,
  bubble_label = c("FDR", "DS", "NES", "n")
)

# Step 4: distribution plot for a specific level
p_dist <- plot_distribution(
  data        = annotated,
  lsea_result = lsea_res,
  group_col   = "LipidClass",
  case_lbl    = "NASH",
  ref_lbl     = "Control"
)
```

### Changing the fgsea rank metric

```{r fgsea-rank}
# Default: pi-value = sign(logFC) * -log10(P.Value)
# Combines effect size and statistical evidence

# Alternative: logFC only
result_fc <- easyLSEA(
  data        = my_lipid_data,
  engine      = "fgsea",
  fgsea_rank  = "logFC"
)

# Alternative: LIMMA t-statistic (requires a 't' column)
result_t <- easyLSEA(
  data        = my_lipid_data,
  engine      = "fgsea",
  fgsea_rank  = "t_stat"
)
```

### Adjusting significance thresholds

```{r thresholds}
result <- easyLSEA(
  data      = my_lipid_data,
  min_n     = 5L,      # require at least 5 lipids per set
  n_perm    = 5000L,   # more permutations for DS_perm_pval
  fgsea_nperm = 20000L # more permutations for fgsea
)
```

### Filtering by annotation confidence rank

When lipid annotations include a confidence rank (A > B > C > D > E > P),
`min_rank` controls which lipids enter the chain analysis:

```{r min-rank}
# Default: include all except P and NA
result_all <- easyLSEA(data = my_lipid_data, min_rank = "E")

# Strict: include only high-confidence annotations (A and B)
result_strict <- easyLSEA(data = my_lipid_data, min_rank = "B")

# Or apply directly to parse_lipid_chains
chains_strict <- parse_lipid_chains(annotated, min_rank = "B")
table(chains_strict$summary$status)
```

Lipids excluded by `min_rank` appear in `result$chains$summary` with status
`excluded_rank_below_X` (where X is the chosen threshold), making it easy to
audit which lipids were filtered.

---

## Interpreting results

### KS vs fgsea — which to trust?

Both engines test the same null hypothesis (no systematic enrichment) but are
sensitive to different signal patterns:

- **KS** is powerful when many lipids in a class shift moderately in the same
  direction. A significant KS result with `DirectionalScore ≈ 0` suggests a
  variance or shape difference rather than a net directional shift — interpret
  with caution.
- **fgsea** is powerful when a small number of lipids drive an extreme signal.
  A significant fgsea result without KS support suggests the enrichment is
  carried by a few lipids rather than the class as a whole.

**Convergent results** (significant by both) provide the strongest evidence
for coordinated lipid class remodeling.

### The DirectionalScore

The `DirectionalScore` is a standardized mean difference (Cohen's *d* analog):

```
DS = (mean_logFC_set - mean_logFC_background) / SD_pooled
```

It quantifies the **direction and magnitude** of the shift, independent of the
KS p-value. A set can have a significant KS p-value (distributional difference)
with a near-zero DirectionalScore (no net direction) — this pattern suggests
heterogeneous within-set regulation.

### The pi-value rank metric

By default, fgsea ranks lipids by:

```
pi-value = sign(logFC) × −log10(P.Value)
```

This combines the direction of change with statistical confidence, giving more
weight to lipids that are both strongly and significantly regulated. It is
preferred over logFC alone when p-values are available.

---

## Session info

```{r session, eval = TRUE}
sessionInfo()
```