Phenological Analysis

Matthias Templ1

Barbara Templ2

2026-03-04

Vignette 3 of 4 for the pep725 R package. The most current version is available on GitHub and CRAN.

Introduction

This vignette teaches you how to perform core phenological analyses using the pep725 package. By the end, you’ll be able to:

  1. Calculate phenological normals (long-term baselines)
  2. Detect anomalies (years that deviate from normal)
  3. Assess data quality before analysis
  4. Analyze and visualize trends over time
  5. Link phenology to climate variables

These analyses form the foundation of phenological research, whether you’re studying climate change impacts, agricultural timing, or ecosystem dynamics.

What You’ll Learn

Analysis What It Tells You Key Function
Normals “What is typical?” pheno_normals()
Anomalies “Which years are unusual?” pheno_anomaly()
Quality “Can I trust this data?” pep_quality()
Trends “Is timing changing?” pheno_trend_turning()
Climate links “Why is it changing?” pheno_regional()

Prerequisites

This vignette assumes you’ve read the Getting Started vignette and understand:

Setup

Let’s load the package and prepare our data:

library(pep725)

# Download the synthetic dataset
pep <- pep_download()

# For this vignette, we'll focus on two well-represented species:
# 1. Apple (Malus domestica) - excellent coverage across Europe
# 2. Grapevine (Vitis vinifera) - longest historical records

apple <- pep[species == "Malus domestica"]
vine <- pep[species == "Vitis vinifera"]

# Verify subsets contain data
stopifnot(nrow(apple) > 0, nrow(vine) > 0)

# Quick overview of our data
cat("=== Apple (Malus domestica) ===\n")
#> === Apple (Malus domestica) ===
cat("Observations:", nrow(apple), "\n")
#> Observations: 899311
cat("Year range:", min(apple$year), "-", max(apple$year), "\n")
#> Year range: 1896 - 2025
cat("Stations:", length(unique(apple$s_id)), "\n\n")
#> Stations: 8403

cat("=== Grapevine (Vitis vinifera) ===\n")
#> === Grapevine (Vitis vinifera) ===
cat("Observations:", nrow(vine), "\n")
#> Observations: 146667
cat("Year range:", min(vine$year), "-", max(vine$year), "\n")
#> Year range: 1775 - 2025
cat("Stations:", length(unique(vine$s_id)), "\n")
#> Stations: 2145

Why these species? - Malus domestica (apple) has excellent coverage across Europe with clear phenological phases (bud burst, flowering, fruit development) - Vitis vinifera (grapevine) has the longest historical records (back to 1775) and is economically important for wine production

Part 1: Phenological Normals

What are Phenological Normals?

In climatology, “climate normals” are defined as 30-year averages of temperature or precipitation used to characterise typical climatic conditions. By analogy, we propose the term phenological normals to describe long-term averages of the timing of recurring biological events.

Why are normals important?

  1. Baseline for comparison: Describing an event as “early” or “late” is only meaningful relative to when it typically occurs.
  2. Detecting change: Comparing normals from different reference periods (e.g., 1961-1990 vs. 1991-2020) highlights long-term shifts in phenological timing
  3. Spatial patterns: Normals vary across regions, reflecting differences in climate, latitude and elevation
  4. Model validation: Phenological models predict normals; observations validate them

Standard Reference Periods

The World Meteorological Organization (WMO) defines standard 30-year periods for calculating normals:

Period Use Case
1961-1990 Historical baseline, often used as “pre-warming” reference
1991-2020 Current WMO standard normal period
Custom Any period relevant to your research question

Important: The period you choose affects your conclusions! If you compare 2023 to a 1961-1990 baseline, you might find larger anomalies than comparing to a 1991-2020 baseline (because the recent period already includes warming).

Calculating Normals with pheno_normals()

Let’s calculate normals for apple:

# Calculate normals for apple across countries and phases
# Using 1990-2015 for illustration (ensures enough data in synthetic dataset)
normals <- pheno_normals(
  pep = apple,
  period = 1990:2015,
  by = c("country", "phase_id"),
  min_years = 10
)
#> Note: 47 group(s) have fewer than 10 years of data and return NA values.

print(normals)
#> Phenological Normals
#> -------------------------------------------------- 
#> Reference period: 1990-2015
#> Minimum years required: 10
#> Groups with valid normals: 88 / 135
#> -------------------------------------------------- 
#> 
#>              country phase_id n_years n_obs mean_doy median_doy    sd_doy
#>               <char>    <int>   <int> <int>    <num>      <num>     <num>
#>  1:          Austria       10      16    16 100.1768   99.72368  6.648286
#>  2:          Austria       60      26    26 120.0063  121.61483  4.613005
#>  3:          Austria       65      26    26 124.4908  124.98810  4.524158
#>  4:          Austria       69      26    26 133.7162  135.30309  4.256209
#>  5:          Austria       87      26    26 253.7542  253.73135  3.656988
#>  6:          Austria       95      26    26 300.8767  301.42437  4.591024
#>  7:          Austria      205      26    26 284.4528  285.28720  4.520886
#>  8: Bosnia and Herz.       11      10    10 103.7233  104.00000 13.583708
#>  9: Bosnia and Herz.       60      10    10 108.8033  109.75000  5.558204
#> 10: Bosnia and Herz.       65      10    10 119.3617  113.83333 22.938199
#>       iqr_doy   mad_doy       q05      q10       q25      q75      q90      q95
#>         <num>     <num>     <num>    <num>     <num>    <num>    <num>    <num>
#>  1: 10.736547  8.231738  90.73387  92.6073  94.81338 105.5499 107.7067 108.3478
#>  2:  6.134029  3.981605 111.94225 113.1114 117.28287 123.4169 124.5456 125.4707
#>  3:  5.791667  4.800800 117.29167 119.9167 121.62500 127.4167 130.1548 130.4917
#>  4:  6.233295  3.997865 126.24906 127.2145 130.56262 136.7959 138.1406 138.6637
#>  5:  4.498309  3.616135 248.65787 249.4148 251.15189 255.6502 258.1226 260.1312
#>  6:  6.294203  4.795415 294.10897 294.9604 297.58550 303.8797 305.9588 307.2188
#>  7:  7.528800  6.053137 277.50401 278.2792 280.57616 288.1050 289.7455 290.1033
#>  8: 20.883333 16.605120  85.79000  86.7800  93.45000 114.3333 116.5500 121.2750
#>  9:  6.025000  5.782140 100.98333 102.6333 105.02500 111.0500 115.7900 116.6450
#> 10: 33.462500 18.433660  94.62000  99.8400 102.60000 136.0625 146.6500 154.0750
#>        period
#>        <char>
#>  1: 1990-2015
#>  2: 1990-2015
#>  3: 1990-2015
#>  4: 1990-2015
#>  5: 1990-2015
#>  6: 1990-2015
#>  7: 1990-2015
#>  8: 1990-2015
#>  9: 1990-2015
#> 10: 1990-2015
#> 
#> ... and 125 more rows

Understanding the parameters:

Parameter What it does Our choice
pep Input data Apple observations
period Years to include in normal 1990-2015 (26 years)
by Grouping variables Country and phase (separate normals for each)
min_years Minimum years required 10 (groups with fewer years get NA)

Understanding the Output Statistics

The function returns several statistics for each group. Here’s what these mean and when to use each:

Statistic Description When to Use
n_years Years with data Check data availability
mean_doy Arithmetic mean DOY Standard measure, but sensitive to outliers
median_doy Middle value Recommended - robust to outliers
sd_doy Standard deviation Describes year-to-year variability
iqr_doy Interquartile range Robust measure of spread
mad_doy Median absolute deviation Most robust spread measure
q05-q95 Percentiles Define “early” (q05) and “late” (q95) thresholds

Which measure should you use?

# Get a summary of the normals
summary(normals)
#> Phenological Normals Summary
#> ================================================== 
#> 
#> Reference period: 1990-2015
#> Total groups: 135
#> Groups with valid normals: 88 (65.2%)
#> 
#> DOY Statistics (across all groups with valid normals):
#>   Mean DOY range: 100 - 320
#>   Median DOY range: 100 - 322
#>   Typical SD: 5.5 days
#>   Typical IQR: 6.9 days

Visualizing Normals

The plot() method shows the median DOY with interquartile range (Q25-Q75) for each group:

plot(normals)

A bar chart variant is also available with plot(normals, which = "bar").

Filtering to Specific Phases

Often you want normals for a specific phenological phase:

# Normals for apple first flowering (BBCH 60) only
flowering_normals <- pheno_normals(
  pep = pep,                          # Can use full dataset
  period = 1990:2015,
  species = "Malus domestica",         # Filter by species
  phase_id = 60,                      # Filter by phase (first flowers)
  by = "country",                     # Group only by country
  min_years = 5                       # Lower threshold for more countries
)
#> Note: 2 group(s) have fewer than 5 years of data and return NA values.

print(flowering_normals)
#> Phenological Normals
#> -------------------------------------------------- 
#> Reference period: 1990-2015
#> Minimum years required: 5
#> Species filter: Malus domestica
#> Phase filter: 60
#> Groups with valid normals: 20 / 22
#> -------------------------------------------------- 
#> 
#>              country n_years n_obs mean_doy median_doy   sd_doy   iqr_doy
#>               <char>   <int> <int>    <num>      <num>    <num>     <num>
#>  1:          Austria      26    26 120.0063   121.6148 4.613005  6.134029
#>  2: Bosnia and Herz.      10    10 108.8033   109.7500 5.558204  6.025000
#>  3:          Croatia      26    26 109.0705   108.2500 5.425643  7.416667
#>  4:          Czechia       7     7 132.7143   131.0000 8.769536 11.500000
#>  5:           France      16    16 109.7708   109.0000 7.145025 12.250000
#>  6:    Germany-North      26    26 119.9165   119.9906 2.367487  4.329320
#>  7:    Germany-South      26    26 117.7625   117.2110 2.499477  4.387785
#>  8:          Hungary      12    12 108.1403   108.5000 4.586159  4.150000
#>  9:            Italy      25    25 105.2377   107.0000 8.447191  9.666667
#> 10:           Latvia       3     3       NA         NA       NA        NA
#>      mad_doy      q05      q10      q25      q75      q90      q95    period
#>        <num>    <num>    <num>    <num>    <num>    <num>    <num>    <char>
#>  1: 3.981605 111.9423 113.1114 117.2829 123.4169 124.5456 125.4707 1990-2015
#>  2: 5.782140 100.9833 102.6333 105.0250 111.0500 115.7900 116.6450 1990-2015
#>  3: 5.003775 102.3125 103.6250 105.2708 112.6875 115.6250 119.3750 1990-2015
#>  4: 7.413000 122.8000 124.6000 127.0000 138.5000 143.0000 144.5000 1990-2015
#>  5: 9.636900 100.7500 101.0000 104.1250 116.3750 118.6667 119.7500 1990-2015
#>  6: 3.289164 116.9248 117.0149 117.7564 122.0857 123.0667 123.1908 1990-2015
#>  7: 3.273826 114.1460 114.3804 115.6648 120.0526 120.8275 121.4885 1990-2015
#>  8: 3.335850 100.0750 101.9750 107.0625 111.2125 112.9600 113.4500 1990-2015
#>  9: 6.989400  91.6000  94.8000 101.0000 110.6667 112.5714 116.1333 1990-2015
#> 10:       NA       NA       NA       NA       NA       NA       NA 1990-2015
#> 
#> ... and 12 more rows

Note on NA values: You’ll notice some countries show NA for all statistics. This happens when a country has fewer than min_years years of data. This is intentional - calculating a “normal” from just 2-3 years would be misleading.

The print output tells you “Groups with valid normals: X / Y” so you know how many groups had sufficient data.

Comparing Two Time Periods

A powerful application is comparing normals between periods to detect shifts:

# Calculate normals for two periods
# Period 1: Historical (1961-1990)
period1 <- pheno_normals(
  pep,
  period = 1961:1990,
  species = "Malus",
  phase_id = 60,
  by = "country",
  min_years = 10
)

# Period 2: Recent (1991-2020)
period2 <- pheno_normals(
  pep,
  period = 1991:2020,
  species = "Malus",
  phase_id = 60,
  by = "country",
  min_years = 10
)

# Compare the two periods by merging on country
# Negative shift = flowering got earlier
# Positive shift = flowering got later
if (nrow(period1) > 0 && nrow(period2) > 0) {
  comparison <- merge(
    period1[, .(country, mean_doy_early = mean_doy)],
    period2[, .(country, mean_doy_recent = mean_doy)],
    by = "country"
  )
  comparison[, shift := mean_doy_recent - mean_doy_early]

  cat("Countries compared:", nrow(comparison), "\n")
  cat("Mean shift:", round(mean(comparison$shift, na.rm = TRUE), 1), "days\n")
  cat("Range of shifts:", round(min(comparison$shift, na.rm = TRUE), 1), "to",
      round(max(comparison$shift, na.rm = TRUE), 1), "days\n")

  if (mean(comparison$shift, na.rm = TRUE) < 0) {
    cat("\nInterpretation: On average, flowering has shifted EARLIER\n")
  } else {
    cat("\nInterpretation: On average, flowering has shifted LATER\n")
  }
}
#> Countries compared: 20 
#> Mean shift: -7 days
#> Range of shifts: -11.5 to -1.6 days
#> 
#> Interpretation: On average, flowering has shifted EARLIER

Ecological interpretation: A shift of -5 days (earlier) over 30 years corresponds to about 1.7 days per decade.

Comparing Species: Grapevine with Longer History

Grapevine records extend much further back, enabling comparison across longer time periods:

# Calculate normals for grapevine flowering
vine_normals <- pheno_normals(
  pep = vine,
  period = 1990:2015,
  phase_id = 65,  # Full flowering
  by = "country",
  min_years = 5
)

# Compare with apple for countries that have both
if (nrow(vine_normals) > 0) {
  cat("Grapevine flowering normals (BBCH 65):\n")
  print(vine_normals[!is.na(mean_doy), .(country, n_years, mean_doy, sd_doy)])
}
#> Grapevine flowering normals (BBCH 65):
#> Phenological Normals
#> -------------------------------------------------- 
#> Groups with valid normals: 9 / 9
#> -------------------------------------------------- 
#> 
#>          country n_years mean_doy    sd_doy
#>           <char>   <int>    <num>     <num>
#> 1:        France      16 155.4896  6.105468
#> 2: Germany-North      26 167.4531  3.757682
#> 3: Germany-South      26 166.5263  2.825585
#> 4:       Hungary      13 151.6538  8.191420
#> 5:         Italy      25 151.8000  8.596302
#> 6: Liechtenstein      17 167.0588  8.429640
#> 7:   Netherlands       6 154.3333 22.069587
#> 8:      Slovakia      16 158.4401  3.157515
#> 9:   Switzerland      26 163.4974  4.630641

Part 2: Phenological Anomalies

What are Anomalies?

An anomaly is the deviation of a single year from the long-term normal. Anomalies help you:

  1. Identify extreme years: Which years had unusually early or late phenology?
  2. Link to climate: Do years with early phenology correspond to warm springs?
  3. Assess trends: Are anomalies becoming more frequent or extreme?

How Anomalies are Calculated

For each year, the anomaly is calculated as:

Anomaly = Observed DOY - Normal DOY

The anomaly can be expressed in:

Calculating Anomalies with pheno_anomaly()

# Calculate anomalies for apple
# Using 1990-2010 as baseline period
anomalies <- pheno_anomaly(
  pep = apple,
  baseline_period = 1990:2010,
  by = c("country", "phase_id"),
  min_years = 5
)
#> Note: 932 observation(s) have insufficient baseline data and return NA anomalies.

print(anomalies)
#> Phenological Anomalies
#> -------------------------------------------------- 
#> Baseline period: 1990-2010
#> Method: Robust (median/MAD)
#> Extreme threshold: |z| > 2.0
#> 
#> Total observations: 5639 (valid: 4707)
#> Extreme events: 1372 (29.1%) - 475 early, 897 late
#> Anomaly range: -193.0 to 84.8 days
#> -------------------------------------------------- 
#> 
#>     country phase_id  year observed_doy baseline_doy baseline_spread
#>      <char>    <int> <int>        <num>        <num>           <num>
#>  1: Austria       10  2000    103.37097      103.371        6.178726
#>  2: Austria       10  2001    104.23333      103.371        6.178726
#>  3: Austria       10  2002     99.94737      103.371        6.178726
#>  4: Austria       10  2003    107.53846      103.371        6.178726
#>  5: Austria       10  2004    109.76623      103.371        6.178726
#>  6: Austria       10  2005    106.26866      103.371        6.178726
#>  7: Austria       10  2006    107.87500      103.371        6.178726
#>  8: Austria       10  2007     95.01587      103.371        6.178726
#>  9: Austria       10  2008     94.20588      103.371        6.178726
#> 10: Austria       10  2009     98.98551      103.371        6.178726
#>     anomaly_days    z_score percentile is_extreme direction n_obs
#>            <num>      <num>      <num>     <lgcl>    <char> <int>
#>  1:    0.0000000  0.0000000  50.000000      FALSE    normal    62
#>  2:    0.8623656  0.1395701  55.550018      FALSE    normal    60
#>  3:   -3.4235993 -0.5540947  28.975703      FALSE    normal    57
#>  4:    4.1674938  0.6744908  75.000032      FALSE    normal    65
#>  5:    6.3952660  1.0350460  84.967629      FALSE    normal    77
#>  6:    2.8976890  0.4689784  68.045745      FALSE    normal    67
#>  7:    4.5040323  0.7289580  76.698634      FALSE    normal    56
#>  8:   -8.3550947 -1.3522358   8.814995      FALSE    normal    63
#>  9:   -9.1650854 -1.4833292   6.899348      FALSE    normal    68
#> 10:   -4.3854605 -0.7097677  23.892409      FALSE    normal    69
#>     n_baseline_years baseline_q05 baseline_q95
#>                <int>        <num>        <num>
#>  1:               11     94.61088     108.8206
#>  2:               11     94.61088     108.8206
#>  3:               11     94.61088     108.8206
#>  4:               11     94.61088     108.8206
#>  5:               11     94.61088     108.8206
#>  6:               11     94.61088     108.8206
#>  7:               11     94.61088     108.8206
#>  8:               11     94.61088     108.8206
#>  9:               11     94.61088     108.8206
#> 10:               11     94.61088     108.8206
#> 
#> ... and 5629 more rows

Understanding Anomaly Metrics

The function returns several ways to express the anomaly:

Metric Description Interpretation
anomaly_days Difference in days Negative = early, positive = late
z_score Standardized anomaly |z| > 2 is unusual, > 3 is extreme
percentile Position in distribution <10 very early, >90 very late
is_extreme Boolean flag TRUE if outside normal range
direction Text label “early”, “late”, or “normal”

Which metric to use?

Identifying Extreme Years

Let’s find the most extreme years in our data:

# Find extreme years (is_extreme == TRUE)
extreme <- anomalies[is_extreme == TRUE]

if (nrow(extreme) > 0) {
  cat("Total extreme years found:", nrow(extreme), "\n\n")

  # Show the most extreme early years
  cat("Most extreme EARLY years:\n")
  early <- extreme[direction == "early"][order(anomaly_days)]
  if (nrow(early) > 0) {
    print(early[1:min(5, nrow(early)),
          .(country, phase_id, year, anomaly_days, z_score)])
  }

  # Show the most extreme late years
  cat("\nMost extreme LATE years:\n")
  late <- extreme[direction == "late"][order(-anomaly_days)]
  if (nrow(late) > 0) {
    print(late[1:min(5, nrow(late)),
          .(country, phase_id, year, anomaly_days, z_score)])
  }
} else {
  cat("No extreme years detected in this dataset\n")
}
#> Total extreme years found: 1372 
#> 
#> Most extreme EARLY years:
#> Phenological Anomalies
#> -------------------------------------------------- 
#> Baseline period: (from pre-computed normals)
#> Method: Unknown
#> 
#> Total observations: 5 (valid: 5)
#> Extreme events: 0 (0.0%) - 0 early, 0 late
#> Anomaly range: -193.0 to -123.7 days
#> -------------------------------------------------- 
#> 
#>    country phase_id  year anomaly_days    z_score
#>     <char>    <int> <int>        <num>      <num>
#> 1: Croatia       87  2023    -193.0000 -13.017672
#> 2: Croatia       87  2022    -163.0000 -10.994199
#> 3:   Spain       87  2008    -129.6667  -8.263840
#> 4:   Spain       87  2014    -128.6667  -8.200108
#> 5:   Spain       87  2007    -123.6667  -7.881451
#> 
#> Most extreme LATE years:
#> Phenological Anomalies
#> -------------------------------------------------- 
#> Baseline period: (from pre-computed normals)
#> Method: Unknown
#> 
#> Total observations: 5 (valid: 5)
#> Extreme events: 0 (0.0%) - 0 early, 0 late
#> Anomaly range: 44.8 to 84.8 days
#> -------------------------------------------------- 
#> 
#>        country phase_id  year anomaly_days   z_score
#>         <char>    <int> <int>        <num>     <num>
#> 1: Switzerland       87  2020        84.75  4.992410
#> 2: Switzerland       87  2012        56.25  3.313546
#> 3:     Hungary       87  1965        52.75 12.557431
#> 4:     Hungary       87  1960        44.75 10.652986
#> 5:     Hungary       87  1962        44.75 10.652986

What makes a year “extreme”? By default, years with |z-score| > 2 are flagged as extreme. This corresponds to roughly the most unusual 5% of years (2.5% early + 2.5% late).

Summary of Anomalies

summary(anomalies)
#> Phenological Anomalies Summary
#> ================================================== 
#> 
#> Baseline period: 1990-2010
#> Total observations: 5639
#> Valid anomalies: 4707 (83.5%)
#> 
#> Anomaly Statistics:
#>   Mean anomaly: 0.3 days
#>   Median anomaly: 1.0 days
#>   SD of anomalies: 12.1 days
#>   Range: -193.0 to 84.8 days
#> 
#> Extreme Events:
#>   Total extreme: 1372 (29.1%)
#>   Extreme early: 475
#>   Extreme late: 897
#> 
#> Most Extreme Observations:
#>   1962: -30.7 days (early, z=-26.12)
#>   1954: 25.4 days (late, z=14.14)
#>   2023: -193.0 days (early, z=-13.02)
#>   1965: 52.8 days (late, z=12.56)
#>   2024: -49.6 days (early, z=-11.80)

Visualizing Anomalies

The plot() method shows anomalies over time, coloured by direction (early vs. late), with extreme events marked:

plot(anomalies)

A histogram of anomaly magnitudes is available with plot(anomalies, which = "histogram").

Robust vs. Classical Methods

Phenological data often contains outliers - observations that are clearly erroneous or affected by unusual local conditions. The anomaly function offers two approaches.

The robust method (robust = TRUE, default and recommended) uses the median as the baseline and the MAD (median absolute deviation) for standardization. This approach is less affected by outliers and should be preferred in most situations.

The classical method (robust = FALSE) uses the mean as the baseline and the standard deviation (SD) for standardization. This approach is more affected by outliers.

When to use each:

Part 3: Data Quality Assessment

Why Quality Matters

Before running any analysis, you should understand your data’s quality. Poor quality data can lead to:

Quality Dimensions

The pep_quality() function assesses three dimensions:

Dimension Question Metric
Completeness How many years have data? % of period covered
Continuity Are there gaps? Maximum consecutive missing years
Consistency Are there outliers? % of observations flagged

Running Quality Assessment

# Assess quality at the station level
quality <- pep_quality(
  pep = apple,
  by = c("s_id", "phase_id")
)

print(quality)
#> Phenological Data Quality Assessment
#> -------------------------------------------------- 
#> Outlier method: tukey
#> 
#> Quality Grade Distribution:
#>   Grade A: 16887 (46.3%)
#>   Grade B: 6363 (17.4%)
#>   Grade C: 5311 (14.6%)
#>   Grade D: 7927 (21.7%)
#> 
#> Total groups assessed: 36488
#> -------------------------------------------------- 
#> 
#>       s_id phase_id n_obs year_min year_max year_span n_years completeness_pct
#>     <fctr>    <int> <int>    <int>    <int>     <int>   <int>            <num>
#>  1:      1       60    40     1954     1999        46      36             78.3
#>  2:      1       65    36     1954     1998        45      34             75.6
#>  3:      1       69    37     1954     1999        46      36             78.3
#>  4:      1       87    63     1954     1999        46      34             73.9
#>  5:      1       95     7     1992     1999         8       7             87.5
#>  6:      2       60    23     1960     1982        23      23            100.0
#>  7:      2       65    23     1960     1982        23      23            100.0
#>  8:      2       69    23     1960     1982        23      23            100.0
#>  9:      2       87    45     1959     1982        24      24            100.0
#> 10:      3       60    21     1985     2005        21      21            100.0
#>     max_gap_years n_gaps n_outliers outlier_pct doy_mean doy_sd doy_range
#>             <int>  <int>      <int>       <num>    <num>  <num>     <int>
#>  1:             4      6          1         2.5    137.3    6.6        30
#>  2:             6      5          0         0.0    139.2    7.1        30
#>  3:             4      6          2         5.4    148.2    6.0        24
#>  4:             8      3          3         4.8    265.9   34.5       192
#>  5:             1      1          1        14.3    314.9    8.9        26
#>  6:             0      0          0         0.0    137.7    9.0        36
#>  7:             0      0          0         0.0    139.3    8.3        30
#>  8:             0      0          0         0.0    146.1    8.4        33
#>  9:             0      0          1         2.2    286.0   37.6       183
#> 10:             0      0          2         9.5    123.4    8.7        38
#>     quality_grade
#>            <char>
#>  1:             B
#>  2:             C
#>  3:             C
#>  4:             C
#>  5:             D
#>  6:             A
#>  7:             A
#>  8:             A
#>  9:             B
#> 10:             C
#> 
#> ... and 36478 more rows

Understanding Quality Grades

The function assigns letter grades based on thresholds:

Grade Completeness Max Gap Outliers Interpretation
A ≥ 80% ≤ 2 years < 2% Excellent - use confidently
B ≥ 60% ≤ 5 years < 5% Good - suitable for most analyses
C ≥ 40% ≤ 10 years < 10% Fair - use with caution
D Below C - - Poor - consider excluding

What grade do you need?

# Summary of quality across all stations
summary(quality)
#> Phenological Data Quality Summary
#> ================================================== 
#> 
#> Total groups assessed: 36488
#> 
#> Quality Grade Distribution:
#>   Grade A: 16887 ( 46.3%) *********
#>   Grade B: 6363 ( 17.4%) ***
#>   Grade C: 5311 ( 14.6%) ***
#>   Grade D: 7927 ( 21.7%) ****
#> 
#> Completeness Statistics:
#>   Mean completeness: 80.9%
#>   Median completeness: 90.3%
#>   Range: 2.9% - 100.0%
#> 
#> Temporal Coverage:
#>   Mean years with data: 18.4
#>   Mean year span: 24.7
#>   Mean max gap: 4.6 years
#> 
#> Outlier Statistics:
#>   Mean outlier percentage: 2.29%
#>   Groups with no outliers: 26683 (73.1%)

Visualizing Quality Assessment

With many stations to assess, visual inspection is essential. The plot() method provides an overview of quality metrics:

# Overview plot: grade distribution + station map (requires pep for coordinates)
plot(quality, pep = apple)

Reading the plot:

You can also generate individual plots:

# Just the grade distribution (no pep data needed)
plot(quality, which = "grades")

# Just the map of station quality
plot(quality, which = "map", pep = apple)

Filtering Data by Quality

A critical step before analysis is filtering to high-quality data:

# Get high-quality stations
high_quality <- quality[quality_grade %in% c("A", "B")]

cat("Quality distribution:\n")
#> Quality distribution:
print(table(quality$quality_grade))
#> 
#>     A     B     C     D 
#> 16887  6363  5311  7927

cat("\nHigh quality (A or B):", nrow(high_quality), "of", nrow(quality),
    "(", round(100 * nrow(high_quality) / nrow(quality), 1), "%)\n")
#> 
#> High quality (A or B): 23250 of 36488 ( 63.7 %)

# Filter original data to these stations
if (nrow(high_quality) > 0) {
  good_stations <- unique(high_quality$s_id)
  apple_hq <- apple[s_id %in% good_stations]

  cat("\nOriginal apple observations:", nrow(apple), "\n")
  cat("After quality filter:", nrow(apple_hq),
      "(", round(100 * nrow(apple_hq) / nrow(apple), 1), "%)\n")
}
#> 
#> Original apple observations: 899311 
#> After quality filter: 819498 ( 91.1 %)

Assessing Quality at Different Levels

You can assess quality at various aggregation levels:

# Country-level quality (coarser assessment)
country_quality <- pep_quality(
  pep = pep,
  by = c("country", "species", "phase_id")
)

print(country_quality)
#> Phenological Data Quality Assessment
#> -------------------------------------------------- 
#> Outlier method: tukey
#> 
#> Quality Grade Distribution:
#>   Grade A: 967 (41.6%)
#>   Grade B: 429 (18.4%)
#>   Grade C: 363 (15.6%)
#>   Grade D: 568 (24.4%)
#> 
#> Total groups assessed: 2327
#> -------------------------------------------------- 
#> 
#>     country      species phase_id n_obs year_min year_max year_span n_years
#>      <char>       <fctr>    <int> <int>    <int>    <int>     <int>   <int>
#>  1:    <NA>         <NA>      100    92     1877     1910        34      29
#>  2:    <NA>         <NA>      109    49     1877     1910        34      27
#>  3:    <NA>         <NA>      111    97     1991     2021        31      29
#>  4:    <NA>         <NA>      131   552     1951     2021        71      71
#>  5:    <NA>         <NA>      182   187     1991     2021        31      31
#>  6:    <NA> Avena sativa        0   575     1951     2023        73      73
#>  7:    <NA> Avena sativa        1    94     1873     1909        37      32
#>  8:    <NA> Avena sativa       10   557     1951     2023        73      73
#>  9:    <NA> Avena sativa       31   259     1951     2023        73      61
#> 10:    <NA> Avena sativa       51   325     1951     2023        73      63
#>     completeness_pct max_gap_years n_gaps n_outliers outlier_pct doy_mean
#>                <num>         <int>  <int>      <int>       <num>    <num>
#>  1:             85.3             5      1          0         0.0    186.0
#>  2:             79.4             5      2          6        12.2    207.4
#>  3:             93.5             1      2          5         5.2    141.2
#>  4:            100.0             0      0         16         2.9    158.7
#>  5:            100.0             0      0          3         1.6     83.1
#>  6:            100.0             0      0          2         0.3     94.2
#>  7:             86.5             3      3          4         4.3    117.9
#>  8:            100.0             0      0          8         1.4    111.4
#>  9:             83.6            10      2         12         4.6    144.0
#> 10:             86.3            10      1          2         0.6    172.1
#>     doy_sd doy_range quality_grade
#>      <num>     <int>        <char>
#>  1:    9.3        43             B
#>  2:   14.0        69             D
#>  3:   10.2        51             C
#>  4:   11.3        92             B
#>  5:   17.1       105             A
#>  6:   13.3        82             A
#>  7:   10.6        55             B
#>  8:   12.1        78             A
#>  9:   13.7        88             C
#> 10:    9.5        55             C
#> 
#> ... and 2317 more rows

Part 4: Complete Analytical Workflow

Now let’s put it all together in a realistic workflow:

cat("=== STEP 1: Data Quality Assessment ===\n\n")
#> === STEP 1: Data Quality Assessment ===

# Assess quality
quality <- pep_quality(apple, by = c("s_id", "phase_id"))
cat("Quality grades:\n")
#> Quality grades:
print(table(quality$quality_grade))
#> 
#>     A     B     C     D 
#> 16887  6363  5311  7927

cat("\n=== STEP 2: Filter to Good Quality Data ===\n\n")
#> 
#> === STEP 2: Filter to Good Quality Data ===

# Keep grades A, B, and C (exclude only grade D)
good_stations <- quality[quality_grade %in% c("A", "B", "C"), s_id]
apple_clean <- apple[s_id %in% good_stations]

cat("Original observations:", nrow(apple), "\n")
#> Original observations: 899311
cat("After quality filter:", nrow(apple_clean), "\n")
#> After quality filter: 865445
cat("Retained:", round(100 * nrow(apple_clean) / nrow(apple), 1), "%\n")
#> Retained: 96.2 %

stopifnot(nrow(apple_clean) > 0)

cat("\n=== STEP 3: Calculate Baseline Normals ===\n\n")
#> 
#> === STEP 3: Calculate Baseline Normals ===

normals <- pheno_normals(
  apple_clean,
  period = 1990:2010,
  by = c("country", "phase_id"),
  min_years = 5
)
#> Note: 20 group(s) have fewer than 5 years of data and return NA values.

cat("Normals calculated for", sum(!is.na(normals$mean_doy)), "groups\n")
#> Normals calculated for 90 groups
cat("Mean DOY range:", round(min(normals$mean_doy, na.rm = TRUE), 0),
    "to", round(max(normals$mean_doy, na.rm = TRUE), 0), "\n")
#> Mean DOY range: 101 to 319

cat("\n=== STEP 4: Calculate Anomalies ===\n\n")
#> 
#> === STEP 4: Calculate Anomalies ===

anomalies <- pheno_anomaly(
  apple_clean,
  baseline_period = 1990:2010,
  by = c("country", "phase_id"),
  min_years = 5
)
#> Note: 884 observation(s) have insufficient baseline data and return NA anomalies.

# Convert to plain data.table so grouped aggregation works correctly
# (the pheno_anomaly S3 class can interfere with [.data.table dispatch)
anomalies_dt <- as.data.table(anomalies)

cat("Anomalies calculated for", sum(!is.na(anomalies_dt$anomaly_days)), "year-groups\n")
#> Anomalies calculated for 4694 year-groups
cat("Extreme years:", sum(anomalies_dt$is_extreme, na.rm = TRUE), "\n")
#> Extreme years: 1390

cat("\n=== STEP 5: Analyze Trends by Decade ===\n\n")
#> 
#> === STEP 5: Analyze Trends by Decade ===

decade_summary <- anomalies_dt[
  !is.na(anomaly_days),
  .(
    mean_anomaly = round(mean(anomaly_days, na.rm = TRUE), 1),
    sd_anomaly = round(sd(anomaly_days, na.rm = TRUE), 1),
    n_obs = .N,
    pct_extreme = round(100 * sum(is_extreme, na.rm = TRUE) / .N, 1)
  ),
  by = .(decade = floor(year / 10) * 10)
][order(decade)]

print(decade_summary)
#>     decade mean_anomaly sd_anomaly n_obs pct_extreme
#>      <num>        <num>      <num> <int>       <num>
#>  1:   1920          3.5        7.8     9        55.6
#>  2:   1930          0.2        9.6    19        42.1
#>  3:   1940         -0.8       10.7    51        51.0
#>  4:   1950          1.4       11.1   337        46.0
#>  5:   1960          3.0       12.6   537        44.7
#>  6:   1970          3.8        7.7   640        41.9
#>  7:   1980          3.9        8.9   630        40.2
#>  8:   1990          0.6        8.0   664         7.2
#>  9:   2000         -1.2       11.5   788         8.1
#> 10:   2010         -4.6       13.5   711        28.4
#> 11:   2020         -8.4       20.4   308        39.3

cat("\nInterpretation:\n")
#> 
#> Interpretation:
first_decade <- decade_summary$mean_anomaly[1]
last_decade <- decade_summary$mean_anomaly[nrow(decade_summary)]
change <- last_decade - first_decade

if (change < -3) {
  cat("- Strong shift toward EARLIER phenology\n")
} else if (change < 0) {
  cat("- Moderate shift toward earlier phenology\n")
} else if (change > 3) {
  cat("- Strong shift toward LATER phenology\n")
} else {
  cat("- No clear directional change\n")
}
#> - Strong shift toward EARLIER phenology

Best Practices Summary

Before Analysis

  1. Check data quality with pep_quality() - always!
  2. Filter to appropriate quality grades - usually A, B, or C
  3. Verify sample sizes - are there enough years/stations?

Calculating Normals

  1. Choose reference period carefully - it affects your conclusions
  2. Use median_doy for robust central tendency
  3. Set appropriate min_years - WMO recommends 30, but 10-20 often works

Calculating Anomalies

  1. Use robust method (default) - handles outliers automatically
  2. Consider z-scores for cross-region/phase comparisons
  3. Check for extreme years - investigate them!

Reporting

  1. Document your methods - period, thresholds, filters
  2. Report sample sizes - readers need to assess reliability
  3. Include uncertainty - confidence intervals or standard errors

Next Steps

You now understand the core analytical functions. Continue to:

vignette("spatial-patterns", package = "pep725")
vignette("data-quality", package = "pep725")

Session Info

sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Tahoe 26.0.1
#> 
#> Matrix products: default
#> BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Europe/Zurich
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] pep725_1.0.0        ggplot2_4.0.1       data.table_1.18.2.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.10             generics_0.1.4          tidyr_1.3.2            
#>  [4] class_7.3-23            robustbase_0.99-7       KernSmooth_2.23-26     
#>  [7] lattice_0.22-7          digest_0.6.39           magrittr_2.0.4         
#> [10] evaluate_1.0.5          grid_4.5.2              RColorBrewer_1.1-3     
#> [13] rnaturalearth_1.2.0     fastmap_1.2.0           Matrix_1.7-4           
#> [16] jsonlite_2.0.0          e1071_1.7-17            DBI_1.2.3              
#> [19] mgcv_1.9-3              purrr_1.2.1             viridisLite_0.4.2      
#> [22] scales_1.4.0            jquerylib_0.1.4         cli_3.6.5              
#> [25] rlang_1.1.7             units_1.0-0             splines_4.5.2          
#> [28] withr_3.0.2             cachem_1.1.0            yaml_2.3.12            
#> [31] otel_0.2.0              tools_4.5.2             dplyr_1.2.0            
#> [34] vctrs_0.7.1             R6_2.6.1                proxy_0.4-29           
#> [37] lifecycle_1.0.5         classInt_0.4-11         pkgconfig_2.0.3        
#> [40] pillar_1.11.1           bslib_0.10.0            gtable_0.3.6           
#> [43] glue_1.8.0              Rcpp_1.1.1              sf_1.0-24              
#> [46] rnaturalearthdata_1.0.0 DEoptimR_1.1-4          xfun_0.56              
#> [49] tibble_3.3.1            tidyselect_1.2.1        rstudioapi_0.18.0      
#> [52] knitr_1.51              farver_2.1.2            nlme_3.1-168           
#> [55] htmltools_0.5.9         patchwork_1.3.2         rmarkdown_2.30         
#> [58] labeling_0.4.3          compiler_4.5.2          S7_0.2.1

  1. University of Applied Sciences Northwestern Switzerland (FHNW)↩︎

  2. Swiss Federal Institute for Forest, Snow and Landscape Research (WSL)↩︎