Vignette 3 of 4 for the pep725 R package. The most current version is available on GitHub and CRAN.
This vignette teaches you how to perform core phenological analyses using the pep725 package. By the end, you’ll be able to:
These analyses form the foundation of phenological research, whether you’re studying climate change impacts, agricultural timing, or ecosystem dynamics.
| Analysis | What It Tells You | Key Function |
|---|---|---|
| Normals | “What is typical?” | pheno_normals() |
| Anomalies | “Which years are unusual?” | pheno_anomaly() |
| Quality | “Can I trust this data?” | pep_quality() |
| Trends | “Is timing changing?” | pheno_trend_turning() |
| Climate links | “Why is it changing?” | pheno_regional() |
This vignette assumes you’ve read the Getting Started vignette and understand:
pep objectLet’s load the package and prepare our data:
library(pep725)
# Download the synthetic dataset
pep <- pep_download()
# For this vignette, we'll focus on two well-represented species:
# 1. Apple (Malus domestica) - excellent coverage across Europe
# 2. Grapevine (Vitis vinifera) - longest historical records
apple <- pep[species == "Malus domestica"]
vine <- pep[species == "Vitis vinifera"]
# Verify subsets contain data
stopifnot(nrow(apple) > 0, nrow(vine) > 0)
# Quick overview of our data
cat("=== Apple (Malus domestica) ===\n")
#> === Apple (Malus domestica) ===
cat("Observations:", nrow(apple), "\n")
#> Observations: 899311
cat("Year range:", min(apple$year), "-", max(apple$year), "\n")
#> Year range: 1896 - 2025
cat("Stations:", length(unique(apple$s_id)), "\n\n")
#> Stations: 8403
cat("=== Grapevine (Vitis vinifera) ===\n")
#> === Grapevine (Vitis vinifera) ===
cat("Observations:", nrow(vine), "\n")
#> Observations: 146667
cat("Year range:", min(vine$year), "-", max(vine$year), "\n")
#> Year range: 1775 - 2025
cat("Stations:", length(unique(vine$s_id)), "\n")
#> Stations: 2145Why these species? - Malus domestica (apple) has excellent coverage across Europe with clear phenological phases (bud burst, flowering, fruit development) - Vitis vinifera (grapevine) has the longest historical records (back to 1775) and is economically important for wine production
In climatology, “climate normals” are defined as 30-year averages of temperature or precipitation used to characterise typical climatic conditions. By analogy, we propose the term phenological normals to describe long-term averages of the timing of recurring biological events.
Why are normals important?
The World Meteorological Organization (WMO) defines standard 30-year periods for calculating normals:
| Period | Use Case |
|---|---|
| 1961-1990 | Historical baseline, often used as “pre-warming” reference |
| 1991-2020 | Current WMO standard normal period |
| Custom | Any period relevant to your research question |
Important: The period you choose affects your conclusions! If you compare 2023 to a 1961-1990 baseline, you might find larger anomalies than comparing to a 1991-2020 baseline (because the recent period already includes warming).
Let’s calculate normals for apple:
# Calculate normals for apple across countries and phases
# Using 1990-2015 for illustration (ensures enough data in synthetic dataset)
normals <- pheno_normals(
pep = apple,
period = 1990:2015,
by = c("country", "phase_id"),
min_years = 10
)
#> Note: 47 group(s) have fewer than 10 years of data and return NA values.
print(normals)
#> Phenological Normals
#> --------------------------------------------------
#> Reference period: 1990-2015
#> Minimum years required: 10
#> Groups with valid normals: 88 / 135
#> --------------------------------------------------
#>
#> country phase_id n_years n_obs mean_doy median_doy sd_doy
#> <char> <int> <int> <int> <num> <num> <num>
#> 1: Austria 10 16 16 100.1768 99.72368 6.648286
#> 2: Austria 60 26 26 120.0063 121.61483 4.613005
#> 3: Austria 65 26 26 124.4908 124.98810 4.524158
#> 4: Austria 69 26 26 133.7162 135.30309 4.256209
#> 5: Austria 87 26 26 253.7542 253.73135 3.656988
#> 6: Austria 95 26 26 300.8767 301.42437 4.591024
#> 7: Austria 205 26 26 284.4528 285.28720 4.520886
#> 8: Bosnia and Herz. 11 10 10 103.7233 104.00000 13.583708
#> 9: Bosnia and Herz. 60 10 10 108.8033 109.75000 5.558204
#> 10: Bosnia and Herz. 65 10 10 119.3617 113.83333 22.938199
#> iqr_doy mad_doy q05 q10 q25 q75 q90 q95
#> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 10.736547 8.231738 90.73387 92.6073 94.81338 105.5499 107.7067 108.3478
#> 2: 6.134029 3.981605 111.94225 113.1114 117.28287 123.4169 124.5456 125.4707
#> 3: 5.791667 4.800800 117.29167 119.9167 121.62500 127.4167 130.1548 130.4917
#> 4: 6.233295 3.997865 126.24906 127.2145 130.56262 136.7959 138.1406 138.6637
#> 5: 4.498309 3.616135 248.65787 249.4148 251.15189 255.6502 258.1226 260.1312
#> 6: 6.294203 4.795415 294.10897 294.9604 297.58550 303.8797 305.9588 307.2188
#> 7: 7.528800 6.053137 277.50401 278.2792 280.57616 288.1050 289.7455 290.1033
#> 8: 20.883333 16.605120 85.79000 86.7800 93.45000 114.3333 116.5500 121.2750
#> 9: 6.025000 5.782140 100.98333 102.6333 105.02500 111.0500 115.7900 116.6450
#> 10: 33.462500 18.433660 94.62000 99.8400 102.60000 136.0625 146.6500 154.0750
#> period
#> <char>
#> 1: 1990-2015
#> 2: 1990-2015
#> 3: 1990-2015
#> 4: 1990-2015
#> 5: 1990-2015
#> 6: 1990-2015
#> 7: 1990-2015
#> 8: 1990-2015
#> 9: 1990-2015
#> 10: 1990-2015
#>
#> ... and 125 more rowsUnderstanding the parameters:
| Parameter | What it does | Our choice |
|---|---|---|
pep |
Input data | Apple observations |
period |
Years to include in normal | 1990-2015 (26 years) |
by |
Grouping variables | Country and phase (separate normals for each) |
min_years |
Minimum years required | 10 (groups with fewer years get NA) |
The function returns several statistics for each group. Here’s what these mean and when to use each:
| Statistic | Description | When to Use |
|---|---|---|
n_years |
Years with data | Check data availability |
mean_doy |
Arithmetic mean DOY | Standard measure, but sensitive to outliers |
median_doy |
Middle value | Recommended - robust to outliers |
sd_doy |
Standard deviation | Describes year-to-year variability |
iqr_doy |
Interquartile range | Robust measure of spread |
mad_doy |
Median absolute deviation | Most robust spread measure |
q05-q95 |
Percentiles | Define “early” (q05) and “late” (q95) thresholds |
Which measure should you use?
median_doy
(robust to outliers)mad_doy or
iqr_doy (robust)# Get a summary of the normals
summary(normals)
#> Phenological Normals Summary
#> ==================================================
#>
#> Reference period: 1990-2015
#> Total groups: 135
#> Groups with valid normals: 88 (65.2%)
#>
#> DOY Statistics (across all groups with valid normals):
#> Mean DOY range: 100 - 320
#> Median DOY range: 100 - 322
#> Typical SD: 5.5 days
#> Typical IQR: 6.9 daysThe plot() method shows the median DOY with
interquartile range (Q25-Q75) for each group:
A bar chart variant is also available with
plot(normals, which = "bar").
Often you want normals for a specific phenological phase:
# Normals for apple first flowering (BBCH 60) only
flowering_normals <- pheno_normals(
pep = pep, # Can use full dataset
period = 1990:2015,
species = "Malus domestica", # Filter by species
phase_id = 60, # Filter by phase (first flowers)
by = "country", # Group only by country
min_years = 5 # Lower threshold for more countries
)
#> Note: 2 group(s) have fewer than 5 years of data and return NA values.
print(flowering_normals)
#> Phenological Normals
#> --------------------------------------------------
#> Reference period: 1990-2015
#> Minimum years required: 5
#> Species filter: Malus domestica
#> Phase filter: 60
#> Groups with valid normals: 20 / 22
#> --------------------------------------------------
#>
#> country n_years n_obs mean_doy median_doy sd_doy iqr_doy
#> <char> <int> <int> <num> <num> <num> <num>
#> 1: Austria 26 26 120.0063 121.6148 4.613005 6.134029
#> 2: Bosnia and Herz. 10 10 108.8033 109.7500 5.558204 6.025000
#> 3: Croatia 26 26 109.0705 108.2500 5.425643 7.416667
#> 4: Czechia 7 7 132.7143 131.0000 8.769536 11.500000
#> 5: France 16 16 109.7708 109.0000 7.145025 12.250000
#> 6: Germany-North 26 26 119.9165 119.9906 2.367487 4.329320
#> 7: Germany-South 26 26 117.7625 117.2110 2.499477 4.387785
#> 8: Hungary 12 12 108.1403 108.5000 4.586159 4.150000
#> 9: Italy 25 25 105.2377 107.0000 8.447191 9.666667
#> 10: Latvia 3 3 NA NA NA NA
#> mad_doy q05 q10 q25 q75 q90 q95 period
#> <num> <num> <num> <num> <num> <num> <num> <char>
#> 1: 3.981605 111.9423 113.1114 117.2829 123.4169 124.5456 125.4707 1990-2015
#> 2: 5.782140 100.9833 102.6333 105.0250 111.0500 115.7900 116.6450 1990-2015
#> 3: 5.003775 102.3125 103.6250 105.2708 112.6875 115.6250 119.3750 1990-2015
#> 4: 7.413000 122.8000 124.6000 127.0000 138.5000 143.0000 144.5000 1990-2015
#> 5: 9.636900 100.7500 101.0000 104.1250 116.3750 118.6667 119.7500 1990-2015
#> 6: 3.289164 116.9248 117.0149 117.7564 122.0857 123.0667 123.1908 1990-2015
#> 7: 3.273826 114.1460 114.3804 115.6648 120.0526 120.8275 121.4885 1990-2015
#> 8: 3.335850 100.0750 101.9750 107.0625 111.2125 112.9600 113.4500 1990-2015
#> 9: 6.989400 91.6000 94.8000 101.0000 110.6667 112.5714 116.1333 1990-2015
#> 10: NA NA NA NA NA NA NA 1990-2015
#>
#> ... and 12 more rowsNote on NA values: You’ll notice some countries show
NA for all statistics. This happens when a country has
fewer than min_years years of data. This is intentional -
calculating a “normal” from just 2-3 years would be misleading.
The print output tells you “Groups with valid normals: X / Y” so you know how many groups had sufficient data.
A powerful application is comparing normals between periods to detect shifts:
# Calculate normals for two periods
# Period 1: Historical (1961-1990)
period1 <- pheno_normals(
pep,
period = 1961:1990,
species = "Malus",
phase_id = 60,
by = "country",
min_years = 10
)
# Period 2: Recent (1991-2020)
period2 <- pheno_normals(
pep,
period = 1991:2020,
species = "Malus",
phase_id = 60,
by = "country",
min_years = 10
)
# Compare the two periods by merging on country
# Negative shift = flowering got earlier
# Positive shift = flowering got later
if (nrow(period1) > 0 && nrow(period2) > 0) {
comparison <- merge(
period1[, .(country, mean_doy_early = mean_doy)],
period2[, .(country, mean_doy_recent = mean_doy)],
by = "country"
)
comparison[, shift := mean_doy_recent - mean_doy_early]
cat("Countries compared:", nrow(comparison), "\n")
cat("Mean shift:", round(mean(comparison$shift, na.rm = TRUE), 1), "days\n")
cat("Range of shifts:", round(min(comparison$shift, na.rm = TRUE), 1), "to",
round(max(comparison$shift, na.rm = TRUE), 1), "days\n")
if (mean(comparison$shift, na.rm = TRUE) < 0) {
cat("\nInterpretation: On average, flowering has shifted EARLIER\n")
} else {
cat("\nInterpretation: On average, flowering has shifted LATER\n")
}
}
#> Countries compared: 20
#> Mean shift: -7 days
#> Range of shifts: -11.5 to -1.6 days
#>
#> Interpretation: On average, flowering has shifted EARLIEREcological interpretation: A shift of -5 days (earlier) over 30 years corresponds to about 1.7 days per decade.
Grapevine records extend much further back, enabling comparison across longer time periods:
# Calculate normals for grapevine flowering
vine_normals <- pheno_normals(
pep = vine,
period = 1990:2015,
phase_id = 65, # Full flowering
by = "country",
min_years = 5
)
# Compare with apple for countries that have both
if (nrow(vine_normals) > 0) {
cat("Grapevine flowering normals (BBCH 65):\n")
print(vine_normals[!is.na(mean_doy), .(country, n_years, mean_doy, sd_doy)])
}
#> Grapevine flowering normals (BBCH 65):
#> Phenological Normals
#> --------------------------------------------------
#> Groups with valid normals: 9 / 9
#> --------------------------------------------------
#>
#> country n_years mean_doy sd_doy
#> <char> <int> <num> <num>
#> 1: France 16 155.4896 6.105468
#> 2: Germany-North 26 167.4531 3.757682
#> 3: Germany-South 26 166.5263 2.825585
#> 4: Hungary 13 151.6538 8.191420
#> 5: Italy 25 151.8000 8.596302
#> 6: Liechtenstein 17 167.0588 8.429640
#> 7: Netherlands 6 154.3333 22.069587
#> 8: Slovakia 16 158.4401 3.157515
#> 9: Switzerland 26 163.4974 4.630641An anomaly is the deviation of a single year from the long-term normal. Anomalies help you:
For each year, the anomaly is calculated as:
Anomaly = Observed DOY - Normal DOY
The anomaly can be expressed in:
# Calculate anomalies for apple
# Using 1990-2010 as baseline period
anomalies <- pheno_anomaly(
pep = apple,
baseline_period = 1990:2010,
by = c("country", "phase_id"),
min_years = 5
)
#> Note: 932 observation(s) have insufficient baseline data and return NA anomalies.
print(anomalies)
#> Phenological Anomalies
#> --------------------------------------------------
#> Baseline period: 1990-2010
#> Method: Robust (median/MAD)
#> Extreme threshold: |z| > 2.0
#>
#> Total observations: 5639 (valid: 4707)
#> Extreme events: 1372 (29.1%) - 475 early, 897 late
#> Anomaly range: -193.0 to 84.8 days
#> --------------------------------------------------
#>
#> country phase_id year observed_doy baseline_doy baseline_spread
#> <char> <int> <int> <num> <num> <num>
#> 1: Austria 10 2000 103.37097 103.371 6.178726
#> 2: Austria 10 2001 104.23333 103.371 6.178726
#> 3: Austria 10 2002 99.94737 103.371 6.178726
#> 4: Austria 10 2003 107.53846 103.371 6.178726
#> 5: Austria 10 2004 109.76623 103.371 6.178726
#> 6: Austria 10 2005 106.26866 103.371 6.178726
#> 7: Austria 10 2006 107.87500 103.371 6.178726
#> 8: Austria 10 2007 95.01587 103.371 6.178726
#> 9: Austria 10 2008 94.20588 103.371 6.178726
#> 10: Austria 10 2009 98.98551 103.371 6.178726
#> anomaly_days z_score percentile is_extreme direction n_obs
#> <num> <num> <num> <lgcl> <char> <int>
#> 1: 0.0000000 0.0000000 50.000000 FALSE normal 62
#> 2: 0.8623656 0.1395701 55.550018 FALSE normal 60
#> 3: -3.4235993 -0.5540947 28.975703 FALSE normal 57
#> 4: 4.1674938 0.6744908 75.000032 FALSE normal 65
#> 5: 6.3952660 1.0350460 84.967629 FALSE normal 77
#> 6: 2.8976890 0.4689784 68.045745 FALSE normal 67
#> 7: 4.5040323 0.7289580 76.698634 FALSE normal 56
#> 8: -8.3550947 -1.3522358 8.814995 FALSE normal 63
#> 9: -9.1650854 -1.4833292 6.899348 FALSE normal 68
#> 10: -4.3854605 -0.7097677 23.892409 FALSE normal 69
#> n_baseline_years baseline_q05 baseline_q95
#> <int> <num> <num>
#> 1: 11 94.61088 108.8206
#> 2: 11 94.61088 108.8206
#> 3: 11 94.61088 108.8206
#> 4: 11 94.61088 108.8206
#> 5: 11 94.61088 108.8206
#> 6: 11 94.61088 108.8206
#> 7: 11 94.61088 108.8206
#> 8: 11 94.61088 108.8206
#> 9: 11 94.61088 108.8206
#> 10: 11 94.61088 108.8206
#>
#> ... and 5629 more rowsThe function returns several ways to express the anomaly:
| Metric | Description | Interpretation |
|---|---|---|
anomaly_days |
Difference in days | Negative = early, positive = late |
z_score |
Standardized anomaly | |z| > 2 is unusual, > 3 is extreme |
percentile |
Position in distribution | <10 very early, >90 very late |
is_extreme |
Boolean flag | TRUE if outside normal range |
direction |
Text label | “early”, “late”, or “normal” |
Which metric to use?
anomaly_days
(“flowering was 10 days early”)z_score (allows
comparison across regions/phases)percentile
(“a 1-in-20 year event”)Let’s find the most extreme years in our data:
# Find extreme years (is_extreme == TRUE)
extreme <- anomalies[is_extreme == TRUE]
if (nrow(extreme) > 0) {
cat("Total extreme years found:", nrow(extreme), "\n\n")
# Show the most extreme early years
cat("Most extreme EARLY years:\n")
early <- extreme[direction == "early"][order(anomaly_days)]
if (nrow(early) > 0) {
print(early[1:min(5, nrow(early)),
.(country, phase_id, year, anomaly_days, z_score)])
}
# Show the most extreme late years
cat("\nMost extreme LATE years:\n")
late <- extreme[direction == "late"][order(-anomaly_days)]
if (nrow(late) > 0) {
print(late[1:min(5, nrow(late)),
.(country, phase_id, year, anomaly_days, z_score)])
}
} else {
cat("No extreme years detected in this dataset\n")
}
#> Total extreme years found: 1372
#>
#> Most extreme EARLY years:
#> Phenological Anomalies
#> --------------------------------------------------
#> Baseline period: (from pre-computed normals)
#> Method: Unknown
#>
#> Total observations: 5 (valid: 5)
#> Extreme events: 0 (0.0%) - 0 early, 0 late
#> Anomaly range: -193.0 to -123.7 days
#> --------------------------------------------------
#>
#> country phase_id year anomaly_days z_score
#> <char> <int> <int> <num> <num>
#> 1: Croatia 87 2023 -193.0000 -13.017672
#> 2: Croatia 87 2022 -163.0000 -10.994199
#> 3: Spain 87 2008 -129.6667 -8.263840
#> 4: Spain 87 2014 -128.6667 -8.200108
#> 5: Spain 87 2007 -123.6667 -7.881451
#>
#> Most extreme LATE years:
#> Phenological Anomalies
#> --------------------------------------------------
#> Baseline period: (from pre-computed normals)
#> Method: Unknown
#>
#> Total observations: 5 (valid: 5)
#> Extreme events: 0 (0.0%) - 0 early, 0 late
#> Anomaly range: 44.8 to 84.8 days
#> --------------------------------------------------
#>
#> country phase_id year anomaly_days z_score
#> <char> <int> <int> <num> <num>
#> 1: Switzerland 87 2020 84.75 4.992410
#> 2: Switzerland 87 2012 56.25 3.313546
#> 3: Hungary 87 1965 52.75 12.557431
#> 4: Hungary 87 1960 44.75 10.652986
#> 5: Hungary 87 1962 44.75 10.652986What makes a year “extreme”? By default, years with |z-score| > 2 are flagged as extreme. This corresponds to roughly the most unusual 5% of years (2.5% early + 2.5% late).
summary(anomalies)
#> Phenological Anomalies Summary
#> ==================================================
#>
#> Baseline period: 1990-2010
#> Total observations: 5639
#> Valid anomalies: 4707 (83.5%)
#>
#> Anomaly Statistics:
#> Mean anomaly: 0.3 days
#> Median anomaly: 1.0 days
#> SD of anomalies: 12.1 days
#> Range: -193.0 to 84.8 days
#>
#> Extreme Events:
#> Total extreme: 1372 (29.1%)
#> Extreme early: 475
#> Extreme late: 897
#>
#> Most Extreme Observations:
#> 1962: -30.7 days (early, z=-26.12)
#> 1954: 25.4 days (late, z=14.14)
#> 2023: -193.0 days (early, z=-13.02)
#> 1965: 52.8 days (late, z=12.56)
#> 2024: -49.6 days (early, z=-11.80)The plot() method shows anomalies over time, coloured by
direction (early vs. late), with extreme events marked:
A histogram of anomaly magnitudes is available with
plot(anomalies, which = "histogram").
Phenological data often contains outliers - observations that are clearly erroneous or affected by unusual local conditions. The anomaly function offers two approaches.
The robust method (robust = TRUE,
default and recommended) uses the median as the baseline and the MAD
(median absolute deviation) for standardization. This approach is less
affected by outliers and should be preferred in most situations.
The classical method (robust = FALSE)
uses the mean as the baseline and the standard deviation (SD) for
standardization. This approach is more affected by outliers.
When to use each:
Before running any analysis, you should understand your data’s quality. Poor quality data can lead to:
The pep_quality() function assesses three
dimensions:
| Dimension | Question | Metric |
|---|---|---|
| Completeness | How many years have data? | % of period covered |
| Continuity | Are there gaps? | Maximum consecutive missing years |
| Consistency | Are there outliers? | % of observations flagged |
# Assess quality at the station level
quality <- pep_quality(
pep = apple,
by = c("s_id", "phase_id")
)
print(quality)
#> Phenological Data Quality Assessment
#> --------------------------------------------------
#> Outlier method: tukey
#>
#> Quality Grade Distribution:
#> Grade A: 16887 (46.3%)
#> Grade B: 6363 (17.4%)
#> Grade C: 5311 (14.6%)
#> Grade D: 7927 (21.7%)
#>
#> Total groups assessed: 36488
#> --------------------------------------------------
#>
#> s_id phase_id n_obs year_min year_max year_span n_years completeness_pct
#> <fctr> <int> <int> <int> <int> <int> <int> <num>
#> 1: 1 60 40 1954 1999 46 36 78.3
#> 2: 1 65 36 1954 1998 45 34 75.6
#> 3: 1 69 37 1954 1999 46 36 78.3
#> 4: 1 87 63 1954 1999 46 34 73.9
#> 5: 1 95 7 1992 1999 8 7 87.5
#> 6: 2 60 23 1960 1982 23 23 100.0
#> 7: 2 65 23 1960 1982 23 23 100.0
#> 8: 2 69 23 1960 1982 23 23 100.0
#> 9: 2 87 45 1959 1982 24 24 100.0
#> 10: 3 60 21 1985 2005 21 21 100.0
#> max_gap_years n_gaps n_outliers outlier_pct doy_mean doy_sd doy_range
#> <int> <int> <int> <num> <num> <num> <int>
#> 1: 4 6 1 2.5 137.3 6.6 30
#> 2: 6 5 0 0.0 139.2 7.1 30
#> 3: 4 6 2 5.4 148.2 6.0 24
#> 4: 8 3 3 4.8 265.9 34.5 192
#> 5: 1 1 1 14.3 314.9 8.9 26
#> 6: 0 0 0 0.0 137.7 9.0 36
#> 7: 0 0 0 0.0 139.3 8.3 30
#> 8: 0 0 0 0.0 146.1 8.4 33
#> 9: 0 0 1 2.2 286.0 37.6 183
#> 10: 0 0 2 9.5 123.4 8.7 38
#> quality_grade
#> <char>
#> 1: B
#> 2: C
#> 3: C
#> 4: C
#> 5: D
#> 6: A
#> 7: A
#> 8: A
#> 9: B
#> 10: C
#>
#> ... and 36478 more rowsThe function assigns letter grades based on thresholds:
| Grade | Completeness | Max Gap | Outliers | Interpretation |
|---|---|---|---|---|
| A | ≥ 80% | ≤ 2 years | < 2% | Excellent - use confidently |
| B | ≥ 60% | ≤ 5 years | < 5% | Good - suitable for most analyses |
| C | ≥ 40% | ≤ 10 years | < 10% | Fair - use with caution |
| D | Below C | - | - | Poor - consider excluding |
What grade do you need?
# Summary of quality across all stations
summary(quality)
#> Phenological Data Quality Summary
#> ==================================================
#>
#> Total groups assessed: 36488
#>
#> Quality Grade Distribution:
#> Grade A: 16887 ( 46.3%) *********
#> Grade B: 6363 ( 17.4%) ***
#> Grade C: 5311 ( 14.6%) ***
#> Grade D: 7927 ( 21.7%) ****
#>
#> Completeness Statistics:
#> Mean completeness: 80.9%
#> Median completeness: 90.3%
#> Range: 2.9% - 100.0%
#>
#> Temporal Coverage:
#> Mean years with data: 18.4
#> Mean year span: 24.7
#> Mean max gap: 4.6 years
#>
#> Outlier Statistics:
#> Mean outlier percentage: 2.29%
#> Groups with no outliers: 26683 (73.1%)With many stations to assess, visual inspection is essential. The
plot() method provides an overview of quality metrics:
# Overview plot: grade distribution + station map (requires pep for coordinates)
plot(quality, pep = apple)Reading the plot:
You can also generate individual plots:
A critical step before analysis is filtering to high-quality data:
# Get high-quality stations
high_quality <- quality[quality_grade %in% c("A", "B")]
cat("Quality distribution:\n")
#> Quality distribution:
print(table(quality$quality_grade))
#>
#> A B C D
#> 16887 6363 5311 7927
cat("\nHigh quality (A or B):", nrow(high_quality), "of", nrow(quality),
"(", round(100 * nrow(high_quality) / nrow(quality), 1), "%)\n")
#>
#> High quality (A or B): 23250 of 36488 ( 63.7 %)
# Filter original data to these stations
if (nrow(high_quality) > 0) {
good_stations <- unique(high_quality$s_id)
apple_hq <- apple[s_id %in% good_stations]
cat("\nOriginal apple observations:", nrow(apple), "\n")
cat("After quality filter:", nrow(apple_hq),
"(", round(100 * nrow(apple_hq) / nrow(apple), 1), "%)\n")
}
#>
#> Original apple observations: 899311
#> After quality filter: 819498 ( 91.1 %)You can assess quality at various aggregation levels:
# Country-level quality (coarser assessment)
country_quality <- pep_quality(
pep = pep,
by = c("country", "species", "phase_id")
)
print(country_quality)
#> Phenological Data Quality Assessment
#> --------------------------------------------------
#> Outlier method: tukey
#>
#> Quality Grade Distribution:
#> Grade A: 967 (41.6%)
#> Grade B: 429 (18.4%)
#> Grade C: 363 (15.6%)
#> Grade D: 568 (24.4%)
#>
#> Total groups assessed: 2327
#> --------------------------------------------------
#>
#> country species phase_id n_obs year_min year_max year_span n_years
#> <char> <fctr> <int> <int> <int> <int> <int> <int>
#> 1: <NA> <NA> 100 92 1877 1910 34 29
#> 2: <NA> <NA> 109 49 1877 1910 34 27
#> 3: <NA> <NA> 111 97 1991 2021 31 29
#> 4: <NA> <NA> 131 552 1951 2021 71 71
#> 5: <NA> <NA> 182 187 1991 2021 31 31
#> 6: <NA> Avena sativa 0 575 1951 2023 73 73
#> 7: <NA> Avena sativa 1 94 1873 1909 37 32
#> 8: <NA> Avena sativa 10 557 1951 2023 73 73
#> 9: <NA> Avena sativa 31 259 1951 2023 73 61
#> 10: <NA> Avena sativa 51 325 1951 2023 73 63
#> completeness_pct max_gap_years n_gaps n_outliers outlier_pct doy_mean
#> <num> <int> <int> <int> <num> <num>
#> 1: 85.3 5 1 0 0.0 186.0
#> 2: 79.4 5 2 6 12.2 207.4
#> 3: 93.5 1 2 5 5.2 141.2
#> 4: 100.0 0 0 16 2.9 158.7
#> 5: 100.0 0 0 3 1.6 83.1
#> 6: 100.0 0 0 2 0.3 94.2
#> 7: 86.5 3 3 4 4.3 117.9
#> 8: 100.0 0 0 8 1.4 111.4
#> 9: 83.6 10 2 12 4.6 144.0
#> 10: 86.3 10 1 2 0.6 172.1
#> doy_sd doy_range quality_grade
#> <num> <int> <char>
#> 1: 9.3 43 B
#> 2: 14.0 69 D
#> 3: 10.2 51 C
#> 4: 11.3 92 B
#> 5: 17.1 105 A
#> 6: 13.3 82 A
#> 7: 10.6 55 B
#> 8: 12.1 78 A
#> 9: 13.7 88 C
#> 10: 9.5 55 C
#>
#> ... and 2317 more rowsNow let’s put it all together in a realistic workflow:
cat("=== STEP 1: Data Quality Assessment ===\n\n")
#> === STEP 1: Data Quality Assessment ===
# Assess quality
quality <- pep_quality(apple, by = c("s_id", "phase_id"))
cat("Quality grades:\n")
#> Quality grades:
print(table(quality$quality_grade))
#>
#> A B C D
#> 16887 6363 5311 7927
cat("\n=== STEP 2: Filter to Good Quality Data ===\n\n")
#>
#> === STEP 2: Filter to Good Quality Data ===
# Keep grades A, B, and C (exclude only grade D)
good_stations <- quality[quality_grade %in% c("A", "B", "C"), s_id]
apple_clean <- apple[s_id %in% good_stations]
cat("Original observations:", nrow(apple), "\n")
#> Original observations: 899311
cat("After quality filter:", nrow(apple_clean), "\n")
#> After quality filter: 865445
cat("Retained:", round(100 * nrow(apple_clean) / nrow(apple), 1), "%\n")
#> Retained: 96.2 %
stopifnot(nrow(apple_clean) > 0)
cat("\n=== STEP 3: Calculate Baseline Normals ===\n\n")
#>
#> === STEP 3: Calculate Baseline Normals ===
normals <- pheno_normals(
apple_clean,
period = 1990:2010,
by = c("country", "phase_id"),
min_years = 5
)
#> Note: 20 group(s) have fewer than 5 years of data and return NA values.
cat("Normals calculated for", sum(!is.na(normals$mean_doy)), "groups\n")
#> Normals calculated for 90 groups
cat("Mean DOY range:", round(min(normals$mean_doy, na.rm = TRUE), 0),
"to", round(max(normals$mean_doy, na.rm = TRUE), 0), "\n")
#> Mean DOY range: 101 to 319
cat("\n=== STEP 4: Calculate Anomalies ===\n\n")
#>
#> === STEP 4: Calculate Anomalies ===
anomalies <- pheno_anomaly(
apple_clean,
baseline_period = 1990:2010,
by = c("country", "phase_id"),
min_years = 5
)
#> Note: 884 observation(s) have insufficient baseline data and return NA anomalies.
# Convert to plain data.table so grouped aggregation works correctly
# (the pheno_anomaly S3 class can interfere with [.data.table dispatch)
anomalies_dt <- as.data.table(anomalies)
cat("Anomalies calculated for", sum(!is.na(anomalies_dt$anomaly_days)), "year-groups\n")
#> Anomalies calculated for 4694 year-groups
cat("Extreme years:", sum(anomalies_dt$is_extreme, na.rm = TRUE), "\n")
#> Extreme years: 1390
cat("\n=== STEP 5: Analyze Trends by Decade ===\n\n")
#>
#> === STEP 5: Analyze Trends by Decade ===
decade_summary <- anomalies_dt[
!is.na(anomaly_days),
.(
mean_anomaly = round(mean(anomaly_days, na.rm = TRUE), 1),
sd_anomaly = round(sd(anomaly_days, na.rm = TRUE), 1),
n_obs = .N,
pct_extreme = round(100 * sum(is_extreme, na.rm = TRUE) / .N, 1)
),
by = .(decade = floor(year / 10) * 10)
][order(decade)]
print(decade_summary)
#> decade mean_anomaly sd_anomaly n_obs pct_extreme
#> <num> <num> <num> <int> <num>
#> 1: 1920 3.5 7.8 9 55.6
#> 2: 1930 0.2 9.6 19 42.1
#> 3: 1940 -0.8 10.7 51 51.0
#> 4: 1950 1.4 11.1 337 46.0
#> 5: 1960 3.0 12.6 537 44.7
#> 6: 1970 3.8 7.7 640 41.9
#> 7: 1980 3.9 8.9 630 40.2
#> 8: 1990 0.6 8.0 664 7.2
#> 9: 2000 -1.2 11.5 788 8.1
#> 10: 2010 -4.6 13.5 711 28.4
#> 11: 2020 -8.4 20.4 308 39.3
cat("\nInterpretation:\n")
#>
#> Interpretation:
first_decade <- decade_summary$mean_anomaly[1]
last_decade <- decade_summary$mean_anomaly[nrow(decade_summary)]
change <- last_decade - first_decade
if (change < -3) {
cat("- Strong shift toward EARLIER phenology\n")
} else if (change < 0) {
cat("- Moderate shift toward earlier phenology\n")
} else if (change > 3) {
cat("- Strong shift toward LATER phenology\n")
} else {
cat("- No clear directional change\n")
}
#> - Strong shift toward EARLIER phenologyThe pheno_plot_timeseries() function creates
publication-ready time series:
# Prepare aggregated data for visualization
apple_annual <- apple[, .(
day = mean(day, na.rm = TRUE),
n = .N
), by = .(year, species, phase_id)]
# Add human-readable phase labels
apple_annual[, phase := factor(phase_id,
levels = c(60, 65, 100),
labels = c("First flowers", "Full flowering", "Harvest"))]
# Remove rows where phase mapping failed
apple_annual <- apple_annual[!is.na(phase)]
# Plot time series with trend lines
if (nrow(apple_annual) > 0) {
pheno_plot_timeseries(
apple_annual,
color_by = "phase",
smooth = TRUE,
title = "Apple Phenology Trends"
)
}
#> `geom_smooth()` using formula = 'y ~ x'Reading the plot:
Sometimes trends aren’t linear - they may accelerate, slow, or even
reverse. The pheno_trend_turning() function uses the
sequential Mann-Kendall test to detect such changes:
# Get apple flowering data
apple_flowering <- pep[species == "Malus domestica" & phase_id == 60]
# Detect turning points
turning <- pheno_trend_turning(apple_flowering, min_years = 10)
print(turning)
#> Phenological Trend Turning Point Analysis
#> ==================================================
#> Years analyzed: 100
#> Turning points found: 11
#> Turning point years: 1927, 1929, 1931, 1942, 1943, 1944, 2010, 2013, 2021, 2023, 2024
#>
#> Tau statistics:
#> Final progressive tau: -0.289
#> Final retrograde tau: -0.000
#>
#> Significance thresholds: |tau| > 1.96 (95%), |tau| > 2.58 (99%)
# Visualize the analysis
plot(turning)Understanding the plot:
For a single trend value, use kendall_tau():
# Aggregate to annual means
annual_doy <- apple_flowering[, .(
mean_doy = mean(day, na.rm = TRUE)
), by = year][order(year)]
# Calculate Mann-Kendall test statistic
# Note: kendall_tau() returns the standardized z-statistic, not Kendall's
# tau correlation. Values beyond ±1.96 indicate significant trends.
tau <- kendall_tau(annual_doy$mean_doy)
cat("Mann-Kendall z-statistic:", round(tau, 3), "\n")
#> Mann-Kendall z-statistic: -7.928
cat("\nInterpretation:\n")
#>
#> Interpretation:
if (tau < -2.58) {
cat("- Highly significant earlier trend (p < 0.01)\n")
} else if (tau < -1.96) {
cat("- Significant earlier trend (p < 0.05)\n")
} else if (tau > 2.58) {
cat("- Highly significant later trend (p < 0.01)\n")
} else if (tau > 1.96) {
cat("- Significant later trend (p < 0.05)\n")
} else {
cat("- No statistically significant trend\n")
}
#> - Highly significant earlier trend (p < 0.01)Why Kendall’s tau?
A central goal of phenological research is to understand how climate
drives the timing of biological events. The
pheno_regional() function compiles multi-station
phenological data into regional time series and links them to global
temperature anomalies, enabling you to assess climate
sensitivity - how many days earlier (or later) phenology occurs
per degree of warming.
# Compile regional data and link to temperature anomalies
regional_data <- pheno_regional(
pep = pep,
species_name = "Malus domestica",
year_min = 1961
)
# Visualize phenology alongside temperature anomalies
pheno_plot(regional_data)What the output shows:
Typical climate sensitivities for European spring phenology:
| Phase | Sensitivity | Meaning |
|---|---|---|
| Leaf unfolding | -2 to -4 days/°C | Moderate response to spring warming |
| Flowering | -3 to -5 days/°C | Strong response, especially for fruit trees |
| Harvest | -1 to -3 days/°C | Weaker response, partly compensated by summer processes |
These values mean that for every 1°C increase in spring temperature, flowering tends to occur 3-5 days earlier. Over the past 50 years, this has translated to an advance of roughly 2-5 days per decade across Central Europe.
pep_quality()
- always!pheno_trend_turning()You now understand the core analytical functions. Continue to:
pheno_gradient() and
pheno_synchrony().sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Tahoe 26.0.1
#>
#> Matrix products: default
#> BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Europe/Zurich
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] pep725_1.0.0 ggplot2_4.0.1 data.table_1.18.2.1
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.10 generics_0.1.4 tidyr_1.3.2
#> [4] class_7.3-23 robustbase_0.99-7 KernSmooth_2.23-26
#> [7] lattice_0.22-7 digest_0.6.39 magrittr_2.0.4
#> [10] evaluate_1.0.5 grid_4.5.2 RColorBrewer_1.1-3
#> [13] rnaturalearth_1.2.0 fastmap_1.2.0 Matrix_1.7-4
#> [16] jsonlite_2.0.0 e1071_1.7-17 DBI_1.2.3
#> [19] mgcv_1.9-3 purrr_1.2.1 viridisLite_0.4.2
#> [22] scales_1.4.0 jquerylib_0.1.4 cli_3.6.5
#> [25] rlang_1.1.7 units_1.0-0 splines_4.5.2
#> [28] withr_3.0.2 cachem_1.1.0 yaml_2.3.12
#> [31] otel_0.2.0 tools_4.5.2 dplyr_1.2.0
#> [34] vctrs_0.7.1 R6_2.6.1 proxy_0.4-29
#> [37] lifecycle_1.0.5 classInt_0.4-11 pkgconfig_2.0.3
#> [40] pillar_1.11.1 bslib_0.10.0 gtable_0.3.6
#> [43] glue_1.8.0 Rcpp_1.1.1 sf_1.0-24
#> [46] rnaturalearthdata_1.0.0 DEoptimR_1.1-4 xfun_0.56
#> [49] tibble_3.3.1 tidyselect_1.2.1 rstudioapi_0.18.0
#> [52] knitr_1.51 farver_2.1.2 nlme_3.1-168
#> [55] htmltools_0.5.9 patchwork_1.3.2 rmarkdown_2.30
#> [58] labeling_0.4.3 compiler_4.5.2 S7_0.2.1