Getting Started with NRMstatsML

Introduction

NRMstatsML is an R package for statistical and machine learning-based analysis of long-term Natural Resource Management (NRM) datasets. It is designed for datasets spanning 10–20 years covering soil health, water, crop yield, and climate variables.

The package is organised into seven core modules:

Module Functions Purpose
trendML nrm_trend, nrm_mann_kendall, nrm_sens_slope, nrm_structural_break Trend detection and structural change
multiSysML nrm_multivariate, nrm_pls, nrm_sem Multivariate system modelling
responseML nrm_response_curve, nrm_optimize_input Yield-response and optimisation
tsML nrm_forecast, nrm_arima Time series forecasting
panelML nrm_panel, nrm_did Panel data and treatment effects
uncertaintyML nrm_uncertainty, nrm_bootstrap, nrm_monte_carlo Uncertainty quantification
autoML nrm_automl, nrm_benchmark Automated model selection

Installation

# From CRAN (once published)
install.packages("NRMstatsML")

# Development version from GitHub
# install.packages("remotes")
remotes::install_github("yourorg/NRMstatsML")

Example Data

The package ships with nrm_example, a synthetic 20-year dataset that mimics a long-term fertiliser experiment.

library(NRMstatsML)

data(nrm_example)
head(nrm_example)
#>   year treatment crop_yield soil_OC    N    P    K rainfall runoff biomass
#> 1 2000   control       3.37   0.867 55.6 23.2 26.3    782.5   76.0    5.32
#> 2 2001       NPK       3.30   0.836 63.1 22.8 34.6    676.4   75.1    4.70
#> 3 2002   control       3.21   0.882 68.8 23.2 35.7    712.6   60.4    4.98
#> 4 2003       NPK       3.26   0.829 79.3 23.5 29.5    657.5   77.0    6.10
#> 5 2004   control       3.75   0.870 66.6 27.5 36.8    574.1   69.3    5.85
#> 6 2005       NPK       3.87   0.783 76.4 25.0 41.2    672.5   80.4    5.89

Always start by validating the data:

nrm_data_check(nrm_example)
#> === NRMstatsML: Data Check ===
#>   Rows       : 20
#>   Columns    : 10
#>   Numeric    : year, crop_yield, soil_OC, N, P, K, rainfall, runoff, biomass
#>   Status     : OK

Module 1 — Trend Analysis (trendML)

Mann-Kendall test and Sen’s slope

trend_res <- nrm_trend(nrm_example,
                       time_var  = "year",
                       value_var = "crop_yield",
                       breaks    = TRUE)
print(trend_res)
#> === NRMstatsML: Trend Analysis ===
#> Variable  : crop_yield 
#> Time index: year 
#> 
#> -- Mann-Kendall Test --
#>   tau = 0.4105,  p-value = 0.0125
#>   Trend: increasing (alpha = 0.05)
#> 
#> -- Sen's Slope --
#>   Slope = 0.0500  (95% CI: 0.0150, 0.0767)
#> 
#> -- Structural Breaks (Bai-Perron) --
#>   1 break(s) at: 2010

Visualise the trend

nrm_plot(trend_res)

Individual components

mk <- nrm_mann_kendall(nrm_example, time_var = "year",
                       value_var = "soil_OC")
print(mk)
#> -- Mann-Kendall Test --
#>   tau = 0.6000,  p-value = 0.0002
#>   Trend: increasing (alpha = 0.05)

ss <- nrm_sens_slope(nrm_example, time_var = "year",
                     value_var = "soil_OC")
print(ss)
#> -- Sen's Slope --
#>   Slope = 0.0082  (95% CI: 0.0052, 0.0121)

Module 2 — Multivariate System Modelling (multiSysML)

Scaled OLS regression

mv <- nrm_multivariate(nrm_example,
        formula = crop_yield ~ N + P + K + rainfall,
        scale   = TRUE)
print(mv)
#> === NRMstatsML: Multivariate Regression ===
#>   Formula   : crop_yield ~ N + P + K + rainfall
#>   Adj. R^2  : 0.2663
#>   Scaled    : TRUE
#> 
#>                Estimate Std. Error     t value     Pr(>|t|)
#> (Intercept)  3.75150000 0.08088321 46.38169192 1.290760e-17
#> N           -0.01451394 0.23920031 -0.06067693 9.524176e-01
#> P            0.20689436 0.18209712  1.13617591 2.737102e-01
#> K            0.08989712 0.23114178  0.38892629 7.027959e-01
#> rainfall     0.01982764 0.08441227  0.23489052 8.174700e-01

Partial Least Squares (PLS)

PLS is preferred when predictors are highly collinear, which is common in NRM data where N, P, K, and rainfall are often correlated.

pl <- nrm_pls(nrm_example,
              formula = crop_yield ~ N + P + K + rainfall + soil_OC,
              ncomp   = 3)
print(pl)
#> === NRMstatsML: PLS Regression ===
#>   Formula   : crop_yield ~ N + P + K + rainfall + soil_OC
#>   Components: 2
#>   RMSEP     : 0.3763

Module 3 — Response Curve & Optimisation (responseML)

rc <- nrm_response_curve(nrm_example,
        input_var    = "N",
        response_var = "crop_yield",
        type         = "quadratic")
print(rc)
#> === NRMstatsML: Response Curve ===
#>   Type      : quadratic
#>   Input     : N
#>   Response  : crop_yield
#>   R-squared : 0.3446
nrm_plot(rc)

Economic optimum

Find the nitrogen rate that maximises profit given a cost-to-price ratio:

opt <- nrm_optimize_input(rc, price_ratio = 0.6)
print(opt)
#> === NRMstatsML: Input Optimisation ===
#>   Price ratio          : 0.600
#>   Biophysical optimum  : NA
#>   Economic optimum     : NA

Module 4 — Time Series Forecasting (tsML)

ar <- nrm_arima(nrm_example, value_var = "crop_yield")
print(ar)
#> === NRMstatsML: ARIMA Model ===
#>   Variable : crop_yield
#>   Order    : ARIMA(0,1,0)
#>   AIC = 23.98  BIC = 24.93
fc <- nrm_forecast(nrm_example,
                   value_var = "crop_yield",
                   horizon   = 5)
print(fc)
#> === NRMstatsML: Time Series Forecast ===
#>   Variable : crop_yield
#>   Method   : auto_arima
#>   Horizon  : 5 steps
#>   Accuracy (in-sample):
#>                  ME   RMSE    MAE    MPE   MAPE   MASE    ACF1
#> Training set 0.0317 0.4206 0.2887 0.2445 7.5499 0.9506 -0.3701
nrm_plot(fc)


Module 5 — Panel Data & Treatment Effects (panelML)

Panel and DiD analyses require a multi-site, multi-year dataset. See ?nrm_panel and ?nrm_did for worked examples with appropriate data.

# Requires a panel dataset with site and year identifiers
pm <- nrm_panel(panel_data,
                formula = crop_yield ~ N + P + K + rainfall,
                index   = c("site", "year"),
                model   = "within")
nrm_summary(pm)

did <- nrm_did(panel_data,
               outcome   = "crop_yield",
               treat_var = "treatment_binary",
               time_var  = "post_period")
print(did)

Module 6 — Uncertainty & Sensitivity Analysis (uncertaintyML)

bs <- nrm_bootstrap(nrm_example,
        stat_fn = function(d) mean(d$crop_yield),
        n_iter  = 500)
print(bs)
#> -- Bootstrap --
#>   Mean : 3.7513  SD : 0.0916  CI : [3.5735, 3.9325]
mc <- nrm_monte_carlo(nrm_example,
        stat_fn   = function(d) mean(d$crop_yield),
        n_iter    = 500,
        noise_sd  = 0.1)
print(mc)
#> -- Monte Carlo --
#>   Mean : 3.7508  SD : 0.0094  CI : [3.7325, 3.7685]

Use nrm_uncertainty() as a unified entry point:

unc <- nrm_uncertainty(nrm_example,
         stat_fn = function(d) mean(d$crop_yield),
         method  = "bootstrap",
         n_iter  = 500)
print(unc)
#> === NRMstatsML: Uncertainty Analysis ===
#>   Method : bootstrap
#>   Mean   : 3.7558
#>   SD     : 0.0904
#>   95% CI : [3.5866, 3.9416]

Module 7 — AutoML & Model Benchmarking (autoML)

# Requires caret + method-specific packages (e.g. randomForest, gbm)
am <- nrm_automl(nrm_example,
                 formula  = crop_yield ~ N + P + K + rainfall + soil_OC,
                 methods  = c("lm", "rf", "gbm"),
                 cv_folds = 5,
                 seed     = 42)
nrm_summary(am)

Benchmarking on a hold-out set

n     <- nrow(nrm_example)
train <- nrm_example[seq_len(floor(0.8 * n)), ]
test  <- nrm_example[seq(floor(0.8 * n) + 1, n), ]

m_ols <- lm(crop_yield ~ N + P + K, data = train)

bm <- nrm_benchmark(
  models       = list(ols = m_ols),
  test_data    = test,
  response_var = "crop_yield"
)
print(bm)

Session Info

sessionInfo()
#> R version 4.2.1 (2022-06-23 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 26200)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=C                          
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_4.0.2    NRMstatsML_0.1.4
#> 
#> loaded via a namespace (and not attached):
#>  [1] zoo_1.8-15          tidyselect_1.2.1    xfun_0.57          
#>  [4] bslib_0.10.0        urca_1.3-4          lattice_0.20-45    
#>  [7] colorspace_2.1-2    vctrs_0.7.2         generics_0.1.4     
#> [10] htmltools_0.5.9     viridisLite_0.4.3   yaml_2.3.12        
#> [13] rlang_1.1.7         isoband_0.3.0       jquerylib_0.1.4    
#> [16] pillar_1.11.1       glue_1.7.0          withr_3.0.2        
#> [19] forecast_9.0.2      RColorBrewer_1.1-3  S7_0.2.1           
#> [22] lifecycle_1.0.5     timeDate_4052.112   pls_2.9-0          
#> [25] gtable_0.3.6        evaluate_1.0.5      labeling_0.4.3     
#> [28] knitr_1.51          strucchange_1.5-4   fastmap_1.2.0      
#> [31] parallel_4.2.1      Rcpp_1.1.1          scales_1.4.0       
#> [34] cachem_1.1.0        jsonlite_2.0.0      farver_2.1.2       
#> [37] otel_0.2.0          fracdiff_1.5-3      Kendall_2.2.2      
#> [40] digest_0.6.39       dplyr_1.2.0         trend_1.1.6        
#> [43] grid_4.2.1          cli_3.6.5           tools_4.2.1        
#> [46] sandwich_3.1-1      magrittr_2.0.3      sass_0.4.10        
#> [49] tibble_3.3.1        pkgconfig_2.0.3     rmarkdown_2.31     
#> [52] extraDistr_1.10.0.2 rstudioapi_0.18.0   R6_2.6.1           
#> [55] boot_1.3-32         nlme_3.1-168        compiler_4.2.1