Outliers are unusual data points that are different from the rest of the data. It is important to detect these outliers because they have an effect on data analysis, models, and conclusions.
MOutliers package provides tools to detect and visualize multivariate outliers using robust statistical methods:
Mahalanobis distance
Minimum Covariance Determinant (MCD)
Principal Component Analysis (PCA)
Parameters
1. data (Required)
A numeric dataframe that contains the variables of interest. Each row corresponds to one observation and each column to one variable.
2. method (Optional)
A character value specifying the detection method. Options include:
“mahalanobis”: classical Mahalanobis distance
“mcd”: Minimum Covariance Determinant (robust method)
“pca”: principal component based Euclidean distances.
Default is “mahalanobis”.
3. alpha (Optional)
A numeric value representing the cutoff level for detecting outliers, based on the quantiles of the chi-squared distribution. Default is 0.975.
Returns
The function returns a data frame that combines the original input dataset with the following additional columns:
Distance: the computed distance value for each observation (depends on the chosen method).
Outlier: TRUE if the observation is flagged as an outlier. Otherwise, FALSE.
This example demonstrates detecting multivariate outliers using simulated data.
set.seed(123)
df <- data.frame(
x = c(rnorm(50), 6),
y = c(rnorm(50), 6)
)
head(df)
#> x y
#> 1 -0.56047565 0.25331851
#> 2 -0.23017749 -0.02854676
#> 3 1.55870831 -0.04287046
#> 4 0.07050839 1.36860228
#> 5 0.12928774 -0.22577099
#> 6 1.71506499 1.51647060# Mahalanobis Distance
result_mahal <- detect_multivariate_outliers(df, method = "mahalanobis", alpha = 0.975)
head(result_mahal)
#> x y Distance Outlier
#> 1 -0.56047565 0.25331851 0.4024629 FALSE
#> 2 -0.23017749 -0.02854676 0.1081832 FALSE
#> 3 1.55870831 -0.04287046 1.9705648 FALSE
#> 4 0.07050839 1.36860228 1.0943377 FALSE
#> 5 0.12928774 -0.22577099 0.1909745 FALSE
#> 6 1.71506499 1.51647060 1.8800060 FALSE# Minimum Covariance Determinant (MCD)
result_mcd <- detect_multivariate_outliers(df, method = "mcd", alpha = 0.975)
head(result_mcd)
#> x y Distance Outlier
#> 1 -0.56047565 0.25331851 0.4591213 FALSE
#> 2 -0.23017749 -0.02854676 0.1299266 FALSE
#> 3 1.55870831 -0.04287046 2.5319996 FALSE
#> 4 0.07050839 1.36860228 2.7497316 FALSE
#> 5 0.12928774 -0.22577099 0.2077008 FALSE
#> 6 1.71506499 1.51647060 6.5143416 FALSE# Principal Component Analysis (PCA)
result_pca <- detect_multivariate_outliers(df, method = "pca", alpha = 0.975)
head(result_pca)
#> x y Distance Outlier
#> 1 -0.56047565 0.25331851 0.3295383 FALSE
#> 2 -0.23017749 -0.02854676 0.1515636 FALSE
#> 3 1.55870831 -0.04287046 1.3505140 FALSE
#> 4 0.07050839 1.36860228 0.8355279 FALSE
#> 5 0.12928774 -0.22577099 0.1610487 FALSE
#> 6 1.71506499 1.51647060 2.6579984 FALSEThis example demonstrates detecting multivariate outliers using a real dataset (mtcars) with three variables: mpg, hp, and wt.
df_mtcars <- mtcars[, c("mpg", "hp", "wt" )]
head(df_mtcars)
#> mpg hp wt
#> Mazda RX4 21.0 110 2.620
#> Mazda RX4 Wag 21.0 110 2.875
#> Datsun 710 22.8 93 2.320
#> Hornet 4 Drive 21.4 110 3.215
#> Hornet Sportabout 18.7 175 3.440
#> Valiant 18.1 105 3.460# Mahalanobis Distance
result_mahal <- detect_multivariate_outliers(df_mtcars, method = "mahalanobis",alpha = 0.975)
head(result_mahal)
#> mpg hp wt Distance Outlier
#> Mazda RX4 21.0 110 2.620 1.4554908 FALSE
#> Mazda RX4 Wag 21.0 110 2.875 0.6848547 FALSE
#> Datsun 710 22.8 93 2.320 1.8717032 FALSE
#> Hornet 4 Drive 21.4 110 3.215 0.5058688 FALSE
#> Hornet Sportabout 18.7 175 3.440 0.1960802 FALSE
#> Valiant 18.1 105 3.460 2.0085341 FALSE# Minimum Covariance Determinant (MCD)
result_mcd <- detect_multivariate_outliers(df_mtcars, method = "mcd",alpha = 0.975)
head(result_mcd)
#> mpg hp wt Distance Outlier
#> Mazda RX4 21.0 110 2.620 1.4032515 FALSE
#> Mazda RX4 Wag 21.0 110 2.875 0.4356093 FALSE
#> Datsun 710 22.8 93 2.320 1.7928535 FALSE
#> Hornet 4 Drive 21.4 110 3.215 0.7528113 FALSE
#> Hornet Sportabout 18.7 175 3.440 1.8629727 FALSE
#> Valiant 18.1 105 3.460 3.1254814 FALSE# Principal Component Analysis (PCA)
result_pca <- detect_multivariate_outliers(df_mtcars, method = "pca",alpha = 0.975)
head(result_pca)
#> mpg hp wt Distance Outlier
#> Mazda RX4 21.0 110 2.620 0.5460497 FALSE
#> Mazda RX4 Wag 21.0 110 2.875 0.3829775 FALSE
#> Datsun 710 22.8 93 2.320 1.5163542 FALSE
#> Hornet 4 Drive 21.4 110 3.215 0.3326773 FALSE
#> Hornet Sportabout 18.7 175 3.440 0.2723783 FALSE
#> Valiant 18.1 105 3.460 0.4647775 FALSEParameters
1. data (Required)
A numeric dataframe with atleast two continous variables.
2. method (Optional)
A character value specifying the outlier detection approach. Options include:
“mahalanobis”: classical Mahalanobis distance
“mcd”: Minimum Covariance Determinant (robust method)
Default is “mahalanobis”.
3. alpha (Optional)
A numeric value specifying the cutoff quantile for identifying outliers from the chi-squared distribution. Default is 0.975.
Returns
A set of 2D scatterplots for each pair of variables in the dataset. Only works for either Mahalanobis or MCD distances. Outlier are highlighted in red, while inliers are shown in black. The function also arranges all pairwise scatterplots into one frame.
This example demonstrates visualizing 2D scatterplots for each pair of variable in the dataset using simulated data.
This example demonstrates visualizing 2D scatterplots for each pair of variable in the dataset using a real dataset (mtcars) with three variables: mpg, hp, and wt.