ztils

License: MIT License R-CMD-check lifecycle year

Various utilities meant to aid in speeding up common statistical operations, such as: - removing outliers and extremes - generating probability density and cumulative distribution graphs with ggplot2 - running one-sample Kolmogorov-Smirnov tests against multiple distributions at once - generating prediction plots with ggplot2 - scaling data and performing principal component analysis (PCA) - plotting PCA with ggplot2

Installation

To install from CRAN

install.packages("ztils")

To install the development version:

remotes::install_github("zachpeagler/ztils")


no_outliers()

Description

This function works by keeping only rows in the dataframe containing variable values within the quartiles +- 1.5 times the interquartile range.

Usage

This function has no defaults, as it is entirely dependent on the user input.

no_outliers(data,
            var
            )

Arguments

Returns

Returns the specified dataframe data minus the rows containing outliers in the var variable.

Examples:

no_outliers(iris, Sepal.Length)

This isn’t a great example because the iris dataset does not contain any statistical outliers.

no_extremes()

Description

This function works by keeping only rows in the dataframe containing variable values within the quartiles +- 3.0 times the interquartile range.

Usage

This function has no defaults, as it is entirely dependent on the user input.

no_extremes(data,
            var
            )

Arguments

Returns

Returns the specified dataframe data minus the rows containing extremes in the var variable.

Examples:

no_extremes(iris, Sepal.Length)

This isn’t a great example because the iris dataset does not contain any statistical outliers.


multipdf_cont()

Description

This function gets the probability density function (PDF) for selected distributions against continuous variables. Possible distributions include any combination of “normal”, “lognormal”, “gamma”, “exponential”, and “all” (which just uses all of the prior distributions).

Note that only non-negative numbers are supported by the lognormal and gamma distributions. Feeding this function a negative number with those distributions selected will result in an error.

Usage:

multipdf_cont(var, 
              seq_length = 50, 
              distributions = "all"
              )

Returns

This function returns a dataframe with row number equal to seq_length containing the real density and the probability density function of var for selected distributions.

Arguments

Examples

multipdf_cont(iris$Petal.Length)

multipdf_cont(iris$Sepal.Length, 100, c("normal", "lognormal"))

multipdf_plot()

Description

This function extends multiPDF_cont and gets the probability density functions (PDFs) for selected distributions against continuous, non-negative numbers. Possible distributions include any combination of “normal”, “lognormal”, “gamma”, “exponential”, and “all” (which just uses all of the prior distributions). It then plots this using ggplot2 and a scico palette, using var_name for the plot labeling, if specified. If not specified, it will use var instead.

Usage

multipdf_plot(var, 
              seq_length = 50,
              distributions = "all", 
              palette = "oslo", 
              var_name = NULL
              )

Returns

A plot showing the PDF of the selected variable against the selected distributions over the selected sequence length.

Arguments

Examples


multipdf_plot(iris$Sepal.Length)

multipdf_plot(iris$Sepal.Length,
              seq_length = 100,
              distributions = c("normal", "lognormal", "gamma"),
              palette = "bilbao",
              var_name = "Sepal Length (cm)"
              )


multicdf_cont()

Description

This function gets the cumulative distribution function (CDF) for selected distributions against continuous variables. Possible distributions include any combination of “normal”, “lognormal”, “gamma”, “exponential”, and “all” (which just uses all of the prior distributions).

Note that only non-negative numbers are supported by the lognormal and gamma distributions. Feeding this function a negative number with those distributions selected will result in an error.

Usage:

multicdf_cont(var, 
              seq_length = 50, 
              distributions = "all"
              )

Returns

This function returns a dataframe with row number equal to seq_length containing the real density and the probability density function of var for selected distributions.

Arguments

Examples

multicdf_cont(iris$Petal.Length)

multicdf_cont(iris$Sepal.Length,
              100, 
              c("normal", "lognormal")
              )

multicdf_plot()

Description

This function extends multiCDF_cont and gets the cumulative distribution functions (CDFs) for selected distributions against continuous, non-negative numbers. Possible distributions include any combination of “normal”, “lognormal”, “gamma”, “exponential”, and “all” (which just uses all of the prior distributions). It then plots this using ggplot2 and a scico palette, using var_name for the plot labeling, if specified. If not specified, it will use var instead.

Usage

multicdf_plot(var, 
              seq_length = 50,
              distributions = "all", 
              palette = "oslo", 
              var_name = NULL
              )

Returns

A plot showing the CDF of the selected variable against the selected distributions over the selected sequence length.

Arguments

Examples


multicdf_plot(iris$Sepal.Length)

multicdf_plot(iris$Sepal.Length,
              seq_length = 100,
              distributions = c("normal", "lognormal", "gamma"),
              palette = "bilbao",
              var_name = "Sepal Length (cm)"
              )


multiks_cont()

Description

This function gets the distance and p-value from a one-sample Kolmogorov-Smirnov (KS) test for selected distributions against a continous input variable. Possible distributions include “normal”, “lognormal”, “gamma”, “exponential”, and “all”.

Usage

multiks_cont(var,
             distributions = "all"   
             )

Note: If using “lognormal” or “gamma” distributions, the target variable must be non-negative.

Arguments

Returns

Returns a dataframe with the distance and p-value for each performed KS test. The distance is a relative metric of similarity. A p-value of > 0.05 indicates that the target variable’s distribution is not significantly different from the specified distribution.

Examples

multiks_cont(iris$Sepal.Length)

multiks_cont(iris$Sepal.Length, c("normal", "lognormal"))


gml_pseudor2

Description

This function calculates the pseudo R^2 (proportion of variance explained by the model) for a general linear model (glm). glms don’t have real R^2 due to the intrinsic difference between a linear model and a generalized linear model, but we can still calculate an approximiation of the R^2 as (1 - (deviance/null deviance)).

Usage

glm_pseudor2(mod)

Arguments

Returns

Returns the pseudo R^2 value of the model.

Examples

gmod <- glm(Sepal.Length ~ Petal.Length + Species, data = iris)
glm_pseudor2(gmod)


pca_plot()

Description

This function performs a principal component analysis (PCA) for the selected pcavars with the option to automatically scale the variables. It then graphs PC1 on the x axis and PC2 on the y-axis using ggplot2, coloring the graph with a scico palette over the specified groups. This is similar to the biplot command from the stats package, but performs all the steps required in graphing a PCA for you.

Usage

pca_plot(group,
         pcavars,
         scaled = FALSE,
         palette = "oslo
         )

Arguments

Returns

A ggplot object showing PC1 on the x axis and PC2 on the y axis, colored by group with vectors and labels showing the individual pca variables.

Examples

pca_plot(iris$Species, iris[,c(1:4)])

pca_plot(iris$Species, iris[,c(1:4)], FALSE, "bilbao")


pca_data()

Description

This function performs a principal component analysis (PCA) on the specified variables, pcavars and attaches the resulting principal components to the specified dataframe, data, with optional variable scaling.

Usage

pca_data(data,
         pcavars,
         scaled = FALSE
         )

Arguments

Returns

Returns a dataframe with principal components as additional columns.

Examples

pca_data(iris, iris[,c(1:4)], FALSE)


predict_plot()

Description

This function performs a prediction based on the supplied model, then graphs it using ggplot2. Options are available for predicting based on the confidence or prediction interval, as well as for applying corrections, such as exponential and logistic.

I would like to alter this function to reduce the number of required inputs, as all the information should be available from the model call, but that’s a work in progress. ### Usage

predict_plot(mod,
             data,
             rvar,
             pvar,
             group = NULL,
             length = 50,
             interval = "confidence",
             correction = "normal",
             palette = "oslo"
             )

Arguments

Returns

Returns a plot with the observed (real) data plotted as points and the prediction plotted as lines, with a 95% confidence or prediction interval.

This function has a known issue with the colors on ungrouped predictions being kind of funky, as the function uses the predictor variable (x-axis) for the color, which works for the actual data (points), but doesn’t translate well to the predicted lines and ribbon.

Examples

mod1 <- lm(Sepal.Length ~ Petal.Length + Species, data = iris)
predict_plot(mod1, iris, Sepal.Length, Petal.Length, Species)


Bug reporting

If you find any bugs, please report them at https://github.com/zachpeagler/ztils/issues.